Sample records for text mining approaches

  1. Adaptive semantic tag mining from heterogeneous clinical research texts.

    PubMed

    Hao, T; Weng, C

    2015-01-01

    To develop an adaptive approach to mine frequent semantic tags (FSTs) from heterogeneous clinical research texts. We develop a "plug-n-play" framework that integrates replaceable unsupervised kernel algorithms with formatting, functional, and utility wrappers for FST mining. Temporal information identification and semantic equivalence detection were two example functional wrappers. We first compared this approach's recall and efficiency for mining FSTs from ClinicalTrials.gov to that of a recently published tag-mining algorithm. Then we assessed this approach's adaptability to two other types of clinical research texts: clinical data requests and clinical trial protocols, by comparing the prevalence trends of FSTs across three texts. Our approach increased the average recall and speed by 12.8% and 47.02% respectively upon the baseline when mining FSTs from ClinicalTrials.gov, and maintained an overlap in relevant FSTs with the base- line ranging between 76.9% and 100% for varying FST frequency thresholds. The FSTs saturated when the data size reached 200 documents. Consistent trends in the prevalence of FST were observed across the three texts as the data size or frequency threshold changed. This paper contributes an adaptive tag-mining framework that is scalable and adaptable without sacrificing its recall. This component-based architectural design can be potentially generalizable to improve the adaptability of other clinical text mining methods.

  2. Compatibility between Text Mining and Qualitative Research in the Perspectives of Grounded Theory, Content Analysis, and Reliability

    ERIC Educational Resources Information Center

    Yu, Chong Ho; Jannasch-Pennell, Angel; DiGangi, Samuel

    2011-01-01

    The objective of this article is to illustrate that text mining and qualitative research are epistemologically compatible. First, like many qualitative research approaches, such as grounded theory, text mining encourages open-mindedness and discourages preconceptions. Contrary to the popular belief that text mining is a linear and fully automated…

  3. Gene prioritization and clustering by multi-view text mining

    PubMed Central

    2010-01-01

    Background Text mining has become a useful tool for biologists trying to understand the genetics of diseases. In particular, it can help identify the most interesting candidate genes for a disease for further experimental analysis. Many text mining approaches have been introduced, but the effect of disease-gene identification varies in different text mining models. Thus, the idea of incorporating more text mining models may be beneficial to obtain more refined and accurate knowledge. However, how to effectively combine these models still remains a challenging question in machine learning. In particular, it is a non-trivial issue to guarantee that the integrated model performs better than the best individual model. Results We present a multi-view approach to retrieve biomedical knowledge using different controlled vocabularies. These controlled vocabularies are selected on the basis of nine well-known bio-ontologies and are applied to index the vast amounts of gene-based free-text information available in the MEDLINE repository. The text mining result specified by a vocabulary is considered as a view and the obtained multiple views are integrated by multi-source learning algorithms. We investigate the effect of integration in two fundamental computational disease gene identification tasks: gene prioritization and gene clustering. The performance of the proposed approach is systematically evaluated and compared on real benchmark data sets. In both tasks, the multi-view approach demonstrates significantly better performance than other comparing methods. Conclusions In practical research, the relevance of specific vocabulary pertaining to the task is usually unknown. In such case, multi-view text mining is a superior and promising strategy for text-based disease gene identification. PMID:20074336

  4. Introducing Text Analytics as a Graduate Business School Course

    ERIC Educational Resources Information Center

    Edgington, Theresa M.

    2011-01-01

    Text analytics refers to the process of analyzing unstructured data from documented sources, including open-ended surveys, blogs, and other types of web dialog. Text analytics has enveloped the concept of text mining, an analysis approach influenced heavily from data mining. While text mining has been covered extensively in various computer…

  5. Working with Data: Discovering Knowledge through Mining and Analysis; Systematic Knowledge Management and Knowledge Discovery; Text Mining; Methodological Approach in Discovering User Search Patterns through Web Log Analysis; Knowledge Discovery in Databases Using Formal Concept Analysis; Knowledge Discovery with a Little Perspective.

    ERIC Educational Resources Information Center

    Qin, Jian; Jurisica, Igor; Liddy, Elizabeth D.; Jansen, Bernard J; Spink, Amanda; Priss, Uta; Norton, Melanie J.

    2000-01-01

    These six articles discuss knowledge discovery in databases (KDD). Topics include data mining; knowledge management systems; applications of knowledge discovery; text and Web mining; text mining and information retrieval; user search patterns through Web log analysis; concept analysis; data collection; and data structure inconsistency. (LRW)

  6. Mining Adverse Drug Reactions in Social Media with Named Entity Recognition and Semantic Methods.

    PubMed

    Chen, Xiaoyi; Deldossi, Myrtille; Aboukhamis, Rim; Faviez, Carole; Dahamna, Badisse; Karapetiantz, Pierre; Guenegou-Arnoux, Armelle; Girardeau, Yannick; Guillemin-Lanne, Sylvie; Lillo-Le-Louët, Agnès; Texier, Nathalie; Burgun, Anita; Katsahian, Sandrine

    2017-01-01

    Suspected adverse drug reactions (ADR) reported by patients through social media can be a complementary source to current pharmacovigilance systems. However, the performance of text mining tools applied to social media text data to discover ADRs needs to be evaluated. In this paper, we introduce the approach developed to mine ADR from French social media. A protocol of evaluation is highlighted, which includes a detailed sample size determination and evaluation corpus constitution. Our text mining approach provided very encouraging preliminary results with F-measures of 0.94 and 0.81 for recognition of drugs and symptoms respectively, and with F-measure of 0.70 for ADR detection. Therefore, this approach is promising for downstream pharmacovigilance analysis.

  7. Citation Mining: Integrating Text Mining and Bibliometrics for Research User Profiling.

    ERIC Educational Resources Information Center

    Kostoff, Ronald N.; del Rio, J. Antonio; Humenik, James A.; Garcia, Esther Ofilia; Ramirez, Ana Maria

    2001-01-01

    Discusses the importance of identifying the users and impact of research, and describes an approach for identifying the pathways through which research can impact other research, technology development, and applications. Describes a study that used citation mining, an integration of citation bibliometrics and text mining, on articles from the…

  8. Chapter 16: text mining for translational bioinformatics.

    PubMed

    Cohen, K Bretonnel; Hunter, Lawrence E

    2013-04-01

    Text mining for translational bioinformatics is a new field with tremendous research potential. It is a subfield of biomedical natural language processing that concerns itself directly with the problem of relating basic biomedical research to clinical practice, and vice versa. Applications of text mining fall both into the category of T1 translational research-translating basic science results into new interventions-and T2 translational research, or translational research for public health. Potential use cases include better phenotyping of research subjects, and pharmacogenomic research. A variety of methods for evaluating text mining applications exist, including corpora, structured test suites, and post hoc judging. Two basic principles of linguistic structure are relevant for building text mining applications. One is that linguistic structure consists of multiple levels. The other is that every level of linguistic structure is characterized by ambiguity. There are two basic approaches to text mining: rule-based, also known as knowledge-based; and machine-learning-based, also known as statistical. Many systems are hybrids of the two approaches. Shared tasks have had a strong effect on the direction of the field. Like all translational bioinformatics software, text mining software for translational bioinformatics can be considered health-critical and should be subject to the strictest standards of quality assurance and software testing.

  9. A Feature Mining Based Approach for the Classification of Text Documents into Disjoint Classes.

    ERIC Educational Resources Information Center

    Nieto Sanchez, Salvador; Triantaphyllou, Evangelos; Kraft, Donald

    2002-01-01

    Proposes a new approach for classifying text documents into two disjoint classes. Highlights include a brief overview of document clustering; a data mining approach called the One Clause at a Time (OCAT) algorithm which is based on mathematical logic; vector space model (VSM); and comparing the OCAT to the VSM. (Author/LRW)

  10. An Evaluation of Text Mining Tools as Applied to Selected Scientific and Engineering Literature.

    ERIC Educational Resources Information Center

    Trybula, Walter J.; Wyllys, Ronald E.

    2000-01-01

    Addresses an approach to the discovery of scientific knowledge through an examination of data mining and text mining techniques. Presents the results of experiments that investigated knowledge acquisition from a selected set of technical documents by domain experts. (Contains 15 references.) (Author/LRW)

  11. Text Mining in Organizational Research

    PubMed Central

    Kobayashi, Vladimer B.; Berkers, Hannah A.; Kismihók, Gábor; Den Hartog, Deanne N.

    2017-01-01

    Despite the ubiquity of textual data, so far few researchers have applied text mining to answer organizational research questions. Text mining, which essentially entails a quantitative approach to the analysis of (usually) voluminous textual data, helps accelerate knowledge discovery by radically increasing the amount data that can be analyzed. This article aims to acquaint organizational researchers with the fundamental logic underpinning text mining, the analytical stages involved, and contemporary techniques that may be used to achieve different types of objectives. The specific analytical techniques reviewed are (a) dimensionality reduction, (b) distance and similarity computing, (c) clustering, (d) topic modeling, and (e) classification. We describe how text mining may extend contemporary organizational research by allowing the testing of existing or new research questions with data that are likely to be rich, contextualized, and ecologically valid. After an exploration of how evidence for the validity of text mining output may be generated, we conclude the article by illustrating the text mining process in a job analysis setting using a dataset composed of job vacancies. PMID:29881248

  12. Text Mining in Organizational Research.

    PubMed

    Kobayashi, Vladimer B; Mol, Stefan T; Berkers, Hannah A; Kismihók, Gábor; Den Hartog, Deanne N

    2018-07-01

    Despite the ubiquity of textual data, so far few researchers have applied text mining to answer organizational research questions. Text mining, which essentially entails a quantitative approach to the analysis of (usually) voluminous textual data, helps accelerate knowledge discovery by radically increasing the amount data that can be analyzed. This article aims to acquaint organizational researchers with the fundamental logic underpinning text mining, the analytical stages involved, and contemporary techniques that may be used to achieve different types of objectives. The specific analytical techniques reviewed are (a) dimensionality reduction, (b) distance and similarity computing, (c) clustering, (d) topic modeling, and (e) classification. We describe how text mining may extend contemporary organizational research by allowing the testing of existing or new research questions with data that are likely to be rich, contextualized, and ecologically valid. After an exploration of how evidence for the validity of text mining output may be generated, we conclude the article by illustrating the text mining process in a job analysis setting using a dataset composed of job vacancies.

  13. Mining protein function from text using term-based support vector machines

    PubMed Central

    Rice, Simon B; Nenadic, Goran; Stapley, Benjamin J

    2005-01-01

    Background Text mining has spurred huge interest in the domain of biology. The goal of the BioCreAtIvE exercise was to evaluate the performance of current text mining systems. We participated in Task 2, which addressed assigning Gene Ontology terms to human proteins and selecting relevant evidence from full-text documents. We approached it as a modified form of the document classification task. We used a supervised machine-learning approach (based on support vector machines) to assign protein function and select passages that support the assignments. As classification features, we used a protein's co-occurring terms that were automatically extracted from documents. Results The results evaluated by curators were modest, and quite variable for different problems: in many cases we have relatively good assignment of GO terms to proteins, but the selected supporting text was typically non-relevant (precision spanning from 3% to 50%). The method appears to work best when a substantial set of relevant documents is obtained, while it works poorly on single documents and/or short passages. The initial results suggest that our approach can also mine annotations from text even when an explicit statement relating a protein to a GO term is absent. Conclusion A machine learning approach to mining protein function predictions from text can yield good performance only if sufficient training data is available, and significant amount of supporting data is used for prediction. The most promising results are for combined document retrieval and GO term assignment, which calls for the integration of methods developed in BioCreAtIvE Task 1 and Task 2. PMID:15960835

  14. Text Mining for Neuroscience

    NASA Astrophysics Data System (ADS)

    Tirupattur, Naveen; Lapish, Christopher C.; Mukhopadhyay, Snehasis

    2011-06-01

    Text mining, sometimes alternately referred to as text analytics, refers to the process of extracting high-quality knowledge from the analysis of textual data. Text mining has wide variety of applications in areas such as biomedical science, news analysis, and homeland security. In this paper, we describe an approach and some relatively small-scale experiments which apply text mining to neuroscience research literature to find novel associations among a diverse set of entities. Neuroscience is a discipline which encompasses an exceptionally wide range of experimental approaches and rapidly growing interest. This combination results in an overwhelmingly large and often diffuse literature which makes a comprehensive synthesis difficult. Understanding the relations or associations among the entities appearing in the literature not only improves the researchers current understanding of recent advances in their field, but also provides an important computational tool to formulate novel hypotheses and thereby assist in scientific discoveries. We describe a methodology to automatically mine the literature and form novel associations through direct analysis of published texts. The method first retrieves a set of documents from databases such as PubMed using a set of relevant domain terms. In the current study these terms yielded a set of documents ranging from 160,909 to 367,214 documents. Each document is then represented in a numerical vector form from which an Association Graph is computed which represents relationships between all pairs of domain terms, based on co-occurrence. Association graphs can then be subjected to various graph theoretic algorithms such as transitive closure and cycle (circuit) detection to derive additional information, and can also be visually presented to a human researcher for understanding. In this paper, we present three relatively small-scale problem-specific case studies to demonstrate that such an approach is very successful in replicating a neuroscience expert's mental model of object-object associations entirely by means of text mining. These preliminary results provide the confidence that this type of text mining based research approach provides an extremely powerful tool to better understand the literature and drive novel discovery for the neuroscience community.

  15. Redundancy and Novelty Mining in the Business Blogosphere

    ERIC Educational Resources Information Center

    Tsai, Flora S.; Chan, Kap Luk

    2010-01-01

    Purpose: The paper aims to explore the performance of redundancy and novelty mining in the business blogosphere, which has not been studied before. Design/methodology/approach: Novelty mining techniques are implemented to single out novel information out of a massive set of text documents. This paper adopted the mixed metric approach which…

  16. A Review of Recent Advancement in Integrating Omics Data with Literature Mining towards Biomedical Discoveries

    PubMed Central

    Raja, Kalpana; Patrick, Matthew; Gao, Yilin; Madu, Desmond; Yang, Yuyang

    2017-01-01

    In the past decade, the volume of “omics” data generated by the different high-throughput technologies has expanded exponentially. The managing, storing, and analyzing of this big data have been a great challenge for the researchers, especially when moving towards the goal of generating testable data-driven hypotheses, which has been the promise of the high-throughput experimental techniques. Different bioinformatics approaches have been developed to streamline the downstream analyzes by providing independent information to interpret and provide biological inference. Text mining (also known as literature mining) is one of the commonly used approaches for automated generation of biological knowledge from the huge number of published articles. In this review paper, we discuss the recent advancement in approaches that integrate results from omics data and information generated from text mining approaches to uncover novel biomedical information. PMID:28331849

  17. The potential of text mining in data integration and network biology for plant research: a case study on Arabidopsis.

    PubMed

    Van Landeghem, Sofie; De Bodt, Stefanie; Drebert, Zuzanna J; Inzé, Dirk; Van de Peer, Yves

    2013-03-01

    Despite the availability of various data repositories for plant research, a wealth of information currently remains hidden within the biomolecular literature. Text mining provides the necessary means to retrieve these data through automated processing of texts. However, only recently has advanced text mining methodology been implemented with sufficient computational power to process texts at a large scale. In this study, we assess the potential of large-scale text mining for plant biology research in general and for network biology in particular using a state-of-the-art text mining system applied to all PubMed abstracts and PubMed Central full texts. We present extensive evaluation of the textual data for Arabidopsis thaliana, assessing the overall accuracy of this new resource for usage in plant network analyses. Furthermore, we combine text mining information with both protein-protein and regulatory interactions from experimental databases. Clusters of tightly connected genes are delineated from the resulting network, illustrating how such an integrative approach is essential to grasp the current knowledge available for Arabidopsis and to uncover gene information through guilt by association. All large-scale data sets, as well as the manually curated textual data, are made publicly available, hereby stimulating the application of text mining data in future plant biology studies.

  18. Text mining approach to predict hospital admissions using early medical records from the emergency department.

    PubMed

    Lucini, Filipe R; S Fogliatto, Flavio; C da Silveira, Giovani J; L Neyeloff, Jeruza; Anzanello, Michel J; de S Kuchenbecker, Ricardo; D Schaan, Beatriz

    2017-04-01

    Emergency department (ED) overcrowding is a serious issue for hospitals. Early information on short-term inward bed demand from patients receiving care at the ED may reduce the overcrowding problem, and optimize the use of hospital resources. In this study, we use text mining methods to process data from early ED patient records using the SOAP framework, and predict future hospitalizations and discharges. We try different approaches for pre-processing of text records and to predict hospitalization. Sets-of-words are obtained via binary representation, term frequency, and term frequency-inverse document frequency. Unigrams, bigrams and trigrams are tested for feature formation. Feature selection is based on χ 2 and F-score metrics. In the prediction module, eight text mining methods are tested: Decision Tree, Random Forest, Extremely Randomized Tree, AdaBoost, Logistic Regression, Multinomial Naïve Bayes, Support Vector Machine (Kernel linear) and Nu-Support Vector Machine (Kernel linear). Prediction performance is evaluated by F1-scores. Precision and Recall values are also informed for all text mining methods tested. Nu-Support Vector Machine was the text mining method with the best overall performance. Its average F1-score in predicting hospitalization was 77.70%, with a standard deviation (SD) of 0.66%. The method could be used to manage daily routines in EDs such as capacity planning and resource allocation. Text mining could provide valuable information and facilitate decision-making by inward bed management teams. Copyright © 2017 Elsevier Ireland Ltd. All rights reserved.

  19. A sentence sliding window approach to extract protein annotations from biomedical articles

    PubMed Central

    Krallinger, Martin; Padron, Maria; Valencia, Alfonso

    2005-01-01

    Background Within the emerging field of text mining and statistical natural language processing (NLP) applied to biomedical articles, a broad variety of techniques have been developed during the past years. Nevertheless, there is still a great ned of comparative assessment of the performance of the proposed methods and the development of common evaluation criteria. This issue was addressed by the Critical Assessment of Text Mining Methods in Molecular Biology (BioCreative) contest. The aim of this contest was to assess the performance of text mining systems applied to biomedical texts including tools which recognize named entities such as genes and proteins, and tools which automatically extract protein annotations. Results The "sentence sliding window" approach proposed here was found to efficiently extract text fragments from full text articles containing annotations on proteins, providing the highest number of correctly predicted annotations. Moreover, the number of correct extractions of individual entities (i.e. proteins and GO terms) involved in the relationships used for the annotations was significantly higher than the correct extractions of the complete annotations (protein-function relations). Conclusion We explored the use of averaging sentence sliding windows for information extraction, especially in a context where conventional training data is unavailable. The combination of our approach with more refined statistical estimators and machine learning techniques might be a way to improve annotation extraction for future biomedical text mining applications. PMID:15960831

  20. Vaccine adverse event text mining system for extracting features from vaccine safety reports.

    PubMed

    Botsis, Taxiarchis; Buttolph, Thomas; Nguyen, Michael D; Winiecki, Scott; Woo, Emily Jane; Ball, Robert

    2012-01-01

    To develop and evaluate a text mining system for extracting key clinical features from vaccine adverse event reporting system (VAERS) narratives to aid in the automated review of adverse event reports. Based upon clinical significance to VAERS reviewing physicians, we defined the primary (diagnosis and cause of death) and secondary features (eg, symptoms) for extraction. We built a novel vaccine adverse event text mining (VaeTM) system based on a semantic text mining strategy. The performance of VaeTM was evaluated using a total of 300 VAERS reports in three sequential evaluations of 100 reports each. Moreover, we evaluated the VaeTM contribution to case classification; an information retrieval-based approach was used for the identification of anaphylaxis cases in a set of reports and was compared with two other methods: a dedicated text classifier and an online tool. The performance metrics of VaeTM were text mining metrics: recall, precision and F-measure. We also conducted a qualitative difference analysis and calculated sensitivity and specificity for classification of anaphylaxis cases based on the above three approaches. VaeTM performed best in extracting diagnosis, second level diagnosis, drug, vaccine, and lot number features (lenient F-measure in the third evaluation: 0.897, 0.817, 0.858, 0.874, and 0.914, respectively). In terms of case classification, high sensitivity was achieved (83.1%); this was equal and better compared to the text classifier (83.1%) and the online tool (40.7%), respectively. Our VaeTM implementation of a semantic text mining strategy shows promise in providing accurate and efficient extraction of key features from VAERS narratives.

  1. The Potential of Text Mining in Data Integration and Network Biology for Plant Research: A Case Study on Arabidopsis[C][W

    PubMed Central

    Van Landeghem, Sofie; De Bodt, Stefanie; Drebert, Zuzanna J.; Inzé, Dirk; Van de Peer, Yves

    2013-01-01

    Despite the availability of various data repositories for plant research, a wealth of information currently remains hidden within the biomolecular literature. Text mining provides the necessary means to retrieve these data through automated processing of texts. However, only recently has advanced text mining methodology been implemented with sufficient computational power to process texts at a large scale. In this study, we assess the potential of large-scale text mining for plant biology research in general and for network biology in particular using a state-of-the-art text mining system applied to all PubMed abstracts and PubMed Central full texts. We present extensive evaluation of the textual data for Arabidopsis thaliana, assessing the overall accuracy of this new resource for usage in plant network analyses. Furthermore, we combine text mining information with both protein–protein and regulatory interactions from experimental databases. Clusters of tightly connected genes are delineated from the resulting network, illustrating how such an integrative approach is essential to grasp the current knowledge available for Arabidopsis and to uncover gene information through guilt by association. All large-scale data sets, as well as the manually curated textual data, are made publicly available, hereby stimulating the application of text mining data in future plant biology studies. PMID:23532071

  2. OntoGene web services for biomedical text mining.

    PubMed

    Rinaldi, Fabio; Clematide, Simon; Marques, Hernani; Ellendorff, Tilia; Romacker, Martin; Rodriguez-Esteban, Raul

    2014-01-01

    Text mining services are rapidly becoming a crucial component of various knowledge management pipelines, for example in the process of database curation, or for exploration and enrichment of biomedical data within the pharmaceutical industry. Traditional architectures, based on monolithic applications, do not offer sufficient flexibility for a wide range of use case scenarios, and therefore open architectures, as provided by web services, are attracting increased interest. We present an approach towards providing advanced text mining capabilities through web services, using a recently proposed standard for textual data interchange (BioC). The web services leverage a state-of-the-art platform for text mining (OntoGene) which has been tested in several community-organized evaluation challenges,with top ranked results in several of them.

  3. OntoGene web services for biomedical text mining

    PubMed Central

    2014-01-01

    Text mining services are rapidly becoming a crucial component of various knowledge management pipelines, for example in the process of database curation, or for exploration and enrichment of biomedical data within the pharmaceutical industry. Traditional architectures, based on monolithic applications, do not offer sufficient flexibility for a wide range of use case scenarios, and therefore open architectures, as provided by web services, are attracting increased interest. We present an approach towards providing advanced text mining capabilities through web services, using a recently proposed standard for textual data interchange (BioC). The web services leverage a state-of-the-art platform for text mining (OntoGene) which has been tested in several community-organized evaluation challenges, with top ranked results in several of them. PMID:25472638

  4. Text mining of rheumatoid arthritis and diabetes mellitus to understand the mechanisms of Chinese medicine in different diseases with same treatment.

    PubMed

    Zhao, Ning; Zheng, Guang; Li, Jian; Zhao, Hong-Yan; Lu, Cheng; Jiang, Miao; Zhang, Chi; Guo, Hong-Tao; Lu, Ai-Ping

    2018-01-09

    To identify the commonalities between rheumatoid arthritis (RA) and diabetes mellitus (DM) to understand the mechanisms of Chinese medicine (CM) in different diseases with the same treatment. A text mining approach was adopted to analyze the commonalities between RA and DM according to CM and biological elements. The major commonalities were subsequently verifified in RA and DM rat models, in which herbal formula for the treatment of both RA and DM identifified via text mining was used as the intervention. Similarities were identifified between RA and DM regarding the CM approach used for diagnosis and treatment, as well as the networks of biological activities affected by each disease, including the involvement of adhesion molecules, oxidative stress, cytokines, T-lymphocytes, apoptosis, and inflfl ammation. The Ramulus Cinnamomi-Radix Paeoniae Alba-Rhizoma Anemarrhenae is an herbal combination used to treat RA and DM. This formula demonstrated similar effects on oxidative stress and inflfl ammation in rats with collagen-induced arthritis, which supports the text mining results regarding the commonalities between RA and DM. Commonalities between the biological activities involved in RA and DM were identifified through text mining, and both RA and DM might be responsive to the same intervention at a specifific stage.

  5. Automated Assessment of Patients' Self-Narratives for Posttraumatic Stress Disorder Screening Using Natural Language Processing and Text Mining.

    PubMed

    He, Qiwei; Veldkamp, Bernard P; Glas, Cees A W; de Vries, Theo

    2017-03-01

    Patients' narratives about traumatic experiences and symptoms are useful in clinical screening and diagnostic procedures. In this study, we presented an automated assessment system to screen patients for posttraumatic stress disorder via a natural language processing and text-mining approach. Four machine-learning algorithms-including decision tree, naive Bayes, support vector machine, and an alternative classification approach called the product score model-were used in combination with n-gram representation models to identify patterns between verbal features in self-narratives and psychiatric diagnoses. With our sample, the product score model with unigrams attained the highest prediction accuracy when compared with practitioners' diagnoses. The addition of multigrams contributed most to balancing the metrics of sensitivity and specificity. This article also demonstrates that text mining is a promising approach for analyzing patients' self-expression behavior, thus helping clinicians identify potential patients from an early stage.

  6. Using ontology network structure in text mining.

    PubMed

    Berndt, Donald J; McCart, James A; Luther, Stephen L

    2010-11-13

    Statistical text mining treats documents as bags of words, with a focus on term frequencies within documents and across document collections. Unlike natural language processing (NLP) techniques that rely on an engineered vocabulary or a full-featured ontology, statistical approaches do not make use of domain-specific knowledge. The freedom from biases can be an advantage, but at the cost of ignoring potentially valuable knowledge. The approach proposed here investigates a hybrid strategy based on computing graph measures of term importance over an entire ontology and injecting the measures into the statistical text mining process. As a starting point, we adapt existing search engine algorithms such as PageRank and HITS to determine term importance within an ontology graph. The graph-theoretic approach is evaluated using a smoking data set from the i2b2 National Center for Biomedical Computing, cast as a simple binary classification task for categorizing smoking-related documents, demonstrating consistent improvements in accuracy.

  7. Mining Consumer Health Vocabulary from Community-Generated Text

    PubMed Central

    Vydiswaran, V.G. Vinod; Mei, Qiaozhu; Hanauer, David A.; Zheng, Kai

    2014-01-01

    Community-generated text corpora can be a valuable resource to extract consumer health vocabulary (CHV) and link them to professional terminologies and alternative variants. In this research, we propose a pattern-based text-mining approach to identify pairs of CHV and professional terms from Wikipedia, a large text corpus created and maintained by the community. A novel measure, leveraging the ratio of frequency of occurrence, was used to differentiate consumer terms from professional terms. We empirically evaluated the applicability of this approach using a large data sample consisting of MedLine abstracts and all posts from an online health forum, MedHelp. The results show that the proposed approach is able to identify synonymous pairs and label the terms as either consumer or professional term with high accuracy. We conclude that the proposed approach provides great potential to produce a high quality CHV to improve the performance of computational applications in processing consumer-generated health text. PMID:25954426

  8. An Enhanced Text-Mining Framework for Extracting Disaster Relevant Data through Social Media and Remote Sensing Data Fusion

    NASA Astrophysics Data System (ADS)

    Scheele, C. J.; Huang, Q.

    2016-12-01

    In the past decade, the rise in social media has led to the development of a vast number of social media services and applications. Disaster management represents one of such applications leveraging massive data generated for event detection, response, and recovery. In order to find disaster relevant social media data, current approaches utilize natural language processing (NLP) methods based on keywords, or machine learning algorithms relying on text only. However, these approaches cannot be perfectly accurate due to the variability and uncertainty in language used on social media. To improve current methods, the enhanced text-mining framework is proposed to incorporate location information from social media and authoritative remote sensing datasets for detecting disaster relevant social media posts, which are determined by assessing the textual content using common text mining methods and how the post relates spatiotemporally to the disaster event. To assess the framework, geo-tagged Tweets were collected for three different spatial and temporal disaster events: hurricane, flood, and tornado. Remote sensing data and products for each event were then collected using RealEarthTM. Both Naive Bayes and Logistic Regression classifiers were used to compare the accuracy within the enhanced text-mining framework. Finally, the accuracies from the enhanced text-mining framework were compared to the current text-only methods for each of the case study disaster events. The results from this study address the need for more authoritative data when using social media in disaster management applications.

  9. Agile Text Mining for the 2014 i2b2/UTHealth Cardiac Risk Factors Challenge

    PubMed Central

    Cormack, James; Nath, Chinmoy; Milward, David; Raja, Kalpana; Jonnalagadda, Siddhartha R

    2016-01-01

    This paper describes the use of an agile text mining platform (Linguamatics’ Interactive Information Extraction Platform, I2E) to extract document-level cardiac risk factors in patient records as defined in the i2b2/UTHealth 2014 Challenge. The approach uses a data-driven rule-based methodology with the addition of a simple supervised classifier. We demonstrate that agile text mining allows for rapid optimization of extraction strategies, while post-processing can leverage annotation guidelines, corpus statistics and logic inferred from the gold standard data. We also show how data imbalance in a training set affects performance. Evaluation of this approach on the test data gave an F-Score of 91.7%, one percent behind the top performing system. PMID:26209007

  10. The systematic assessment of traditional evidence from the premodern Chinese medical literature: a text-mining approach.

    PubMed

    May, Brian H; Zhang, Anthony; Lu, Yubo; Lu, Chuanjian; Xue, Charlie C L

    2014-12-01

    This project aimed to develop an approach to evaluating information contained in the premodern Traditional Chinese Medicine (TCM) literature that was (1) comprehensive, systematic, and replicable and (2) able to produce quantifiable output that could be used to answer specific research questions in order to identify natural products for clinical and experimental research. The project involved two stages. In stage 1, 14 TCM collections and compendia were evaluated for suitability as sources for searching; 8 of these were compared in detail. The results were published in the Journal of Alternative and Complementary Medicine. Stage 2 developed a text-mining approach for two of these sources. The text-mining approach was developed for Zhong Hua Yi Dian; Encyclopaedia of Traditional Chinese Medicine, 4th edition) and Zhong Yi Fang Ji Da Ci Dian; Great Compendium of Chinese Medical Formulae). This approach developed procedures for search term selection; methods for screening, classifying, and scoring data; procedures for systematic searching and data extraction; data checking procedures; and approaches for analyzing results. Examples are provided for studies of memory impairment and diabetic nephropathy, and issues relating to data interpretation are discussed. This approach to the analysis of large collections of the premodern TCM literature uses widely available sources and provides a text-mining approach that is systematic, replicable, and adaptable to the requirements of the particular project. Researchers can use these methods to explore changes in the names and conceptions of a disease over time, to identify which therapeutic methods have been more or less frequently used in different eras for particular disorders, and to assist in the selection of natural products for research efforts.

  11. Knowledge based word-concept model estimation and refinement for biomedical text mining.

    PubMed

    Jimeno Yepes, Antonio; Berlanga, Rafael

    2015-02-01

    Text mining of scientific literature has been essential for setting up large public biomedical databases, which are being widely used by the research community. In the biomedical domain, the existence of a large number of terminological resources and knowledge bases (KB) has enabled a myriad of machine learning methods for different text mining related tasks. Unfortunately, KBs have not been devised for text mining tasks but for human interpretation, thus performance of KB-based methods is usually lower when compared to supervised machine learning methods. The disadvantage of supervised methods though is they require labeled training data and therefore not useful for large scale biomedical text mining systems. KB-based methods do not have this limitation. In this paper, we describe a novel method to generate word-concept probabilities from a KB, which can serve as a basis for several text mining tasks. This method not only takes into account the underlying patterns within the descriptions contained in the KB but also those in texts available from large unlabeled corpora such as MEDLINE. The parameters of the model have been estimated without training data. Patterns from MEDLINE have been built using MetaMap for entity recognition and related using co-occurrences. The word-concept probabilities were evaluated on the task of word sense disambiguation (WSD). The results showed that our method obtained a higher degree of accuracy than other state-of-the-art approaches when evaluated on the MSH WSD data set. We also evaluated our method on the task of document ranking using MEDLINE citations. These results also showed an increase in performance over existing baseline retrieval approaches. Copyright © 2014 Elsevier Inc. All rights reserved.

  12. Automatic target validation based on neuroscientific literature mining for tractography

    PubMed Central

    Vasques, Xavier; Richardet, Renaud; Hill, Sean L.; Slater, David; Chappelier, Jean-Cedric; Pralong, Etienne; Bloch, Jocelyne; Draganski, Bogdan; Cif, Laura

    2015-01-01

    Target identification for tractography studies requires solid anatomical knowledge validated by an extensive literature review across species for each seed structure to be studied. Manual literature review to identify targets for a given seed region is tedious and potentially subjective. Therefore, complementary approaches would be useful. We propose to use text-mining models to automatically suggest potential targets from the neuroscientific literature, full-text articles and abstracts, so that they can be used for anatomical connection studies and more specifically for tractography. We applied text-mining models to three structures: two well-studied structures, since validated deep brain stimulation targets, the internal globus pallidus and the subthalamic nucleus and, the nucleus accumbens, an exploratory target for treating psychiatric disorders. We performed a systematic review of the literature to document the projections of the three selected structures and compared it with the targets proposed by text-mining models, both in rat and primate (including human). We ran probabilistic tractography on the nucleus accumbens and compared the output with the results of the text-mining models and literature review. Overall, text-mining the literature could find three times as many targets as two man-weeks of curation could. The overall efficiency of the text-mining against literature review in our study was 98% recall (at 36% precision), meaning that over all the targets for the three selected seeds, only one target has been missed by text-mining. We demonstrate that connectivity for a structure of interest can be extracted from a very large amount of publications and abstracts. We believe this tool will be useful in helping the neuroscience community to facilitate connectivity studies of particular brain regions. The text mining tools used for the study are part of the HBP Neuroinformatics Platform, publicly available at http://connectivity-brainer.rhcloud.com/. PMID:26074781

  13. Cognition-Based Approaches for High-Precision Text Mining

    ERIC Educational Resources Information Center

    Shannon, George John

    2017-01-01

    This research improves the precision of information extraction from free-form text via the use of cognitive-based approaches to natural language processing (NLP). Cognitive-based approaches are an important, and relatively new, area of research in NLP and search, as well as linguistics. Cognitive approaches enable significant improvements in both…

  14. Agile text mining for the 2014 i2b2/UTHealth Cardiac risk factors challenge.

    PubMed

    Cormack, James; Nath, Chinmoy; Milward, David; Raja, Kalpana; Jonnalagadda, Siddhartha R

    2015-12-01

    This paper describes the use of an agile text mining platform (Linguamatics' Interactive Information Extraction Platform, I2E) to extract document-level cardiac risk factors in patient records as defined in the i2b2/UTHealth 2014 challenge. The approach uses a data-driven rule-based methodology with the addition of a simple supervised classifier. We demonstrate that agile text mining allows for rapid optimization of extraction strategies, while post-processing can leverage annotation guidelines, corpus statistics and logic inferred from the gold standard data. We also show how data imbalance in a training set affects performance. Evaluation of this approach on the test data gave an F-Score of 91.7%, one percent behind the top performing system. Copyright © 2015 Elsevier Inc. All rights reserved.

  15. Text mining for traditional Chinese medical knowledge discovery: a survey.

    PubMed

    Zhou, Xuezhong; Peng, Yonghong; Liu, Baoyan

    2010-08-01

    Extracting meaningful information and knowledge from free text is the subject of considerable research interest in the machine learning and data mining fields. Text data mining (or text mining) has become one of the most active research sub-fields in data mining. Significant developments in the area of biomedical text mining during the past years have demonstrated its great promise for supporting scientists in developing novel hypotheses and new knowledge from the biomedical literature. Traditional Chinese medicine (TCM) provides a distinct methodology with which to view human life. It is one of the most complete and distinguished traditional medicines with a history of several thousand years of studying and practicing the diagnosis and treatment of human disease. It has been shown that the TCM knowledge obtained from clinical practice has become a significant complementary source of information for modern biomedical sciences. TCM literature obtained from the historical period and from modern clinical studies has recently been transformed into digital data in the form of relational databases or text documents, which provide an effective platform for information sharing and retrieval. This motivates and facilitates research and development into knowledge discovery approaches and to modernize TCM. In order to contribute to this still growing field, this paper presents (1) a comparative introduction to TCM and modern biomedicine, (2) a survey of the related information sources of TCM, (3) a review and discussion of the state of the art and the development of text mining techniques with applications to TCM, (4) a discussion of the research issues around TCM text mining and its future directions. Copyright 2010 Elsevier Inc. All rights reserved.

  16. Chemical named entities recognition: a review on approaches and applications.

    PubMed

    Eltyeb, Safaa; Salim, Naomie

    2014-01-01

    The rapid increase in the flow rate of published digital information in all disciplines has resulted in a pressing need for techniques that can simplify the use of this information. The chemistry literature is very rich with information about chemical entities. Extracting molecules and their related properties and activities from the scientific literature to "text mine" these extracted data and determine contextual relationships helps research scientists, particularly those in drug development. One of the most important challenges in chemical text mining is the recognition of chemical entities mentioned in the texts. In this review, the authors briefly introduce the fundamental concepts of chemical literature mining, the textual contents of chemical documents, and the methods of naming chemicals in documents. We sketch out dictionary-based, rule-based and machine learning, as well as hybrid chemical named entity recognition approaches with their applied solutions. We end with an outlook on the pros and cons of these approaches and the types of chemical entities extracted.

  17. Mining Tasks from the Web Anchor Text Graph: MSR Notebook Paper for the TREC 2015 Tasks Track

    DTIC Science & Technology

    2015-11-20

    Mining Tasks from the Web Anchor Text Graph: MSR Notebook Paper for the TREC 2015 Tasks Track Paul N. Bennett Microsoft Research Redmond, USA pauben...anchor text graph has proven useful in the general realm of query reformulation [2], we sought to quantify the value of extracting key phrases from...anchor text in the broader setting of the task understanding track. Given a query, our approach considers a simple method for identifying a relevant

  18. Building a glaucoma interaction network using a text mining approach.

    PubMed

    Soliman, Maha; Nasraoui, Olfa; Cooper, Nigel G F

    2016-01-01

    The volume of biomedical literature and its underlying knowledge base is rapidly expanding, making it beyond the ability of a single human being to read through all the literature. Several automated methods have been developed to help make sense of this dilemma. The present study reports on the results of a text mining approach to extract gene interactions from the data warehouse of published experimental results which are then used to benchmark an interaction network associated with glaucoma. To the best of our knowledge, there is, as yet, no glaucoma interaction network derived solely from text mining approaches. The presence of such a network could provide a useful summative knowledge base to complement other forms of clinical information related to this disease. A glaucoma corpus was constructed from PubMed Central and a text mining approach was applied to extract genes and their relations from this corpus. The extracted relations between genes were checked using reference interaction databases and classified generally as known or new relations. The extracted genes and relations were then used to construct a glaucoma interaction network. Analysis of the resulting network indicated that it bears the characteristics of a small world interaction network. Our analysis showed the presence of seven glaucoma linked genes that defined the network modularity. A web-based system for browsing and visualizing the extracted glaucoma related interaction networks is made available at http://neurogene.spd.louisville.edu/GlaucomaINViewer/Form1.aspx. This study has reported the first version of a glaucoma interaction network using a text mining approach. The power of such an approach is in its ability to cover a wide range of glaucoma related studies published over many years. Hence, a bigger picture of the disease can be established. To the best of our knowledge, this is the first glaucoma interaction network to summarize the known literature. The major findings were a set of relations that could not be found in existing interaction databases and that were found to be new, in addition to a smaller subnetwork consisting of interconnected clusters of seven glaucoma genes. Future improvements can be applied towards obtaining a better version of this network.

  19. DISEASES: text mining and data integration of disease-gene associations.

    PubMed

    Pletscher-Frankild, Sune; Pallejà, Albert; Tsafou, Kalliopi; Binder, Janos X; Jensen, Lars Juhl

    2015-03-01

    Text mining is a flexible technology that can be applied to numerous different tasks in biology and medicine. We present a system for extracting disease-gene associations from biomedical abstracts. The system consists of a highly efficient dictionary-based tagger for named entity recognition of human genes and diseases, which we combine with a scoring scheme that takes into account co-occurrences both within and between sentences. We show that this approach is able to extract half of all manually curated associations with a false positive rate of only 0.16%. Nonetheless, text mining should not stand alone, but be combined with other types of evidence. For this reason, we have developed the DISEASES resource, which integrates the results from text mining with manually curated disease-gene associations, cancer mutation data, and genome-wide association studies from existing databases. The DISEASES resource is accessible through a web interface at http://diseases.jensenlab.org/, where the text-mining software and all associations are also freely available for download. Copyright © 2014 The Authors. Published by Elsevier Inc. All rights reserved.

  20. Recent Advances and Emerging Applications in Text and Data Mining for Biomedical Discovery.

    PubMed

    Gonzalez, Graciela H; Tahsin, Tasnia; Goodale, Britton C; Greene, Anna C; Greene, Casey S

    2016-01-01

    Precision medicine will revolutionize the way we treat and prevent disease. A major barrier to the implementation of precision medicine that clinicians and translational scientists face is understanding the underlying mechanisms of disease. We are starting to address this challenge through automatic approaches for information extraction, representation and analysis. Recent advances in text and data mining have been applied to a broad spectrum of key biomedical questions in genomics, pharmacogenomics and other fields. We present an overview of the fundamental methods for text and data mining, as well as recent advances and emerging applications toward precision medicine. © The Author 2015. Published by Oxford University Press.

  1. Recent Advances and Emerging Applications in Text and Data Mining for Biomedical Discovery

    PubMed Central

    Gonzalez, Graciela H.; Tahsin, Tasnia; Goodale, Britton C.; Greene, Anna C.

    2016-01-01

    Precision medicine will revolutionize the way we treat and prevent disease. A major barrier to the implementation of precision medicine that clinicians and translational scientists face is understanding the underlying mechanisms of disease. We are starting to address this challenge through automatic approaches for information extraction, representation and analysis. Recent advances in text and data mining have been applied to a broad spectrum of key biomedical questions in genomics, pharmacogenomics and other fields. We present an overview of the fundamental methods for text and data mining, as well as recent advances and emerging applications toward precision medicine. PMID:26420781

  2. Mining of relations between proteins over biomedical scientific literature using a deep-linguistic approach.

    PubMed

    Rinaldi, Fabio; Schneider, Gerold; Kaljurand, Kaarel; Hess, Michael; Andronis, Christos; Konstandi, Ourania; Persidis, Andreas

    2007-02-01

    The amount of new discoveries (as published in the scientific literature) in the biomedical area is growing at an exponential rate. This growth makes it very difficult to filter the most relevant results, and thus the extraction of the core information becomes very expensive. Therefore, there is a growing interest in text processing approaches that can deliver selected information from scientific publications, which can limit the amount of human intervention normally needed to gather those results. This paper presents and evaluates an approach aimed at automating the process of extracting functional relations (e.g. interactions between genes and proteins) from scientific literature in the biomedical domain. The approach, using a novel dependency-based parser, is based on a complete syntactic analysis of the corpus. We have implemented a state-of-the-art text mining system for biomedical literature, based on a deep-linguistic, full-parsing approach. The results are validated on two different corpora: the manually annotated genomics information access (GENIA) corpus and the automatically annotated arabidopsis thaliana circadian rhythms (ATCR) corpus. We show how a deep-linguistic approach (contrary to common belief) can be used in a real world text mining application, offering high-precision relation extraction, while at the same time retaining a sufficient recall.

  3. Imitating manual curation of text-mined facts in biomedicine.

    PubMed

    Rodriguez-Esteban, Raul; Iossifov, Ivan; Rzhetsky, Andrey

    2006-09-08

    Text-mining algorithms make mistakes in extracting facts from natural-language texts. In biomedical applications, which rely on use of text-mined data, it is critical to assess the quality (the probability that the message is correctly extracted) of individual facts--to resolve data conflicts and inconsistencies. Using a large set of almost 100,000 manually produced evaluations (most facts were independently reviewed more than once, producing independent evaluations), we implemented and tested a collection of algorithms that mimic human evaluation of facts provided by an automated information-extraction system. The performance of our best automated classifiers closely approached that of our human evaluators (ROC score close to 0.95). Our hypothesis is that, were we to use a larger number of human experts to evaluate any given sentence, we could implement an artificial-intelligence curator that would perform the classification job at least as accurately as an average individual human evaluator. We illustrated our analysis by visualizing the predicted accuracy of the text-mined relations involving the term cocaine.

  4. Automatic detection of adverse events to predict drug label changes using text and data mining techniques.

    PubMed

    Gurulingappa, Harsha; Toldo, Luca; Rajput, Abdul Mateen; Kors, Jan A; Taweel, Adel; Tayrouz, Yorki

    2013-11-01

    The aim of this study was to assess the impact of automatically detected adverse event signals from text and open-source data on the prediction of drug label changes. Open-source adverse effect data were collected from FAERS, Yellow Cards and SIDER databases. A shallow linguistic relation extraction system (JSRE) was applied for extraction of adverse effects from MEDLINE case reports. Statistical approach was applied on the extracted datasets for signal detection and subsequent prediction of label changes issued for 29 drugs by the UK Regulatory Authority in 2009. 76% of drug label changes were automatically predicted. Out of these, 6% of drug label changes were detected only by text mining. JSRE enabled precise identification of four adverse drug events from MEDLINE that were undetectable otherwise. Changes in drug labels can be predicted automatically using data and text mining techniques. Text mining technology is mature and well-placed to support the pharmacovigilance tasks. Copyright © 2013 John Wiley & Sons, Ltd.

  5. Assessing semantic similarity of texts - Methods and algorithms

    NASA Astrophysics Data System (ADS)

    Rozeva, Anna; Zerkova, Silvia

    2017-12-01

    Assessing the semantic similarity of texts is an important part of different text-related applications like educational systems, information retrieval, text summarization, etc. This task is performed by sophisticated analysis, which implements text-mining techniques. Text mining involves several pre-processing steps, which provide for obtaining structured representative model of the documents in a corpus by means of extracting and selecting the features, characterizing their content. Generally the model is vector-based and enables further analysis with knowledge discovery approaches. Algorithms and measures are used for assessing texts at syntactical and semantic level. An important text-mining method and similarity measure is latent semantic analysis (LSA). It provides for reducing the dimensionality of the document vector space and better capturing the text semantics. The mathematical background of LSA for deriving the meaning of the words in a given text by exploring their co-occurrence is examined. The algorithm for obtaining the vector representation of words and their corresponding latent concepts in a reduced multidimensional space as well as similarity calculation are presented.

  6. Using text mining for study identification in systematic reviews: a systematic review of current approaches.

    PubMed

    O'Mara-Eves, Alison; Thomas, James; McNaught, John; Miwa, Makoto; Ananiadou, Sophia

    2015-01-14

    The large and growing number of published studies, and their increasing rate of publication, makes the task of identifying relevant studies in an unbiased way for inclusion in systematic reviews both complex and time consuming. Text mining has been offered as a potential solution: through automating some of the screening process, reviewer time can be saved. The evidence base around the use of text mining for screening has not yet been pulled together systematically; this systematic review fills that research gap. Focusing mainly on non-technical issues, the review aims to increase awareness of the potential of these technologies and promote further collaborative research between the computer science and systematic review communities. Five research questions led our review: what is the state of the evidence base; how has workload reduction been evaluated; what are the purposes of semi-automation and how effective are they; how have key contextual problems of applying text mining to the systematic review field been addressed; and what challenges to implementation have emerged? We answered these questions using standard systematic review methods: systematic and exhaustive searching, quality-assured data extraction and a narrative synthesis to synthesise findings. The evidence base is active and diverse; there is almost no replication between studies or collaboration between research teams and, whilst it is difficult to establish any overall conclusions about best approaches, it is clear that efficiencies and reductions in workload are potentially achievable. On the whole, most suggested that a saving in workload of between 30% and 70% might be possible, though sometimes the saving in workload is accompanied by the loss of 5% of relevant studies (i.e. a 95% recall). Using text mining to prioritise the order in which items are screened should be considered safe and ready for use in 'live' reviews. The use of text mining as a 'second screener' may also be used cautiously. The use of text mining to eliminate studies automatically should be considered promising, but not yet fully proven. In highly technical/clinical areas, it may be used with a high degree of confidence; but more developmental and evaluative work is needed in other disciplines.

  7. Gene Prioritization of Resistant Rice Gene against Xanthomas oryzae pv. oryzae by Using Text Mining Technologies

    PubMed Central

    Xia, Jingbo; Zhang, Xing; Yuan, Daojun; Chen, Lingling; Webster, Jonathan; Fang, Alex Chengyu

    2013-01-01

    To effectively assess the possibility of the unknown rice protein resistant to Xanthomonas oryzae pv. oryzae, a hybrid strategy is proposed to enhance gene prioritization by combining text mining technologies with a sequence-based approach. The text mining technique of term frequency inverse document frequency is used to measure the importance of distinguished terms which reflect biomedical activity in rice before candidate genes are screened and vital terms are produced. Afterwards, a built-in classifier under the chaos games representation algorithm is used to sieve the best possible candidate gene. Our experiment results show that the combination of these two methods achieves enhanced gene prioritization. PMID:24371834

  8. Gene prioritization of resistant rice gene against Xanthomas oryzae pv. oryzae by using text mining technologies.

    PubMed

    Xia, Jingbo; Zhang, Xing; Yuan, Daojun; Chen, Lingling; Webster, Jonathan; Fang, Alex Chengyu

    2013-01-01

    To effectively assess the possibility of the unknown rice protein resistant to Xanthomonas oryzae pv. oryzae, a hybrid strategy is proposed to enhance gene prioritization by combining text mining technologies with a sequence-based approach. The text mining technique of term frequency inverse document frequency is used to measure the importance of distinguished terms which reflect biomedical activity in rice before candidate genes are screened and vital terms are produced. Afterwards, a built-in classifier under the chaos games representation algorithm is used to sieve the best possible candidate gene. Our experiment results show that the combination of these two methods achieves enhanced gene prioritization.

  9. Text mining to decipher free-response consumer complaints: insights from the NHTSA vehicle owner's complaint database.

    PubMed

    Ghazizadeh, Mahtab; McDonald, Anthony D; Lee, John D

    2014-09-01

    This study applies text mining to extract clusters of vehicle problems and associated trends from free-response data in the National Highway Traffic Safety Administration's vehicle owner's complaint database. As the automotive industry adopts new technologies, it is important to systematically assess the effect of these changes on traffic safety. Driving simulators, naturalistic driving data, and crash databases all contribute to a better understanding of how drivers respond to changing vehicle technology, but other approaches, such as automated analysis of incident reports, are needed. Free-response data from incidents representing two severity levels (fatal incidents and incidents involving injury) were analyzed using a text mining approach: latent semantic analysis (LSA). LSA and hierarchical clustering identified clusters of complaints for each severity level, which were compared and analyzed across time. Cluster analysis identified eight clusters of fatal incidents and six clusters of incidents involving injury. Comparisons showed that although the airbag clusters across the two severity levels have the same most frequent terms, the circumstances around the incidents differ. The time trends show clear increases in complaints surrounding the Ford/Firestone tire recall and the Toyota unintended acceleration recall. Increases in complaints may be partially driven by these recall announcements and the associated media attention. Text mining can reveal useful information from free-response databases that would otherwise be prohibitively time-consuming and difficult to summarize manually. Text mining can extend human analysis capabilities for large free-response databases to support earlier detection of problems and more timely safety interventions.

  10. A Node Linkage Approach for Sequential Pattern Mining

    PubMed Central

    Navarro, Osvaldo; Cumplido, René; Villaseñor-Pineda, Luis; Feregrino-Uribe, Claudia; Carrasco-Ochoa, Jesús Ariel

    2014-01-01

    Sequential Pattern Mining is a widely addressed problem in data mining, with applications such as analyzing Web usage, examining purchase behavior, and text mining, among others. Nevertheless, with the dramatic increase in data volume, the current approaches prove inefficient when dealing with large input datasets, a large number of different symbols and low minimum supports. In this paper, we propose a new sequential pattern mining algorithm, which follows a pattern-growth scheme to discover sequential patterns. Unlike most pattern growth algorithms, our approach does not build a data structure to represent the input dataset, but instead accesses the required sequences through pseudo-projection databases, achieving better runtime and reducing memory requirements. Our algorithm traverses the search space in a depth-first fashion and only preserves in memory a pattern node linkage and the pseudo-projections required for the branch being explored at the time. Experimental results show that our new approach, the Node Linkage Depth-First Traversal algorithm (NLDFT), has better performance and scalability in comparison with state of the art algorithms. PMID:24933123

  11. Unapparent Information Revelation: Text Mining for Counterterrorism

    NASA Astrophysics Data System (ADS)

    Srihari, Rohini K.

    Unapparent information revelation (UIR) is a special case of text mining that focuses on detecting possible links between concepts across multiple text documents by generating an evidence trail explaining the connection. A traditional search involving, for example, two or more person names will attempt to find documents mentioning both these individuals. This research focuses on a different interpretation of such a query: what is the best evidence trail across documents that explains a connection between these individuals? For example, all may be good golfers. A generalization of this task involves query terms representing general concepts (e.g. indictment, foreign policy). Previous approaches to this problem have focused on graph mining involving hyperlinked documents, and link analysis exploiting named entities. A new robust framework is presented, based on (i) generating concept chain graphs, a hybrid content representation, (ii) performing graph matching to select candidate subgraphs, and (iii) subsequently using graphical models to validate hypotheses using ranked evidence trails. We adapt the DUC data set for cross-document summarization to evaluate evidence trails generated by this approach

  12. DrugQuest - a text mining workflow for drug association discovery.

    PubMed

    Papanikolaou, Nikolas; Pavlopoulos, Georgios A; Theodosiou, Theodosios; Vizirianakis, Ioannis S; Iliopoulos, Ioannis

    2016-06-06

    Text mining and data integration methods are gaining ground in the field of health sciences due to the exponential growth of bio-medical literature and information stored in biological databases. While such methods mostly try to extract bioentity associations from PubMed, very few of them are dedicated in mining other types of repositories such as chemical databases. Herein, we apply a text mining approach on the DrugBank database in order to explore drug associations based on the DrugBank "Description", "Indication", "Pharmacodynamics" and "Mechanism of Action" text fields. We apply Name Entity Recognition (NER) techniques on these fields to identify chemicals, proteins, genes, pathways, diseases, and we utilize the TextQuest algorithm to find additional biologically significant words. Using a plethora of similarity and partitional clustering techniques, we group the DrugBank records based on their common terms and investigate possible scenarios why these records are clustered together. Different views such as clustered chemicals based on their textual information, tag clouds consisting of Significant Terms along with the terms that were used for clustering are delivered to the user through a user-friendly web interface. DrugQuest is a text mining tool for knowledge discovery: it is designed to cluster DrugBank records based on text attributes in order to find new associations between drugs. The service is freely available at http://bioinformatics.med.uoc.gr/drugquest .

  13. Text mining applied to electronic cardiovascular procedure reports to identify patients with trileaflet aortic stenosis and coronary artery disease.

    PubMed

    Small, Aeron M; Kiss, Daniel H; Zlatsin, Yevgeny; Birtwell, David L; Williams, Heather; Guerraty, Marie A; Han, Yuchi; Anwaruddin, Saif; Holmes, John H; Chirinos, Julio A; Wilensky, Robert L; Giri, Jay; Rader, Daniel J

    2017-08-01

    Interrogation of the electronic health record (EHR) using billing codes as a surrogate for diagnoses of interest has been widely used for clinical research. However, the accuracy of this methodology is variable, as it reflects billing codes rather than severity of disease, and depends on the disease and the accuracy of the coding practitioner. Systematic application of text mining to the EHR has had variable success for the detection of cardiovascular phenotypes. We hypothesize that the application of text mining algorithms to cardiovascular procedure reports may be a superior method to identify patients with cardiovascular conditions of interest. We adapted the Oracle product Endeca, which utilizes text mining to identify terms of interest from a NoSQL-like database, for purposes of searching cardiovascular procedure reports and termed the tool "PennSeek". We imported 282,569 echocardiography reports representing 81,164 individuals and 27,205 cardiac catheterization reports representing 14,567 individuals from non-searchable databases into PennSeek. We then applied clinical criteria to these reports in PennSeek to identify patients with trileaflet aortic stenosis (TAS) and coronary artery disease (CAD). Accuracy of patient identification by text mining through PennSeek was compared with ICD-9 billing codes. Text mining identified 7115 patients with TAS and 9247 patients with CAD. ICD-9 codes identified 8272 patients with TAS and 6913 patients with CAD. 4346 patients with AS and 6024 patients with CAD were identified by both approaches. A randomly selected sample of 200-250 patients uniquely identified by text mining was compared with 200-250 patients uniquely identified by billing codes for both diseases. We demonstrate that text mining was superior, with a positive predictive value (PPV) of 0.95 compared to 0.53 by ICD-9 for TAS, and a PPV of 0.97 compared to 0.86 for CAD. These results highlight the superiority of text mining algorithms applied to electronic cardiovascular procedure reports in the identification of phenotypes of interest for cardiovascular research. Copyright © 2017. Published by Elsevier Inc.

  14. Mining biomedical images towards valuable information retrieval in biomedical and life sciences

    PubMed Central

    Ahmed, Zeeshan; Zeeshan, Saman; Dandekar, Thomas

    2016-01-01

    Biomedical images are helpful sources for the scientists and practitioners in drawing significant hypotheses, exemplifying approaches and describing experimental results in published biomedical literature. In last decades, there has been an enormous increase in the amount of heterogeneous biomedical image production and publication, which results in a need for bioimaging platforms for feature extraction and analysis of text and content in biomedical images to take advantage in implementing effective information retrieval systems. In this review, we summarize technologies related to data mining of figures. We describe and compare the potential of different approaches in terms of their developmental aspects, used methodologies, produced results, achieved accuracies and limitations. Our comparative conclusions include current challenges for bioimaging software with selective image mining, embedded text extraction and processing of complex natural language queries. PMID:27538578

  15. Text Mining to Support Gene Ontology Curation and Vice Versa.

    PubMed

    Ruch, Patrick

    2017-01-01

    In this chapter, we explain how text mining can support the curation of molecular biology databases dealing with protein functions. We also show how curated data can play a disruptive role in the developments of text mining methods. We review a decade of efforts to improve the automatic assignment of Gene Ontology (GO) descriptors, the reference ontology for the characterization of genes and gene products. To illustrate the high potential of this approach, we compare the performances of an automatic text categorizer and show a large improvement of +225 % in both precision and recall on benchmarked data. We argue that automatic text categorization functions can ultimately be embedded into a Question-Answering (QA) system to answer questions related to protein functions. Because GO descriptors can be relatively long and specific, traditional QA systems cannot answer such questions. A new type of QA system, so-called Deep QA which uses machine learning methods trained with curated contents, is thus emerging. Finally, future advances of text mining instruments are directly dependent on the availability of high-quality annotated contents at every curation step. Databases workflows must start recording explicitly all the data they curate and ideally also some of the data they do not curate.

  16. pubmed.mineR: an R package with text-mining algorithms to analyse PubMed abstracts.

    PubMed

    Rani, Jyoti; Shah, A B Rauf; Ramachandran, Srinivasan

    2015-10-01

    The PubMed literature database is a valuable source of information for scientific research. It is rich in biomedical literature with more than 24 million citations. Data-mining of voluminous literature is a challenging task. Although several text-mining algorithms have been developed in recent years with focus on data visualization, they have limitations such as speed, are rigid and are not available in the open source. We have developed an R package, pubmed.mineR, wherein we have combined the advantages of existing algorithms, overcome their limitations, and offer user flexibility and link with other packages in Bioconductor and the Comprehensive R Network (CRAN) in order to expand the user capabilities for executing multifaceted approaches. Three case studies are presented, namely, 'Evolving role of diabetes educators', 'Cancer risk assessment' and 'Dynamic concepts on disease and comorbidity' to illustrate the use of pubmed.mineR. The package generally runs fast with small elapsed times in regular workstations even on large corpus sizes and with compute intensive functions. The pubmed.mineR is available at http://cran.rproject. org/web/packages/pubmed.mineR.

  17. Facilitating Decision Making, Re-Use and Collaboration: A Knowledge Management Approach to Acquisition Program Self-Awareness

    DTIC Science & Technology

    2009-06-01

    capabilities: web-based, relational/multi-dimensional, client/server, and metadata (data about data) inclusion (pp. 39-40). Text mining, on the other...and Organizational Systems ( CASOS ) (Carley, 2005). Although AutoMap can be used to conduct text-mining, it was utilized only for its visualization...provides insight into how the GMCOI is using the terms, and where there might be redundant terms and need for de -confliction and standardization

  18. Text Mining Genotype-Phenotype Relationships from Biomedical Literature for Database Curation and Precision Medicine.

    PubMed

    Singhal, Ayush; Simmons, Michael; Lu, Zhiyong

    2016-11-01

    The practice of precision medicine will ultimately require databases of genes and mutations for healthcare providers to reference in order to understand the clinical implications of each patient's genetic makeup. Although the highest quality databases require manual curation, text mining tools can facilitate the curation process, increasing accuracy, coverage, and productivity. However, to date there are no available text mining tools that offer high-accuracy performance for extracting such triplets from biomedical literature. In this paper we propose a high-performance machine learning approach to automate the extraction of disease-gene-variant triplets from biomedical literature. Our approach is unique because we identify the genes and protein products associated with each mutation from not just the local text content, but from a global context as well (from the Internet and from all literature in PubMed). Our approach also incorporates protein sequence validation and disease association using a novel text-mining-based machine learning approach. We extract disease-gene-variant triplets from all abstracts in PubMed related to a set of ten important diseases (breast cancer, prostate cancer, pancreatic cancer, lung cancer, acute myeloid leukemia, Alzheimer's disease, hemochromatosis, age-related macular degeneration (AMD), diabetes mellitus, and cystic fibrosis). We then evaluate our approach in two ways: (1) a direct comparison with the state of the art using benchmark datasets; (2) a validation study comparing the results of our approach with entries in a popular human-curated database (UniProt) for each of the previously mentioned diseases. In the benchmark comparison, our full approach achieves a 28% improvement in F1-measure (from 0.62 to 0.79) over the state-of-the-art results. For the validation study with UniProt Knowledgebase (KB), we present a thorough analysis of the results and errors. Across all diseases, our approach returned 272 triplets (disease-gene-variant) that overlapped with entries in UniProt and 5,384 triplets without overlap in UniProt. Analysis of the overlapping triplets and of a stratified sample of the non-overlapping triplets revealed accuracies of 93% and 80% for the respective categories (cumulative accuracy, 77%). We conclude that our process represents an important and broadly applicable improvement to the state of the art for curation of disease-gene-variant relationships.

  19. Biomedical hypothesis generation by text mining and gene prioritization.

    PubMed

    Petric, Ingrid; Ligeti, Balazs; Gyorffy, Balazs; Pongor, Sandor

    2014-01-01

    Text mining methods can facilitate the generation of biomedical hypotheses by suggesting novel associations between diseases and genes. Previously, we developed a rare-term model called RaJoLink (Petric et al, J. Biomed. Inform. 42(2): 219-227, 2009) in which hypotheses are formulated on the basis of terms rarely associated with a target domain. Since many current medical hypotheses are formulated in terms of molecular entities and molecular mechanisms, here we extend the methodology to proteins and genes, using a standardized vocabulary as well as a gene/protein network model. The proposed enhanced RaJoLink rare-term model combines text mining and gene prioritization approaches. Its utility is illustrated by finding known as well as potential gene-disease associations in ovarian cancer using MEDLINE abstracts and the STRING database.

  20. Mining biomedical images towards valuable information retrieval in biomedical and life sciences.

    PubMed

    Ahmed, Zeeshan; Zeeshan, Saman; Dandekar, Thomas

    2016-01-01

    Biomedical images are helpful sources for the scientists and practitioners in drawing significant hypotheses, exemplifying approaches and describing experimental results in published biomedical literature. In last decades, there has been an enormous increase in the amount of heterogeneous biomedical image production and publication, which results in a need for bioimaging platforms for feature extraction and analysis of text and content in biomedical images to take advantage in implementing effective information retrieval systems. In this review, we summarize technologies related to data mining of figures. We describe and compare the potential of different approaches in terms of their developmental aspects, used methodologies, produced results, achieved accuracies and limitations. Our comparative conclusions include current challenges for bioimaging software with selective image mining, embedded text extraction and processing of complex natural language queries. © The Author(s) 2016. Published by Oxford University Press.

  1. Method to Select Technical Terms for Glossaries in Support of Joint Task Force Operations

    DTIC Science & Technology

    2012-01-01

    have been prohibitively time-consuming. Instead, we identified two publicly available terminology extractor tools: TerMine (NaCTEM, 2011) and Alchemy ...and that from the latter, by high recall. The Alchemy approach contrasts with that used in TerMine in that Alchemy will process the text with...information categories, such as person, location, and organization, in addition to returning topic keywords. Output from both TerMine and Alchemy

  2. Can abstract screening workload be reduced using text mining? User experiences of the tool Rayyan.

    PubMed

    Olofsson, Hanna; Brolund, Agneta; Hellberg, Christel; Silverstein, Rebecca; Stenström, Karin; Österberg, Marie; Dagerhamn, Jessica

    2017-09-01

    One time-consuming aspect of conducting systematic reviews is the task of sifting through abstracts to identify relevant studies. One promising approach for reducing this burden uses text mining technology to identify those abstracts that are potentially most relevant for a project, allowing those abstracts to be screened first. To examine the effectiveness of the text mining functionality of the abstract screening tool Rayyan. User experiences were collected. Rayyan was used to screen abstracts for 6 reviews in 2015. After screening 25%, 50%, and 75% of the abstracts, the screeners logged the relevant references identified. A survey was sent to users. After screening half of the search result with Rayyan, 86% to 99% of the references deemed relevant to the study were identified. Of those studies included in the final reports, 96% to 100% were already identified in the first half of the screening process. Users rated Rayyan 4.5 out of 5. The text mining function in Rayyan successfully helped reviewers identify relevant studies early in the screening process. Copyright © 2017 John Wiley & Sons, Ltd.

  3. Building a protein name dictionary from full text: a machine learning term extraction approach.

    PubMed

    Shi, Lei; Campagne, Fabien

    2005-04-07

    The majority of information in the biological literature resides in full text articles, instead of abstracts. Yet, abstracts remain the focus of many publicly available literature data mining tools. Most literature mining tools rely on pre-existing lexicons of biological names, often extracted from curated gene or protein databases. This is a limitation, because such databases have low coverage of the many name variants which are used to refer to biological entities in the literature. We present an approach to recognize named entities in full text. The approach collects high frequency terms in an article, and uses support vector machines (SVM) to identify biological entity names. It is also computationally efficient and robust to noise commonly found in full text material. We use the method to create a protein name dictionary from a set of 80,528 full text articles. Only 8.3% of the names in this dictionary match SwissProt description lines. We assess the quality of the dictionary by studying its protein name recognition performance in full text. This dictionary term lookup method compares favourably to other published methods, supporting the significance of our direct extraction approach. The method is strong in recognizing name variants not found in SwissProt.

  4. Building a protein name dictionary from full text: a machine learning term extraction approach

    PubMed Central

    Shi, Lei; Campagne, Fabien

    2005-01-01

    Background The majority of information in the biological literature resides in full text articles, instead of abstracts. Yet, abstracts remain the focus of many publicly available literature data mining tools. Most literature mining tools rely on pre-existing lexicons of biological names, often extracted from curated gene or protein databases. This is a limitation, because such databases have low coverage of the many name variants which are used to refer to biological entities in the literature. Results We present an approach to recognize named entities in full text. The approach collects high frequency terms in an article, and uses support vector machines (SVM) to identify biological entity names. It is also computationally efficient and robust to noise commonly found in full text material. We use the method to create a protein name dictionary from a set of 80,528 full text articles. Only 8.3% of the names in this dictionary match SwissProt description lines. We assess the quality of the dictionary by studying its protein name recognition performance in full text. Conclusion This dictionary term lookup method compares favourably to other published methods, supporting the significance of our direct extraction approach. The method is strong in recognizing name variants not found in SwissProt. PMID:15817129

  5. Ion Channel ElectroPhysiology Ontology (ICEPO) - a case study of text mining assisted ontology development.

    PubMed

    Elayavilli, Ravikumar Komandur; Liu, Hongfang

    2016-01-01

    Computational modeling of biological cascades is of great interest to quantitative biologists. Biomedical text has been a rich source for quantitative information. Gathering quantitative parameters and values from biomedical text is one significant challenge in the early steps of computational modeling as it involves huge manual effort. While automatically extracting such quantitative information from bio-medical text may offer some relief, lack of ontological representation for a subdomain serves as impedance in normalizing textual extractions to a standard representation. This may render textual extractions less meaningful to the domain experts. In this work, we propose a rule-based approach to automatically extract relations involving quantitative data from biomedical text describing ion channel electrophysiology. We further translated the quantitative assertions extracted through text mining to a formal representation that may help in constructing ontology for ion channel events using a rule based approach. We have developed Ion Channel ElectroPhysiology Ontology (ICEPO) by integrating the information represented in closely related ontologies such as, Cell Physiology Ontology (CPO), and Cardiac Electro Physiology Ontology (CPEO) and the knowledge provided by domain experts. The rule-based system achieved an overall F-measure of 68.93% in extracting the quantitative data assertions system on an independently annotated blind data set. We further made an initial attempt in formalizing the quantitative data assertions extracted from the biomedical text into a formal representation that offers potential to facilitate the integration of text mining into ontological workflow, a novel aspect of this study. This work is a case study where we created a platform that provides formal interaction between ontology development and text mining. We have achieved partial success in extracting quantitative assertions from the biomedical text and formalizing them in ontological framework. The ICEPO ontology is available for download at http://openbionlp.org/mutd/supplementarydata/ICEPO/ICEPO.owl.

  6. Conceptual biology, hypothesis discovery, and text mining: Swanson's legacy.

    PubMed

    Bekhuis, Tanja

    2006-04-03

    Innovative biomedical librarians and information specialists who want to expand their roles as expert searchers need to know about profound changes in biology and parallel trends in text mining. In recent years, conceptual biology has emerged as a complement to empirical biology. This is partly in response to the availability of massive digital resources such as the network of databases for molecular biologists at the National Center for Biotechnology Information. Developments in text mining and hypothesis discovery systems based on the early work of Swanson, a mathematician and information scientist, are coincident with the emergence of conceptual biology. Very little has been written to introduce biomedical digital librarians to these new trends. In this paper, background for data and text mining, as well as for knowledge discovery in databases (KDD) and in text (KDT) is presented, then a brief review of Swanson's ideas, followed by a discussion of recent approaches to hypothesis discovery and testing. 'Testing' in the context of text mining involves partially automated methods for finding evidence in the literature to support hypothetical relationships. Concluding remarks follow regarding (a) the limits of current strategies for evaluation of hypothesis discovery systems and (b) the role of literature-based discovery in concert with empirical research. Report of an informatics-driven literature review for biomarkers of systemic lupus erythematosus is mentioned. Swanson's vision of the hidden value in the literature of science and, by extension, in biomedical digital databases, is still remarkably generative for information scientists, biologists, and physicians.

  7. A Text-Mining Framework for Supporting Systematic Reviews.

    PubMed

    Li, Dingcheng; Wang, Zhen; Wang, Liwei; Sohn, Sunghwan; Shen, Feichen; Murad, Mohammad Hassan; Liu, Hongfang

    2016-11-01

    Systematic reviews (SRs) involve the identification, appraisal, and synthesis of all relevant studies for focused questions in a structured reproducible manner. High-quality SRs follow strict procedures and require significant resources and time. We investigated advanced text-mining approaches to reduce the burden associated with abstract screening in SRs and provide high-level information summary. A text-mining SR supporting framework consisting of three self-defined semantics-based ranking metrics was proposed, including keyword relevance, indexed-term relevance and topic relevance. Keyword relevance is based on the user-defined keyword list used in the search strategy. Indexed-term relevance is derived from indexed vocabulary developed by domain experts used for indexing journal articles and books. Topic relevance is defined as the semantic similarity among retrieved abstracts in terms of topics generated by latent Dirichlet allocation, a Bayesian-based model for discovering topics. We tested the proposed framework using three published SRs addressing a variety of topics (Mass Media Interventions, Rectal Cancer and Influenza Vaccine). The results showed that when 91.8%, 85.7%, and 49.3% of the abstract screening labor was saved, the recalls were as high as 100% for the three cases; respectively. Relevant studies identified manually showed strong topic similarity through topic analysis, which supported the inclusion of topic analysis as relevance metric. It was demonstrated that advanced text mining approaches can significantly reduce the abstract screening labor of SRs and provide an informative summary of relevant studies.

  8. Identifying Engineering Students' English Sentence Reading Comprehension Errors: Applying a Data Mining Technique

    ERIC Educational Resources Information Center

    Tsai, Yea-Ru; Ouyang, Chen-Sen; Chang, Yukon

    2016-01-01

    The purpose of this study is to propose a diagnostic approach to identify engineering students' English reading comprehension errors. Student data were collected during the process of reading texts of English for science and technology on a web-based cumulative sentence analysis system. For the analysis, the association-rule, data mining technique…

  9. New directions in biomedical text annotation: definitions, guidelines and corpus construction

    PubMed Central

    Wilbur, W John; Rzhetsky, Andrey; Shatkay, Hagit

    2006-01-01

    Background While biomedical text mining is emerging as an important research area, practical results have proven difficult to achieve. We believe that an important first step towards more accurate text-mining lies in the ability to identify and characterize text that satisfies various types of information needs. We report here the results of our inquiry into properties of scientific text that have sufficient generality to transcend the confines of a narrow subject area, while supporting practical mining of text for factual information. Our ultimate goal is to annotate a significant corpus of biomedical text and train machine learning methods to automatically categorize such text along certain dimensions that we have defined. Results We have identified five qualitative dimensions that we believe characterize a broad range of scientific sentences, and are therefore useful for supporting a general approach to text-mining: focus, polarity, certainty, evidence, and directionality. We define these dimensions and describe the guidelines we have developed for annotating text with regard to them. To examine the effectiveness of the guidelines, twelve annotators independently annotated the same set of 101 sentences that were randomly selected from current biomedical periodicals. Analysis of these annotations shows 70–80% inter-annotator agreement, suggesting that our guidelines indeed present a well-defined, executable and reproducible task. Conclusion We present our guidelines defining a text annotation task, along with annotation results from multiple independently produced annotations, demonstrating the feasibility of the task. The annotation of a very large corpus of documents along these guidelines is currently ongoing. These annotations form the basis for the categorization of text along multiple dimensions, to support viable text mining for experimental results, methodology statements, and other forms of information. We are currently developing machine learning methods, to be trained and tested on the annotated corpus, that would allow for the automatic categorization of biomedical text along the general dimensions that we have presented. The guidelines in full detail, along with annotated examples, are publicly available. PMID:16867190

  10. Text Mining Improves Prediction of Protein Functional Sites

    PubMed Central

    Cohn, Judith D.; Ravikumar, Komandur E.

    2012-01-01

    We present an approach that integrates protein structure analysis and text mining for protein functional site prediction, called LEAP-FS (Literature Enhanced Automated Prediction of Functional Sites). The structure analysis was carried out using Dynamics Perturbation Analysis (DPA), which predicts functional sites at control points where interactions greatly perturb protein vibrations. The text mining extracts mentions of residues in the literature, and predicts that residues mentioned are functionally important. We assessed the significance of each of these methods by analyzing their performance in finding known functional sites (specifically, small-molecule binding sites and catalytic sites) in about 100,000 publicly available protein structures. The DPA predictions recapitulated many of the functional site annotations and preferentially recovered binding sites annotated as biologically relevant vs. those annotated as potentially spurious. The text-based predictions were also substantially supported by the functional site annotations: compared to other residues, residues mentioned in text were roughly six times more likely to be found in a functional site. The overlap of predictions with annotations improved when the text-based and structure-based methods agreed. Our analysis also yielded new high-quality predictions of many functional site residues that were not catalogued in the curated data sources we inspected. We conclude that both DPA and text mining independently provide valuable high-throughput protein functional site predictions, and that integrating the two methods using LEAP-FS further improves the quality of these predictions. PMID:22393388

  11. Chemical named entities recognition: a review on approaches and applications

    PubMed Central

    2014-01-01

    The rapid increase in the flow rate of published digital information in all disciplines has resulted in a pressing need for techniques that can simplify the use of this information. The chemistry literature is very rich with information about chemical entities. Extracting molecules and their related properties and activities from the scientific literature to “text mine” these extracted data and determine contextual relationships helps research scientists, particularly those in drug development. One of the most important challenges in chemical text mining is the recognition of chemical entities mentioned in the texts. In this review, the authors briefly introduce the fundamental concepts of chemical literature mining, the textual contents of chemical documents, and the methods of naming chemicals in documents. We sketch out dictionary-based, rule-based and machine learning, as well as hybrid chemical named entity recognition approaches with their applied solutions. We end with an outlook on the pros and cons of these approaches and the types of chemical entities extracted. PMID:24834132

  12. Clustering and Dimensionality Reduction to Discover Interesting Patterns in Binary Data

    NASA Astrophysics Data System (ADS)

    Palumbo, Francesco; D'Enza, Alfonso Iodice

    The attention towards binary data coding increased consistently in the last decade due to several reasons. The analysis of binary data characterizes several fields of application, such as market basket analysis, DNA microarray data, image mining, text mining and web-clickstream mining. The paper illustrates two different approaches exploiting a profitable combination of clustering and dimensionality reduction for the identification of non-trivial association structures in binary data. An application in the Association Rules framework supports the theory with the empirical evidence.

  13. A text-based data mining and toxicity prediction modeling system for a clinical decision support in radiation oncology: A preliminary study

    NASA Astrophysics Data System (ADS)

    Kim, Kwang Hyeon; Lee, Suk; Shim, Jang Bo; Chang, Kyung Hwan; Yang, Dae Sik; Yoon, Won Sup; Park, Young Je; Kim, Chul Yong; Cao, Yuan Jie

    2017-08-01

    The aim of this study is an integrated research for text-based data mining and toxicity prediction modeling system for clinical decision support system based on big data in radiation oncology as a preliminary research. The structured and unstructured data were prepared by treatment plans and the unstructured data were extracted by dose-volume data image pattern recognition of prostate cancer for research articles crawling through the internet. We modeled an artificial neural network to build a predictor model system for toxicity prediction of organs at risk. We used a text-based data mining approach to build the artificial neural network model for bladder and rectum complication predictions. The pattern recognition method was used to mine the unstructured toxicity data for dose-volume at the detection accuracy of 97.9%. The confusion matrix and training model of the neural network were achieved with 50 modeled plans (n = 50) for validation. The toxicity level was analyzed and the risk factors for 25% bladder, 50% bladder, 20% rectum, and 50% rectum were calculated by the artificial neural network algorithm. As a result, 32 plans could cause complication but 18 plans were designed as non-complication among 50 modeled plans. We integrated data mining and a toxicity modeling method for toxicity prediction using prostate cancer cases. It is shown that a preprocessing analysis using text-based data mining and prediction modeling can be expanded to personalized patient treatment decision support based on big data.

  14. Experiences with Text Mining Large Collections of Unstructured Systems Development Artifacts at JPL

    NASA Technical Reports Server (NTRS)

    Port, Dan; Nikora, Allen; Hihn, Jairus; Huang, LiGuo

    2011-01-01

    Often repositories of systems engineering artifacts at NASA's Jet Propulsion Laboratory (JPL) are so large and poorly structured that they have outgrown our capability to effectively manually process their contents to extract useful information. Sophisticated text mining methods and tools seem a quick, low-effort approach to automating our limited manual efforts. Our experiences of exploring such methods mainly in three areas including historical risk analysis, defect identification based on requirements analysis, and over-time analysis of system anomalies at JPL, have shown that obtaining useful results requires substantial unanticipated efforts - from preprocessing the data to transforming the output for practical applications. We have not observed any quick 'wins' or realized benefit from short-term effort avoidance through automation in this area. Surprisingly we have realized a number of unexpected long-term benefits from the process of applying text mining to our repositories. This paper elaborates some of these benefits and our important lessons learned from the process of preparing and applying text mining to large unstructured system artifacts at JPL aiming to benefit future TM applications in similar problem domains and also in hope for being extended to broader areas of applications.

  15. Semantic Theme Analysis of Pilot Incident Reports

    NASA Technical Reports Server (NTRS)

    Thirumalainambi, Rajkumar

    2009-01-01

    Pilots report accidents or incidents during take-off, on flight and landing to airline authorities and Federal aviation authority as well. The description of pilot reports for an incident contains technical terms related to Flight instruments and operations. Normal text mining approaches collect keywords from text documents and relate them among documents that are stored in database. Present approach will extract specific theme analysis of incident reports and semantically relate hierarchy of terms assigning weights of themes. Once the theme extraction has been performed for a given document, a unique key can be assigned to that document to cross linking the documents. Semantic linking will be used to categorize the documents based on specific rules that can help an end-user to analyze certain types of accidents. This presentation outlines the architecture of text mining for pilot incident reports for autonomous categorization of pilot incident reports using semantic theme analysis.

  16. Neural networks for data mining electronic text collections

    NASA Astrophysics Data System (ADS)

    Walker, Nicholas; Truman, Gregory

    1997-04-01

    The use of neural networks in information retrieval and text analysis has primarily suffered from the issues of adequate document representation, the ability to scale to very large collections, dynamism in the face of new information and the practical difficulties of basing the design on the use of supervised training sets. Perhaps the most important approach to begin solving these problems is the use of `intermediate entities' which reduce the dimensionality of document representations and the size of documents collections to manageable levels coupled with the use of unsupervised neural network paradigms. This paper describes the issues, a fully configured neural network-based text analysis system--dataHARVEST--aimed at data mining text collections which begins this process, along with the remaining difficulties and potential ways forward.

  17. Machine learning approaches to analysing textual injury surveillance data: a systematic review.

    PubMed

    Vallmuur, Kirsten

    2015-06-01

    To synthesise recent research on the use of machine learning approaches to mining textual injury surveillance data. Systematic review. The electronic databases which were searched included PubMed, Cinahl, Medline, Google Scholar, and Proquest. The bibliography of all relevant articles was examined and associated articles were identified using a snowballing technique. For inclusion, articles were required to meet the following criteria: (a) used a health-related database, (b) focused on injury-related cases, AND used machine learning approaches to analyse textual data. The papers identified through the search were screened resulting in 16 papers selected for review. Articles were reviewed to describe the databases and methodology used, the strength and limitations of different techniques, and quality assurance approaches used. Due to heterogeneity between studies meta-analysis was not performed. Occupational injuries were the focus of half of the machine learning studies and the most common methods described were Bayesian probability or Bayesian network based methods to either predict injury categories or extract common injury scenarios. Models were evaluated through either comparison with gold standard data or content expert evaluation or statistical measures of quality. Machine learning was found to provide high precision and accuracy when predicting a small number of categories, was valuable for visualisation of injury patterns and prediction of future outcomes. However, difficulties related to generalizability, source data quality, complexity of models and integration of content and technical knowledge were discussed. The use of narrative text for injury surveillance has grown in popularity, complexity and quality over recent years. With advances in data mining techniques, increased capacity for analysis of large databases, and involvement of computer scientists in the injury prevention field, along with more comprehensive use and description of quality assurance methods in text mining approaches, it is likely that we will see a continued growth and advancement in knowledge of text mining in the injury field. Copyright © 2015 Elsevier Ltd. All rights reserved.

  18. Text Mining in Biomedical Domain with Emphasis on Document Clustering.

    PubMed

    Renganathan, Vinaitheerthan

    2017-07-01

    With the exponential increase in the number of articles published every year in the biomedical domain, there is a need to build automated systems to extract unknown information from the articles published. Text mining techniques enable the extraction of unknown knowledge from unstructured documents. This paper reviews text mining processes in detail and the software tools available to carry out text mining. It also reviews the roles and applications of text mining in the biomedical domain. Text mining processes, such as search and retrieval of documents, pre-processing of documents, natural language processing, methods for text clustering, and methods for text classification are described in detail. Text mining techniques can facilitate the mining of vast amounts of knowledge on a given topic from published biomedical research articles and draw meaningful conclusions that are not possible otherwise.

  19. Text Mining in Biomedical Domain with Emphasis on Document Clustering

    PubMed Central

    2017-01-01

    Objectives With the exponential increase in the number of articles published every year in the biomedical domain, there is a need to build automated systems to extract unknown information from the articles published. Text mining techniques enable the extraction of unknown knowledge from unstructured documents. Methods This paper reviews text mining processes in detail and the software tools available to carry out text mining. It also reviews the roles and applications of text mining in the biomedical domain. Results Text mining processes, such as search and retrieval of documents, pre-processing of documents, natural language processing, methods for text clustering, and methods for text classification are described in detail. Conclusions Text mining techniques can facilitate the mining of vast amounts of knowledge on a given topic from published biomedical research articles and draw meaningful conclusions that are not possible otherwise. PMID:28875048

  20. Complex Event Processing for Content-Based Text, Image, and Video Retrieval

    DTIC Science & Technology

    2016-06-01

    NY): Wiley- Interscience; 2000. Feldman R, Sanger J. The text mining handbook: advanced approaches in analyzing unstructured data. New York (NY...ARL-TR-7705 ● JUNE 2016 US Army Research Laboratory Complex Event Processing for Content-Based Text , Image, and Video Retrieval...ARL-TR-7705 ● JUNE 2016 US Army Research Laboratory Complex Event Processing for Content-Based Text , Image, and Video Retrieval

  1. Intelligent bar chart plagiarism detection in documents.

    PubMed

    Al-Dabbagh, Mohammed Mumtaz; Salim, Naomie; Rehman, Amjad; Alkawaz, Mohammed Hazim; Saba, Tanzila; Al-Rodhaan, Mznah; Al-Dhelaan, Abdullah

    2014-01-01

    This paper presents a novel features mining approach from documents that could not be mined via optical character recognition (OCR). By identifying the intimate relationship between the text and graphical components, the proposed technique pulls out the Start, End, and Exact values for each bar. Furthermore, the word 2-gram and Euclidean distance methods are used to accurately detect and determine plagiarism in bar charts.

  2. Intelligent Bar Chart Plagiarism Detection in Documents

    PubMed Central

    Al-Dabbagh, Mohammed Mumtaz; Salim, Naomie; Alkawaz, Mohammed Hazim; Saba, Tanzila; Al-Rodhaan, Mznah; Al-Dhelaan, Abdullah

    2014-01-01

    This paper presents a novel features mining approach from documents that could not be mined via optical character recognition (OCR). By identifying the intimate relationship between the text and graphical components, the proposed technique pulls out the Start, End, and Exact values for each bar. Furthermore, the word 2-gram and Euclidean distance methods are used to accurately detect and determine plagiarism in bar charts. PMID:25309952

  3. @Note: a workbench for biomedical text mining.

    PubMed

    Lourenço, Anália; Carreira, Rafael; Carneiro, Sónia; Maia, Paulo; Glez-Peña, Daniel; Fdez-Riverola, Florentino; Ferreira, Eugénio C; Rocha, Isabel; Rocha, Miguel

    2009-08-01

    Biomedical Text Mining (BioTM) is providing valuable approaches to the automated curation of scientific literature. However, most efforts have addressed the benchmarking of new algorithms rather than user operational needs. Bridging the gap between BioTM researchers and biologists' needs is crucial to solve real-world problems and promote further research. We present @Note, a platform for BioTM that aims at the effective translation of the advances between three distinct classes of users: biologists, text miners and software developers. Its main functional contributions are the ability to process abstracts and full-texts; an information retrieval module enabling PubMed search and journal crawling; a pre-processing module with PDF-to-text conversion, tokenisation and stopword removal; a semantic annotation schema; a lexicon-based annotator; a user-friendly annotation view that allows to correct annotations and a Text Mining Module supporting dataset preparation and algorithm evaluation. @Note improves the interoperability, modularity and flexibility when integrating in-home and open-source third-party components. Its component-based architecture allows the rapid development of new applications, emphasizing the principles of transparency and simplicity of use. Although it is still on-going, it has already allowed the development of applications that are currently being used.

  4. Graph-based biomedical text summarization: An itemset mining and sentence clustering approach.

    PubMed

    Nasr Azadani, Mozhgan; Ghadiri, Nasser; Davoodijam, Ensieh

    2018-06-12

    Automatic text summarization offers an efficient solution to access the ever-growing amounts of both scientific and clinical literature in the biomedical domain by summarizing the source documents while maintaining their most informative contents. In this paper, we propose a novel graph-based summarization method that takes advantage of the domain-specific knowledge and a well-established data mining technique called frequent itemset mining. Our summarizer exploits the Unified Medical Language System (UMLS) to construct a concept-based model of the source document and mapping the document to the concepts. Then, it discovers frequent itemsets to take the correlations among multiple concepts into account. The method uses these correlations to propose a similarity function based on which a represented graph is constructed. The summarizer then employs a minimum spanning tree based clustering algorithm to discover various subthemes of the document. Eventually, it generates the final summary by selecting the most informative and relative sentences from all subthemes within the text. We perform an automatic evaluation over a large number of summaries using the Recall-Oriented Understudy for Gisting Evaluation (ROUGE) metrics. The results demonstrate that the proposed summarization system outperforms various baselines and benchmark approaches. The carried out research suggests that the incorporation of domain-specific knowledge and frequent itemset mining equips the summarization system in a better way to address the informativeness measurement of the sentences. Moreover, clustering the graph nodes (sentences) can enable the summarizer to target different main subthemes of a source document efficiently. The evaluation results show that the proposed approach can significantly improve the performance of the summarization systems in the biomedical domain. Copyright © 2018. Published by Elsevier Inc.

  5. Pressing needs of biomedical text mining in biocuration and beyond: opportunities and challenges

    PubMed Central

    Singhal, Ayush; Leaman, Robert; Catlett, Natalie; Lemberger, Thomas; McEntyre, Johanna; Polson, Shawn; Xenarios, Ioannis; Arighi, Cecilia; Lu, Zhiyong

    2016-01-01

    Text mining in the biomedical sciences is rapidly transitioning from small-scale evaluation to large-scale application. In this article, we argue that text-mining technologies have become essential tools in real-world biomedical research. We describe four large scale applications of text mining, as showcased during a recent panel discussion at the BioCreative V Challenge Workshop. We draw on these applications as case studies to characterize common requirements for successfully applying text-mining techniques to practical biocuration needs. We note that system ‘accuracy’ remains a challenge and identify several additional common difficulties and potential research directions including (i) the ‘scalability’ issue due to the increasing need of mining information from millions of full-text articles, (ii) the ‘interoperability’ issue of integrating various text-mining systems into existing curation workflows and (iii) the ‘reusability’ issue on the difficulty of applying trained systems to text genres that are not seen previously during development. We then describe related efforts within the text-mining community, with a special focus on the BioCreative series of challenge workshops. We believe that focusing on the near-term challenges identified in this work will amplify the opportunities afforded by the continued adoption of text-mining tools. Finally, in order to sustain the curation ecosystem and have text-mining systems adopted for practical benefits, we call for increased collaboration between text-mining researchers and various stakeholders, including researchers, publishers and biocurators. PMID:28025348

  6. Pressing needs of biomedical text mining in biocuration and beyond: opportunities and challenges

    DOE PAGES

    Singhal, Ayush; Leaman, Robert; Catlett, Natalie; ...

    2016-12-26

    Text mining in the biomedical sciences is rapidly transitioning from small-scale evaluation to large-scale application. In this article, we argue that text-mining technologies have become essential tools in real-world biomedical research. We describe four large scale applications of text mining, as showcased during a recent panel discussion at the BioCreative V Challenge Workshop. We draw on these applications as case studies to characterize common requirements for successfully applying text-mining techniques to practical biocuration needs. We note that system ‘accuracy’ remains a challenge and identify several additional common difficulties and potential research directions including (i) the ‘scalability’ issue due to themore » increasing need of mining information from millions of full-text articles, (ii) the ‘interoperability’ issue of integrating various text-mining systems into existing curation workflows and (iii) the ‘reusability’ issue on the difficulty of applying trained systems to text genres that are not seen previously during development. We then describe related efforts within the text-mining community, with a special focus on the BioCreative series of challenge workshops. We believe that focusing on the near-term challenges identified in this work will amplify the opportunities afforded by the continued adoption of text-mining tools. In conclusion, in order to sustain the curation ecosystem and have text-mining systems adopted for practical benefits, we call for increased collaboration between text-mining researchers and various stakeholders, including researchers, publishers and biocurators.« less

  7. Pressing needs of biomedical text mining in biocuration and beyond: opportunities and challenges

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Singhal, Ayush; Leaman, Robert; Catlett, Natalie

    Text mining in the biomedical sciences is rapidly transitioning from small-scale evaluation to large-scale application. In this article, we argue that text-mining technologies have become essential tools in real-world biomedical research. We describe four large scale applications of text mining, as showcased during a recent panel discussion at the BioCreative V Challenge Workshop. We draw on these applications as case studies to characterize common requirements for successfully applying text-mining techniques to practical biocuration needs. We note that system ‘accuracy’ remains a challenge and identify several additional common difficulties and potential research directions including (i) the ‘scalability’ issue due to themore » increasing need of mining information from millions of full-text articles, (ii) the ‘interoperability’ issue of integrating various text-mining systems into existing curation workflows and (iii) the ‘reusability’ issue on the difficulty of applying trained systems to text genres that are not seen previously during development. We then describe related efforts within the text-mining community, with a special focus on the BioCreative series of challenge workshops. We believe that focusing on the near-term challenges identified in this work will amplify the opportunities afforded by the continued adoption of text-mining tools. In conclusion, in order to sustain the curation ecosystem and have text-mining systems adopted for practical benefits, we call for increased collaboration between text-mining researchers and various stakeholders, including researchers, publishers and biocurators.« less

  8. Commercial Online Social Network Data and Statin Side-Effect Surveillance: A Pilot Observational Study of Aggregate Mentions on Facebook.

    PubMed

    Huesch, Marco D

    2017-12-01

    Surveillance of the safety of prescribed drugs after marketing approval has been secured remains fraught with complications. Formal ascertainment by providers and reporting to adverse-event registries, formal surveys by manufacturers, and mining of electronic medical records are all well-known approaches with varying degrees of difficulty, cost, and success. Novel approaches may be a useful adjunct, especially approaches that mine or sample internet-based methods such as online social networks. A novel commercial software-as-a-service data-mining product supplied by Sysomos from Datasift/Facebook was used to mine all mentions on Facebook of statins and stain-related side effects in the US in the 1-month period 9 January 2017 through 8 February 2017. A total of 4.3% of all 25,700 mentions of statins also mentioned typical stain-related side effects. Multiple methodological weaknesses stymie interpretation of this percentage, which is however not inconsistent with estimates that 5-20% of patients taking statins will experience typical side effects at some time. Future work on pharmacovigilance may be informed by this novel commercial tool, but the inability to mine the full text of a posting poses serious challenges to content categorization.

  9. Text mining a self-report back-translation.

    PubMed

    Blanch, Angel; Aluja, Anton

    2016-06-01

    There are several recommendations about the routine to undertake when back translating self-report instruments in cross-cultural research. However, text mining methods have been generally ignored within this field. This work describes a text mining innovative application useful to adapt a personality questionnaire to 12 different languages. The method is divided in 3 different stages, a descriptive analysis of the available back-translated instrument versions, a dissimilarity assessment between the source language instrument and the 12 back-translations, and an item assessment of item meaning equivalence. The suggested method contributes to improve the back-translation process of self-report instruments for cross-cultural research in 2 significant intertwined ways. First, it defines a systematic approach to the back translation issue, allowing for a more orderly and informed evaluation concerning the equivalence of different versions of the same instrument in different languages. Second, it provides more accurate instrument back-translations, which has direct implications for the reliability and validity of the instrument's test scores when used in different cultures/languages. In addition, this procedure can be extended to the back-translation of self-reports measuring psychological constructs in clinical assessment. Future research works could refine the suggested methodology and use additional available text mining tools. (PsycINFO Database Record (c) 2016 APA, all rights reserved).

  10. BioC: a minimalist approach to interoperability for biomedical text processing

    PubMed Central

    Comeau, Donald C.; Islamaj Doğan, Rezarta; Ciccarese, Paolo; Cohen, Kevin Bretonnel; Krallinger, Martin; Leitner, Florian; Lu, Zhiyong; Peng, Yifan; Rinaldi, Fabio; Torii, Manabu; Valencia, Alfonso; Verspoor, Karin; Wiegers, Thomas C.; Wu, Cathy H.; Wilbur, W. John

    2013-01-01

    A vast amount of scientific information is encoded in natural language text, and the quantity of such text has become so great that it is no longer economically feasible to have a human as the first step in the search process. Natural language processing and text mining tools have become essential to facilitate the search for and extraction of information from text. This has led to vigorous research efforts to create useful tools and to create humanly labeled text corpora, which can be used to improve such tools. To encourage combining these efforts into larger, more powerful and more capable systems, a common interchange format to represent, store and exchange the data in a simple manner between different language processing systems and text mining tools is highly desirable. Here we propose a simple extensible mark-up language format to share text documents and annotations. The proposed annotation approach allows a large number of different annotations to be represented including sentences, tokens, parts of speech, named entities such as genes or diseases and relationships between named entities. In addition, we provide simple code to hold this data, read it from and write it back to extensible mark-up language files and perform some sample processing. We also describe completed as well as ongoing work to apply the approach in several directions. Code and data are available at http://bioc.sourceforge.net/. Database URL: http://bioc.sourceforge.net/ PMID:24048470

  11. Extraction of relations between genes and diseases from text and large-scale data analysis: implications for translational research.

    PubMed

    Bravo, Àlex; Piñero, Janet; Queralt-Rosinach, Núria; Rautschka, Michael; Furlong, Laura I

    2015-02-21

    Current biomedical research needs to leverage and exploit the large amount of information reported in scientific publications. Automated text mining approaches, in particular those aimed at finding relationships between entities, are key for identification of actionable knowledge from free text repositories. We present the BeFree system aimed at identifying relationships between biomedical entities with a special focus on genes and their associated diseases. By exploiting morpho-syntactic information of the text, BeFree is able to identify gene-disease, drug-disease and drug-target associations with state-of-the-art performance. The application of BeFree to real-case scenarios shows its effectiveness in extracting information relevant for translational research. We show the value of the gene-disease associations extracted by BeFree through a number of analyses and integration with other data sources. BeFree succeeds in identifying genes associated to a major cause of morbidity worldwide, depression, which are not present in other public resources. Moreover, large-scale extraction and analysis of gene-disease associations, and integration with current biomedical knowledge, provided interesting insights on the kind of information that can be found in the literature, and raised challenges regarding data prioritization and curation. We found that only a small proportion of the gene-disease associations discovered by using BeFree is collected in expert-curated databases. Thus, there is a pressing need to find alternative strategies to manual curation, in order to review, prioritize and curate text-mining data and incorporate it into domain-specific databases. We present our strategy for data prioritization and discuss its implications for supporting biomedical research and applications. BeFree is a novel text mining system that performs competitively for the identification of gene-disease, drug-disease and drug-target associations. Our analyses show that mining only a small fraction of MEDLINE results in a large dataset of gene-disease associations, and only a small proportion of this dataset is actually recorded in curated resources (2%), raising several issues on data prioritization and curation. We propose that joint analysis of text mined data with data curated by experts appears as a suitable approach to both assess data quality and highlight novel and interesting information.

  12. Biomedical text mining for research rigor and integrity: tasks, challenges, directions.

    PubMed

    Kilicoglu, Halil

    2017-06-13

    An estimated quarter of a trillion US dollars is invested in the biomedical research enterprise annually. There is growing alarm that a significant portion of this investment is wasted because of problems in reproducibility of research findings and in the rigor and integrity of research conduct and reporting. Recent years have seen a flurry of activities focusing on standardization and guideline development to enhance the reproducibility and rigor of biomedical research. Research activity is primarily communicated via textual artifacts, ranging from grant applications to journal publications. These artifacts can be both the source and the manifestation of practices leading to research waste. For example, an article may describe a poorly designed experiment, or the authors may reach conclusions not supported by the evidence presented. In this article, we pose the question of whether biomedical text mining techniques can assist the stakeholders in the biomedical research enterprise in doing their part toward enhancing research integrity and rigor. In particular, we identify four key areas in which text mining techniques can make a significant contribution: plagiarism/fraud detection, ensuring adherence to reporting guidelines, managing information overload and accurate citation/enhanced bibliometrics. We review the existing methods and tools for specific tasks, if they exist, or discuss relevant research that can provide guidance for future work. With the exponential increase in biomedical research output and the ability of text mining approaches to perform automatic tasks at large scale, we propose that such approaches can support tools that promote responsible research practices, providing significant benefits for the biomedical research enterprise. Published by Oxford University Press 2017. This work is written by a US Government employee and is in the public domain in the US.

  13. Development and testing of a text-mining approach to analyse patients' comments on their experiences of colorectal cancer care.

    PubMed

    Wagland, Richard; Recio-Saucedo, Alejandra; Simon, Michael; Bracher, Michael; Hunt, Katherine; Foster, Claire; Downing, Amy; Glaser, Adam; Corner, Jessica

    2016-08-01

    Quality of cancer care may greatly impact on patients' health-related quality of life (HRQoL). Free-text responses to patient-reported outcome measures (PROMs) provide rich data but analysis is time and resource-intensive. This study developed and tested a learning-based text-mining approach to facilitate analysis of patients' experiences of care and develop an explanatory model illustrating impact on HRQoL. Respondents to a population-based survey of colorectal cancer survivors provided free-text comments regarding their experience of living with and beyond cancer. An existing coding framework was tested and adapted, which informed learning-based text mining of the data. Machine-learning algorithms were trained to identify comments relating to patients' specific experiences of service quality, which were verified by manual qualitative analysis. Comparisons between coded retrieved comments and a HRQoL measure (EQ5D) were explored. The survey response rate was 63.3% (21 802/34 467), of which 25.8% (n=5634) participants provided free-text comments. Of retrieved comments on experiences of care (n=1688), over half (n=1045, 62%) described positive care experiences. Most negative experiences concerned a lack of post-treatment care (n=191, 11% of retrieved comments) and insufficient information concerning self-management strategies (n=135, 8%) or treatment side effects (n=160, 9%). Associations existed between HRQoL scores and coded algorithm-retrieved comments. Analysis indicated that the mechanism by which service quality impacted on HRQoL was the extent to which services prevented or alleviated challenges associated with disease and treatment burdens. Learning-based text mining techniques were found useful and practical tools to identify specific free-text comments within a large dataset, facilitating resource-efficient qualitative analysis. This method should be considered for future PROM analysis to inform policy and practice. Study findings indicated that perceived care quality directly impacts on HRQoL. Published by the BMJ Publishing Group Limited. For permission to use (where not already granted under a licence) please go to http://www.bmj.com/company/products-services/rights-and-licensing/

  14. Pressing needs of biomedical text mining in biocuration and beyond: opportunities and challenges.

    PubMed

    Singhal, Ayush; Leaman, Robert; Catlett, Natalie; Lemberger, Thomas; McEntyre, Johanna; Polson, Shawn; Xenarios, Ioannis; Arighi, Cecilia; Lu, Zhiyong

    2016-01-01

    Text mining in the biomedical sciences is rapidly transitioning from small-scale evaluation to large-scale application. In this article, we argue that text-mining technologies have become essential tools in real-world biomedical research. We describe four large scale applications of text mining, as showcased during a recent panel discussion at the BioCreative V Challenge Workshop. We draw on these applications as case studies to characterize common requirements for successfully applying text-mining techniques to practical biocuration needs. We note that system 'accuracy' remains a challenge and identify several additional common difficulties and potential research directions including (i) the 'scalability' issue due to the increasing need of mining information from millions of full-text articles, (ii) the 'interoperability' issue of integrating various text-mining systems into existing curation workflows and (iii) the 'reusability' issue on the difficulty of applying trained systems to text genres that are not seen previously during development. We then describe related efforts within the text-mining community, with a special focus on the BioCreative series of challenge workshops. We believe that focusing on the near-term challenges identified in this work will amplify the opportunities afforded by the continued adoption of text-mining tools. Finally, in order to sustain the curation ecosystem and have text-mining systems adopted for practical benefits, we call for increased collaboration between text-mining researchers and various stakeholders, including researchers, publishers and biocurators. Published by Oxford University Press 2016. This work is written by US Government employees and is in the public domain in the US.

  15. Systematic analysis of molecular mechanisms for HCC metastasis via text mining approach.

    PubMed

    Zhen, Cheng; Zhu, Caizhong; Chen, Haoyang; Xiong, Yiru; Tan, Junyuan; Chen, Dong; Li, Jin

    2017-02-21

    To systematically explore the molecular mechanism for hepatocellular carcinoma (HCC) metastasis and identify regulatory genes with text mining methods. Genes with highest frequencies and significant pathways related to HCC metastasis were listed. A handful of proteins such as EGFR, MDM2, TP53 and APP, were identified as hub nodes in PPI (protein-protein interaction) network. Compared with unique genes for HBV-HCCs, genes particular to HCV-HCCs were less, but may participate in more extensive signaling processes. VEGFA, PI3KCA, MAPK1, MMP9 and other genes may play important roles in multiple phenotypes of metastasis. Genes in abstracts of HCC-metastasis literatures were identified. Word frequency analysis, KEGG pathway and PPI network analysis were performed. Then co-occurrence analysis between genes and metastasis-related phenotypes were carried out. Text mining is effective for revealing potential regulators or pathways, but the purpose of it should be specific, and the combination of various methods will be more useful.

  16. Text Mining for Adverse Drug Events: the Promise, Challenges, and State of the Art

    PubMed Central

    Harpaz, Rave; Callahan, Alison; Tamang, Suzanne; Low, Yen; Odgers, David; Finlayson, Sam; Jung, Kenneth; LePendu, Paea; Shah, Nigam H.

    2014-01-01

    Text mining is the computational process of extracting meaningful information from large amounts of unstructured text. Text mining is emerging as a tool to leverage underutilized data sources that can improve pharmacovigilance, including the objective of adverse drug event detection and assessment. This article provides an overview of recent advances in pharmacovigilance driven by the application of text mining, and discusses several data sources—such as biomedical literature, clinical narratives, product labeling, social media, and Web search logs—that are amenable to text-mining for pharmacovigilance. Given the state of the art, it appears text mining can be applied to extract useful ADE-related information from multiple textual sources. Nonetheless, further research is required to address remaining technical challenges associated with the text mining methodologies, and to conclusively determine the relative contribution of each textual source to improving pharmacovigilance. PMID:25151493

  17. Text Mining.

    ERIC Educational Resources Information Center

    Trybula, Walter J.

    1999-01-01

    Reviews the state of research in text mining, focusing on newer developments. The intent is to describe the disparate investigations currently included under the term text mining and provide a cohesive structure for these efforts. A summary of research identifies key organizations responsible for pushing the development of text mining. A section…

  18. DiMeX: A Text Mining System for Mutation-Disease Association Extraction.

    PubMed

    Mahmood, A S M Ashique; Wu, Tsung-Jung; Mazumder, Raja; Vijay-Shanker, K

    2016-01-01

    The number of published articles describing associations between mutations and diseases is increasing at a fast pace. There is a pressing need to gather such mutation-disease associations into public knowledge bases, but manual curation slows down the growth of such databases. We have addressed this problem by developing a text-mining system (DiMeX) to extract mutation to disease associations from publication abstracts. DiMeX consists of a series of natural language processing modules that preprocess input text and apply syntactic and semantic patterns to extract mutation-disease associations. DiMeX achieves high precision and recall with F-scores of 0.88, 0.91 and 0.89 when evaluated on three different datasets for mutation-disease associations. DiMeX includes a separate component that extracts mutation mentions in text and associates them with genes. This component has been also evaluated on different datasets and shown to achieve state-of-the-art performance. The results indicate that our system outperforms the existing mutation-disease association tools, addressing the low precision problems suffered by most approaches. DiMeX was applied on a large set of abstracts from Medline to extract mutation-disease associations, as well as other relevant information including patient/cohort size and population data. The results are stored in a database that can be queried and downloaded at http://biotm.cis.udel.edu/dimex/. We conclude that this high-throughput text-mining approach has the potential to significantly assist researchers and curators to enrich mutation databases.

  19. DiMeX: A Text Mining System for Mutation-Disease Association Extraction

    PubMed Central

    Mahmood, A. S. M. Ashique; Wu, Tsung-Jung; Mazumder, Raja; Vijay-Shanker, K.

    2016-01-01

    The number of published articles describing associations between mutations and diseases is increasing at a fast pace. There is a pressing need to gather such mutation-disease associations into public knowledge bases, but manual curation slows down the growth of such databases. We have addressed this problem by developing a text-mining system (DiMeX) to extract mutation to disease associations from publication abstracts. DiMeX consists of a series of natural language processing modules that preprocess input text and apply syntactic and semantic patterns to extract mutation-disease associations. DiMeX achieves high precision and recall with F-scores of 0.88, 0.91 and 0.89 when evaluated on three different datasets for mutation-disease associations. DiMeX includes a separate component that extracts mutation mentions in text and associates them with genes. This component has been also evaluated on different datasets and shown to achieve state-of-the-art performance. The results indicate that our system outperforms the existing mutation-disease association tools, addressing the low precision problems suffered by most approaches. DiMeX was applied on a large set of abstracts from Medline to extract mutation-disease associations, as well as other relevant information including patient/cohort size and population data. The results are stored in a database that can be queried and downloaded at http://biotm.cis.udel.edu/dimex/. We conclude that this high-throughput text-mining approach has the potential to significantly assist researchers and curators to enrich mutation databases. PMID:27073839

  20. Biomedical text mining and its applications in cancer research.

    PubMed

    Zhu, Fei; Patumcharoenpol, Preecha; Zhang, Cheng; Yang, Yang; Chan, Jonathan; Meechai, Asawin; Vongsangnak, Wanwipa; Shen, Bairong

    2013-04-01

    Cancer is a malignant disease that has caused millions of human deaths. Its study has a long history of well over 100years. There have been an enormous number of publications on cancer research. This integrated but unstructured biomedical text is of great value for cancer diagnostics, treatment, and prevention. The immense body and rapid growth of biomedical text on cancer has led to the appearance of a large number of text mining techniques aimed at extracting novel knowledge from scientific text. Biomedical text mining on cancer research is computationally automatic and high-throughput in nature. However, it is error-prone due to the complexity of natural language processing. In this review, we introduce the basic concepts underlying text mining and examine some frequently used algorithms, tools, and data sets, as well as assessing how much these algorithms have been utilized. We then discuss the current state-of-the-art text mining applications in cancer research and we also provide some resources for cancer text mining. With the development of systems biology, researchers tend to understand complex biomedical systems from a systems biology viewpoint. Thus, the full utilization of text mining to facilitate cancer systems biology research is fast becoming a major concern. To address this issue, we describe the general workflow of text mining in cancer systems biology and each phase of the workflow. We hope that this review can (i) provide a useful overview of the current work of this field; (ii) help researchers to choose text mining tools and datasets; and (iii) highlight how to apply text mining to assist cancer systems biology research. Copyright © 2012 Elsevier Inc. All rights reserved.

  1. Stopping Antidepressants and Anxiolytics as Major Concerns Reported in Online Health Communities: A Text Mining Approach.

    PubMed

    Abbe, Adeline; Falissard, Bruno

    2017-10-23

    Internet is a particularly dynamic way to quickly capture the perceptions of a population in real time. Complementary to traditional face-to-face communication, online social networks help patients to improve self-esteem and self-help. The aim of this study was to use text mining on material from an online forum exploring patients' concerns about treatment (antidepressants and anxiolytics). Concerns about treatment were collected from discussion titles in patients' online community related to antidepressants and anxiolytics. To examine the content of these titles automatically, we used text mining methods, such as word frequency in a document-term matrix and co-occurrence of words using a network analysis. It was thus possible to identify topics discussed on the forum. The forum included 2415 discussions on antidepressants and anxiolytics over a period of 3 years. After a preprocessing step, the text mining algorithm identified the 99 most frequently occurring words in titles, among which were escitalopram, withdrawal, antidepressant, venlafaxine, paroxetine, and effect. Patients' concerns were related to antidepressant withdrawal, the need to share experience about symptoms, effects, and questions on weight gain with some drugs. Patients' expression on the Internet is a potential additional resource in addressing patients' concerns about treatment. Patient profiles are close to that of patients treated in psychiatry. ©Adeline Abbe, Bruno Falissard. Originally published in JMIR Mental Health (http://mental.jmir.org), 23.10.2017.

  2. Coronary artery disease risk assessment from unstructured electronic health records using text mining.

    PubMed

    Jonnagaddala, Jitendra; Liaw, Siaw-Teng; Ray, Pradeep; Kumar, Manish; Chang, Nai-Wen; Dai, Hong-Jie

    2015-12-01

    Coronary artery disease (CAD) often leads to myocardial infarction, which may be fatal. Risk factors can be used to predict CAD, which may subsequently lead to prevention or early intervention. Patient data such as co-morbidities, medication history, social history and family history are required to determine the risk factors for a disease. However, risk factor data are usually embedded in unstructured clinical narratives if the data is not collected specifically for risk assessment purposes. Clinical text mining can be used to extract data related to risk factors from unstructured clinical notes. This study presents methods to extract Framingham risk factors from unstructured electronic health records using clinical text mining and to calculate 10-year coronary artery disease risk scores in a cohort of diabetic patients. We developed a rule-based system to extract risk factors: age, gender, total cholesterol, HDL-C, blood pressure, diabetes history and smoking history. The results showed that the output from the text mining system was reliable, but there was a significant amount of missing data to calculate the Framingham risk score. A systematic approach for understanding missing data was followed by implementation of imputation strategies. An analysis of the 10-year Framingham risk scores for coronary artery disease in this cohort has shown that the majority of the diabetic patients are at moderate risk of CAD. Copyright © 2015 Elsevier Inc. All rights reserved.

  3. Combining Natural Language Processing and Statistical Text Mining: A Study of Specialized versus Common Languages

    ERIC Educational Resources Information Center

    Jarman, Jay

    2011-01-01

    This dissertation focuses on developing and evaluating hybrid approaches for analyzing free-form text in the medical domain. This research draws on natural language processing (NLP) techniques that are used to parse and extract concepts based on a controlled vocabulary. Once important concepts are extracted, additional machine learning algorithms,…

  4. Text mining meets workflow: linking U-Compare with Taverna

    PubMed Central

    Kano, Yoshinobu; Dobson, Paul; Nakanishi, Mio; Tsujii, Jun'ichi; Ananiadou, Sophia

    2010-01-01

    Summary: Text mining from the biomedical literature is of increasing importance, yet it is not easy for the bioinformatics community to create and run text mining workflows due to the lack of accessibility and interoperability of the text mining resources. The U-Compare system provides a wide range of bio text mining resources in a highly interoperable workflow environment where workflows can very easily be created, executed, evaluated and visualized without coding. We have linked U-Compare to Taverna, a generic workflow system, to expose text mining functionality to the bioinformatics community. Availability: http://u-compare.org/taverna.html, http://u-compare.org Contact: kano@is.s.u-tokyo.ac.jp Supplementary information: Supplementary data are available at Bioinformatics online. PMID:20709690

  5. Survey of Natural Language Processing Techniques in Bioinformatics.

    PubMed

    Zeng, Zhiqiang; Shi, Hua; Wu, Yun; Hong, Zhiling

    2015-01-01

    Informatics methods, such as text mining and natural language processing, are always involved in bioinformatics research. In this study, we discuss text mining and natural language processing methods in bioinformatics from two perspectives. First, we aim to search for knowledge on biology, retrieve references using text mining methods, and reconstruct databases. For example, protein-protein interactions and gene-disease relationship can be mined from PubMed. Then, we analyze the applications of text mining and natural language processing techniques in bioinformatics, including predicting protein structure and function, detecting noncoding RNA. Finally, numerous methods and applications, as well as their contributions to bioinformatics, are discussed for future use by text mining and natural language processing researchers.

  6. Combining QSAR Modeling and Text-Mining Techniques to Link Chemical Structures and Carcinogenic Modes of Action.

    PubMed

    Papamokos, George; Silins, Ilona

    2016-01-01

    There is an increasing need for new reliable non-animal based methods to predict and test toxicity of chemicals. Quantitative structure-activity relationship (QSAR), a computer-based method linking chemical structures with biological activities, is used in predictive toxicology. In this study, we tested the approach to combine QSAR data with literature profiles of carcinogenic modes of action automatically generated by a text-mining tool. The aim was to generate data patterns to identify associations between chemical structures and biological mechanisms related to carcinogenesis. Using these two methods, individually and combined, we evaluated 96 rat carcinogens of the hematopoietic system, liver, lung, and skin. We found that skin and lung rat carcinogens were mainly mutagenic, while the group of carcinogens affecting the hematopoietic system and the liver also included a large proportion of non-mutagens. The automatic literature analysis showed that mutagenicity was a frequently reported endpoint in the literature of these carcinogens, however, less common endpoints such as immunosuppression and hormonal receptor-mediated effects were also found in connection with some of the carcinogens, results of potential importance for certain target organs. The combined approach, using QSAR and text-mining techniques, could be useful for identifying more detailed information on biological mechanisms and the relation with chemical structures. The method can be particularly useful in increasing the understanding of structure and activity relationships for non-mutagens.

  7. Combining QSAR Modeling and Text-Mining Techniques to Link Chemical Structures and Carcinogenic Modes of Action

    PubMed Central

    Papamokos, George; Silins, Ilona

    2016-01-01

    There is an increasing need for new reliable non-animal based methods to predict and test toxicity of chemicals. Quantitative structure-activity relationship (QSAR), a computer-based method linking chemical structures with biological activities, is used in predictive toxicology. In this study, we tested the approach to combine QSAR data with literature profiles of carcinogenic modes of action automatically generated by a text-mining tool. The aim was to generate data patterns to identify associations between chemical structures and biological mechanisms related to carcinogenesis. Using these two methods, individually and combined, we evaluated 96 rat carcinogens of the hematopoietic system, liver, lung, and skin. We found that skin and lung rat carcinogens were mainly mutagenic, while the group of carcinogens affecting the hematopoietic system and the liver also included a large proportion of non-mutagens. The automatic literature analysis showed that mutagenicity was a frequently reported endpoint in the literature of these carcinogens, however, less common endpoints such as immunosuppression and hormonal receptor-mediated effects were also found in connection with some of the carcinogens, results of potential importance for certain target organs. The combined approach, using QSAR and text-mining techniques, could be useful for identifying more detailed information on biological mechanisms and the relation with chemical structures. The method can be particularly useful in increasing the understanding of structure and activity relationships for non-mutagens. PMID:27625608

  8. Mining concepts of health responsibility using text mining and exploratory graph analysis.

    PubMed

    Kjellström, Sofia; Golino, Hudson

    2018-05-24

    Occupational therapists need to know about people's beliefs about personal responsibility for health to help them pursue everyday activities. The study aims to employ state-of-the-art quantitative approaches to understand people's views of health and responsibility at different ages. A mixed method approach was adopted, using text mining to extract information from 233 interviews with participants aged 5 to 96 years, and then exploratory graph analysis to estimate the number of latent variables. The fit of the structure estimated via the exploratory graph analysis was verified using confirmatory factor analysis. Exploratory graph analysis estimated three dimensions of health responsibility: (1) creating good health habits and feeling good; (2) thinking about one's own health and wanting to improve it; and 3) adopting explicitly normative attitudes to take care of one's health. The comparison between the three dimensions among age groups showed, in general, that children and adolescents, as well as the old elderly (>73 years old) expressed ideas about personal responsibility for health less than young adults, adults and young elderly. Occupational therapists' knowledge of the concepts of health responsibility is of value when working with a patient's health, but an identified challenge is how to engage children and older persons.

  9. Text mining for adverse drug events: the promise, challenges, and state of the art.

    PubMed

    Harpaz, Rave; Callahan, Alison; Tamang, Suzanne; Low, Yen; Odgers, David; Finlayson, Sam; Jung, Kenneth; LePendu, Paea; Shah, Nigam H

    2014-10-01

    Text mining is the computational process of extracting meaningful information from large amounts of unstructured text. It is emerging as a tool to leverage underutilized data sources that can improve pharmacovigilance, including the objective of adverse drug event (ADE) detection and assessment. This article provides an overview of recent advances in pharmacovigilance driven by the application of text mining, and discusses several data sources-such as biomedical literature, clinical narratives, product labeling, social media, and Web search logs-that are amenable to text mining for pharmacovigilance. Given the state of the art, it appears text mining can be applied to extract useful ADE-related information from multiple textual sources. Nonetheless, further research is required to address remaining technical challenges associated with the text mining methodologies, and to conclusively determine the relative contribution of each textual source to improving pharmacovigilance.

  10. A semantic-based method for extracting concept definitions from scientific publications: evaluation in the autism phenotype domain.

    PubMed

    Hassanpour, Saeed; O'Connor, Martin J; Das, Amar K

    2013-08-12

    A variety of informatics approaches have been developed that use information retrieval, NLP and text-mining techniques to identify biomedical concepts and relations within scientific publications or their sentences. These approaches have not typically addressed the challenge of extracting more complex knowledge such as biomedical definitions. In our efforts to facilitate knowledge acquisition of rule-based definitions of autism phenotypes, we have developed a novel semantic-based text-mining approach that can automatically identify such definitions within text. Using an existing knowledge base of 156 autism phenotype definitions and an annotated corpus of 26 source articles containing such definitions, we evaluated and compared the average rank of correctly identified rule definition or corresponding rule template using both our semantic-based approach and a standard term-based approach. We examined three separate scenarios: (1) the snippet of text contained a definition already in the knowledge base; (2) the snippet contained an alternative definition for a concept in the knowledge base; and (3) the snippet contained a definition not in the knowledge base. Our semantic-based approach had a higher average rank than the term-based approach for each of the three scenarios (scenario 1: 3.8 vs. 5.0; scenario 2: 2.8 vs. 4.9; and scenario 3: 4.5 vs. 6.2), with each comparison significant at the p-value of 0.05 using the Wilcoxon signed-rank test. Our work shows that leveraging existing domain knowledge in the information extraction of biomedical definitions significantly improves the correct identification of such knowledge within sentences. Our method can thus help researchers rapidly acquire knowledge about biomedical definitions that are specified and evolving within an ever-growing corpus of scientific publications.

  11. Health Terrain: Visualizing Large Scale Health Data

    DTIC Science & Technology

    2015-12-01

    Text mining ; Data mining . 16. SECURITY  CLASSIFICATION  OF: 17... text   mining  algorithms  to  construct  a  concept  space.  A   browser-­‐based  user  interface  is  developed  to...Public  health  data,  Notifiable  condition  detector,   Text   mining ,  Data   mining   4 of 29 Disease Patient Location Term

  12. Integration of Text- and Data-Mining Technologies for Use in Banking Applications

    NASA Astrophysics Data System (ADS)

    Maslankowski, Jacek

    Unstructured data, most of it in the form of text files, typically accounts for 85% of an organization's knowledge stores, but it's not always easy to find, access, analyze or use (Robb 2004). That is why it is important to use solutions based on text and data mining. This solution is known as duo mining. This leads to improve management based on knowledge owned in organization. The results are interesting. Data mining provides to lead with structuralized data, usually powered from data warehouses. Text mining, sometimes called web mining, looks for patterns in unstructured data — memos, document and www. Integrating text-based information with structured data enriches predictive modeling capabilities and provides new stores of insightful and valuable information for driving business and research initiatives forward.

  13. Fuzzy Document Clustering Approach using WordNet Lexical Categories

    NASA Astrophysics Data System (ADS)

    Gharib, Tarek F.; Fouad, Mohammed M.; Aref, Mostafa M.

    Text mining refers generally to the process of extracting interesting information and knowledge from unstructured text. This area is growing rapidly mainly because of the strong need for analysing the huge and large amount of textual data that reside on internal file systems and the Web. Text document clustering provides an effective navigation mechanism to organize this large amount of data by grouping their documents into a small number of meaningful classes. In this paper we proposed a fuzzy text document clustering approach using WordNet lexical categories and Fuzzy c-Means algorithm. Some experiments are performed to compare efficiency of the proposed approach with the recently reported approaches. Experimental results show that Fuzzy clustering leads to great performance results. Fuzzy c-means algorithm overcomes other classical clustering algorithms like k-means and bisecting k-means in both clustering quality and running time efficiency.

  14. SparkText: Biomedical Text Mining on Big Data Framework.

    PubMed

    Ye, Zhan; Tafti, Ahmad P; He, Karen Y; Wang, Kai; He, Max M

    Many new biomedical research articles are published every day, accumulating rich information, such as genetic variants, genes, diseases, and treatments. Rapid yet accurate text mining on large-scale scientific literature can discover novel knowledge to better understand human diseases and to improve the quality of disease diagnosis, prevention, and treatment. In this study, we designed and developed an efficient text mining framework called SparkText on a Big Data infrastructure, which is composed of Apache Spark data streaming and machine learning methods, combined with a Cassandra NoSQL database. To demonstrate its performance for classifying cancer types, we extracted information (e.g., breast, prostate, and lung cancers) from tens of thousands of articles downloaded from PubMed, and then employed Naïve Bayes, Support Vector Machine (SVM), and Logistic Regression to build prediction models to mine the articles. The accuracy of predicting a cancer type by SVM using the 29,437 full-text articles was 93.81%. While competing text-mining tools took more than 11 hours, SparkText mined the dataset in approximately 6 minutes. This study demonstrates the potential for mining large-scale scientific articles on a Big Data infrastructure, with real-time update from new articles published daily. SparkText can be extended to other areas of biomedical research.

  15. SparkText: Biomedical Text Mining on Big Data Framework

    PubMed Central

    He, Karen Y.; Wang, Kai

    2016-01-01

    Background Many new biomedical research articles are published every day, accumulating rich information, such as genetic variants, genes, diseases, and treatments. Rapid yet accurate text mining on large-scale scientific literature can discover novel knowledge to better understand human diseases and to improve the quality of disease diagnosis, prevention, and treatment. Results In this study, we designed and developed an efficient text mining framework called SparkText on a Big Data infrastructure, which is composed of Apache Spark data streaming and machine learning methods, combined with a Cassandra NoSQL database. To demonstrate its performance for classifying cancer types, we extracted information (e.g., breast, prostate, and lung cancers) from tens of thousands of articles downloaded from PubMed, and then employed Naïve Bayes, Support Vector Machine (SVM), and Logistic Regression to build prediction models to mine the articles. The accuracy of predicting a cancer type by SVM using the 29,437 full-text articles was 93.81%. While competing text-mining tools took more than 11 hours, SparkText mined the dataset in approximately 6 minutes. Conclusions This study demonstrates the potential for mining large-scale scientific articles on a Big Data infrastructure, with real-time update from new articles published daily. SparkText can be extended to other areas of biomedical research. PMID:27685652

  16. Text mining in livestock animal science: introducing the potential of text mining to animal sciences.

    PubMed

    Sahadevan, S; Hofmann-Apitius, M; Schellander, K; Tesfaye, D; Fluck, J; Friedrich, C M

    2012-10-01

    In biological research, establishing the prior art by searching and collecting information already present in the domain has equal importance as the experiments done. To obtain a complete overview about the relevant knowledge, researchers mainly rely on 2 major information sources: i) various biological databases and ii) scientific publications in the field. The major difference between the 2 information sources is that information from databases is available, typically well structured and condensed. The information content in scientific literature is vastly unstructured; that is, dispersed among the many different sections of scientific text. The traditional method of information extraction from scientific literature occurs by generating a list of relevant publications in the field of interest and manually scanning these texts for relevant information, which is very time consuming. It is more than likely that in using this "classical" approach the researcher misses some relevant information mentioned in the literature or has to go through biological databases to extract further information. Text mining and named entity recognition methods have already been used in human genomics and related fields as a solution to this problem. These methods can process and extract information from large volumes of scientific text. Text mining is defined as the automatic extraction of previously unknown and potentially useful information from text. Named entity recognition (NER) is defined as the method of identifying named entities (names of real world objects; for example, gene/protein names, drugs, enzymes) in text. In animal sciences, text mining and related methods have been briefly used in murine genomics and associated fields, leaving behind other fields of animal sciences, such as livestock genomics. The aim of this work was to develop an information retrieval platform in the livestock domain focusing on livestock publications and the recognition of relevant data from cattle and pigs. For this purpose, the rather noncomprehensive resources of pig and cattle gene and protein terminologies were enriched with orthologue synonyms, integrated in the NER platform, ProMiner, which is successfully used in human genomics domain. Based on the performance tests done, the present system achieved a fair performance with precision 0.64, recall 0.74, and F(1) measure of 0.69 in a test scenario based on cattle literature.

  17. A comprehensive and quantitative comparison of text-mining in 15 million full-text articles versus their corresponding abstracts.

    PubMed

    Westergaard, David; Stærfeldt, Hans-Henrik; Tønsberg, Christian; Jensen, Lars Juhl; Brunak, Søren

    2018-02-01

    Across academia and industry, text mining has become a popular strategy for keeping up with the rapid growth of the scientific literature. Text mining of the scientific literature has mostly been carried out on collections of abstracts, due to their availability. Here we present an analysis of 15 million English scientific full-text articles published during the period 1823-2016. We describe the development in article length and publication sub-topics during these nearly 250 years. We showcase the potential of text mining by extracting published protein-protein, disease-gene, and protein subcellular associations using a named entity recognition system, and quantitatively report on their accuracy using gold standard benchmark data sets. We subsequently compare the findings to corresponding results obtained on 16.5 million abstracts included in MEDLINE and show that text mining of full-text articles consistently outperforms using abstracts only.

  18. A comprehensive and quantitative comparison of text-mining in 15 million full-text articles versus their corresponding abstracts

    PubMed Central

    Westergaard, David; Stærfeldt, Hans-Henrik

    2018-01-01

    Across academia and industry, text mining has become a popular strategy for keeping up with the rapid growth of the scientific literature. Text mining of the scientific literature has mostly been carried out on collections of abstracts, due to their availability. Here we present an analysis of 15 million English scientific full-text articles published during the period 1823–2016. We describe the development in article length and publication sub-topics during these nearly 250 years. We showcase the potential of text mining by extracting published protein–protein, disease–gene, and protein subcellular associations using a named entity recognition system, and quantitatively report on their accuracy using gold standard benchmark data sets. We subsequently compare the findings to corresponding results obtained on 16.5 million abstracts included in MEDLINE and show that text mining of full-text articles consistently outperforms using abstracts only. PMID:29447159

  19. PubRunner: A light-weight framework for updating text mining results.

    PubMed

    Anekalla, Kishore R; Courneya, J P; Fiorini, Nicolas; Lever, Jake; Muchow, Michael; Busby, Ben

    2017-01-01

    Biomedical text mining promises to assist biologists in quickly navigating the combined knowledge in their domain. This would allow improved understanding of the complex interactions within biological systems and faster hypothesis generation. New biomedical research articles are published daily and text mining tools are only as good as the corpus from which they work. Many text mining tools are underused because their results are static and do not reflect the constantly expanding knowledge in the field. In order for biomedical text mining to become an indispensable tool used by researchers, this problem must be addressed. To this end, we present PubRunner, a framework for regularly running text mining tools on the latest publications. PubRunner is lightweight, simple to use, and can be integrated with an existing text mining tool. The workflow involves downloading the latest abstracts from PubMed, executing a user-defined tool, pushing the resulting data to a public FTP or Zenodo dataset, and publicizing the location of these results on the public PubRunner website. We illustrate the use of this tool by re-running the commonly used word2vec tool on the latest PubMed abstracts to generate up-to-date word vector representations for the biomedical domain. This shows a proof of concept that we hope will encourage text mining developers to build tools that truly will aid biologists in exploring the latest publications.

  20. Climate policy: Uncovering ocean-related priorities

    NASA Astrophysics Data System (ADS)

    Barkemeyer, Ralf

    2017-11-01

    Given the complexity and multi-faceted nature of policy processes, national-level policy preferences are notoriously difficult to capture. Now, research applying an automated text mining approach helps to shed light on country-level differences and priorities in the context of marine climate issues.

  1. Text mining for the biocuration workflow

    PubMed Central

    Hirschman, Lynette; Burns, Gully A. P. C; Krallinger, Martin; Arighi, Cecilia; Cohen, K. Bretonnel; Valencia, Alfonso; Wu, Cathy H.; Chatr-Aryamontri, Andrew; Dowell, Karen G.; Huala, Eva; Lourenço, Anália; Nash, Robert; Veuthey, Anne-Lise; Wiegers, Thomas; Winter, Andrew G.

    2012-01-01

    Molecular biology has become heavily dependent on biological knowledge encoded in expert curated biological databases. As the volume of biological literature increases, biocurators need help in keeping up with the literature; (semi-) automated aids for biocuration would seem to be an ideal application for natural language processing and text mining. However, to date, there have been few documented successes for improving biocuration throughput using text mining. Our initial investigations took place for the workshop on ‘Text Mining for the BioCuration Workflow’ at the third International Biocuration Conference (Berlin, 2009). We interviewed biocurators to obtain workflows from eight biological databases. This initial study revealed high-level commonalities, including (i) selection of documents for curation; (ii) indexing of documents with biologically relevant entities (e.g. genes); and (iii) detailed curation of specific relations (e.g. interactions); however, the detailed workflows also showed many variabilities. Following the workshop, we conducted a survey of biocurators. The survey identified biocurator priorities, including the handling of full text indexed with biological entities and support for the identification and prioritization of documents for curation. It also indicated that two-thirds of the biocuration teams had experimented with text mining and almost half were using text mining at that time. Analysis of our interviews and survey provide a set of requirements for the integration of text mining into the biocuration workflow. These can guide the identification of common needs across curated databases and encourage joint experimentation involving biocurators, text mining developers and the larger biomedical research community. PMID:22513129

  2. Text mining for the biocuration workflow.

    PubMed

    Hirschman, Lynette; Burns, Gully A P C; Krallinger, Martin; Arighi, Cecilia; Cohen, K Bretonnel; Valencia, Alfonso; Wu, Cathy H; Chatr-Aryamontri, Andrew; Dowell, Karen G; Huala, Eva; Lourenço, Anália; Nash, Robert; Veuthey, Anne-Lise; Wiegers, Thomas; Winter, Andrew G

    2012-01-01

    Molecular biology has become heavily dependent on biological knowledge encoded in expert curated biological databases. As the volume of biological literature increases, biocurators need help in keeping up with the literature; (semi-) automated aids for biocuration would seem to be an ideal application for natural language processing and text mining. However, to date, there have been few documented successes for improving biocuration throughput using text mining. Our initial investigations took place for the workshop on 'Text Mining for the BioCuration Workflow' at the third International Biocuration Conference (Berlin, 2009). We interviewed biocurators to obtain workflows from eight biological databases. This initial study revealed high-level commonalities, including (i) selection of documents for curation; (ii) indexing of documents with biologically relevant entities (e.g. genes); and (iii) detailed curation of specific relations (e.g. interactions); however, the detailed workflows also showed many variabilities. Following the workshop, we conducted a survey of biocurators. The survey identified biocurator priorities, including the handling of full text indexed with biological entities and support for the identification and prioritization of documents for curation. It also indicated that two-thirds of the biocuration teams had experimented with text mining and almost half were using text mining at that time. Analysis of our interviews and survey provide a set of requirements for the integration of text mining into the biocuration workflow. These can guide the identification of common needs across curated databases and encourage joint experimentation involving biocurators, text mining developers and the larger biomedical research community.

  3. GoWeb: a semantic search engine for the life science web.

    PubMed

    Dietze, Heiko; Schroeder, Michael

    2009-10-01

    Current search engines are keyword-based. Semantic technologies promise a next generation of semantic search engines, which will be able to answer questions. Current approaches either apply natural language processing to unstructured text or they assume the existence of structured statements over which they can reason. Here, we introduce a third approach, GoWeb, which combines classical keyword-based Web search with text-mining and ontologies to navigate large results sets and facilitate question answering. We evaluate GoWeb on three benchmarks of questions on genes and functions, on symptoms and diseases, and on proteins and diseases. The first benchmark is based on the BioCreAtivE 1 Task 2 and links 457 gene names with 1352 functions. GoWeb finds 58% of the functional GeneOntology annotations. The second benchmark is based on 26 case reports and links symptoms with diseases. GoWeb achieves 77% success rate improving an existing approach by nearly 20%. The third benchmark is based on 28 questions in the TREC genomics challenge and links proteins to diseases. GoWeb achieves a success rate of 79%. GoWeb's combination of classical Web search with text-mining and ontologies is a first step towards answering questions in the biomedical domain. GoWeb is online at: http://www.gopubmed.org/goweb.

  4. Frontiers of biomedical text mining: current progress

    PubMed Central

    Zweigenbaum, Pierre; Demner-Fushman, Dina; Yu, Hong; Cohen, Kevin B.

    2008-01-01

    It is now almost 15 years since the publication of the first paper on text mining in the genomics domain, and decades since the first paper on text mining in the medical domain. Enormous progress has been made in the areas of information retrieval, evaluation methodologies and resource construction. Some problems, such as abbreviation-handling, can essentially be considered solved problems, and others, such as identification of gene mentions in text, seem likely to be solved soon. However, a number of problems at the frontiers of biomedical text mining continue to present interesting challenges and opportunities for great improvements and interesting research. In this article we review the current state of the art in biomedical text mining or ‘BioNLP’ in general, focusing primarily on papers published within the past year. PMID:17977867

  5. Sentiment analysis of Arabic tweets using text mining techniques

    NASA Astrophysics Data System (ADS)

    Al-Horaibi, Lamia; Khan, Muhammad Badruddin

    2016-07-01

    Sentiment analysis has become a flourishing field of text mining and natural language processing. Sentiment analysis aims to determine whether the text is written to express positive, negative, or neutral emotions about a certain domain. Most sentiment analysis researchers focus on English texts, with very limited resources available for other complex languages, such as Arabic. In this study, the target was to develop an initial model that performs satisfactorily and measures Arabic Twitter sentiment by using machine learning approach, Naïve Bayes and Decision Tree for classification algorithms. The datasets used contains more than 2,000 Arabic tweets collected from Twitter. We performed several experiments to check the performance of the two algorithms classifiers using different combinations of text-processing functions. We found that available facilities for Arabic text processing need to be made from scratch or improved to develop accurate classifiers. The small functionalities developed by us in a Python language environment helped improve the results and proved that sentiment analysis in the Arabic domain needs lot of work on the lexicon side.

  6. Automated detection of follow-up appointments using text mining of discharge records.

    PubMed

    Ruud, Kari L; Johnson, Matthew G; Liesinger, Juliette T; Grafft, Carrie A; Naessens, James M

    2010-06-01

    To determine whether text mining can accurately detect specific follow-up appointment criteria in free-text hospital discharge records. Cross-sectional study. Mayo Clinic Rochester hospitals. Inpatients discharged from general medicine services in 2006 (n = 6481). Textual hospital dismissal summaries were manually reviewed to determine whether the records contained specific follow-up appointment arrangement elements: date, time and either physician or location for an appointment. The data set was evaluated for the same criteria using SAS Text Miner software. The two assessments were compared to determine the accuracy of text mining for detecting records containing follow-up appointment arrangements. Agreement of text-mined appointment findings with gold standard (manual abstraction) including sensitivity, specificity, positive predictive and negative predictive values (PPV and NPV). About 55.2% (3576) of discharge records contained all criteria for follow-up appointment arrangements according to the manual review, 3.2% (113) of which were missed through text mining. Text mining incorrectly identified 3.7% (107) follow-up appointments that were not considered valid through manual review. Therefore, the text mining analysis concurred with the manual review in 96.6% of the appointment findings. Overall sensitivity and specificity were 96.8 and 96.3%, respectively; and PPV and NPV were 97.0 and 96.1%, respectively. of individual appointment criteria resulted in accuracy rates of 93.5% for date, 97.4% for time, 97.5% for physician and 82.9% for location. Text mining of unstructured hospital dismissal summaries can accurately detect documentation of follow-up appointment arrangement elements, thus saving considerable resources for performance assessment and quality-related research.

  7. Computational systems biology analysis of biomarkers in lung cancer; unravelling genomic regions which frequently encode biomarkers, enriched pathways, and new candidates.

    PubMed

    Alanazi, Ibrahim O; AlYahya, Sami A; Ebrahimie, Esmaeil; Mohammadi-Dehcheshmeh, Manijeh

    2018-06-15

    Exponentially growing scientific knowledge in scientific publications has resulted in the emergence of a new interdisciplinary science of literature mining. In text mining, the machine reads the published literature and transfers the discovered knowledge to mathematical-like formulas. In an integrative approach in this study, we used text mining in combination with network discovery, pathway analysis, and enrichment analysis of genomic regions for better understanding of biomarkers in lung cancer. Particular attention was paid to non-coding biomarkers. In total, 60 MicroRNA biomarkers were reported for lung cancer, including some prognostic biomarkers. MIR21, MIR155, MALAT1, and MIR31 were the top non-coding RNA biomarkers of lung cancer. Text mining identified 447 proteins which have been studied as biomarkers in lung cancer. EGFR (receptor), TP53 (transcription factor), KRAS, CDKN2A, ENO2, KRT19, RASSF1, GRP (ligand), SHOX2 (transcription factor), and ERBB2 (receptor) were the most studied proteins. Within small molecules, thymosin-a1, oestrogen, and 8-OHdG have received more attention. We found some chromosomal bands, such as 7q32.2, 18q12.1, 6p12, 11p15.5, and 3p21.3 that are highly involved in deriving lung cancer biomarkers. Copyright © 2018 Elsevier B.V. All rights reserved.

  8. Trends in Fetal Medicine: A 10-Year Bibliometric Analysis of Prenatal Diagnosis

    PubMed Central

    Dhombres, Ferdinand; Bodenreider, Olivier

    2018-01-01

    The objective is to automatically identify trends in Fetal Medicine over the past 10 years through a bibliometric analysis of articles published in Prenatal Diagnosis, using text mining techniques. We processed 2,423 full-text articles published in Prenatal Diagnosis between 2006 and 2015. We extracted salient terms, calculated their frequencies over time, and established evolution profiles for terms, from which we derived falling, stable, and rising trends. We identified 618 terms with a falling trend, 2,142 stable terms, and 839 terms with a rising trend. Terms with increasing frequencies include those related to statistics and medical study design. The most recent of these terms reflect the new opportunities of next- generation sequencing. Many terms related to cytogenetics exhibit a falling trend. A bibliometric analysis based on text mining effectively supports identification of trends over time. This scalable approach is complementary to analyses based on metadata or expert opinion. PMID:29295220

  9. Detection of interaction articles and experimental methods in biomedical literature.

    PubMed

    Schneider, Gerold; Clematide, Simon; Rinaldi, Fabio

    2011-10-03

    This article describes the approaches taken by the OntoGene group at the University of Zurich in dealing with two tasks of the BioCreative III competition: classification of articles which contain curatable protein-protein interactions (PPI-ACT) and extraction of experimental methods (PPI-IMT). Two main achievements are described in this paper: (a) a system for document classification which crucially relies on the results of an advanced pipeline of natural language processing tools; (b) a system which is capable of detecting all experimental methods mentioned in scientific literature, and listing them with a competitive ranking (AUC iP/R > 0.5). The results of the BioCreative III shared evaluation clearly demonstrate that significant progress has been achieved in the domain of biomedical text mining in the past few years. Our own contribution, together with the results of other participants, provides evidence that natural language processing techniques have become by now an integral part of advanced text mining approaches.

  10. Using uncertainty to link and rank evidence from biomedical literature for model curation

    PubMed Central

    Zerva, Chrysoula; Batista-Navarro, Riza; Day, Philip; Ananiadou, Sophia

    2017-01-01

    Abstract Motivation In recent years, there has been great progress in the field of automated curation of biomedical networks and models, aided by text mining methods that provide evidence from literature. Such methods must not only extract snippets of text that relate to model interactions, but also be able to contextualize the evidence and provide additional confidence scores for the interaction in question. Although various approaches calculating confidence scores have focused primarily on the quality of the extracted information, there has been little work on exploring the textual uncertainty conveyed by the author. Despite textual uncertainty being acknowledged in biomedical text mining as an attribute of text mined interactions (events), it is significantly understudied as a means of providing a confidence measure for interactions in pathways or other biomedical models. In this work, we focus on improving identification of textual uncertainty for events and explore how it can be used as an additional measure of confidence for biomedical models. Results We present a novel method for extracting uncertainty from the literature using a hybrid approach that combines rule induction and machine learning. Variations of this hybrid approach are then discussed, alongside their advantages and disadvantages. We use subjective logic theory to combine multiple uncertainty values extracted from different sources for the same interaction. Our approach achieves F-scores of 0.76 and 0.88 based on the BioNLP-ST and Genia-MK corpora, respectively, making considerable improvements over previously published work. Moreover, we evaluate our proposed system on pathways related to two different areas, namely leukemia and melanoma cancer research. Availability and implementation The leukemia pathway model used is available in Pathway Studio while the Ras model is available via PathwayCommons. Online demonstration of the uncertainty extraction system is available for research purposes at http://argo.nactem.ac.uk/test. The related code is available on https://github.com/c-zrv/uncertainty_components.git. Details on the above are available in the Supplementary Material. Contact sophia.ananiadou@manchester.ac.uk Supplementary information Supplementary data are available at Bioinformatics online. PMID:29036627

  11. Using uncertainty to link and rank evidence from biomedical literature for model curation.

    PubMed

    Zerva, Chrysoula; Batista-Navarro, Riza; Day, Philip; Ananiadou, Sophia

    2017-12-01

    In recent years, there has been great progress in the field of automated curation of biomedical networks and models, aided by text mining methods that provide evidence from literature. Such methods must not only extract snippets of text that relate to model interactions, but also be able to contextualize the evidence and provide additional confidence scores for the interaction in question. Although various approaches calculating confidence scores have focused primarily on the quality of the extracted information, there has been little work on exploring the textual uncertainty conveyed by the author. Despite textual uncertainty being acknowledged in biomedical text mining as an attribute of text mined interactions (events), it is significantly understudied as a means of providing a confidence measure for interactions in pathways or other biomedical models. In this work, we focus on improving identification of textual uncertainty for events and explore how it can be used as an additional measure of confidence for biomedical models. We present a novel method for extracting uncertainty from the literature using a hybrid approach that combines rule induction and machine learning. Variations of this hybrid approach are then discussed, alongside their advantages and disadvantages. We use subjective logic theory to combine multiple uncertainty values extracted from different sources for the same interaction. Our approach achieves F-scores of 0.76 and 0.88 based on the BioNLP-ST and Genia-MK corpora, respectively, making considerable improvements over previously published work. Moreover, we evaluate our proposed system on pathways related to two different areas, namely leukemia and melanoma cancer research. The leukemia pathway model used is available in Pathway Studio while the Ras model is available via PathwayCommons. Online demonstration of the uncertainty extraction system is available for research purposes at http://argo.nactem.ac.uk/test. The related code is available on https://github.com/c-zrv/uncertainty_components.git. Details on the above are available in the Supplementary Material. sophia.ananiadou@manchester.ac.uk. Supplementary data are available at Bioinformatics online. © The Author 2017. Published by Oxford University Press.

  12. Text mining resources for the life sciences.

    PubMed

    Przybyła, Piotr; Shardlow, Matthew; Aubin, Sophie; Bossy, Robert; Eckart de Castilho, Richard; Piperidis, Stelios; McNaught, John; Ananiadou, Sophia

    2016-01-01

    Text mining is a powerful technology for quickly distilling key information from vast quantities of biomedical literature. However, to harness this power the researcher must be well versed in the availability, suitability, adaptability, interoperability and comparative accuracy of current text mining resources. In this survey, we give an overview of the text mining resources that exist in the life sciences to help researchers, especially those employed in biocuration, to engage with text mining in their own work. We categorize the various resources under three sections: Content Discovery looks at where and how to find biomedical publications for text mining; Knowledge Encoding describes the formats used to represent the different levels of information associated with content that enable text mining, including those formats used to carry such information between processes; Tools and Services gives an overview of workflow management systems that can be used to rapidly configure and compare domain- and task-specific processes, via access to a wide range of pre-built tools. We also provide links to relevant repositories in each section to enable the reader to find resources relevant to their own area of interest. Throughout this work we give a special focus to resources that are interoperable-those that have the crucial ability to share information, enabling smooth integration and reusability. © The Author(s) 2016. Published by Oxford University Press.

  13. Text mining resources for the life sciences

    PubMed Central

    Shardlow, Matthew; Aubin, Sophie; Bossy, Robert; Eckart de Castilho, Richard; Piperidis, Stelios; McNaught, John; Ananiadou, Sophia

    2016-01-01

    Text mining is a powerful technology for quickly distilling key information from vast quantities of biomedical literature. However, to harness this power the researcher must be well versed in the availability, suitability, adaptability, interoperability and comparative accuracy of current text mining resources. In this survey, we give an overview of the text mining resources that exist in the life sciences to help researchers, especially those employed in biocuration, to engage with text mining in their own work. We categorize the various resources under three sections: Content Discovery looks at where and how to find biomedical publications for text mining; Knowledge Encoding describes the formats used to represent the different levels of information associated with content that enable text mining, including those formats used to carry such information between processes; Tools and Services gives an overview of workflow management systems that can be used to rapidly configure and compare domain- and task-specific processes, via access to a wide range of pre-built tools. We also provide links to relevant repositories in each section to enable the reader to find resources relevant to their own area of interest. Throughout this work we give a special focus to resources that are interoperable—those that have the crucial ability to share information, enabling smooth integration and reusability. PMID:27888231

  14. Redundancy in electronic health record corpora: analysis, impact on text mining performance and mitigation strategies.

    PubMed

    Cohen, Raphael; Elhadad, Michael; Elhadad, Noémie

    2013-01-16

    The increasing availability of Electronic Health Record (EHR) data and specifically free-text patient notes presents opportunities for phenotype extraction. Text-mining methods in particular can help disease modeling by mapping named-entities mentions to terminologies and clustering semantically related terms. EHR corpora, however, exhibit specific statistical and linguistic characteristics when compared with corpora in the biomedical literature domain. We focus on copy-and-paste redundancy: clinicians typically copy and paste information from previous notes when documenting a current patient encounter. Thus, within a longitudinal patient record, one expects to observe heavy redundancy. In this paper, we ask three research questions: (i) How can redundancy be quantified in large-scale text corpora? (ii) Conventional wisdom is that larger corpora yield better results in text mining. But how does the observed EHR redundancy affect text mining? Does such redundancy introduce a bias that distorts learned models? Or does the redundancy introduce benefits by highlighting stable and important subsets of the corpus? (iii) How can one mitigate the impact of redundancy on text mining? We analyze a large-scale EHR corpus and quantify redundancy both in terms of word and semantic concept repetition. We observe redundancy levels of about 30% and non-standard distribution of both words and concepts. We measure the impact of redundancy on two standard text-mining applications: collocation identification and topic modeling. We compare the results of these methods on synthetic data with controlled levels of redundancy and observe significant performance variation. Finally, we compare two mitigation strategies to avoid redundancy-induced bias: (i) a baseline strategy, keeping only the last note for each patient in the corpus; (ii) removing redundant notes with an efficient fingerprinting-based algorithm. (a)For text mining, preprocessing the EHR corpus with fingerprinting yields significantly better results. Before applying text-mining techniques, one must pay careful attention to the structure of the analyzed corpora. While the importance of data cleaning has been known for low-level text characteristics (e.g., encoding and spelling), high-level and difficult-to-quantify corpus characteristics, such as naturally occurring redundancy, can also hurt text mining. Fingerprinting enables text-mining techniques to leverage available data in the EHR corpus, while avoiding the bias introduced by redundancy.

  15. ParaBTM: A Parallel Processing Framework for Biomedical Text Mining on Supercomputers.

    PubMed

    Xing, Yuting; Wu, Chengkun; Yang, Xi; Wang, Wei; Zhu, En; Yin, Jianping

    2018-04-27

    A prevailing way of extracting valuable information from biomedical literature is to apply text mining methods on unstructured texts. However, the massive amount of literature that needs to be analyzed poses a big data challenge to the processing efficiency of text mining. In this paper, we address this challenge by introducing parallel processing on a supercomputer. We developed paraBTM, a runnable framework that enables parallel text mining on the Tianhe-2 supercomputer. It employs a low-cost yet effective load balancing strategy to maximize the efficiency of parallel processing. We evaluated the performance of paraBTM on several datasets, utilizing three types of named entity recognition tasks as demonstration. Results show that, in most cases, the processing efficiency can be greatly improved with parallel processing, and the proposed load balancing strategy is simple and effective. In addition, our framework can be readily applied to other tasks of biomedical text mining besides NER.

  16. Text-mining and information-retrieval services for molecular biology

    PubMed Central

    Krallinger, Martin; Valencia, Alfonso

    2005-01-01

    Text-mining in molecular biology - defined as the automatic extraction of information about genes, proteins and their functional relationships from text documents - has emerged as a hybrid discipline on the edges of the fields of information science, bioinformatics and computational linguistics. A range of text-mining applications have been developed recently that will improve access to knowledge for biologists and database annotators. PMID:15998455

  17. Managing biological networks by using text mining and computer-aided curation

    NASA Astrophysics Data System (ADS)

    Yu, Seok Jong; Cho, Yongseong; Lee, Min-Ho; Lim, Jongtae; Yoo, Jaesoo

    2015-11-01

    In order to understand a biological mechanism in a cell, a researcher should collect a huge number of protein interactions with experimental data from experiments and the literature. Text mining systems that extract biological interactions from papers have been used to construct biological networks for a few decades. Even though the text mining of literature is necessary to construct a biological network, few systems with a text mining tool are available for biologists who want to construct their own biological networks. We have developed a biological network construction system called BioKnowledge Viewer that can generate a biological interaction network by using a text mining tool and biological taggers. It also Boolean simulation software to provide a biological modeling system to simulate the model that is made with the text mining tool. A user can download PubMed articles and construct a biological network by using the Multi-level Knowledge Emergence Model (KMEM), MetaMap, and A Biomedical Named Entity Recognizer (ABNER) as a text mining tool. To evaluate the system, we constructed an aging-related biological network that consist 9,415 nodes (genes) by using manual curation. With network analysis, we found that several genes, including JNK, AP-1, and BCL-2, were highly related in aging biological network. We provide a semi-automatic curation environment so that users can obtain a graph database for managing text mining results that are generated in the server system and can navigate the network with BioKnowledge Viewer, which is freely available at http://bioknowledgeviewer.kisti.re.kr.

  18. An overview of the biocreative 2012 workshop track III: Interactive text mining task

    USDA-ARS?s Scientific Manuscript database

    An important question is how to make use of text mining to enhance the biocuration workflow. A number of groups have developed tools for text mining from a computer science/linguistics perspective and there are many initiatives to curate some aspect of biology from the literature. In some cases the ...

  19. Text Mining in Cancer Gene and Pathway Prioritization

    PubMed Central

    Luo, Yuan; Riedlinger, Gregory; Szolovits, Peter

    2014-01-01

    Prioritization of cancer implicated genes has received growing attention as an effective way to reduce wet lab cost by computational analysis that ranks candidate genes according to the likelihood that experimental verifications will succeed. A multitude of gene prioritization tools have been developed, each integrating different data sources covering gene sequences, differential expressions, function annotations, gene regulations, protein domains, protein interactions, and pathways. This review places existing gene prioritization tools against the backdrop of an integrative Omic hierarchy view toward cancer and focuses on the analysis of their text mining components. We explain the relatively slow progress of text mining in gene prioritization, identify several challenges to current text mining methods, and highlight a few directions where more effective text mining algorithms may improve the overall prioritization task and where prioritizing the pathways may be more desirable than prioritizing only genes. PMID:25392685

  20. Text mining in cancer gene and pathway prioritization.

    PubMed

    Luo, Yuan; Riedlinger, Gregory; Szolovits, Peter

    2014-01-01

    Prioritization of cancer implicated genes has received growing attention as an effective way to reduce wet lab cost by computational analysis that ranks candidate genes according to the likelihood that experimental verifications will succeed. A multitude of gene prioritization tools have been developed, each integrating different data sources covering gene sequences, differential expressions, function annotations, gene regulations, protein domains, protein interactions, and pathways. This review places existing gene prioritization tools against the backdrop of an integrative Omic hierarchy view toward cancer and focuses on the analysis of their text mining components. We explain the relatively slow progress of text mining in gene prioritization, identify several challenges to current text mining methods, and highlight a few directions where more effective text mining algorithms may improve the overall prioritization task and where prioritizing the pathways may be more desirable than prioritizing only genes.

  1. Generating a Spanish Affective Dictionary with Supervised Learning Techniques

    ERIC Educational Resources Information Center

    Bermudez-Gonzalez, Daniel; Miranda-Jiménez, Sabino; García-Moreno, Raúl-Ulises; Calderón-Nepamuceno, Dora

    2016-01-01

    Nowadays, machine learning techniques are being used in several Natural Language Processing (NLP) tasks such as Opinion Mining (OM). OM is used to analyse and determine the affective orientation of texts. Usually, OM approaches use affective dictionaries in order to conduct sentiment analysis. These lexicons are labeled manually with affective…

  2. Public reactions to e-cigarette regulations on Twitter: a text mining analysis.

    PubMed

    Lazard, Allison J; Wilcox, Gary B; Tuttle, Hannah M; Glowacki, Elizabeth M; Pikowski, Jessica

    2017-12-01

    In May 2016, the Food and Drug Administration (FDA) issued a final rule that deemed e-cigarettes to be within their regulatory authority as a tobacco product. News and opinions about the regulation were shared on social media platforms, such as Twitter, which can play an important role in shaping the public's attitudes. We analysed information shared on Twitter for insights into initial public reactions. A text mining approach was used to uncover important topics among reactions to the e-cigarette regulations on Twitter. SAS Text Miner V.12.1 software was used for descriptive text mining to uncover the primary topics from tweets collected from May 1 to May 17 2016 using NUVI software to gather the data. A total of nine topics were generated. These topics reveal initial reactions to whether the FDA's e-cigarette regulations will benefit or harm public health, how the regulations will impact the emerging e-cigarette market and efforts to share the news. The topics were dominated by negative or mixed reactions. In the days following the FDA's announcement of the new deeming regulations, the public reaction on Twitter was largely negative. Public health advocates should consider using social media outlets to better communicate the policy's intentions, reach and potential impact for public good to create a more balanced conversation. © Article author(s) (or their employer(s) unless otherwise stated in the text of the article) 2017. All rights reserved. No commercial use is permitted unless otherwise expressly granted.

  3. Using text-mining techniques in electronic patient records to identify ADRs from medicine use.

    PubMed

    Warrer, Pernille; Hansen, Ebba Holme; Juhl-Jensen, Lars; Aagaard, Lise

    2012-05-01

    This literature review included studies that use text-mining techniques in narrative documents stored in electronic patient records (EPRs) to investigate ADRs. We searched PubMed, Embase, Web of Science and International Pharmaceutical Abstracts without restrictions from origin until July 2011. We included empirically based studies on text mining of electronic patient records (EPRs) that focused on detecting ADRs, excluding those that investigated adverse events not related to medicine use. We extracted information on study populations, EPR data sources, frequencies and types of the identified ADRs, medicines associated with ADRs, text-mining algorithms used and their performance. Seven studies, all from the United States, were eligible for inclusion in the review. Studies were published from 2001, the majority between 2009 and 2010. Text-mining techniques varied over time from simple free text searching of outpatient visit notes and inpatient discharge summaries to more advanced techniques involving natural language processing (NLP) of inpatient discharge summaries. Performance appeared to increase with the use of NLP, although many ADRs were still missed. Due to differences in study design and populations, various types of ADRs were identified and thus we could not make comparisons across studies. The review underscores the feasibility and potential of text mining to investigate narrative documents in EPRs for ADRs. However, more empirical studies are needed to evaluate whether text mining of EPRs can be used systematically to collect new information about ADRs. © 2011 The Authors. British Journal of Clinical Pharmacology © 2011 The British Pharmacological Society.

  4. Using text-mining techniques in electronic patient records to identify ADRs from medicine use

    PubMed Central

    Warrer, Pernille; Hansen, Ebba Holme; Juhl-Jensen, Lars; Aagaard, Lise

    2012-01-01

    This literature review included studies that use text-mining techniques in narrative documents stored in electronic patient records (EPRs) to investigate ADRs. We searched PubMed, Embase, Web of Science and International Pharmaceutical Abstracts without restrictions from origin until July 2011. We included empirically based studies on text mining of electronic patient records (EPRs) that focused on detecting ADRs, excluding those that investigated adverse events not related to medicine use. We extracted information on study populations, EPR data sources, frequencies and types of the identified ADRs, medicines associated with ADRs, text-mining algorithms used and their performance. Seven studies, all from the United States, were eligible for inclusion in the review. Studies were published from 2001, the majority between 2009 and 2010. Text-mining techniques varied over time from simple free text searching of outpatient visit notes and inpatient discharge summaries to more advanced techniques involving natural language processing (NLP) of inpatient discharge summaries. Performance appeared to increase with the use of NLP, although many ADRs were still missed. Due to differences in study design and populations, various types of ADRs were identified and thus we could not make comparisons across studies. The review underscores the feasibility and potential of text mining to investigate narrative documents in EPRs for ADRs. However, more empirical studies are needed to evaluate whether text mining of EPRs can be used systematically to collect new information about ADRs. PMID:22122057

  5. ANDSystem: an Associative Network Discovery System for automated literature mining in the field of biology

    PubMed Central

    2015-01-01

    Background Sufficient knowledge of molecular and genetic interactions, which comprise the entire basis of the functioning of living systems, is one of the necessary requirements for successfully answering almost any research question in the field of biology and medicine. To date, more than 24 million scientific papers can be found in PubMed, with many of them containing descriptions of a wide range of biological processes. The analysis of such tremendous amounts of data requires the use of automated text-mining approaches. Although a handful of tools have recently been developed to meet this need, none of them provide error-free extraction of highly detailed information. Results The ANDSystem package was developed for the reconstruction and analysis of molecular genetic networks based on an automated text-mining technique. It provides a detailed description of the various types of interactions between genes, proteins, microRNA's, metabolites, cellular components, pathways and diseases, taking into account the specificity of cell lines and organisms. Although the accuracy of ANDSystem is comparable to other well known text-mining tools, such as Pathway Studio and STRING, it outperforms them in having the ability to identify an increased number of interaction types. Conclusion The use of ANDSystem, in combination with Pathway Studio and STRING, can improve the quality of the automated reconstruction of molecular and genetic networks. ANDSystem should provide a useful tool for researchers working in a number of different fields, including biology, biotechnology, pharmacology and medicine. PMID:25881313

  6. tmBioC: improving interoperability of text-mining tools with BioC.

    PubMed

    Khare, Ritu; Wei, Chih-Hsuan; Mao, Yuqing; Leaman, Robert; Lu, Zhiyong

    2014-01-01

    The lack of interoperability among biomedical text-mining tools is a major bottleneck in creating more complex applications. Despite the availability of numerous methods and techniques for various text-mining tasks, combining different tools requires substantial efforts and time owing to heterogeneity and variety in data formats. In response, BioC is a recent proposal that offers a minimalistic approach to tool interoperability by stipulating minimal changes to existing tools and applications. BioC is a family of XML formats that define how to present text documents and annotations, and also provides easy-to-use functions to read/write documents in the BioC format. In this study, we introduce our text-mining toolkit, which is designed to perform several challenging and significant tasks in the biomedical domain, and repackage the toolkit into BioC to enhance its interoperability. Our toolkit consists of six state-of-the-art tools for named-entity recognition, normalization and annotation (PubTator) of genes (GenNorm), diseases (DNorm), mutations (tmVar), species (SR4GN) and chemicals (tmChem). Although developed within the same group, each tool is designed to process input articles and output annotations in a different format. We modify these tools and enable them to read/write data in the proposed BioC format. We find that, using the BioC family of formats and functions, only minimal changes were required to build the newer versions of the tools. The resulting BioC wrapped toolkit, which we have named tmBioC, consists of our tools in BioC, an annotated full-text corpus in BioC, and a format detection and conversion tool. Furthermore, through participation in the 2013 BioCreative IV Interoperability Track, we empirically demonstrate that the tools in tmBioC can be more efficiently integrated with each other as well as with external tools: Our experimental results show that using BioC reduces >60% in lines of code for text-mining tool integration. The tmBioC toolkit is publicly available at http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/tmTools/. Database URL: http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/tmTools/. Published by Oxford University Press 2014. This work is written by US Government employees and is in the public domain in the US.

  7. Sampling and monitoring for the mine life cycle

    USGS Publications Warehouse

    McLemore, Virginia T.; Smith, Kathleen S.; Russell, Carol C.

    2014-01-01

    Sampling and Monitoring for the Mine Life Cycle provides an overview of sampling for environmental purposes and monitoring of environmentally relevant variables at mining sites. It focuses on environmental sampling and monitoring of surface water, and also considers groundwater, process water streams, rock, soil, and other media including air and biological organisms. The handbook includes an appendix of technical summaries written by subject-matter experts that describe field measurements, collection methods, and analytical techniques and procedures relevant to environmental sampling and monitoring.The sixth of a series of handbooks on technologies for management of metal mine and metallurgical process drainage, this handbook supplements and enhances current literature and provides an awareness of the critical components and complexities involved in environmental sampling and monitoring at the mine site. It differs from most information sources by providing an approach to address all types of mining influenced water and other sampling media throughout the mine life cycle.Sampling and Monitoring for the Mine Life Cycle is organized into a main text and six appendices that are an integral part of the handbook. Sidebars and illustrations are included to provide additional detail about important concepts, to present examples and brief case studies, and to suggest resources for further information. Extensive references are included.

  8. Web services-based text-mining demonstrates broad impacts for interoperability and process simplification.

    PubMed

    Wiegers, Thomas C; Davis, Allan Peter; Mattingly, Carolyn J

    2014-01-01

    The Critical Assessment of Information Extraction systems in Biology (BioCreAtIvE) challenge evaluation tasks collectively represent a community-wide effort to evaluate a variety of text-mining and information extraction systems applied to the biological domain. The BioCreative IV Workshop included five independent subject areas, including Track 3, which focused on named-entity recognition (NER) for the Comparative Toxicogenomics Database (CTD; http://ctdbase.org). Previously, CTD had organized document ranking and NER-related tasks for the BioCreative Workshop 2012; a key finding of that effort was that interoperability and integration complexity were major impediments to the direct application of the systems to CTD's text-mining pipeline. This underscored a prevailing problem with software integration efforts. Major interoperability-related issues included lack of process modularity, operating system incompatibility, tool configuration complexity and lack of standardization of high-level inter-process communications. One approach to potentially mitigate interoperability and general integration issues is the use of Web services to abstract implementation details; rather than integrating NER tools directly, HTTP-based calls from CTD's asynchronous, batch-oriented text-mining pipeline could be made to remote NER Web services for recognition of specific biological terms using BioC (an emerging family of XML formats) for inter-process communications. To test this concept, participating groups developed Representational State Transfer /BioC-compliant Web services tailored to CTD's NER requirements. Participants were provided with a comprehensive set of training materials. CTD evaluated results obtained from the remote Web service-based URLs against a test data set of 510 manually curated scientific articles. Twelve groups participated in the challenge. Recall, precision, balanced F-scores and response times were calculated. Top balanced F-scores for gene, chemical and disease NER were 61, 74 and 51%, respectively. Response times ranged from fractions-of-a-second to over a minute per article. We present a description of the challenge and summary of results, demonstrating how curation groups can effectively use interoperable NER technologies to simplify text-mining pipeline implementation. Database URL: http://ctdbase.org/ © The Author(s) 2014. Published by Oxford University Press.

  9. Web services-based text-mining demonstrates broad impacts for interoperability and process simplification

    PubMed Central

    Wiegers, Thomas C.; Davis, Allan Peter; Mattingly, Carolyn J.

    2014-01-01

    The Critical Assessment of Information Extraction systems in Biology (BioCreAtIvE) challenge evaluation tasks collectively represent a community-wide effort to evaluate a variety of text-mining and information extraction systems applied to the biological domain. The BioCreative IV Workshop included five independent subject areas, including Track 3, which focused on named-entity recognition (NER) for the Comparative Toxicogenomics Database (CTD; http://ctdbase.org). Previously, CTD had organized document ranking and NER-related tasks for the BioCreative Workshop 2012; a key finding of that effort was that interoperability and integration complexity were major impediments to the direct application of the systems to CTD's text-mining pipeline. This underscored a prevailing problem with software integration efforts. Major interoperability-related issues included lack of process modularity, operating system incompatibility, tool configuration complexity and lack of standardization of high-level inter-process communications. One approach to potentially mitigate interoperability and general integration issues is the use of Web services to abstract implementation details; rather than integrating NER tools directly, HTTP-based calls from CTD's asynchronous, batch-oriented text-mining pipeline could be made to remote NER Web services for recognition of specific biological terms using BioC (an emerging family of XML formats) for inter-process communications. To test this concept, participating groups developed Representational State Transfer /BioC-compliant Web services tailored to CTD's NER requirements. Participants were provided with a comprehensive set of training materials. CTD evaluated results obtained from the remote Web service-based URLs against a test data set of 510 manually curated scientific articles. Twelve groups participated in the challenge. Recall, precision, balanced F-scores and response times were calculated. Top balanced F-scores for gene, chemical and disease NER were 61, 74 and 51%, respectively. Response times ranged from fractions-of-a-second to over a minute per article. We present a description of the challenge and summary of results, demonstrating how curation groups can effectively use interoperable NER technologies to simplify text-mining pipeline implementation. Database URL: http://ctdbase.org/ PMID:24919658

  10. Text Mining of the Classical Medical Literature for Medicines That Show Potential in Diabetic Nephropathy

    PubMed Central

    Zhang, Lei; Li, Yin; Guo, Xinfeng; May, Brian H.; Xue, Charlie C. L.; Yang, Lihong; Liu, Xusheng

    2014-01-01

    Objectives. To apply modern text-mining methods to identify candidate herbs and formulae for the treatment of diabetic nephropathy. Methods. The method we developed includes three steps: (1) identification of candidate ancient terms; (2) systemic search and assessment of medical records written in classical Chinese; (3) preliminary evaluation of the effect and safety of candidates. Results. Ancient terms Xia Xiao, Shen Xiao, and Xiao Shen were determined as the most likely to correspond with diabetic nephropathy and used in text mining. A total of 80 Chinese formulae for treating conditions congruent with diabetic nephropathy recorded in medical books from Tang Dynasty to Qing Dynasty were collected. Sao si tang (also called Reeling Silk Decoction) was chosen to show the process of preliminary evaluation of the candidates. It had promising potential for development as new agent for the treatment of diabetic nephropathy. However, further investigations about the safety to patients with renal insufficiency are still needed. Conclusions. The methods developed in this study offer a targeted approach to identifying traditional herbs and/or formulae as candidates for further investigation in the search for new drugs for modern disease. However, more effort is still required to improve our techniques, especially with regard to compound formulae. PMID:24744808

  11. A preliminary approach to creating an overview of lactoferrin multi-functionality utilizing a text mining method.

    PubMed

    Shimazaki, Kei-ichi; Kushida, Tatsuya

    2010-06-01

    Lactoferrin is a multi-functional metal-binding glycoprotein that exhibits many biological functions of interest to many researchers from the fields of clinical medicine, dentistry, pharmacology, veterinary medicine, nutrition and milk science. To date, a number of academic reports concerning the biological activities of lactoferrin have been published and are easily accessible through public data repositories. However, as the literature is expanding daily, this presents challenges in understanding the larger picture of lactoferrin function and mechanisms. In order to overcome the "analysis paralysis" associated with lactoferrin information, we attempted to apply a text mining method to the accumulated lactoferrin literature. To this end, we used the information extraction system GENPAC (provided by Nalapro Technologies Inc., Tokyo). This information extraction system uses natural language processing and text mining technology. This system analyzes the sentences and titles from abstracts stored in the PubMed database, and can automatically extract binary relations that consist of interactions between genes/proteins, chemicals and diseases/functions. We expect that such information visualization analysis will be useful in determining novel relationships among a multitude of lactoferrin functions and mechanisms. We have demonstrated the utilization of this method to find pathways of lactoferrin participation in neovascularization, Helicobacter pylori attack on gastric mucosa, atopic dermatitis and lipid metabolism.

  12. Semantic Annotation of Complex Text Structures in Problem Reports

    NASA Technical Reports Server (NTRS)

    Malin, Jane T.; Throop, David R.; Fleming, Land D.

    2011-01-01

    Text analysis is important for effective information retrieval from databases where the critical information is embedded in text fields. Aerospace safety depends on effective retrieval of relevant and related problem reports for the purpose of trend analysis. The complex text syntax in problem descriptions has limited statistical text mining of problem reports. The presentation describes an intelligent tagging approach that applies syntactic and then semantic analysis to overcome this problem. The tags identify types of problems and equipment that are embedded in the text descriptions. The power of these tags is illustrated in a faceted searching and browsing interface for problem report trending that combines automatically generated tags with database code fields and temporal information.

  13. Protein-protein interaction predictions using text mining methods.

    PubMed

    Papanikolaou, Nikolas; Pavlopoulos, Georgios A; Theodosiou, Theodosios; Iliopoulos, Ioannis

    2015-03-01

    It is beyond any doubt that proteins and their interactions play an essential role in most complex biological processes. The understanding of their function individually, but also in the form of protein complexes is of a great importance. Nowadays, despite the plethora of various high-throughput experimental approaches for detecting protein-protein interactions, many computational methods aiming to predict new interactions have appeared and gained interest. In this review, we focus on text-mining based computational methodologies, aiming to extract information for proteins and their interactions from public repositories such as literature and various biological databases. We discuss their strengths, their weaknesses and how they complement existing experimental techniques by simultaneously commenting on the biological databases which hold such information and the benchmark datasets that can be used for evaluating new tools. Copyright © 2014 Elsevier Inc. All rights reserved.

  14. Web-Based Collaborative Writing in L2 Contexts: Methodological Insights from Text Mining

    ERIC Educational Resources Information Center

    Yim, Soobin; Warschauer, Mark

    2017-01-01

    The increasingly widespread use of social software (e.g., Wikis, Google Docs) in second language (L2) settings has brought a renewed attention to collaborative writing. Although the current methodological approaches to examining collaborative writing are valuable to understand L2 students' interactional patterns or perceived experiences, they can…

  15. Multi-Filter String Matching and Human-Centric Entity Matching for Information Extraction

    ERIC Educational Resources Information Center

    Sun, Chong

    2012-01-01

    More and more information is being generated in text documents, such as Web pages, emails and blogs. To effectively manage this unstructured information, one broadly used approach includes locating relevant content in documents, extracting structured information and integrating the extracted information for querying, mining or further analysis. In…

  16. Contextual Text Mining

    ERIC Educational Resources Information Center

    Mei, Qiaozhu

    2009-01-01

    With the dramatic growth of text information, there is an increasing need for powerful text mining systems that can automatically discover useful knowledge from text. Text is generally associated with all kinds of contextual information. Those contexts can be explicit, such as the time and the location where a blog article is written, and the…

  17. Two Different Approaches to Automated Mark Up of Emotions in Text

    NASA Astrophysics Data System (ADS)

    Francisco, Virginia; Hervás, Raqucl; Gervás, Pablo

    This paper presents two different approaches to automated marking up of texts with emotional labels. For the first approach a corpus of example texts previously annotated by human evaluators is mined for an initial assignment of emotional features to words. This results in a List of Emotional Words (LEW) which becomes a useful resource for later automated mark up. The mark up algorithm in this first approach mirrors closely the steps taken during feature extraction, employing for the actual assignment of emotional features a combination of the LEW resource and WordNet for knowledge-based expansion of words not occurring in LEW. The algorithm for automated mark up is tested against new text samples to test its coverage. The second approach mark up texts during their generation. We have a knowledge base which contains the necessary information for marking up the text. This information is related to actions and characters. The algorithm in this case employ the information of the knowledge database and decides the correct emotion for every sentence. The algorithm for automated mark up is tested against four different texts. The results of the two approaches are compared and discussed with respect to three main issues: relative adequacy of each one of the representations used, correctness and coverage of the proposed algorithms, and additional techniques and solutions that may be employed to improve the results.

  18. Chemical Topic Modeling: Exploring Molecular Data Sets Using a Common Text-Mining Approach.

    PubMed

    Schneider, Nadine; Fechner, Nikolas; Landrum, Gregory A; Stiefl, Nikolaus

    2017-08-28

    Big data is one of the key transformative factors which increasingly influences all aspects of modern life. Although this transformation brings vast opportunities it also generates novel challenges, not the least of which is organizing and searching this data deluge. The field of medicinal chemistry is not different: more and more data are being generated, for instance, by technologies such as DNA encoded libraries, peptide libraries, text mining of large literature corpora, and new in silico enumeration methods. Handling those huge sets of molecules effectively is quite challenging and requires compromises that often come at the expense of the interpretability of the results. In order to find an intuitive and meaningful approach to organizing large molecular data sets, we adopted a probabilistic framework called "topic modeling" from the text-mining field. Here we present the first chemistry-related implementation of this method, which allows large molecule sets to be assigned to "chemical topics" and investigating the relationships between those. In this first study, we thoroughly evaluate this novel method in different experiments and discuss both its disadvantages and advantages. We show very promising results in reproducing human-assigned concepts using the approach to identify and retrieve chemical series from sets of molecules. We have also created an intuitive visualization of the chemical topics output by the algorithm. This is a huge benefit compared to other unsupervised machine-learning methods, like clustering, which are commonly used to group sets of molecules. Finally, we applied the new method to the 1.6 million molecules of the ChEMBL22 data set to test its robustness and efficiency. In about 1 h we built a 100-topic model of this large data set in which we could identify interesting topics like "proteins", "DNA", or "steroids". Along with this publication we provide our data sets and an open-source implementation of the new method (CheTo) which will be part of an upcoming version of the open-source cheminformatics toolkit RDKit.

  19. Mining Patients' Narratives in Social Media for Pharmacovigilance: Adverse Effects and Misuse of Methylphenidate.

    PubMed

    Chen, Xiaoyi; Faviez, Carole; Schuck, Stéphane; Lillo-Le-Louët, Agnès; Texier, Nathalie; Dahamna, Badisse; Huot, Charles; Foulquié, Pierre; Pereira, Suzanne; Leroux, Vincent; Karapetiantz, Pierre; Guenegou-Arnoux, Armelle; Katsahian, Sandrine; Bousquet, Cédric; Burgun, Anita

    2018-01-01

    Background: The Food and Drug Administration (FDA) in the United States and the European Medicines Agency (EMA) have recognized social media as a new data source to strengthen their activities regarding drug safety. Objective: Our objective in the ADR-PRISM project was to provide text mining and visualization tools to explore a corpus of posts extracted from social media. We evaluated this approach on a corpus of 21 million posts from five patient forums, and conducted a qualitative analysis of the data available on methylphenidate in this corpus. Methods: We applied text mining methods based on named entity recognition and relation extraction in the corpus, followed by signal detection using proportional reporting ratio (PRR). We also used topic modeling based on the Correlated Topic Model to obtain the list of the matics in the corpus and classify the messages based on their topics. Results: We automatically identified 3443 posts about methylphenidate published between 2007 and 2016, among which 61 adverse drug reactions (ADR) were automatically detected. Two pharmacovigilance experts evaluated manually the quality of automatic identification, and a f-measure of 0.57 was reached. Patient's reports were mainly neuro-psychiatric effects. Applying PRR, 67% of the ADRs were signals, including most of the neuro-psychiatric symptoms but also palpitations. Topic modeling showed that the most represented topics were related to Childhood and Treatment initiation , but also Side effects . Cases of misuse were also identified in this corpus, including recreational use and abuse. Conclusion: Named entity recognition combined with signal detection and topic modeling have demonstrated their complementarity in mining social media data. An in-depth analysis focused on methylphenidate showed that this approach was able to detect potential signals and to provide better understanding of patients' behaviors regarding drugs, including misuse.

  20. Mines, Quarries and Landscape. Visuality and Transformation

    NASA Astrophysics Data System (ADS)

    Jimeno, Carlos López; Torrijos, Ignacio Díez; González, Carmen Mataix

    2016-06-01

    In this paper a review of two basic concepts is carried out: scenery and landscape integration, proposing a new concept: "visuality", alternative to the classical "visibility" used in landscape studies related to mining activity, which explores the qualitative aspects that define the visual relationships between observer and environment. In relation to landscape integration studies, some reflections on substantive issues are made which induce certain prejudices at the time of addressing the issue of mining operations landscape integration, and some guidance and integration strategies are formulated. In the second part of the text, a new approach to the landscape integration of mines and quarries is raised, closely linked to the concept of visuality which are based on a basic goal: the re-qualification of the place, and give innovative answers to re-qualify the place and show how to catch the opportunity in the deep transformation generated by the development of mining activities. As a conclusion, a case study is presented in the last section, the landscape integration study conducted on marble exploitations Coto Pinos (Alicante, Spain), considered the largest ornamental rock quarry in Europe.

  1. Annotating images by mining image search results.

    PubMed

    Wang, Xin-Jing; Zhang, Lei; Li, Xirong; Ma, Wei-Ying

    2008-11-01

    Although it has been studied for years by the computer vision and machine learning communities, image annotation is still far from practical. In this paper, we propose a novel attempt at model-free image annotation, which is a data-driven approach that annotates images by mining their search results. Some 2.4 million images with their surrounding text are collected from a few photo forums to support this approach. The entire process is formulated in a divide-and-conquer framework where a query keyword is provided along with the uncaptioned image to improve both the effectiveness and efficiency. This is helpful when the collected data set is not dense everywhere. In this sense, our approach contains three steps: 1) the search process to discover visually and semantically similar search results, 2) the mining process to identify salient terms from textual descriptions of the search results, and 3) the annotation rejection process to filter out noisy terms yielded by Step 2. To ensure real-time annotation, two key techniques are leveraged-one is to map the high-dimensional image visual features into hash codes, the other is to implement it as a distributed system, of which the search and mining processes are provided as Web services. As a typical result, the entire process finishes in less than 1 second. Since no training data set is required, our approach enables annotating with unlimited vocabulary and is highly scalable and robust to outliers. Experimental results on both real Web images and a benchmark image data set show the effectiveness and efficiency of the proposed algorithm. It is also worth noting that, although the entire approach is illustrated within the divide-and conquer framework, a query keyword is not crucial to our current implementation. We provide experimental results to prove this.

  2. Biocuration workflows and text mining: overview of the BioCreative 2012 Workshop Track II.

    PubMed

    Lu, Zhiyong; Hirschman, Lynette

    2012-01-01

    Manual curation of data from the biomedical literature is a rate-limiting factor for many expert curated databases. Despite the continuing advances in biomedical text mining and the pressing needs of biocurators for better tools, few existing text-mining tools have been successfully integrated into production literature curation systems such as those used by the expert curated databases. To close this gap and better understand all aspects of literature curation, we invited submissions of written descriptions of curation workflows from expert curated databases for the BioCreative 2012 Workshop Track II. We received seven qualified contributions, primarily from model organism databases. Based on these descriptions, we identified commonalities and differences across the workflows, the common ontologies and controlled vocabularies used and the current and desired uses of text mining for biocuration. Compared to a survey done in 2009, our 2012 results show that many more databases are now using text mining in parts of their curation workflows. In addition, the workshop participants identified text-mining aids for finding gene names and symbols (gene indexing), prioritization of documents for curation (document triage) and ontology concept assignment as those most desired by the biocurators. DATABASE URL: http://www.biocreative.org/tasks/bc-workshop-2012/workflow/.

  3. A New Pivoting and Iterative Text Detection Algorithm for Biomedical Images

    PubMed Central

    Xu, Songhua; Krauthammer, Michael

    2010-01-01

    There is interest to expand the reach of literature mining to include the analysis of biomedical images, which often contain a paper’s key findings. Examples include recent studies that use Optical Character Recognition (OCR) to extract image text, which is used to boost biomedical image retrieval and classification. Such studies rely on the robust identification of text elements in biomedical images, which is a non-trivial task. In this work, we introduce a new text detection algorithm for biomedical images based on iterative projection histograms. We study the effectiveness of our algorithm by evaluating the performance on a set of manually labeled random biomedical images, and compare the performance against other state-of-the-art text detection algorithms. In this paper, we demonstrate that a projection histogram-based text detection approach is well suited for text detection in biomedical images, with a performance of F score of .60. The approach performs better than comparable approaches for text detection. Further, we show that the iterative application of the algorithm is boosting overall detection performance. A C++ implementation of our algorithm is freely available through email request for academic use. PMID:20887803

  4. The Distribution of the Informative Intensity of the Text in Terms of its Structure (On Materials of the English Texts in the Mining Sphere)

    NASA Astrophysics Data System (ADS)

    Znikina, Ludmila; Rozhneva, Elena

    2017-11-01

    The article deals with the distribution of informative intensity of the English-language scientific text based on its structural features contributing to the process of formalization of the scientific text and the preservation of the adequacy of the text with derived semantic information in relation to the primary. Discourse analysis is built on specific compositional and meaningful examples of scientific texts taken from the mining field. It also analyzes the adequacy of the translation of foreign texts into another language, the relationships between elements of linguistic systems, the degree of a formal conformance, translation with the specific objectives and information needs of the recipient. Some key words and ideas are emphasized in the paragraphs of the English-language mining scientific texts. The article gives the characteristic features of the structure of paragraphs of technical text and examples of constructions in English scientific texts based on a mining theme with the aim to explain the possible ways of their adequate translation.

  5. Introduction to the JASIST Special Topic Issue on Web Retrieval and Mining: A Machine Learning Perspective.

    ERIC Educational Resources Information Center

    Chen, Hsinchun

    2003-01-01

    Discusses information retrieval techniques used on the World Wide Web. Topics include machine learning in information extraction; relevance feedback; information filtering and recommendation; text classification and text clustering; Web mining, based on data mining techniques; hyperlink structure; and Web size. (LRW)

  6. Application of text mining in the biomedical domain.

    PubMed

    Fleuren, Wilco W M; Alkema, Wynand

    2015-03-01

    In recent years the amount of experimental data that is produced in biomedical research and the number of papers that are being published in this field have grown rapidly. In order to keep up to date with developments in their field of interest and to interpret the outcome of experiments in light of all available literature, researchers turn more and more to the use of automated literature mining. As a consequence, text mining tools have evolved considerably in number and quality and nowadays can be used to address a variety of research questions ranging from de novo drug target discovery to enhanced biological interpretation of the results from high throughput experiments. In this paper we introduce the most important techniques that are used for a text mining and give an overview of the text mining tools that are currently being used and the type of problems they are typically applied for. Copyright © 2015 Elsevier Inc. All rights reserved.

  7. Optimizing the Information Presentation on Mining Potential by using Web Services Technology with Restful Protocol

    NASA Astrophysics Data System (ADS)

    Abdillah, T.; Dai, R.; Setiawan, E.

    2018-02-01

    This study aims to develop the application of Web Services technology with RestFul Protocol to optimize the information presentation on mining potential. This study used User Interface Design approach for the information accuracy and relevance as well as the Web Service for the reliability in presenting the information. The results show that: the information accuracy and relevance regarding mining potential can be seen from the achievement of User Interface implementation in the application that is based on the following rules: The consideration of the appropriate colours and objects, the easiness of using the navigation, and users’ interaction with the applications that employs symbols and languages understood by the users; the information accuracy and relevance related to mining potential can be observed by the information presented by using charts and Tool Tip Text to help the users understand the provided chart/figure; the reliability of the information presentation is evident by the results of Web Services testing in Figure 4.5.6. This study finds out that User Interface Design and Web Services approaches (for the access of different Platform apps) are able to optimize the presentation. The results of this study can be used as a reference for software developers and Provincial Government of Gorontalo.

  8. The Islamic State Battle Plan: Press Release Natural Language Processing

    DTIC Science & Technology

    2016-06-01

    Processing, text mining , corpus, generalized linear model, cascade, R Shiny, leaflet, data visualization 15. NUMBER OF PAGES 83 16. PRICE CODE...Terrorism and Responses to Terrorism TDM Term Document Matrix TF Term Frequency TF-IDF Term Frequency-Inverse Document Frequency tm text mining (R...package=leaflet. Feinerer I, Hornik K (2015) Text Mining Package “tm,” Version 0.6-2. (Jul 3) https://cran.r-project.org/web/packages/tm/tm.pdf

  9. Text mining patents for biomedical knowledge.

    PubMed

    Rodriguez-Esteban, Raul; Bundschus, Markus

    2016-06-01

    Biomedical text mining of scientific knowledge bases, such as Medline, has received much attention in recent years. Given that text mining is able to automatically extract biomedical facts that revolve around entities such as genes, proteins, and drugs, from unstructured text sources, it is seen as a major enabler to foster biomedical research and drug discovery. In contrast to the biomedical literature, research into the mining of biomedical patents has not reached the same level of maturity. Here, we review existing work and highlight the associated technical challenges that emerge from automatically extracting facts from patents. We conclude by outlining potential future directions in this domain that could help drive biomedical research and drug discovery. Copyright © 2016 Elsevier Ltd. All rights reserved.

  10. The ACODEA Framework: Developing Segmentation and Classification Schemes for Fully Automatic Analysis of Online Discussions

    ERIC Educational Resources Information Center

    Mu, Jin; Stegmann, Karsten; Mayfield, Elijah; Rose, Carolyn; Fischer, Frank

    2012-01-01

    Research related to online discussions frequently faces the problem of analyzing huge corpora. Natural Language Processing (NLP) technologies may allow automating this analysis. However, the state-of-the-art in machine learning and text mining approaches yields models that do not transfer well between corpora related to different topics. Also,…

  11. Enhancing a Core Journal Collection for Digital Libraries

    ERIC Educational Resources Information Center

    Kovacevic, Ana; Devedzic, Vladan; Pocajt, Viktor

    2010-01-01

    Purpose: This paper aims to address the problem of enhancing the selection of titles offered by a digital library, by analysing the differences in these titles when they are cited by local authors in their publications and when they are listed in the digital library offer. Design/methodology/approach: Text mining techniques were used to identify…

  12. Application of the EVEX resource to event extraction and network construction: Shared Task entry and result analysis

    PubMed Central

    2015-01-01

    Background Modern methods for mining biomolecular interactions from literature typically make predictions based solely on the immediate textual context, in effect a single sentence. No prior work has been published on extending this context to the information automatically gathered from the whole biomedical literature. Thus, our motivation for this study is to explore whether mutually supporting evidence, aggregated across several documents can be utilized to improve the performance of the state-of-the-art event extraction systems. In this paper, we describe our participation in the latest BioNLP Shared Task using the large-scale text mining resource EVEX. We participated in the Genia Event Extraction (GE) and Gene Regulation Network (GRN) tasks with two separate systems. In the GE task, we implemented a re-ranking approach to improve the precision of an existing event extraction system, incorporating features from the EVEX resource. In the GRN task, our system relied solely on the EVEX resource and utilized a rule-based conversion algorithm between the EVEX and GRN formats. Results In the GE task, our re-ranking approach led to a modest performance increase and resulted in the first rank of the official Shared Task results with 50.97% F-score. Additionally, in this paper we explore and evaluate the usage of distributed vector representations for this challenge. In the GRN task, we ranked fifth in the official results with a strict/relaxed SER score of 0.92/0.81 respectively. To try and improve upon these results, we have implemented a novel machine learning based conversion system and benchmarked its performance against the original rule-based system. Conclusions For the GRN task, we were able to produce a gene regulatory network from the EVEX data, warranting the use of such generic large-scale text mining data in network biology settings. A detailed performance and error analysis provides more insight into the relatively low recall rates. In the GE task we demonstrate that both the re-ranking approach and the word vectors can provide slight performance improvement. A manual evaluation of the re-ranking results pinpoints some of the challenges faced in applying large-scale text mining knowledge to event extraction. PMID:26551766

  13. Automatically classifying sentences in full-text biomedical articles into Introduction, Methods, Results and Discussion.

    PubMed

    Agarwal, Shashank; Yu, Hong

    2009-12-01

    Biomedical texts can be typically represented by four rhetorical categories: Introduction, Methods, Results and Discussion (IMRAD). Classifying sentences into these categories can benefit many other text-mining tasks. Although many studies have applied different approaches for automatically classifying sentences in MEDLINE abstracts into the IMRAD categories, few have explored the classification of sentences that appear in full-text biomedical articles. We first evaluated whether sentences in full-text biomedical articles could be reliably annotated into the IMRAD format and then explored different approaches for automatically classifying these sentences into the IMRAD categories. Our results show an overall annotation agreement of 82.14% with a Kappa score of 0.756. The best classification system is a multinomial naïve Bayes classifier trained on manually annotated data that achieved 91.95% accuracy and an average F-score of 91.55%, which is significantly higher than baseline systems. A web version of this system is available online at-http://wood.ims.uwm.edu/full_text_classifier/.

  14. The Voice of Chinese Health Consumers: A Text Mining Approach to Web-Based Physician Reviews

    PubMed Central

    Zhang, Kunpeng

    2016-01-01

    Background Many Web-based health care platforms allow patients to evaluate physicians by posting open-end textual reviews based on their experiences. These reviews are helpful resources for other patients to choose high-quality doctors, especially in countries like China where no doctor referral systems exist. Analyzing such a large amount of user-generated content to understand the voice of health consumers has attracted much attention from health care providers and health care researchers. Objective The aim of this paper is to automatically extract hidden topics from Web-based physician reviews using text-mining techniques to examine what Chinese patients have said about their doctors and whether these topics differ across various specialties. This knowledge will help health care consumers, providers, and researchers better understand this information. Methods We conducted two-fold analyses on the data collected from the “Good Doctor Online” platform, the largest online health community in China. First, we explored all reviews from 2006-2014 using descriptive statistics. Second, we applied the well-known topic extraction algorithm Latent Dirichlet Allocation to more than 500,000 textual reviews from over 75,000 Chinese doctors across four major specialty areas to understand what Chinese health consumers said online about their doctor visits. Results On the “Good Doctor Online” platform, 112,873 out of 314,624 doctors had been reviewed at least once by April 11, 2014. Among the 772,979 textual reviews, we chose to focus on four major specialty areas that received the most reviews: Internal Medicine, Surgery, Obstetrics/Gynecology and Pediatrics, and Chinese Traditional Medicine. Among the doctors who received reviews from those four medical specialties, two-thirds of them received more than two reviews and in a few extreme cases, some doctors received more than 500 reviews. Across the four major areas, the most popular topics reviewers found were the experience of finding doctors, doctors’ technical skills and bedside manner, general appreciation from patients, and description of various symptoms. Conclusions To the best of our knowledge, our work is the first study using an automated text-mining approach to analyze a large amount of unstructured textual data of Web-based physician reviews in China. Based on our analysis, we found that Chinese reviewers mainly concentrate on a few popular topics. This is consistent with the goal of Chinese online health platforms and demonstrates the health care focus in China’s health care system. Our text-mining approach reveals a new research area on how to use big data to help health care providers, health care administrators, and policy makers hear patient voices, target patient concerns, and improve the quality of care in this age of patient-centered care. Also, on the health care consumer side, our text mining technique helps patients make more informed decisions about which specialists to see without reading thousands of reviews, which is simply not feasible. In addition, our comparison analysis of Web-based physician reviews in China and the United States also indicates some cultural differences. PMID:27165558

  15. The Voice of Chinese Health Consumers: A Text Mining Approach to Web-Based Physician Reviews.

    PubMed

    Hao, Haijing; Zhang, Kunpeng

    2016-05-10

    Many Web-based health care platforms allow patients to evaluate physicians by posting open-end textual reviews based on their experiences. These reviews are helpful resources for other patients to choose high-quality doctors, especially in countries like China where no doctor referral systems exist. Analyzing such a large amount of user-generated content to understand the voice of health consumers has attracted much attention from health care providers and health care researchers. The aim of this paper is to automatically extract hidden topics from Web-based physician reviews using text-mining techniques to examine what Chinese patients have said about their doctors and whether these topics differ across various specialties. This knowledge will help health care consumers, providers, and researchers better understand this information. We conducted two-fold analyses on the data collected from the "Good Doctor Online" platform, the largest online health community in China. First, we explored all reviews from 2006-2014 using descriptive statistics. Second, we applied the well-known topic extraction algorithm Latent Dirichlet Allocation to more than 500,000 textual reviews from over 75,000 Chinese doctors across four major specialty areas to understand what Chinese health consumers said online about their doctor visits. On the "Good Doctor Online" platform, 112,873 out of 314,624 doctors had been reviewed at least once by April 11, 2014. Among the 772,979 textual reviews, we chose to focus on four major specialty areas that received the most reviews: Internal Medicine, Surgery, Obstetrics/Gynecology and Pediatrics, and Chinese Traditional Medicine. Among the doctors who received reviews from those four medical specialties, two-thirds of them received more than two reviews and in a few extreme cases, some doctors received more than 500 reviews. Across the four major areas, the most popular topics reviewers found were the experience of finding doctors, doctors' technical skills and bedside manner, general appreciation from patients, and description of various symptoms. To the best of our knowledge, our work is the first study using an automated text-mining approach to analyze a large amount of unstructured textual data of Web-based physician reviews in China. Based on our analysis, we found that Chinese reviewers mainly concentrate on a few popular topics. This is consistent with the goal of Chinese online health platforms and demonstrates the health care focus in China's health care system. Our text-mining approach reveals a new research area on how to use big data to help health care providers, health care administrators, and policy makers hear patient voices, target patient concerns, and improve the quality of care in this age of patient-centered care. Also, on the health care consumer side, our text mining technique helps patients make more informed decisions about which specialists to see without reading thousands of reviews, which is simply not feasible. In addition, our comparison analysis of Web-based physician reviews in China and the United States also indicates some cultural differences.

  16. Combining evidence, biomedical literature and statistical dependence: new insights for functional annotation of gene sets

    PubMed Central

    Aubry, Marc; Monnier, Annabelle; Chicault, Celine; de Tayrac, Marie; Galibert, Marie-Dominique; Burgun, Anita; Mosser, Jean

    2006-01-01

    Background Large-scale genomic studies based on transcriptome technologies provide clusters of genes that need to be functionally annotated. The Gene Ontology (GO) implements a controlled vocabulary organised into three hierarchies: cellular components, molecular functions and biological processes. This terminology allows a coherent and consistent description of the knowledge about gene functions. The GO terms related to genes come primarily from semi-automatic annotations made by trained biologists (annotation based on evidence) or text-mining of the published scientific literature (literature profiling). Results We report an original functional annotation method based on a combination of evidence and literature that overcomes the weaknesses and the limitations of each approach. It relies on the Gene Ontology Annotation database (GOA Human) and the PubGene biomedical literature index. We support these annotations with statistically associated GO terms and retrieve associative relations across the three GO hierarchies to emphasise the major pathways involved by a gene cluster. Both annotation methods and associative relations were quantitatively evaluated with a reference set of 7397 genes and a multi-cluster study of 14 clusters. We also validated the biological appropriateness of our hybrid method with the annotation of a single gene (cdc2) and that of a down-regulated cluster of 37 genes identified by a transcriptome study of an in vitro enterocyte differentiation model (CaCo-2 cells). Conclusion The combination of both approaches is more informative than either separate approach: literature mining can enrich an annotation based only on evidence. Text-mining of the literature can also find valuable associated MEDLINE references that confirm the relevance of the annotation. Eventually, GO terms networks can be built with associative relations in order to highlight cooperative and competitive pathways and their connected molecular functions. PMID:16674810

  17. Text and Structural Data Mining of Influenza Mentions in Web and Social Media

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Corley, Courtney D.; Cook, Diane; Mikler, Armin R.

    Text and structural data mining of Web and social media (WSM) provides a novel disease surveillance resource and can identify online communities for targeted public health communications (PHC) to assure wide dissemination of pertinent information. WSM that mention influenza are harvested over a 24-week period, 5-October-2008 to 21-March-2009. Link analysis reveals communities for targeted PHC. Text mining is shown to identify trends in flu posts that correlate to real-world influenza-like-illness patient report data. We also bring to bear a graph-based data mining technique to detect anomalies among flu blogs connected by publisher type, links, and user-tags.

  18. Text Mining for Precision Medicine: Bringing structure to EHRs and biomedical literature to understand genes and health

    PubMed Central

    Simmons, Michael; Singhal, Ayush; Lu, Zhiyong

    2018-01-01

    The key question of precision medicine is whether it is possible to find clinically actionable granularity in diagnosing disease and classifying patient risk. The advent of next generation sequencing and the widespread adoption of electronic health records (EHRs) have provided clinicians and researchers a wealth of data and made possible the precise characterization of individual patient genotypes and phenotypes. Unstructured text — found in biomedical publications and clinical notes — is an important component of genotype and phenotype knowledge. Publications in the biomedical literature provide essential information for interpreting genetic data. Likewise, clinical notes contain the richest source of phenotype information in EHRs. Text mining can render these texts computationally accessible and support information extraction and hypothesis generation. This chapter reviews the mechanics of text mining in precision medicine and discusses several specific use cases, including database curation for personalized cancer medicine, patient outcome prediction from EHR-derived cohorts, and pharmacogenomic research. Taken as a whole, these use cases demonstrate how text mining enables effective utilization of existing knowledge sources and thus promotes increased value for patients and healthcare systems. Text mining is an indispensable tool for translating genotype-phenotype data into effective clinical care that will undoubtedly play an important role in the eventual realization of precision medicine. PMID:27807747

  19. Text Mining for Precision Medicine: Bringing Structure to EHRs and Biomedical Literature to Understand Genes and Health.

    PubMed

    Simmons, Michael; Singhal, Ayush; Lu, Zhiyong

    2016-01-01

    The key question of precision medicine is whether it is possible to find clinically actionable granularity in diagnosing disease and classifying patient risk. The advent of next-generation sequencing and the widespread adoption of electronic health records (EHRs) have provided clinicians and researchers a wealth of data and made possible the precise characterization of individual patient genotypes and phenotypes. Unstructured text-found in biomedical publications and clinical notes-is an important component of genotype and phenotype knowledge. Publications in the biomedical literature provide essential information for interpreting genetic data. Likewise, clinical notes contain the richest source of phenotype information in EHRs. Text mining can render these texts computationally accessible and support information extraction and hypothesis generation. This chapter reviews the mechanics of text mining in precision medicine and discusses several specific use cases, including database curation for personalized cancer medicine, patient outcome prediction from EHR-derived cohorts, and pharmacogenomic research. Taken as a whole, these use cases demonstrate how text mining enables effective utilization of existing knowledge sources and thus promotes increased value for patients and healthcare systems. Text mining is an indispensable tool for translating genotype-phenotype data into effective clinical care that will undoubtedly play an important role in the eventual realization of precision medicine.

  20. Using Text Mining to Uncover Students' Technology-Related Problems in Live Video Streaming

    ERIC Educational Resources Information Center

    Abdous, M'hammed; He, Wu

    2011-01-01

    Because of their capacity to sift through large amounts of data, text mining and data mining are enabling higher education institutions to reveal valuable patterns in students' learning behaviours without having to resort to traditional survey methods. In an effort to uncover live video streaming (LVS) students' technology related-problems and to…

  1. The Feasibility of Using Large-Scale Text Mining to Detect Adverse Childhood Experiences in a VA-Treated Population.

    PubMed

    Hammond, Kenric W; Ben-Ari, Alon Y; Laundry, Ryan J; Boyko, Edward J; Samore, Matthew H

    2015-12-01

    Free text in electronic health records resists large-scale analysis. Text records facts of interest not found in encoded data, and text mining enables their retrieval and quantification. The U.S. Department of Veterans Affairs (VA) clinical data repository affords an opportunity to apply text-mining methodology to study clinical questions in large populations. To assess the feasibility of text mining, investigation of the relationship between exposure to adverse childhood experiences (ACEs) and recorded diagnoses was conducted among all VA-treated Gulf war veterans, utilizing all progress notes recorded from 2000-2011. Text processing extracted ACE exposures recorded among 44.7 million clinical notes belonging to 243,973 veterans. The relationship of ACE exposure to adult illnesses was analyzed using logistic regression. Bias considerations were assessed. ACE score was strongly associated with suicide attempts and serious mental disorders (ORs = 1.84 to 1.97), and less so with behaviorally mediated and somatic conditions (ORs = 1.02 to 1.36) per unit. Bias adjustments did not remove persistent associations between ACE score and most illnesses. Text mining to detect ACE exposure in a large population was feasible. Analysis of the relationship between ACE score and adult health conditions yielded patterns of association consistent with prior research. Copyright © 2015 International Society for Traumatic Stress Studies.

  2. Mining the Geophysical Research Abstracts Corpus: Mapping the impact of Free and Open Source Software on the EGU Divisions

    NASA Astrophysics Data System (ADS)

    Löwe, Peter; Klump, Jens; Robertson, Jesse

    2015-04-01

    Text mining is commonly employed as a tool in data science to investigate and chart emergent information from corpora of research abstracts, such as the Geophysical Research Abstracts (GRA) published by Copernicus. In this context current standards, such as persistent identifiers like DOI and ORCID, allow us to trace, cite and map links between journal publications, the underlying research data and scientific software. This network can be expressed as a directed graph which enables us to chart networks of cooperation and innovation, thematic foci and the locations of research communities in time and space. However, this approach of data science, focusing on the research process in a self-referential manner, rather than the topical work, is still in a developing stage. Scientific work presented at the EGU General Assembly is often the first step towards new approaches and innovative ideas to the geospatial community. It represents a rich, deep and heterogeneous source of geoscientific thought. This corpus is a significant data source for data science, which has not been analysed on this scale previously. In this work, the corpus of the Geophysical Research Abstracts is used for the first time as a data base for analyses of topical text mining. For this, we used a sturdy and customizable software framework, based on the work of Schmitt et al. [1]. For the analysis we used the High Performance Computing infrastructure of the German Research Centre for Geosciences GFZ in Potsdam, Germany. Here, we report on the first results from the analysis of the continuous spreading the of use of Free and Open Source Software Tools (FOSS) within the EGU communities, mapping the general increase of FOSS-themed GRA articles in the last decade and the developing spatial patterns of involved parties and FOSS topics. References: [1] Schmitt, L. M., Christianson, K.T, Gupta R..: Linguistic Computing with UNIX Tools, in Kao, A., Poteet S.R. (Eds.): Natural Language processing and Text Mining, Springer, 2007. doi:10.1007/978-1-84628-754-1_12.

  3. Text mining and its potential applications in systems biology.

    PubMed

    Ananiadou, Sophia; Kell, Douglas B; Tsujii, Jun-ichi

    2006-12-01

    With biomedical literature increasing at a rate of several thousand papers per week, it is impossible to keep abreast of all developments; therefore, automated means to manage the information overload are required. Text mining techniques, which involve the processes of information retrieval, information extraction and data mining, provide a means of solving this. By adding meaning to text, these techniques produce a more structured analysis of textual knowledge than simple word searches, and can provide powerful tools for the production and analysis of systems biology models.

  4. Mining adverse drug reactions from online healthcare forums using hidden Markov model.

    PubMed

    Sampathkumar, Hariprasad; Chen, Xue-wen; Luo, Bo

    2014-10-23

    Adverse Drug Reactions are one of the leading causes of injury or death among patients undergoing medical treatments. Not all Adverse Drug Reactions are identified before a drug is made available in the market. Current post-marketing drug surveillance methods, which are based purely on voluntary spontaneous reports, are unable to provide the early indications necessary to prevent the occurrence of such injuries or fatalities. The objective of this research is to extract reports of adverse drug side-effects from messages in online healthcare forums and use them as early indicators to assist in post-marketing drug surveillance. We treat the task of extracting adverse side-effects of drugs from healthcare forum messages as a sequence labeling problem and present a Hidden Markov Model(HMM) based Text Mining system that can be used to classify a message as containing drug side-effect information and then extract the adverse side-effect mentions from it. A manually annotated dataset from http://www.medications.com is used in the training and validation of the HMM based Text Mining system. A 10-fold cross-validation on the manually annotated dataset yielded on average an F-Score of 0.76 from the HMM Classifier, in comparison to 0.575 from the Baseline classifier. Without the Plain Text Filter component as a part of the Text Processing module, the F-Score of the HMM Classifier was reduced to 0.378 on average, while absence of the HTML Filter component was found to have no impact. Reducing the Drug names dictionary size by half, on average reduced the F-Score of the HMM Classifier to 0.359, while a similar reduction to the side-effects dictionary yielded an F-Score of 0.651 on average. Adverse side-effects mined from http://www.medications.com and http://www.steadyhealth.com were found to match the Adverse Drug Reactions on the Drug Package Labels of several drugs. In addition, some novel adverse side-effects, which can be potential Adverse Drug Reactions, were also identified. The results from the HMM based Text Miner are encouraging to pursue further enhancements to this approach. The mined novel side-effects can act as early indicators for health authorities to help focus their efforts in post-marketing drug surveillance.

  5. BioCreative V track 4: a shared task for the extraction of causal network information using the Biological Expression Language.

    PubMed

    Rinaldi, Fabio; Ellendorff, Tilia Renate; Madan, Sumit; Clematide, Simon; van der Lek, Adrian; Mevissen, Theo; Fluck, Juliane

    2016-01-01

    Automatic extraction of biological network information is one of the most desired and most complex tasks in biological and medical text mining. Track 4 at BioCreative V attempts to approach this complexity using fragments of large-scale manually curated biological networks, represented in Biological Expression Language (BEL), as training and test data. BEL is an advanced knowledge representation format which has been designed to be both human readable and machine processable. The specific goal of track 4 was to evaluate text mining systems capable of automatically constructing BEL statements from given evidence text, and of retrieving evidence text for given BEL statements. Given the complexity of the task, we designed an evaluation methodology which gives credit to partially correct statements. We identified various levels of information expressed by BEL statements, such as entities, functions, relations, and introduced an evaluation framework which rewards systems capable of delivering useful BEL fragments at each of these levels. The aim of this evaluation method is to help identify the characteristics of the systems which, if combined, would be most useful for achieving the overall goal of automatically constructing causal biological networks from text. © The Author(s) 2016. Published by Oxford University Press.

  6. Thematic clustering of text documents using an EM-based approach

    PubMed Central

    2012-01-01

    Clustering textual contents is an important step in mining useful information on the web or other text-based resources. The common task in text clustering is to handle text in a multi-dimensional space, and to partition documents into groups, where each group contains documents that are similar to each other. However, this strategy lacks a comprehensive view for humans in general since it cannot explain the main subject of each cluster. Utilizing semantic information can solve this problem, but it needs a well-defined ontology or pre-labeled gold standard set. In this paper, we present a thematic clustering algorithm for text documents. Given text, subject terms are extracted and used for clustering documents in a probabilistic framework. An EM approach is used to ensure documents are assigned to correct subjects, hence it converges to a locally optimal solution. The proposed method is distinctive because its results are sufficiently explanatory for human understanding as well as efficient for clustering performance. The experimental results show that the proposed method provides a competitive performance compared to other state-of-the-art approaches. We also show that the extracted themes from the MEDLINE® dataset represent the subjects of clusters reasonably well. PMID:23046528

  7. Modelling the sensory space of varietal wines: Mining of large, unstructured text data and visualisation of style patterns.

    PubMed

    Valente, Carlo C; Bauer, Florian F; Venter, Fritz; Watson, Bruce; Nieuwoudt, Hélène H

    2018-03-21

    The increasingly large volumes of publicly available sensory descriptions of wine raises the question whether this source of data can be mined to extract meaningful domain-specific information about the sensory properties of wine. We introduce a novel application of formal concept lattices, in combination with traditional statistical tests, to visualise the sensory attributes of a big data set of some 7,000 Chenin blanc and Sauvignon blanc wines. Complexity was identified as an important driver of style in hereto uncharacterised Chenin blanc, and the sensory cues for specific styles were identified. This is the first study to apply these methods for the purpose of identifying styles within varietal wines. More generally, our interactive data visualisation and mining driven approach opens up new investigations towards better understanding of the complex field of sensory science.

  8. VisualUrText: A Text Analytics Tool for Unstructured Textual Data

    NASA Astrophysics Data System (ADS)

    Zainol, Zuraini; Jaymes, Mohd T. H.; Nohuddin, Puteri N. E.

    2018-05-01

    The growing amount of unstructured text over Internet is tremendous. Text repositories come from Web 2.0, business intelligence and social networking applications. It is also believed that 80-90% of future growth data is available in the form of unstructured text databases that may potentially contain interesting patterns and trends. Text Mining is well known technique for discovering interesting patterns and trends which are non-trivial knowledge from massive unstructured text data. Text Mining covers multidisciplinary fields involving information retrieval (IR), text analysis, natural language processing (NLP), data mining, machine learning statistics and computational linguistics. This paper discusses the development of text analytics tool that is proficient in extracting, processing, analyzing the unstructured text data and visualizing cleaned text data into multiple forms such as Document Term Matrix (DTM), Frequency Graph, Network Analysis Graph, Word Cloud and Dendogram. This tool, VisualUrText, is developed to assist students and researchers for extracting interesting patterns and trends in document analyses.

  9. BioCreative Workshops for DOE Genome Sciences: Text Mining for Metagenomics

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Wu, Cathy H.; Hirschman, Lynette

    The objective of this project was to host BioCreative workshops to define and develop text mining tasks to meet the needs of the Genome Sciences community, focusing on metadata information extraction in metagenomics. Following the successful introduction of metagenomics at the BioCreative IV workshop, members of the metagenomics community and BioCreative communities continued discussion to identify candidate topics for a BioCreative metagenomics track for BioCreative V. Of particular interest was the capture of environmental and isolation source information from text. The outcome was to form a “community of interest” around work on the interactive EXTRACT system, which supported interactive taggingmore » of environmental and species data. This experiment is included in the BioCreative V virtual issue of Database. In addition, there was broad participation by members of the metagenomics community in the panels held at BioCreative V, leading to valuable exchanges between the text mining developers and members of the metagenomics research community. These exchanges are reflected in a number of the overview and perspective pieces also being captured in the BioCreative V virtual issue. Overall, this conversation has exposed the metagenomics researchers to the possibilities of text mining, and educated the text mining developers to the specific needs of the metagenomics community.« less

  10. Benchmarking infrastructure for mutation text mining

    PubMed Central

    2014-01-01

    Background Experimental research on the automatic extraction of information about mutations from texts is greatly hindered by the lack of consensus evaluation infrastructure for the testing and benchmarking of mutation text mining systems. Results We propose a community-oriented annotation and benchmarking infrastructure to support development, testing, benchmarking, and comparison of mutation text mining systems. The design is based on semantic standards, where RDF is used to represent annotations, an OWL ontology provides an extensible schema for the data and SPARQL is used to compute various performance metrics, so that in many cases no programming is needed to analyze results from a text mining system. While large benchmark corpora for biological entity and relation extraction are focused mostly on genes, proteins, diseases, and species, our benchmarking infrastructure fills the gap for mutation information. The core infrastructure comprises (1) an ontology for modelling annotations, (2) SPARQL queries for computing performance metrics, and (3) a sizeable collection of manually curated documents, that can support mutation grounding and mutation impact extraction experiments. Conclusion We have developed the principal infrastructure for the benchmarking of mutation text mining tasks. The use of RDF and OWL as the representation for corpora ensures extensibility. The infrastructure is suitable for out-of-the-box use in several important scenarios and is ready, in its current state, for initial community adoption. PMID:24568600

  11. Benchmarking infrastructure for mutation text mining.

    PubMed

    Klein, Artjom; Riazanov, Alexandre; Hindle, Matthew M; Baker, Christopher Jo

    2014-02-25

    Experimental research on the automatic extraction of information about mutations from texts is greatly hindered by the lack of consensus evaluation infrastructure for the testing and benchmarking of mutation text mining systems. We propose a community-oriented annotation and benchmarking infrastructure to support development, testing, benchmarking, and comparison of mutation text mining systems. The design is based on semantic standards, where RDF is used to represent annotations, an OWL ontology provides an extensible schema for the data and SPARQL is used to compute various performance metrics, so that in many cases no programming is needed to analyze results from a text mining system. While large benchmark corpora for biological entity and relation extraction are focused mostly on genes, proteins, diseases, and species, our benchmarking infrastructure fills the gap for mutation information. The core infrastructure comprises (1) an ontology for modelling annotations, (2) SPARQL queries for computing performance metrics, and (3) a sizeable collection of manually curated documents, that can support mutation grounding and mutation impact extraction experiments. We have developed the principal infrastructure for the benchmarking of mutation text mining tasks. The use of RDF and OWL as the representation for corpora ensures extensibility. The infrastructure is suitable for out-of-the-box use in several important scenarios and is ready, in its current state, for initial community adoption.

  12. Evaluating a Bilingual Text-Mining System with a Taxonomy of Key Words and Hierarchical Visualization for Understanding Learner-Generated Text

    ERIC Educational Resources Information Center

    Kong, Siu Cheung; Li, Ping; Song, Yanjie

    2018-01-01

    This study evaluated a bilingual text-mining system, which incorporated a bilingual taxonomy of key words and provided hierarchical visualization, for understanding learner-generated text in the learning management systems through automatic identification and counting of matching key words. A class of 27 in-service teachers studied a course…

  13. Beyond accuracy: creating interoperable and scalable text-mining web services.

    PubMed

    Wei, Chih-Hsuan; Leaman, Robert; Lu, Zhiyong

    2016-06-15

    The biomedical literature is a knowledge-rich resource and an important foundation for future research. With over 24 million articles in PubMed and an increasing growth rate, research in automated text processing is becoming increasingly important. We report here our recently developed web-based text mining services for biomedical concept recognition and normalization. Unlike most text-mining software tools, our web services integrate several state-of-the-art entity tagging systems (DNorm, GNormPlus, SR4GN, tmChem and tmVar) and offer a batch-processing mode able to process arbitrary text input (e.g. scholarly publications, patents and medical records) in multiple formats (e.g. BioC). We support multiple standards to make our service interoperable and allow simpler integration with other text-processing pipelines. To maximize scalability, we have preprocessed all PubMed articles, and use a computer cluster for processing large requests of arbitrary text. Our text-mining web service is freely available at http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/tmTools/#curl : Zhiyong.Lu@nih.gov. Published by Oxford University Press 2016. This work is written by US Government employees and is in the public domain in the US.

  14. 75 FR 51291 - National Science Board: Sunshine Act Meetings; Notice

    Federal Register 2010, 2011, 2012, 2013, 2014

    2010-08-19

    ...-Gathering Activities. [cir] COV Report Text-Mining. [cir] Design of Research Questions for External Input. [cir] SBE/CISE Text-Mining Projects. [cir] Using a Blog for Informal Input. Committee on Education and...

  15. ChemBrowser: a flexible framework for mining chemical documents.

    PubMed

    Wu, Xian; Zhang, Li; Chen, Ying; Rhodes, James; Griffin, Thomas D; Boyer, Stephen K; Alba, Alfredo; Cai, Keke

    2010-01-01

    The ability to extract chemical and biological entities and relations from text documents automatically has great value to biochemical research and development activities. The growing maturity of text mining and artificial intelligence technologies shows promise in enabling such automatic chemical entity extraction capabilities (called "Chemical Annotation" in this paper). Many techniques have been reported in the literature, ranging from dictionary and rule-based techniques to machine learning approaches. In practice, we found that no single technique works well in all cases. A combinatorial approach that allows one to quickly compose different annotation techniques together for a given situation is most effective. In this paper, we describe the key challenges we face in real-world chemical annotation scenarios. We then present a solution called ChemBrowser which has a flexible framework for chemical annotation. ChemBrowser includes a suite of customizable processing units that might be utilized in a chemical annotator, a high-level language that describes the composition of various processing units that would form a chemical annotator, and an execution engine that translates the composition language to an actual annotator that can generate annotation results for a given set of documents. We demonstrate the impact of this approach by tailoring an annotator for extracting chemical names from patent documents and show how this annotator can be easily modified with simple configuration alone.

  16. Assimilating Text-Mining & Bio-Informatics Tools to Analyze Cellulase structures

    NASA Astrophysics Data System (ADS)

    Satyasree, K. P. N. V., Dr; Lalitha Kumari, B., Dr; Jyotsna Devi, K. S. N. V.; Choudri, S. M. Roy; Pratap Joshi, K.

    2017-08-01

    Text-mining is one of the best potential way of automatically extracting information from the huge biological literature. To exploit its prospective, the knowledge encrypted in the text should be converted to some semantic representation such as entities and relations, which could be analyzed by machines. But large-scale practical systems for this purpose are rare. But text mining could be helpful for generating or validating predictions. Cellulases have abundant applications in various industries. Cellulose degrading enzymes are cellulases and the same producing bacteria - Bacillus subtilis & fungus Pseudomonas putida were isolated from top soil of Guntur Dt. A.P. India. Absolute cultures were conserved on potato dextrose agar medium for molecular studies. In this paper, we presented how well the text mining concepts can be used to analyze cellulase producing bacteria and fungi, their comparative structures are also studied with the aid of well-establised, high quality standard bioinformatic tools such as Bioedit, Swissport, Protparam, EMBOSSwin with which a complete data on Cellulases like structure, constituents of the enzyme has been obtained.

  17. Mining the SDSS SkyServer SQL queries log

    NASA Astrophysics Data System (ADS)

    Hirota, Vitor M.; Santos, Rafael; Raddick, Jordan; Thakar, Ani

    2016-05-01

    SkyServer, the Internet portal for the Sloan Digital Sky Survey (SDSS) astronomic catalog, provides a set of tools that allows data access for astronomers and scientific education. One of SkyServer data access interfaces allows users to enter ad-hoc SQL statements to query the catalog. SkyServer also presents some template queries that can be used as basis for more complex queries. This interface has logged over 330 million queries submitted since 2001. It is expected that analysis of this data can be used to investigate usage patterns, identify potential new classes of queries, find similar queries, etc. and to shed some light on how users interact with the Sloan Digital Sky Survey data and how scientists have adopted the new paradigm of e-Science, which could in turn lead to enhancements on the user interfaces and experience in general. In this paper we review some approaches to SQL query mining, apply the traditional techniques used in the literature and present lessons learned, namely, that the general text mining approach for feature extraction and clustering does not seem to be adequate for this type of data, and, most importantly, we find that this type of analysis can result in very different queries being clustered together.

  18. A knowledge engineering approach to recognizing and extracting sequences of nucleic acids from scientific literature.

    PubMed

    García-Remesal, Miguel; Maojo, Victor; Crespo, José

    2010-01-01

    In this paper we present a knowledge engineering approach to automatically recognize and extract genetic sequences from scientific articles. To carry out this task, we use a preliminary recognizer based on a finite state machine to extract all candidate DNA/RNA sequences. The latter are then fed into a knowledge-based system that automatically discards false positives and refines noisy and incorrectly merged sequences. We created the knowledge base by manually analyzing different manuscripts containing genetic sequences. Our approach was evaluated using a test set of 211 full-text articles in PDF format containing 3134 genetic sequences. For such set, we achieved 87.76% precision and 97.70% recall respectively. This method can facilitate different research tasks. These include text mining, information extraction, and information retrieval research dealing with large collections of documents containing genetic sequences.

  19. Detection and Evaluation of Cheating on College Exams Using Supervised Classification

    ERIC Educational Resources Information Center

    Cavalcanti, Elmano Ramalho; Pires, Carlos Eduardo; Cavalcanti, Elmano Pontes; Pires, Vládia Freire

    2012-01-01

    Text mining has been used for various purposes, such as document classification and extraction of domain-specific information from text. In this paper we present a study in which text mining methodology and algorithms were properly employed for academic dishonesty (cheating) detection and evaluation on open-ended college exams, based on document…

  20. What the papers say: Text mining for genomics and systems biology

    PubMed Central

    2010-01-01

    Keeping up with the rapidly growing literature has become virtually impossible for most scientists. This can have dire consequences. First, we may waste research time and resources on reinventing the wheel simply because we can no longer maintain a reliable grasp on the published literature. Second, and perhaps more detrimental, judicious (or serendipitous) combination of knowledge from different scientific disciplines, which would require following disparate and distinct research literatures, is rapidly becoming impossible for even the most ardent readers of research publications. Text mining -- the automated extraction of information from (electronically) published sources -- could potentially fulfil an important role -- but only if we know how to harness its strengths and overcome its weaknesses. As we do not expect that the rate at which scientific results are published will decrease, text mining tools are now becoming essential in order to cope with, and derive maximum benefit from, this information explosion. In genomics, this is particularly pressing as more and more rare disease-causing variants are found and need to be understood. Not being conversant with this technology may put scientists and biomedical regulators at a severe disadvantage. In this review, we introduce the basic concepts underlying modern text mining and its applications in genomics and systems biology. We hope that this review will serve three purposes: (i) to provide a timely and useful overview of the current status of this field, including a survey of present challenges; (ii) to enable researchers to decide how and when to apply text mining tools in their own research; and (iii) to highlight how the research communities in genomics and systems biology can help to make text mining from biomedical abstracts and texts more straightforward. PMID:21106487

  1. PubMedPortable: A Framework for Supporting the Development of Text Mining Applications.

    PubMed

    Döring, Kersten; Grüning, Björn A; Telukunta, Kiran K; Thomas, Philippe; Günther, Stefan

    2016-01-01

    Information extraction from biomedical literature is continuously growing in scope and importance. Many tools exist that perform named entity recognition, e.g. of proteins, chemical compounds, and diseases. Furthermore, several approaches deal with the extraction of relations between identified entities. The BioCreative community supports these developments with yearly open challenges, which led to a standardised XML text annotation format called BioC. PubMed provides access to the largest open biomedical literature repository, but there is no unified way of connecting its data to natural language processing tools. Therefore, an appropriate data environment is needed as a basis to combine different software solutions and to develop customised text mining applications. PubMedPortable builds a relational database and a full text index on PubMed citations. It can be applied either to the complete PubMed data set or an arbitrary subset of downloaded PubMed XML files. The software provides the infrastructure to combine stand-alone applications by exporting different data formats, e.g. BioC. The presented workflows show how to use PubMedPortable to retrieve, store, and analyse a disease-specific data set. The provided use cases are well documented in the PubMedPortable wiki. The open-source software library is small, easy to use, and scalable to the user's system requirements. It is freely available for Linux on the web at https://github.com/KerstenDoering/PubMedPortable and for other operating systems as a virtual container. The approach was tested extensively and applied successfully in several projects.

  2. PubMedPortable: A Framework for Supporting the Development of Text Mining Applications

    PubMed Central

    Döring, Kersten; Grüning, Björn A.; Telukunta, Kiran K.; Thomas, Philippe; Günther, Stefan

    2016-01-01

    Information extraction from biomedical literature is continuously growing in scope and importance. Many tools exist that perform named entity recognition, e.g. of proteins, chemical compounds, and diseases. Furthermore, several approaches deal with the extraction of relations between identified entities. The BioCreative community supports these developments with yearly open challenges, which led to a standardised XML text annotation format called BioC. PubMed provides access to the largest open biomedical literature repository, but there is no unified way of connecting its data to natural language processing tools. Therefore, an appropriate data environment is needed as a basis to combine different software solutions and to develop customised text mining applications. PubMedPortable builds a relational database and a full text index on PubMed citations. It can be applied either to the complete PubMed data set or an arbitrary subset of downloaded PubMed XML files. The software provides the infrastructure to combine stand-alone applications by exporting different data formats, e.g. BioC. The presented workflows show how to use PubMedPortable to retrieve, store, and analyse a disease-specific data set. The provided use cases are well documented in the PubMedPortable wiki. The open-source software library is small, easy to use, and scalable to the user’s system requirements. It is freely available for Linux on the web at https://github.com/KerstenDoering/PubMedPortable and for other operating systems as a virtual container. The approach was tested extensively and applied successfully in several projects. PMID:27706202

  3. Data Mining.

    ERIC Educational Resources Information Center

    Benoit, Gerald

    2002-01-01

    Discusses data mining (DM) and knowledge discovery in databases (KDD), taking the view that KDD is the larger view of the entire process, with DM emphasizing the cleaning, warehousing, mining, and visualization of knowledge discovery in databases. Highlights include algorithms; users; the Internet; text mining; and information extraction.…

  4. Ask and Ye Shall Receive? Automated Text Mining of Michigan Capital Facility Finance Bond Election Proposals to Identify Which Topics Are Associated with Bond Passage and Voter Turnout

    ERIC Educational Resources Information Center

    Bowers, Alex J.; Chen, Jingjing

    2015-01-01

    The purpose of this study is to bring together recent innovations in the research literature around school district capital facility finance, municipal bond elections, statistical models of conditional time-varying outcomes, and data mining algorithms for automated text mining of election ballot proposals to examine the factors that influence the…

  5. A survey on annotation tools for the biomedical literature.

    PubMed

    Neves, Mariana; Leser, Ulf

    2014-03-01

    New approaches to biomedical text mining crucially depend on the existence of comprehensive annotated corpora. Such corpora, commonly called gold standards, are important for learning patterns or models during the training phase, for evaluating and comparing the performance of algorithms and also for better understanding the information sought for by means of examples. Gold standards depend on human understanding and manual annotation of natural language text. This process is very time-consuming and expensive because it requires high intellectual effort from domain experts. Accordingly, the lack of gold standards is considered as one of the main bottlenecks for developing novel text mining methods. This situation led the development of tools that support humans in annotating texts. Such tools should be intuitive to use, should support a range of different input formats, should include visualization of annotated texts and should generate an easy-to-parse output format. Today, a range of tools which implement some of these functionalities are available. In this survey, we present a comprehensive survey of tools for supporting annotation of biomedical texts. Altogether, we considered almost 30 tools, 13 of which were selected for an in-depth comparison. The comparison was performed using predefined criteria and was accompanied by hands-on experiences whenever possible. Our survey shows that current tools can support many of the tasks in biomedical text annotation in a satisfying manner, but also that no tool can be considered as a true comprehensive solution.

  6. Impact of translation on named-entity recognition in radiology texts

    PubMed Central

    Pedro, Vasco

    2017-01-01

    Abstract Radiology reports describe the results of radiography procedures and have the potential of being a useful source of information which can bring benefits to health care systems around the world. One way to automatically extract information from the reports is by using Text Mining tools. The problem is that these tools are mostly developed for English and reports are usually written in the native language of the radiologist, which is not necessarily English. This creates an obstacle to the sharing of Radiology information between different communities. This work explores the solution of translating the reports to English before applying the Text Mining tools, probing the question of what translation approach should be used. We created MRRAD (Multilingual Radiology Research Articles Dataset), a parallel corpus of Portuguese research articles related to Radiology and a number of alternative translations (human, automatic and semi-automatic) to English. This is a novel corpus which can be used to move forward the research on this topic. Using MRRAD we studied which kind of automatic or semi-automatic translation approach is more effective on the Named-entity recognition task of finding RadLex terms in the English version of the articles. Considering the terms extracted from human translations as our gold standard, we calculated how similar to this standard were the terms extracted using other translations. We found that a completely automatic translation approach using Google leads to F-scores (between 0.861 and 0.868, depending on the extraction approach) similar to the ones obtained through a more expensive semi-automatic translation approach using Unbabel (between 0.862 and 0.870). To better understand the results we also performed a qualitative analysis of the type of errors found in the automatic and semi-automatic translations. Database URL: https://github.com/lasigeBioTM/MRRAD PMID:29220455

  7. Using Open Web APIs in Teaching Web Mining

    ERIC Educational Resources Information Center

    Chen, Hsinchun; Li, Xin; Chau, M.; Ho, Yi-Jen; Tseng, Chunju

    2009-01-01

    With the advent of the World Wide Web, many business applications that utilize data mining and text mining techniques to extract useful business information on the Web have evolved from Web searching to Web mining. It is important for students to acquire knowledge and hands-on experience in Web mining during their education in information systems…

  8. The contribution of the vaccine adverse event text mining system to the classification of possible Guillain-Barré syndrome reports.

    PubMed

    Botsis, T; Woo, E J; Ball, R

    2013-01-01

    We previously demonstrated that a general purpose text mining system, the Vaccine adverse event Text Mining (VaeTM) system, could be used to automatically classify reports of an-aphylaxis for post-marketing safety surveillance of vaccines. To evaluate the ability of VaeTM to classify reports to the Vaccine Adverse Event Reporting System (VAERS) of possible Guillain-Barré Syndrome (GBS). We used VaeTM to extract the key diagnostic features from the text of reports in VAERS. Then, we applied the Brighton Collaboration (BC) case definition for GBS, and an information retrieval strategy (i.e. the vector space model) to quantify the specific information that is included in the key features extracted by VaeTM and compared it with the encoded information that is already stored in VAERS as Medical Dictionary for Regulatory Activities (MedDRA) Preferred Terms (PTs). We also evaluated the contribution of the primary (diagnosis and cause of death) and secondary (second level diagnosis and symptoms) diagnostic VaeTM-based features to the total VaeTM-based information. MedDRA captured more information and better supported the classification of reports for GBS than VaeTM (AUC: 0.904 vs. 0.777); the lower performance of VaeTM is likely due to the lack of extraction by VaeTM of specific laboratory results that are included in the BC criteria for GBS. On the other hand, the VaeTM-based classification exhibited greater specificity than the MedDRA-based approach (94.96% vs. 87.65%). Most of the VaeTM-based information was contained in the secondary diagnostic features. For GBS, clinical signs and symptoms alone are not sufficient to match MedDRA coding for purposes of case classification, but are preferred if specificity is the priority.

  9. Are Female Applicants Disadvantaged in National Institutes of Health Peer Review? Combining Algorithmic Text Mining and Qualitative Methods to Detect Evaluative Differences in R01 Reviewers' Critiques.

    PubMed

    Magua, Wairimu; Zhu, Xiaojin; Bhattacharya, Anupama; Filut, Amarette; Potvien, Aaron; Leatherberry, Renee; Lee, You-Geon; Jens, Madeline; Malikireddy, Dastagiri; Carnes, Molly; Kaatz, Anna

    2017-05-01

    Women are less successful than men in renewing R01 grants from the National Institutes of Health. Continuing to probe text mining as a tool to identify gender bias in peer review, we used algorithmic text mining and qualitative analysis to examine a sample of critiques from men's and women's R01 renewal applications previously analyzed by counting and comparing word categories. We analyzed 241 critiques from 79 Summary Statements for 51 R01 renewals awarded to 45 investigators (64% male, 89% white, 80% PhD) at the University of Wisconsin-Madison between 2010 and 2014. We used latent Dirichlet allocation to discover evaluative "topics" (i.e., words that co-occur with high probability). We then qualitatively examined the context in which evaluative words occurred for male and female investigators. We also examined sex differences in assigned scores controlling for investigator productivity. Text analysis results showed that male investigators were described as "leaders" and "pioneers" in their "fields," with "highly innovative" and "highly significant research." By comparison, female investigators were characterized as having "expertise" and working in "excellent" environments. Applications from men received significantly better priority, approach, and significance scores, which could not be accounted for by differences in productivity. Results confirm our previous analyses suggesting that gender stereotypes operate in R01 grant peer review. Reviewers may more easily view male than female investigators as scientific leaders with significant and innovative research, and score their applications more competitively. Such implicit bias may contribute to sex differences in award rates for R01 renewals.

  10. Are Female Applicants Disadvantaged in National Institutes of Health Peer Review? Combining Algorithmic Text Mining and Qualitative Methods to Detect Evaluative Differences in R01 Reviewers' Critiques

    PubMed Central

    Magua, Wairimu; Zhu, Xiaojin; Bhattacharya, Anupama; Filut, Amarette; Potvien, Aaron; Leatherberry, Renee; Lee, You-Geon; Jens, Madeline; Malikireddy, Dastagiri; Carnes, Molly

    2017-01-01

    Abstract Background: Women are less successful than men in renewing R01 grants from the National Institutes of Health. Continuing to probe text mining as a tool to identify gender bias in peer review, we used algorithmic text mining and qualitative analysis to examine a sample of critiques from men's and women's R01 renewal applications previously analyzed by counting and comparing word categories. Methods: We analyzed 241 critiques from 79 Summary Statements for 51 R01 renewals awarded to 45 investigators (64% male, 89% white, 80% PhD) at the University of Wisconsin-Madison between 2010 and 2014. We used latent Dirichlet allocation to discover evaluative “topics” (i.e., words that co-occur with high probability). We then qualitatively examined the context in which evaluative words occurred for male and female investigators. We also examined sex differences in assigned scores controlling for investigator productivity. Results: Text analysis results showed that male investigators were described as “leaders” and “pioneers” in their “fields,” with “highly innovative” and “highly significant research.” By comparison, female investigators were characterized as having “expertise” and working in “excellent” environments. Applications from men received significantly better priority, approach, and significance scores, which could not be accounted for by differences in productivity. Conclusions: Results confirm our previous analyses suggesting that gender stereotypes operate in R01 grant peer review. Reviewers may more easily view male than female investigators as scientific leaders with significant and innovative research, and score their applications more competitively. Such implicit bias may contribute to sex differences in award rates for R01 renewals. PMID:28281870

  11. Constructing Model of Relationship among Behaviors and Injuries to Products Based on Large Scale Text Data on Injuries

    NASA Astrophysics Data System (ADS)

    Nomori, Koji; Kitamura, Koji; Motomura, Yoichi; Nishida, Yoshifumi; Yamanaka, Tatsuhiro; Komatsubara, Akinori

    In Japan, childhood injury prevention is urgent issue. Safety measures through creating knowledge of injury data are essential for preventing childhood injuries. Especially the injury prevention approach by product modification is very important. The risk assessment is one of the most fundamental methods to design safety products. The conventional risk assessment has been carried out subjectively because product makers have poor data on injuries. This paper deals with evidence-based risk assessment, in which artificial intelligence technologies are strongly needed. This paper describes a new method of foreseeing usage of products, which is the first step of the evidence-based risk assessment, and presents a retrieval system of injury data. The system enables a product designer to foresee how children use a product and which types of injuries occur due to the product in daily environment. The developed system consists of large scale injury data, text mining technology and probabilistic modeling technology. Large scale text data on childhood injuries was collected from medical institutions by an injury surveillance system. Types of behaviors to a product were derived from the injury text data using text mining technology. The relationship among products, types of behaviors, types of injuries and characteristics of children was modeled by Bayesian Network. The fundamental functions of the developed system and examples of new findings obtained by the system are reported in this paper.

  12. The Role of Work Experiences in College Student Leadership Development: Evidence from a National Dataset and a Text Mining Approach to Examining Beliefs about Leadership

    ERIC Educational Resources Information Center

    Lewis, Jonathan S.

    2017-01-01

    Paid employment is one of the most common extracurricular activities among full-time undergraduates, and an array of studies has attempted to measure its impact. Methodological concerns with the extant literature, however, make it difficult to draw reliable conclusions. Furthermore, the research on working college students has little to say about…

  13. Citation-related reliability analysis for a pilot sample of underground coal mines.

    PubMed

    Kinilakodi, Harisha; Grayson, R Larry

    2011-05-01

    The scrutiny of underground coal mine safety was heightened because of the disasters that occurred in 2006-2007, and more recently in 2010. In the aftermath of the 2006 incidents, the U.S. Congress passed the Mine Improvement and New Emergency Response Act of 2006 (MINER Act), which strengthened the existing regulations and mandated new laws to address various issues related to emergency preparedness and response, escape from an emergency situation, and protection of miners. The National Mining Association-sponsored Mine Safety Technology and Training Commission study highlighted the role of risk management in identifying and controlling major hazards, which are elements that could come together and cause a mine disaster. In 2007 MSHA revised its approach to the "Pattern of Violations" (POV) process in order to target unsafe mines and then force them to remediate conditions in their mines. The POV approach has certain limitations that make it difficult for it to be enforced. One very understandable way to focus on removing threats from major-hazard conditions is to use citation-related reliability analysis. The citation reliability approach, which focuses on the probability of not getting a citation on a given inspector day, is considered an analogue to the maintenance reliability approach, which many mine operators understand and use. In this study, the citation reliability approach was applied to a stratified random sample of 31 underground coal mines to examine its potential for broader application. The results clearly show the best-performing and worst-performing mines for compliance with mine safety standards, and they highlight differences among different mine sizes. Copyright © 2010 Elsevier Ltd. All rights reserved.

  14. Text Mining of Journal Articles for Sleep Disorder Terminologies.

    PubMed

    Lam, Calvin; Lai, Fu-Chih; Wang, Chia-Hui; Lai, Mei-Hsin; Hsu, Nanly; Chung, Min-Huey

    2016-01-01

    Research on publication trends in journal articles on sleep disorders (SDs) and the associated methodologies by using text mining has been limited. The present study involved text mining for terms to determine the publication trends in sleep-related journal articles published during 2000-2013 and to identify associations between SD and methodology terms as well as conducting statistical analyses of the text mining findings. SD and methodology terms were extracted from 3,720 sleep-related journal articles in the PubMed database by using MetaMap. The extracted data set was analyzed using hierarchical cluster analyses and adjusted logistic regression models to investigate publication trends and associations between SD and methodology terms. MetaMap had a text mining precision, recall, and false positive rate of 0.70, 0.77, and 11.51%, respectively. The most common SD term was breathing-related sleep disorder, whereas narcolepsy was the least common. Cluster analyses showed similar methodology clusters for each SD term, except narcolepsy. The logistic regression models showed an increasing prevalence of insomnia, parasomnia, and other sleep disorders but a decreasing prevalence of breathing-related sleep disorder during 2000-2013. Different SD terms were positively associated with different methodology terms regarding research design terms, measure terms, and analysis terms. Insomnia-, parasomnia-, and other sleep disorder-related articles showed an increasing publication trend, whereas those related to breathing-related sleep disorder showed a decreasing trend. Furthermore, experimental studies more commonly focused on hypersomnia and other SDs and less commonly on insomnia, breathing-related sleep disorder, narcolepsy, and parasomnia. Thus, text mining may facilitate the exploration of the publication trends in SDs and the associated methodologies.

  15. Information Retrieval and Text Mining Technologies for Chemistry.

    PubMed

    Krallinger, Martin; Rabal, Obdulia; Lourenço, Anália; Oyarzabal, Julen; Valencia, Alfonso

    2017-06-28

    Efficient access to chemical information contained in scientific literature, patents, technical reports, or the web is a pressing need shared by researchers and patent attorneys from different chemical disciplines. Retrieval of important chemical information in most cases starts with finding relevant documents for a particular chemical compound or family. Targeted retrieval of chemical documents is closely connected to the automatic recognition of chemical entities in the text, which commonly involves the extraction of the entire list of chemicals mentioned in a document, including any associated information. In this Review, we provide a comprehensive and in-depth description of fundamental concepts, technical implementations, and current technologies for meeting these information demands. A strong focus is placed on community challenges addressing systems performance, more particularly CHEMDNER and CHEMDNER patents tasks of BioCreative IV and V, respectively. Considering the growing interest in the construction of automatically annotated chemical knowledge bases that integrate chemical information and biological data, cheminformatics approaches for mapping the extracted chemical names into chemical structures and their subsequent annotation together with text mining applications for linking chemistry with biological information are also presented. Finally, future trends and current challenges are highlighted as a roadmap proposal for research in this emerging field.

  16. Text mining applications in psychiatry: a systematic literature review.

    PubMed

    Abbe, Adeline; Grouin, Cyril; Zweigenbaum, Pierre; Falissard, Bruno

    2016-06-01

    The expansion of biomedical literature is creating the need for efficient tools to keep pace with increasing volumes of information. Text mining (TM) approaches are becoming essential to facilitate the automated extraction of useful biomedical information from unstructured text. We reviewed the applications of TM in psychiatry, and explored its advantages and limitations. A systematic review of the literature was carried out using the CINAHL, Medline, EMBASE, PsycINFO and Cochrane databases. In this review, 1103 papers were screened, and 38 were included as applications of TM in psychiatric research. Using TM and content analysis, we identified four major areas of application: (1) Psychopathology (i.e. observational studies focusing on mental illnesses) (2) the Patient perspective (i.e. patients' thoughts and opinions), (3) Medical records (i.e. safety issues, quality of care and description of treatments), and (4) Medical literature (i.e. identification of new scientific information in the literature). The information sources were qualitative studies, Internet postings, medical records and biomedical literature. Our work demonstrates that TM can contribute to complex research tasks in psychiatry. We discuss the benefits, limits, and further applications of this tool in the future. Copyright © 2015 John Wiley & Sons, Ltd. Copyright © 2015 John Wiley & Sons, Ltd.

  17. [Text mining, a method for computer-assisted analysis of scientific texts, demonstrated by an analysis of author networks].

    PubMed

    Hahn, P; Dullweber, F; Unglaub, F; Spies, C K

    2014-06-01

    Searching for relevant publications is becoming more difficult with the increasing number of scientific articles. Text mining as a specific form of computer-based data analysis may be helpful in this context. Highlighting relations between authors and finding relevant publications concerning a specific subject using text analysis programs are illustrated graphically by 2 performed examples. © Georg Thieme Verlag KG Stuttgart · New York.

  18. Extracting biomedical events from pairs of text entities

    PubMed Central

    2015-01-01

    Background Huge amounts of electronic biomedical documents, such as molecular biology reports or genomic papers are generated daily. Nowadays, these documents are mainly available in the form of unstructured free texts, which require heavy processing for their registration into organized databases. This organization is instrumental for information retrieval, enabling to answer the advanced queries of researchers and practitioners in biology, medicine, and related fields. Hence, the massive data flow calls for efficient automatic methods of text-mining that extract high-level information, such as biomedical events, from biomedical text. The usual computational tools of Natural Language Processing cannot be readily applied to extract these biomedical events, due to the peculiarities of the domain. Indeed, biomedical documents contain highly domain-specific jargon and syntax. These documents also describe distinctive dependencies, making text-mining in molecular biology a specific discipline. Results We address biomedical event extraction as the classification of pairs of text entities into the classes corresponding to event types. The candidate pairs of text entities are recursively provided to a multiclass classifier relying on Support Vector Machines. This recursive process extracts events involving other events as arguments. Compared to joint models based on Markov Random Fields, our model simplifies inference and hence requires shorter training and prediction times along with lower memory capacity. Compared to usual pipeline approaches, our model passes over a complex intermediate problem, while making a more extensive usage of sophisticated joint features between text entities. Our method focuses on the core event extraction of the Genia task of BioNLP challenges yielding the best result reported so far on the 2013 edition. PMID:26201478

  19. A Cloud-based Approach to Medical NLP

    PubMed Central

    Chard, Kyle; Russell, Michael; Lussier, Yves A.; Mendonça, Eneida A; Silverstein, Jonathan C.

    2011-01-01

    Natural Language Processing (NLP) enables access to deep content embedded in medical texts. To date, NLP has not fulfilled its promise of enabling robust clinical encoding, clinical use, quality improvement, and research. We submit that this is in part due to poor accessibility, scalability, and flexibility of NLP systems. We describe here an approach and system which leverages cloud-based approaches such as virtual machines and Representational State Transfer (REST) to extract, process, synthesize, mine, compare/contrast, explore, and manage medical text data in a flexibly secure and scalable architecture. Available architectures in which our Smntx (pronounced as semantics) system can be deployed include: virtual machines in a HIPAA-protected hospital environment, brought up to run analysis over bulk data and destroyed in a local cloud; a commercial cloud for a large complex multi-institutional trial; and within other architectures such as caGrid, i2b2, or NHIN. PMID:22195072

  20. A cloud-based approach to medical NLP.

    PubMed

    Chard, Kyle; Russell, Michael; Lussier, Yves A; Mendonça, Eneida A; Silverstein, Jonathan C

    2011-01-01

    Natural Language Processing (NLP) enables access to deep content embedded in medical texts. To date, NLP has not fulfilled its promise of enabling robust clinical encoding, clinical use, quality improvement, and research. We submit that this is in part due to poor accessibility, scalability, and flexibility of NLP systems. We describe here an approach and system which leverages cloud-based approaches such as virtual machines and Representational State Transfer (REST) to extract, process, synthesize, mine, compare/contrast, explore, and manage medical text data in a flexibly secure and scalable architecture. Available architectures in which our Smntx (pronounced as semantics) system can be deployed include: virtual machines in a HIPAA-protected hospital environment, brought up to run analysis over bulk data and destroyed in a local cloud; a commercial cloud for a large complex multi-institutional trial; and within other architectures such as caGrid, i2b2, or NHIN.

  1. VIOLENT FRAMES IN ACTION

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Sanfilippo, Antonio P.; McGrath, Liam R.; Whitney, Paul D.

    2011-11-17

    We present a computational approach to radical rhetoric that leverages the co-expression of rhetoric and action features in discourse to identify violent intent. The approach combines text mining and machine learning techniques with insights from Frame Analysis and theories that explain the emergence of violence in terms of moral disengagement, the violation of sacred values and social isolation in order to build computational models that identify messages from terrorist sources and estimate their proximity to an attack. We discuss a specific application of this approach to a body of documents from and about radical and terrorist groups in the Middlemore » East and present the results achieved.« less

  2. Mining of Business-Oriented Conversations at a Call Center

    NASA Astrophysics Data System (ADS)

    Takeuchi, Hironori; Nasukawa, Tetsuya; Watanabe, Hideo

    Recently it has become feasible to transcribe textual records from telephone conversations at call centers by using automatic speech recognition. In this research, we extended a text mining system for call summary records and constructed a conversation mining system for the business-oriented conversations at the call center. To acquire useful business insights from the conversational data through the text mining system, it is critical to identify appropriate textual segments and expressions as the viewpoints to focus on. In the analysis of call summary data using a text mining system, some experts defined the viewpoints for the analysis by looking at some sample records and by preparing the dictionaries based on frequent keywords in the sample dataset. However with conversations it is difficult to identify such viewpoints manually and in advance because the target data consists of complete transcripts that are often lengthy and redundant. In this research, we defined a model of the business-oriented conversations and proposed a mining method to identify segments that have impacts on the outcomes of the conversations and can then extract useful expressions in each of these identified segments. In the experiment, we processed the real datasets from a car rental service center and constructed a mining system. With this system, we show the effectiveness of the method based on the defined conversation model.

  3. Recent progress in automatically extracting information from the pharmacogenomic literature

    PubMed Central

    Garten, Yael; Coulet, Adrien; Altman, Russ B

    2011-01-01

    The biomedical literature holds our understanding of pharmacogenomics, but it is dispersed across many journals. In order to integrate our knowledge, connect important facts across publications and generate new hypotheses we must organize and encode the contents of the literature. By creating databases of structured pharmocogenomic knowledge, we can make the value of the literature much greater than the sum of the individual reports. We can, for example, generate candidate gene lists or interpret surprising hits in genome-wide association studies. Text mining automatically adds structure to the unstructured knowledge embedded in millions of publications, and recent years have seen a surge in work on biomedical text mining, some specific to pharmacogenomics literature. These methods enable extraction of specific types of information and can also provide answers to general, systemic queries. In this article, we describe the main tasks of text mining in the context of pharmacogenomics, summarize recent applications and anticipate the next phase of text mining applications. PMID:21047206

  4. Proceedings: Fourth Workshop on Mining Scientific Datasets

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Kamath, C

    Commercial applications of data mining in areas such as e-commerce, market-basket analysis, text-mining, and web-mining have taken on a central focus in the JCDD community. However, there is a significant amount of innovative data mining work taking place in the context of scientific and engineering applications that is not well represented in the mainstream KDD conferences. For example, scientific data mining techniques are being developed and applied to diverse fields such as remote sensing, physics, chemistry, biology, astronomy, structural mechanics, computational fluid dynamics etc. In these areas, data mining frequently complements and enhances existing analysis methods based on statistics, exploratorymore » data analysis, and domain-specific approaches. On the surface, it may appear that data from one scientific field, say genomics, is very different from another field, such as physics. However, despite their diversity, there is much that is common across the mining of scientific and engineering data. For example, techniques used to identify objects in images are very similar, regardless of whether the images came from a remote sensing application, a physics experiment, an astronomy observation, or a medical study. Further, with data mining being applied to new types of data, such as mesh data from scientific simulations, there is the opportunity to apply and extend data mining to new scientific domains. This one-day workshop brings together data miners analyzing science data and scientists from diverse fields to share their experiences, learn how techniques developed in one field can be applied in another, and better understand some of the newer techniques being developed in the KDD community. This is the fourth workshop on the topic of Mining Scientific Data sets; for information on earlier workshops, see http://www.ahpcrc.org/conferences/. This workshop continues the tradition of addressing challenging problems in a field where the diversity of applications is matched only by the opportunities that await a practitioner.« less

  5. Application of text mining for customer evaluations in commercial banking

    NASA Astrophysics Data System (ADS)

    Tan, Jing; Du, Xiaojiang; Hao, Pengpeng; Wang, Yanbo J.

    2015-07-01

    Nowadays customer attrition is increasingly serious in commercial banks. To combat this problem roundly, mining customer evaluation texts is as important as mining customer structured data. In order to extract hidden information from customer evaluations, Textual Feature Selection, Classification and Association Rule Mining are necessary techniques. This paper presents all three techniques by using Chinese Word Segmentation, C5.0 and Apriori, and a set of experiments were run based on a collection of real textual data that includes 823 customer evaluations taken from a Chinese commercial bank. Results, consequent solutions, some advice for the commercial bank are given in this paper.

  6. Extracting semantically enriched events from biomedical literature

    PubMed Central

    2012-01-01

    Background Research into event-based text mining from the biomedical literature has been growing in popularity to facilitate the development of advanced biomedical text mining systems. Such technology permits advanced search, which goes beyond document or sentence-based retrieval. However, existing event-based systems typically ignore additional information within the textual context of events that can determine, amongst other things, whether an event represents a fact, hypothesis, experimental result or analysis of results, whether it describes new or previously reported knowledge, and whether it is speculated or negated. We refer to such contextual information as meta-knowledge. The automatic recognition of such information can permit the training of systems allowing finer-grained searching of events according to the meta-knowledge that is associated with them. Results Based on a corpus of 1,000 MEDLINE abstracts, fully manually annotated with both events and associated meta-knowledge, we have constructed a machine learning-based system that automatically assigns meta-knowledge information to events. This system has been integrated into EventMine, a state-of-the-art event extraction system, in order to create a more advanced system (EventMine-MK) that not only extracts events from text automatically, but also assigns five different types of meta-knowledge to these events. The meta-knowledge assignment module of EventMine-MK performs with macro-averaged F-scores in the range of 57-87% on the BioNLP’09 Shared Task corpus. EventMine-MK has been evaluated on the BioNLP’09 Shared Task subtask of detecting negated and speculated events. Our results show that EventMine-MK can outperform other state-of-the-art systems that participated in this task. Conclusions We have constructed the first practical system that extracts both events and associated, detailed meta-knowledge information from biomedical literature. The automatically assigned meta-knowledge information can be used to refine search systems, in order to provide an extra search layer beyond entities and assertions, dealing with phenomena such as rhetorical intent, speculations, contradictions and negations. This finer grained search functionality can assist in several important tasks, e.g., database curation (by locating new experimental knowledge) and pathway enrichment (by providing information for inference). To allow easy integration into text mining systems, EventMine-MK is provided as a UIMA component that can be used in the interoperable text mining infrastructure, U-Compare. PMID:22621266

  7. Extracting semantically enriched events from biomedical literature.

    PubMed

    Miwa, Makoto; Thompson, Paul; McNaught, John; Kell, Douglas B; Ananiadou, Sophia

    2012-05-23

    Research into event-based text mining from the biomedical literature has been growing in popularity to facilitate the development of advanced biomedical text mining systems. Such technology permits advanced search, which goes beyond document or sentence-based retrieval. However, existing event-based systems typically ignore additional information within the textual context of events that can determine, amongst other things, whether an event represents a fact, hypothesis, experimental result or analysis of results, whether it describes new or previously reported knowledge, and whether it is speculated or negated. We refer to such contextual information as meta-knowledge. The automatic recognition of such information can permit the training of systems allowing finer-grained searching of events according to the meta-knowledge that is associated with them. Based on a corpus of 1,000 MEDLINE abstracts, fully manually annotated with both events and associated meta-knowledge, we have constructed a machine learning-based system that automatically assigns meta-knowledge information to events. This system has been integrated into EventMine, a state-of-the-art event extraction system, in order to create a more advanced system (EventMine-MK) that not only extracts events from text automatically, but also assigns five different types of meta-knowledge to these events. The meta-knowledge assignment module of EventMine-MK performs with macro-averaged F-scores in the range of 57-87% on the BioNLP'09 Shared Task corpus. EventMine-MK has been evaluated on the BioNLP'09 Shared Task subtask of detecting negated and speculated events. Our results show that EventMine-MK can outperform other state-of-the-art systems that participated in this task. We have constructed the first practical system that extracts both events and associated, detailed meta-knowledge information from biomedical literature. The automatically assigned meta-knowledge information can be used to refine search systems, in order to provide an extra search layer beyond entities and assertions, dealing with phenomena such as rhetorical intent, speculations, contradictions and negations. This finer grained search functionality can assist in several important tasks, e.g., database curation (by locating new experimental knowledge) and pathway enrichment (by providing information for inference). To allow easy integration into text mining systems, EventMine-MK is provided as a UIMA component that can be used in the interoperable text mining infrastructure, U-Compare.

  8. Approach for Text Classification Based on the Similarity Measurement between Normal Cloud Models

    PubMed Central

    Dai, Jin; Liu, Xin

    2014-01-01

    The similarity between objects is the core research area of data mining. In order to reduce the interference of the uncertainty of nature language, a similarity measurement between normal cloud models is adopted to text classification research. On this basis, a novel text classifier based on cloud concept jumping up (CCJU-TC) is proposed. It can efficiently accomplish conversion between qualitative concept and quantitative data. Through the conversion from text set to text information table based on VSM model, the text qualitative concept, which is extraction from the same category, is jumping up as a whole category concept. According to the cloud similarity between the test text and each category concept, the test text is assigned to the most similar category. By the comparison among different text classifiers in different feature selection set, it fully proves that not only does CCJU-TC have a strong ability to adapt to the different text features, but also the classification performance is also better than the traditional classifiers. PMID:24711737

  9. Individual Profiling Using Text Analysis

    DTIC Science & Technology

    2016-04-15

    Mining a Text for Errors. . . . on Knowledge discovery in data mining , pages 624–628, 2005. [12] Michal Kosinski, David Stillwell, and Thore Graepel...AFRL-AFOSR-UK-TR-2016-0011 Individual Profiling using Text Analysis 140333 Mark Stevenson UNIVERSITY OF SHEFFIELD, DEPARTMENT OF PSYCHOLOGY Final...REPORT TYPE      Final 3.  DATES COVERED (From - To)      15 Sep 2014 to 14 Sep 2015 4.  TITLE AND SUBTITLE Individual Profiling using Text Analysis

  10. Mining the pharmacogenomics literature—a survey of the state of the art

    PubMed Central

    Cohen, K. Bretonnel; Garten, Yael; Shah, Nigam H.

    2012-01-01

    This article surveys efforts on text mining of the pharmacogenomics literature, mainly from the period 2008 to 2011. Pharmacogenomics (or pharmacogenetics) is the field that studies how human genetic variation impacts drug response. Therefore, publications span the intersection of research in genotypes, phenotypes and pharmacology, a topic that has increasingly become a focus of active research in recent years. This survey covers efforts dealing with the automatic recognition of relevant named entities (e.g. genes, gene variants and proteins, diseases and other pathological phenomena, drugs and other chemicals relevant for medical treatment), as well as various forms of relations between them. A wide range of text genres is considered, such as scientific publications (abstracts, as well as full texts), patent texts and clinical narratives. We also discuss infrastructure and resources needed for advanced text analytics, e.g. document corpora annotated with corresponding semantic metadata (gold standards and training data), biomedical terminologies and ontologies providing domain-specific background knowledge at different levels of formality and specificity, software architectures for building complex and scalable text analytics pipelines and Web services grounded to them, as well as comprehensive ways to disseminate and interact with the typically huge amounts of semiformal knowledge structures extracted by text mining tools. Finally, we consider some of the novel applications that have already been developed in the field of pharmacogenomic text mining and point out perspectives for future research. PMID:22833496

  11. Mining the pharmacogenomics literature--a survey of the state of the art.

    PubMed

    Hahn, Udo; Cohen, K Bretonnel; Garten, Yael; Shah, Nigam H

    2012-07-01

    This article surveys efforts on text mining of the pharmacogenomics literature, mainly from the period 2008 to 2011. Pharmacogenomics (or pharmacogenetics) is the field that studies how human genetic variation impacts drug response. Therefore, publications span the intersection of research in genotypes, phenotypes and pharmacology, a topic that has increasingly become a focus of active research in recent years. This survey covers efforts dealing with the automatic recognition of relevant named entities (e.g. genes, gene variants and proteins, diseases and other pathological phenomena, drugs and other chemicals relevant for medical treatment), as well as various forms of relations between them. A wide range of text genres is considered, such as scientific publications (abstracts, as well as full texts), patent texts and clinical narratives. We also discuss infrastructure and resources needed for advanced text analytics, e.g. document corpora annotated with corresponding semantic metadata (gold standards and training data), biomedical terminologies and ontologies providing domain-specific background knowledge at different levels of formality and specificity, software architectures for building complex and scalable text analytics pipelines and Web services grounded to them, as well as comprehensive ways to disseminate and interact with the typically huge amounts of semiformal knowledge structures extracted by text mining tools. Finally, we consider some of the novel applications that have already been developed in the field of pharmacogenomic text mining and point out perspectives for future research.

  12. Using Text Mining to Characterize Online Discussion Facilitation

    ERIC Educational Resources Information Center

    Ming, Norma; Baumer, Eric

    2011-01-01

    Facilitating class discussions effectively is a critical yet challenging component of instruction, particularly in online environments where student and faculty interaction is limited. Our goals in this research were to identify facilitation strategies that encourage productive discussion, and to explore text mining techniques that can help…

  13. Design approaches in quarrying and pit-mining reclamation

    USGS Publications Warehouse

    Arbogast, Belinda F.

    1999-01-01

    Reclaimed mine sites have been evaluated so that the public, industry, and land planners may recognize there are innovative designs available for consideration and use. People tend to see cropland, range, and road cuts as a necessary part of their everyday life, not as disturbed areas despite their high visibility. Mining also generates a disturbed landscape, unfortunately one that many consider waste until reclaimed by human beings. The development of mining provides an economic base and use of a natural resource to improve the quality of human life. Equally important is a sensitivity to the geologic origin and natural pattern of the land. Wisely shaping out environment requires a design plan and product that responds to a site's physiography, ecology, function, artistic form, and publication perception. An examination of selected sites for their landscape design suggested nine approaches for mining reclamation. The oldest design approach around is nature itself. Humans may sometimes do more damage going to an area in the attempt to repair it. Given enough geologic time, a small-site area, and stable adjacent ecosystems, disturbed areas recover without mankind's input. Visual screens and buffer zones conceal the facility in a camouflage approach. Typically, earth berms, fences, and plantings are used to disguise the mining facility. Restoration targets social or economic benefits by reusing the site for public amenities, most often in urban centers with large populations. A mitigation approach attempts to protect the environment and return mined areas to use with scientific input. The reuse of cement, building rubble, macadam meets only about 10% of the demand from aggregate. Recognizing the limited supply of mineral resources and encouraging recycling efforts are steps are steps in a renewable resource approach. An educative design approach effectively communicates mining information through outreach, land stewardship, and community service. Mine sites used for art show a celebration of beauty and experience -- abstract geology. The last design approach combines art and science in a human-nature ecosystem termed integration. With environmental concerns, an operating or reclaimed mine site can no longer be considered isolated from its surroundings. Site analysis of mine works needs to go beyond site-specific information and relate to the regional context of the greater landscape. Understanding design approach can turn undesirable features (mines and pits) into something perceived as desirable by the public.

  14. 40 CFR 372.23 - SIC and NAICS codes to which this Part applies.

    Code of Federal Regulations, 2010 CFR

    2010-07-01

    ... facilities primarily engaged in reproducing text, drawings, plans, maps, or other copy, by blueprinting...)); 212324Kaolin and Ball Clay Mining Limited to facilities operating without a mine or quarry and that are...)); 212393Other Chemical and Fertilizer Mineral Mining Limited to facilities operating without a mine or quarry...

  15. PubstractHelper: A Web-based Text-Mining Tool for Marking Sentences in Abstracts from PubMed Using Multiple User-Defined Keywords.

    PubMed

    Chen, Chou-Cheng; Ho, Chung-Liang

    2014-01-01

    While a huge amount of information about biological literature can be obtained by searching the PubMed database, reading through all the titles and abstracts resulting from such a search for useful information is inefficient. Text mining makes it possible to increase this efficiency. Some websites use text mining to gather information from the PubMed database; however, they are database-oriented, using pre-defined search keywords while lacking a query interface for user-defined search inputs. We present the PubMed Abstract Reading Helper (PubstractHelper) website which combines text mining and reading assistance for an efficient PubMed search. PubstractHelper can accept a maximum of ten groups of keywords, within each group containing up to ten keywords. The principle behind the text-mining function of PubstractHelper is that keywords contained in the same sentence are likely to be related. PubstractHelper highlights sentences with co-occurring keywords in different colors. The user can download the PMID and the abstracts with color markings to be reviewed later. The PubstractHelper website can help users to identify relevant publications based on the presence of related keywords, which should be a handy tool for their research. http://bio.yungyun.com.tw/ATM/PubstractHelper.aspx and http://holab.med.ncku.edu.tw/ATM/PubstractHelper.aspx.

  16. Text Classification for Organizational Researchers

    PubMed Central

    Kobayashi, Vladimer B.; Mol, Stefan T.; Berkers, Hannah A.; Kismihók, Gábor; Den Hartog, Deanne N.

    2017-01-01

    Organizations are increasingly interested in classifying texts or parts thereof into categories, as this enables more effective use of their information. Manual procedures for text classification work well for up to a few hundred documents. However, when the number of documents is larger, manual procedures become laborious, time-consuming, and potentially unreliable. Techniques from text mining facilitate the automatic assignment of text strings to categories, making classification expedient, fast, and reliable, which creates potential for its application in organizational research. The purpose of this article is to familiarize organizational researchers with text mining techniques from machine learning and statistics. We describe the text classification process in several roughly sequential steps, namely training data preparation, preprocessing, transformation, application of classification techniques, and validation, and provide concrete recommendations at each step. To help researchers develop their own text classifiers, the R code associated with each step is presented in a tutorial. The tutorial draws from our own work on job vacancy mining. We end the article by discussing how researchers can validate a text classification model and the associated output. PMID:29881249

  17. A method for integrating and ranking the evidence for biochemical pathways by mining reactions from text

    PubMed Central

    Miwa, Makoto; Ohta, Tomoko; Rak, Rafal; Rowley, Andrew; Kell, Douglas B.; Pyysalo, Sampo; Ananiadou, Sophia

    2013-01-01

    Motivation: To create, verify and maintain pathway models, curators must discover and assess knowledge distributed over the vast body of biological literature. Methods supporting these tasks must understand both the pathway model representations and the natural language in the literature. These methods should identify and order documents by relevance to any given pathway reaction. No existing system has addressed all aspects of this challenge. Method: We present novel methods for associating pathway model reactions with relevant publications. Our approach extracts the reactions directly from the models and then turns them into queries for three text mining-based MEDLINE literature search systems. These queries are executed, and the resulting documents are combined and ranked according to their relevance to the reactions of interest. We manually annotate document-reaction pairs with the relevance of the document to the reaction and use this annotation to study several ranking methods, using various heuristic and machine-learning approaches. Results: Our evaluation shows that the annotated document-reaction pairs can be used to create a rule-based document ranking system, and that machine learning can be used to rank documents by their relevance to pathway reactions. We find that a Support Vector Machine-based system outperforms several baselines and matches the performance of the rule-based system. The success of the query extraction and ranking methods are used to update our existing pathway search system, PathText. Availability: An online demonstration of PathText 2 and the annotated corpus are available for research purposes at http://www.nactem.ac.uk/pathtext2/. Contact: makoto.miwa@manchester.ac.uk Supplementary information: Supplementary data are available at Bioinformatics online. PMID:23813008

  18. Graphics-based intelligent search and abstracting using Data Modeling

    NASA Astrophysics Data System (ADS)

    Jaenisch, Holger M.; Handley, James W.; Case, Carl T.; Songy, Claude G.

    2002-11-01

    This paper presents an autonomous text and context-mining algorithm that converts text documents into point clouds for visual search cues. This algorithm is applied to the task of data-mining a scriptural database comprised of the Old and New Testaments from the Bible and the Book of Mormon, Doctrine and Covenants, and the Pearl of Great Price. Results are generated which graphically show the scripture that represents the average concept of the database and the mining of the documents down to the verse level.

  19. Pathway enrichment based on text mining and its validation on carotenoid and vitamin A metabolism.

    PubMed

    Waagmeester, Andra; Pezik, Piotr; Coort, Susan; Tourniaire, Franck; Evelo, Chris; Rebholz-Schuhmann, Dietrich

    2009-10-01

    Carotenoid metabolism is relevant to the prevention of various diseases. Although the main actors in this metabolic pathway are known, our understanding of the pathway is still incomplete. The information on the carotenoids is scattered in the large and growing body of scientific literature. We designed a text-mining work flow to enrich existing pathways. It has been validated on the vitamin A pathway, which is a well-studied part of the carotenoid metabolism. In this study we used the vitamin A metabolism pathway as it has been described by an expert team on carotenoid metabolism from the European network of excellence in Nutrigenomics (NuGO). This work flow uses an initial set of publications cited in a review paper (1,191 publications), enlarges this corpus with Medline abstracts (13,579 documents), and then extracts the key terminology from all relevant publications. Domain experts validated the intermediate and final results of our text-mining work flow. With our approach we were able to enrich the pathway representing vitamin A metabolism. We found 37 new and relevant terms from a total of 89,086 terms, which have been qualified for inclusion in the analyzed pathway. These 37 terms have been assessed manually and as a result 13 new terms were then added as entities to the pathway. Another 14 entities belonged to other pathways, which could form the link of these pathways with the vitamin A pathway. The remaining 10 terms were classified as biomarkers or nutrients. Automatic literature analysis improves the enrichment of pathways with entities already described in the scientific literature.

  20. The MicroRNA Interaction Network of Lipid Diseases

    PubMed Central

    Kandhro, Abdul H.; Shoombuatong, Watshara; Nantasenamat, Chanin; Prachayasittikul, Virapong; Nuchnoi, Pornlada

    2017-01-01

    Background: Dyslipidemia is one of the major forms of lipid disorder, characterized by increased triglycerides (TGs), increased low-density lipoprotein-cholesterol (LDL-C), and decreased high-density lipoprotein-cholesterol (HDL-C) levels in blood. Recently, MicroRNAs (miRNAs) have been reported to involve in various biological processes; their potential usage being a biomarkers and in diagnosis of various diseases. Computational approaches including text mining have been used recently to analyze abstracts from the public databases to observe the relationships/associations between the biological molecules, miRNAs, and disease phenotypes. Materials and Methods: In the present study, significance of text mined extracted pair associations (miRNA-lipid disease) were estimated by one-sided Fisher's exact test. The top 20 significant miRNA-disease associations were visualized on Cytoscape. The CyTargetLinker plug-in tool on Cytoscape was used to extend the network and predicts new miRNA target genes. The Biological Networks Gene Ontology (BiNGO) plug-in tool on Cytoscape was used to retrieve gene ontology (GO) annotations for the targeted genes. Results: We retrieved 227 miRNA-lipid disease associations including 148 miRNAs. The top 20 significant miRNAs analysis on CyTargetLinker provides defined, predicted and validated gene targets, further targeted genes analyzed by BiNGO showed targeted genes were significantly associated with lipid, cholesterol, apolipoprotein, and fatty acids GO terms. Conclusion: We are the first to provide a reliable miRNA-lipid disease association network based on text mining. This could help future experimental studies that aim to validate predicted gene targets. PMID:29018475

  1. Data mining learning bootstrap through semantic thumbnail analysis

    NASA Astrophysics Data System (ADS)

    Battiato, Sebastiano; Farinella, Giovanni Maria; Giuffrida, Giovanni; Tribulato, Giuseppe

    2007-01-01

    The rapid increase of technological innovations in the mobile phone industry induces the research community to develop new and advanced systems to optimize services offered by mobile phones operators (telcos) to maximize their effectiveness and improve their business. Data mining algorithms can run over data produced by mobile phones usage (e.g. image, video, text and logs files) to discover user's preferences and predict the most likely (to be purchased) offer for each individual customer. One of the main challenges is the reduction of the learning time and cost of these automatic tasks. In this paper we discuss an experiment where a commercial offer is composed by a small picture augmented with a short text describing the offer itself. Each customer's purchase is properly logged with all relevant information. Upon arrival of new items we need to learn who the best customers (prospects) for each item are, that is, the ones most likely to be interested in purchasing that specific item. Such learning activity is time consuming and, in our specific case, is not applicable given the large number of new items arriving every day. Basically, given the current customer base we are not able to learn on all new items. Thus, we need somehow to select among those new items to identify the best candidates. We do so by using a joint analysis between visual features and text to estimate how good each new item could be, that is, whether or not is worth to learn on it. Preliminary results show the effectiveness of the proposed approach to improve classical data mining techniques.

  2. Text mining electronic hospital records to automatically classify admissions against disease: Measuring the impact of linking data sources.

    PubMed

    Kocbek, Simon; Cavedon, Lawrence; Martinez, David; Bain, Christopher; Manus, Chris Mac; Haffari, Gholamreza; Zukerman, Ingrid; Verspoor, Karin

    2016-12-01

    Text and data mining play an important role in obtaining insights from Health and Hospital Information Systems. This paper presents a text mining system for detecting admissions marked as positive for several diseases: Lung Cancer, Breast Cancer, Colon Cancer, Secondary Malignant Neoplasm of Respiratory and Digestive Organs, Multiple Myeloma and Malignant Plasma Cell Neoplasms, Pneumonia, and Pulmonary Embolism. We specifically examine the effect of linking multiple data sources on text classification performance. Support Vector Machine classifiers are built for eight data source combinations, and evaluated using the metrics of Precision, Recall and F-Score. Sub-sampling techniques are used to address unbalanced datasets of medical records. We use radiology reports as an initial data source and add other sources, such as pathology reports and patient and hospital admission data, in order to assess the research question regarding the impact of the value of multiple data sources. Statistical significance is measured using the Wilcoxon signed-rank test. A second set of experiments explores aspects of the system in greater depth, focusing on Lung Cancer. We explore the impact of feature selection; analyse the learning curve; examine the effect of restricting admissions to only those containing reports from all data sources; and examine the impact of reducing the sub-sampling. These experiments provide better understanding of how to best apply text classification in the context of imbalanced data of variable completeness. Radiology questions plus patient and hospital admission data contribute valuable information for detecting most of the diseases, significantly improving performance when added to radiology reports alone or to the combination of radiology and pathology reports. Overall, linking data sources significantly improved classification performance for all the diseases examined. However, there is no single approach that suits all scenarios; the choice of the most effective combination of data sources depends on the specific disease to be classified. Copyright © 2016 Elsevier Inc. All rights reserved.

  3. Unsupervised discovery of information structure in biomedical documents.

    PubMed

    Kiela, Douwe; Guo, Yufan; Stenius, Ulla; Korhonen, Anna

    2015-04-01

    Information structure (IS) analysis is a text mining technique, which classifies text in biomedical articles into categories that capture different types of information, such as objectives, methods, results and conclusions of research. It is a highly useful technique that can support a range of Biomedical Text Mining tasks and can help readers of biomedical literature find information of interest faster, accelerating the highly time-consuming process of literature review. Several approaches to IS analysis have been presented in the past, with promising results in real-world biomedical tasks. However, all existing approaches, even weakly supervised ones, require several hundreds of hand-annotated training sentences specific to the domain in question. Because biomedicine is subject to considerable domain variation, such annotations are expensive to obtain. This makes the application of IS analysis across biomedical domains difficult. In this article, we investigate an unsupervised approach to IS analysis and evaluate the performance of several unsupervised methods on a large corpus of biomedical abstracts collected from PubMed. Our best unsupervised algorithm (multilevel-weighted graph clustering algorithm) performs very well on the task, obtaining over 0.70 F scores for most IS categories when applied to well-known IS schemes. This level of performance is close to that of lightly supervised IS methods and has proven sufficient to aid a range of practical tasks. Thus, using an unsupervised approach, IS could be applied to support a wide range of tasks across sub-domains of biomedicine. We also demonstrate that unsupervised learning brings novel insights into IS of biomedical literature and discovers information categories that are not present in any of the existing IS schemes. The annotated corpus and software are available at http://www.cl.cam.ac.uk/∼dk427/bio14info.html. © The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

  4. Evaluation of a rule-based method for epidemiological document classification towards the automation of systematic reviews.

    PubMed

    Karystianis, George; Thayer, Kristina; Wolfe, Mary; Tsafnat, Guy

    2017-06-01

    Most data extraction efforts in epidemiology are focused on obtaining targeted information from clinical trials. In contrast, limited research has been conducted on the identification of information from observational studies, a major source for human evidence in many fields, including environmental health. The recognition of key epidemiological information (e.g., exposures) through text mining techniques can assist in the automation of systematic reviews and other evidence summaries. We designed and applied a knowledge-driven, rule-based approach to identify targeted information (study design, participant population, exposure, outcome, confounding factors, and the country where the study was conducted) from abstracts of epidemiological studies included in several systematic reviews of environmental health exposures. The rules were based on common syntactical patterns observed in text and are thus not specific to any systematic review. To validate the general applicability of our approach, we compared the data extracted using our approach versus hand curation for 35 epidemiological study abstracts manually selected for inclusion in two systematic reviews. The returned F-score, precision, and recall ranged from 70% to 98%, 81% to 100%, and 54% to 97%, respectively. The highest precision was observed for exposure, outcome and population (100%) while recall was best for exposure and study design with 97% and 89%, respectively. The lowest recall was observed for the population (54%), which also had the lowest F-score (70%). The generated performance of our text-mining approach demonstrated encouraging results for the identification of targeted information from observational epidemiological study abstracts related to environmental exposures. We have demonstrated that rules based on generic syntactic patterns in one corpus can be applied to other observational study design by simple interchanging the dictionaries aiming to identify certain characteristics (i.e., outcomes, exposures). At the document level, the recognised information can assist in the selection and categorization of studies included in a systematic review. Copyright © 2017 Elsevier Inc. All rights reserved.

  5. Science and Technology Text Mining: Analytical Chemistry

    DTIC Science & Technology

    2001-01-01

    Raynauds Syndrome papers and having a central theme(s) "b" and sub-themes "c." One interesting discovery was that dietary eicosapentaenoic acid (theme...disease (theme "c" from literature BC). There was no mention of eicosapentaenoic acid in the Raynaud’s Syndrome literature, but the acid was linked to the...theoretical approaches and analyses. The main types of COMPOUNDS (ANALYTES) examined include, in order of emphasis, AMINO ACIDS , ORGANIC COMPOUNDS

  6. The Labour Welfare Fund Laws (Amendment) Act, 1987 (No. 15 of 1987), 22 May 1987.

    PubMed

    1987-01-01

    This Act authorizes funds constituted under the Mica Mines Labour Welfare Fund Act, 1946, the Limestone and Dolomite Mines Labour Welfare Fund Act, 1972, the Iron Ore Mines, Manganese Ore Mines and Chrome Mines Labour Welfare Fund Act, 1976, and the Beedi Workers Welfare Fund Act, 1976, to be applied for the provision of family welfare, including family planning education and services. full text

  7. Automated Text Data Mining Analysis of Five Decades of Educational Leadership Research Literature: Probabilistic Topic Modeling of "EAQ" Articles From 1965 to 2014

    ERIC Educational Resources Information Center

    Wang, Yinying; Bowers, Alex J.; Fikis, David J.

    2017-01-01

    Purpose: The purpose of this study is to describe the underlying topics and the topic evolution in the 50-year history of educational leadership research literature. Method: We used automated text data mining with probabilistic latent topic models to examine the full text of the entire publication history of all 1,539 articles published in…

  8. Using natural language processing techniques to inform research on nanotechnology.

    PubMed

    Lewinski, Nastassja A; McInnes, Bridget T

    2015-01-01

    Literature in the field of nanotechnology is exponentially increasing with more and more engineered nanomaterials being created, characterized, and tested for performance and safety. With the deluge of published data, there is a need for natural language processing approaches to semi-automate the cataloguing of engineered nanomaterials and their associated physico-chemical properties, performance, exposure scenarios, and biological effects. In this paper, we review the different informatics methods that have been applied to patent mining, nanomaterial/device characterization, nanomedicine, and environmental risk assessment. Nine natural language processing (NLP)-based tools were identified: NanoPort, NanoMapper, TechPerceptor, a Text Mining Framework, a Nanodevice Analyzer, a Clinical Trial Document Classifier, Nanotoxicity Searcher, NanoSifter, and NEIMiner. We conclude with recommendations for sharing NLP-related tools through online repositories to broaden participation in nanoinformatics.

  9. A framework for semisupervised feature generation and its applications in biomedical literature mining.

    PubMed

    Li, Yanpeng; Hu, Xiaohua; Lin, Hongfei; Yang, Zhihao

    2011-01-01

    Feature representation is essential to machine learning and text mining. In this paper, we present a feature coupling generalization (FCG) framework for generating new features from unlabeled data. It selects two special types of features, i.e., example-distinguishing features (EDFs) and class-distinguishing features (CDFs) from original feature set, and then generalizes EDFs into higher-level features based on their coupling degrees with CDFs in unlabeled data. The advantage is: EDFs with extreme sparsity in labeled data can be enriched by their co-occurrences with CDFs in unlabeled data so that the performance of these low-frequency features can be greatly boosted and new information from unlabeled can be incorporated. We apply this approach to three tasks in biomedical literature mining: gene named entity recognition (NER), protein-protein interaction extraction (PPIE), and text classification (TC) for gene ontology (GO) annotation. New features are generated from over 20 GB unlabeled PubMed abstracts. The experimental results on BioCreative 2, AIMED corpus, and TREC 2005 Genomics Track show that 1) FCG can utilize well the sparse features ignored by supervised learning. 2) It improves the performance of supervised baselines by 7.8 percent, 5.0 percent, and 5.8 percent, respectively, in the tree tasks. 3) Our methods achieve 89.1, 64.5 F-score, and 60.1 normalized utility on the three benchmark data sets.

  10. Data mining of mental health issues of non-bone marrow donor siblings.

    PubMed

    Takita, Morihito; Tanaka, Yuji; Kodama, Yuko; Murashige, Naoko; Hatanaka, Nobuyo; Kishi, Yukiko; Matsumura, Tomoko; Ohsawa, Yukio; Kami, Masahiro

    2011-07-20

    Allogenic hematopoietic stem cell transplantation is a curative treatment for patients with advanced hematologic malignancies. However, the long-term mental health issues of siblings who were not selected as donors (non-donor siblings, NDS) in the transplantation have not been well assessed. Data mining is useful in discovering new findings from a large, multidisciplinary data set and the Scenario Map analysis is a novel approach which allows extracting keywords linking different conditions/events from text data of interviews even when the keywords appeared infrequently. The aim of this study is to assess mental health issues on NDSs and to find helpful keywords for the clinical follow-up using a Scenario Map analysis. A 47-year-old woman whose younger sister had undergone allogenic hematopoietic stem cell transplantation 20 years earlier was interviewed as a NDS. The text data from the interview transcriptions was analyzed using Scenario Mapping. Four clusters of words and six keywords were identified. Upon review of the word clusters and keywords, both the subject and researchers noticed that the subject has had mental health issues since the disease onset to date with being a NDS. The issues have been alleviated by her family. This single subject study suggested the advantages of data mining in clinical follow-up for mental health issues of patients and/or their families.

  11. Uncovering text mining: A survey of current work on web-based epidemic intelligence

    PubMed Central

    Collier, Nigel

    2012-01-01

    Real world pandemics such as SARS 2002 as well as popular fiction like the movie Contagion graphically depict the health threat of a global pandemic and the key role of epidemic intelligence (EI). While EI relies heavily on established indicator sources a new class of methods based on event alerting from unstructured digital Internet media is rapidly becoming acknowledged within the public health community. At the heart of automated information gathering systems is a technology called text mining. My contribution here is to provide an overview of the role that text mining technology plays in detecting epidemics and to synthesise my existing research on the BioCaster project. PMID:22783909

  12. Finding novel relationships with integrated gene-gene association network analysis of Synechocystis sp. PCC 6803 using species-independent text-mining.

    PubMed

    Kreula, Sanna M; Kaewphan, Suwisa; Ginter, Filip; Jones, Patrik R

    2018-01-01

    The increasing move towards open access full-text scientific literature enhances our ability to utilize advanced text-mining methods to construct information-rich networks that no human will be able to grasp simply from 'reading the literature'. The utility of text-mining for well-studied species is obvious though the utility for less studied species, or those with no prior track-record at all, is not clear. Here we present a concept for how advanced text-mining can be used to create information-rich networks even for less well studied species and apply it to generate an open-access gene-gene association network resource for Synechocystis sp. PCC 6803, a representative model organism for cyanobacteria and first case-study for the methodology. By merging the text-mining network with networks generated from species-specific experimental data, network integration was used to enhance the accuracy of predicting novel interactions that are biologically relevant. A rule-based algorithm (filter) was constructed in order to automate the search for novel candidate genes with a high degree of likely association to known target genes by (1) ignoring established relationships from the existing literature, as they are already 'known', and (2) demanding multiple independent evidences for every novel and potentially relevant relationship. Using selected case studies, we demonstrate the utility of the network resource and filter to ( i ) discover novel candidate associations between different genes or proteins in the network, and ( ii ) rapidly evaluate the potential role of any one particular gene or protein. The full network is provided as an open-source resource.

  13. Motif-Based Text Mining of Microbial Metagenome Redundancy Profiling Data for Disease Classification.

    PubMed

    Wang, Yin; Li, Rudong; Zhou, Yuhua; Ling, Zongxin; Guo, Xiaokui; Xie, Lu; Liu, Lei

    2016-01-01

    Text data of 16S rRNA are informative for classifications of microbiota-associated diseases. However, the raw text data need to be systematically processed so that features for classification can be defined/extracted; moreover, the high-dimension feature spaces generated by the text data also pose an additional difficulty. Here we present a Phylogenetic Tree-Based Motif Finding algorithm (PMF) to analyze 16S rRNA text data. By integrating phylogenetic rules and other statistical indexes for classification, we can effectively reduce the dimension of the large feature spaces generated by the text datasets. Using the retrieved motifs in combination with common classification methods, we can discriminate different samples of both pneumonia and dental caries better than other existing methods. We extend the phylogenetic approaches to perform supervised learning on microbiota text data to discriminate the pathological states for pneumonia and dental caries. The results have shown that PMF may enhance the efficiency and reliability in analyzing high-dimension text data.

  14. Brain model of text animation as a data mining strategy.

    PubMed

    Astakhova, Tamara; Astakhov, Vadim

    2009-01-01

    Imagination is the critical point in developing of realistic intelligence (AI) systems. One way to approach imagination would be simulation of its properties and operations. We developed two models "Brain Network Hierarchy of Languages," and "Semantical Holographic Calculus" and simulation system ScriptWriter that emulate the process of imagination through an automatic animation of English texts. The purpose of this paper is to demonstrate the model and present "ScriptWriter" system http://nvo.sdsc.edu/NVO/JCSG/get_SRB_mime_file2.cgi//home/tamara.sdsc/test/demo.zip?F=/home/tamara.sdsc/test/demo.zip&M=application/x-gtar for simulation of the imagination.

  15. 30 CFR 900.2 - Objectives.

    Code of Federal Regulations, 2010 CFR

    2010-07-01

    ... texts of State and Federal cooperative agreements for regulation of mining on Federal lands. The... Resources OFFICE OF SURFACE MINING RECLAMATION AND ENFORCEMENT, DEPARTMENT OF THE INTERIOR PROGRAMS FOR THE CONDUCT OF SURFACE MINING OPERATIONS WITHIN EACH STATE INTRODUCTION § 900.2 Objectives. The objective of...

  16. 76 FR 40649 - Indiana Regulatory Program

    Federal Register 2010, 2011, 2012, 2013, 2014

    2011-07-11

    ... at 312 IAC 25-6-30 Surface mining; explosives; general requirements. The full text of the program... DEPARTMENT OF THE INTERIOR Office of Surface Mining Reclamation and Enforcement 30 CFR Part 914... Mining Reclamation and Enforcement, Interior. ACTION: Proposed rule; public comment period on proposed...

  17. Complementing the Numbers: A Text Mining Analysis of College Course Withdrawals

    ERIC Educational Resources Information Center

    Michalski, Greg V.

    2011-01-01

    Excessive college course withdrawals are costly to the student and the institution in terms of time to degree completion, available classroom space, and other resources. Although generally well quantified, detailed analysis of the reasons given by students for course withdrawal is less common. To address this, a text mining analysis was performed…

  18. Adverse Event extraction from Structured Product Labels using the Event-based Text-mining of Health Electronic Records (ETHER)system.

    PubMed

    Pandey, Abhishek; Kreimeyer, Kory; Foster, Matthew; Botsis, Taxiarchis; Dang, Oanh; Ly, Thomas; Wang, Wei; Forshee, Richard

    2018-01-01

    Structured Product Labels follow an XML-based document markup standard approved by the Health Level Seven organization and adopted by the US Food and Drug Administration as a mechanism for exchanging medical products information. Their current organization makes their secondary use rather challenging. We used the Side Effect Resource database and DailyMed to generate a comparison dataset of 1159 Structured Product Labels. We processed the Adverse Reaction section of these Structured Product Labels with the Event-based Text-mining of Health Electronic Records system and evaluated its ability to extract and encode Adverse Event terms to Medical Dictionary for Regulatory Activities Preferred Terms. A small sample of 100 labels was then selected for further analysis. Of the 100 labels, Event-based Text-mining of Health Electronic Records achieved a precision and recall of 81 percent and 92 percent, respectively. This study demonstrated Event-based Text-mining of Health Electronic Record's ability to extract and encode Adverse Event terms from Structured Product Labels which may potentially support multiple pharmacoepidemiological tasks.

  19. A Framework for Text Mining in Scientometric Study: A Case Study in Biomedicine Publications

    NASA Astrophysics Data System (ADS)

    Silalahi, V. M. M.; Hardiyati, R.; Nadhiroh, I. M.; Handayani, T.; Rahmaida, R.; Amelia, M.

    2018-04-01

    The data of Indonesians research publications in the domain of biomedicine has been collected to be text mined for the purpose of a scientometric study. The goal is to build a predictive model that provides a classification of research publications on the potency for downstreaming. The model is based on the drug development processes adapted from the literatures. An effort is described to build the conceptual model and the development of a corpus on the research publications in the domain of Indonesian biomedicine. Then an investigation is conducted relating to the problems associated with building a corpus and validating the model. Based on our experience, a framework is proposed to manage the scientometric study based on text mining. Our method shows the effectiveness of conducting a scientometric study based on text mining in order to get a valid classification model. This valid model is mainly supported by the iterative and close interactions with the domain experts starting from identifying the issues, building a conceptual model, to the labelling, validation and results interpretation.

  20. Data Processing and Text Mining Technologies on Electronic Medical Records: A Review

    PubMed Central

    Sun, Wencheng; Li, Yangyang; Liu, Fang; Fang, Shengqun; Wang, Guoyan

    2018-01-01

    Currently, medical institutes generally use EMR to record patient's condition, including diagnostic information, procedures performed, and treatment results. EMR has been recognized as a valuable resource for large-scale analysis. However, EMR has the characteristics of diversity, incompleteness, redundancy, and privacy, which make it difficult to carry out data mining and analysis directly. Therefore, it is necessary to preprocess the source data in order to improve data quality and improve the data mining results. Different types of data require different processing technologies. Most structured data commonly needs classic preprocessing technologies, including data cleansing, data integration, data transformation, and data reduction. For semistructured or unstructured data, such as medical text, containing more health information, it requires more complex and challenging processing methods. The task of information extraction for medical texts mainly includes NER (named-entity recognition) and RE (relation extraction). This paper focuses on the process of EMR processing and emphatically analyzes the key techniques. In addition, we make an in-depth study on the applications developed based on text mining together with the open challenges and research issues for future work. PMID:29849998

  1. An Approach to Realizing Process Control for Underground Mining Operations of Mobile Machines

    PubMed Central

    Song, Zhen; Schunnesson, Håkan; Rinne, Mikael; Sturgul, John

    2015-01-01

    The excavation and production in underground mines are complicated processes which consist of many different operations. The process of underground mining is considerably constrained by the geometry and geology of the mine. The various mining operations are normally performed in series at each working face. The delay of a single operation will lead to a domino effect, thus delay the starting time for the next process and the completion time of the entire process. This paper presents a new approach to the process control for underground mining operations, e.g. drilling, bolting, mucking. This approach can estimate the working time and its probability for each operation more efficiently and objectively by improving the existing PERT (Program Evaluation and Review Technique) and CPM (Critical Path Method). If the delay of the critical operation (which is on a critical path) inevitably affects the productivity of mined ore, the approach can rapidly assign mucking machines new jobs to increase this amount at a maximum level by using a new mucking algorithm under external constraints. PMID:26062092

  2. An Approach to Realizing Process Control for Underground Mining Operations of Mobile Machines.

    PubMed

    Song, Zhen; Schunnesson, Håkan; Rinne, Mikael; Sturgul, John

    2015-01-01

    The excavation and production in underground mines are complicated processes which consist of many different operations. The process of underground mining is considerably constrained by the geometry and geology of the mine. The various mining operations are normally performed in series at each working face. The delay of a single operation will lead to a domino effect, thus delay the starting time for the next process and the completion time of the entire process. This paper presents a new approach to the process control for underground mining operations, e.g. drilling, bolting, mucking. This approach can estimate the working time and its probability for each operation more efficiently and objectively by improving the existing PERT (Program Evaluation and Review Technique) and CPM (Critical Path Method). If the delay of the critical operation (which is on a critical path) inevitably affects the productivity of mined ore, the approach can rapidly assign mucking machines new jobs to increase this amount at a maximum level by using a new mucking algorithm under external constraints.

  3. A random set scoring model for prioritization of disease candidate genes using protein complexes and data-mining of GeneRIF, OMIM and PubMed records.

    PubMed

    Jiang, Li; Edwards, Stefan M; Thomsen, Bo; Workman, Christopher T; Guldbrandtsen, Bernt; Sørensen, Peter

    2014-09-24

    Prioritizing genetic variants is a challenge because disease susceptibility loci are often located in genes of unknown function or the relationship with the corresponding phenotype is unclear. A global data-mining exercise on the biomedical literature can establish the phenotypic profile of genes with respect to their connection to disease phenotypes. The importance of protein-protein interaction networks in the genetic heterogeneity of common diseases or complex traits is becoming increasingly recognized. Thus, the development of a network-based approach combined with phenotypic profiling would be useful for disease gene prioritization. We developed a random-set scoring model and implemented it to quantify phenotype relevance in a network-based disease gene-prioritization approach. We validated our approach based on different gene phenotypic profiles, which were generated from PubMed abstracts, OMIM, and GeneRIF records. We also investigated the validity of several vocabulary filters and different likelihood thresholds for predicted protein-protein interactions in terms of their effect on the network-based gene-prioritization approach, which relies on text-mining of the phenotype data. Our method demonstrated good precision and sensitivity compared with those of two alternative complex-based prioritization approaches. We then conducted a global ranking of all human genes according to their relevance to a range of human diseases. The resulting accurate ranking of known causal genes supported the reliability of our approach. Moreover, these data suggest many promising novel candidate genes for human disorders that have a complex mode of inheritance. We have implemented and validated a network-based approach to prioritize genes for human diseases based on their phenotypic profile. We have devised a powerful and transparent tool to identify and rank candidate genes. Our global gene prioritization provides a unique resource for the biological interpretation of data from genome-wide association studies, and will help in the understanding of how the associated genetic variants influence disease or quantitative phenotypes.

  4. Screening Electronic Health Record-Related Patient Safety Reports Using Machine Learning.

    PubMed

    Marella, William M; Sparnon, Erin; Finley, Edward

    2017-03-01

    The objective of this study was to develop a semiautomated approach to screening cases that describe hazards associated with the electronic health record (EHR) from a mandatory, population-based patient safety reporting system. Potentially relevant cases were identified through a query of the Pennsylvania Patient Safety Reporting System. A random sample of cases were manually screened for relevance and divided into training, testing, and validation data sets to develop a machine learning model. This model was used to automate screening of remaining potentially relevant cases. Of the 4 algorithms tested, a naive Bayes kernel performed best, with an area under the receiver operating characteristic curve of 0.927 ± 0.023, accuracy of 0.855 ± 0.033, and F score of 0.877 ± 0.027. The machine learning model and text mining approach described here are useful tools for identifying and analyzing adverse event and near-miss reports. Although reporting systems are beginning to incorporate structured fields on health information technology and the EHR, these methods can identify related events that reporters classify in other ways. These methods can facilitate analysis of legacy safety reports by retrieving health information technology-related and EHR-related events from databases without fields and controlled values focused on this subject and distinguishing them from reports in which the EHR is mentioned only in passing. Machine learning and text mining are useful additions to the patient safety toolkit and can be used to semiautomate screening and analysis of unstructured text in safety reports from frontline staff.

  5. Analysis of the human diseasome using phenotype similarity between common, genetic, and infectious diseases

    NASA Astrophysics Data System (ADS)

    Hoehndorf, Robert; Schofield, Paul N.; Gkoutos, Georgios V.

    2015-06-01

    Phenotypes are the observable characteristics of an organism arising from its response to the environment. Phenotypes associated with engineered and natural genetic variation are widely recorded using phenotype ontologies in model organisms, as are signs and symptoms of human Mendelian diseases in databases such as OMIM and Orphanet. Exploiting these resources, several computational methods have been developed for integration and analysis of phenotype data to identify the genetic etiology of diseases or suggest plausible interventions. A similar resource would be highly useful not only for rare and Mendelian diseases, but also for common, complex and infectious diseases. We apply a semantic text-mining approach to identify the phenotypes (signs and symptoms) associated with over 6,000 diseases. We evaluate our text-mined phenotypes by demonstrating that they can correctly identify known disease-associated genes in mice and humans with high accuracy. Using a phenotypic similarity measure, we generate a human disease network in which diseases that have similar signs and symptoms cluster together, and we use this network to identify closely related diseases based on common etiological, anatomical as well as physiological underpinnings.

  6. Text mining and natural language processing approaches for automatic categorization of lay requests to web-based expert forums.

    PubMed

    Himmel, Wolfgang; Reincke, Ulrich; Michelmann, Hans Wilhelm

    2009-07-22

    Both healthy and sick people increasingly use electronic media to obtain medical information and advice. For example, Internet users may send requests to Web-based expert forums, or so-called "ask the doctor" services. To automatically classify lay requests to an Internet medical expert forum using a combination of different text-mining strategies. We first manually classified a sample of 988 requests directed to a involuntary childlessness forum on the German website "Rund ums Baby" ("Everything about Babies") into one or more of 38 categories belonging to two dimensions ("subject matter" and "expectations"). After creating start and synonym lists, we calculated the average Cramer's V statistic for the association of each word with each category. We also used principle component analysis and singular value decomposition as further text-mining strategies. With these measures we trained regression models and determined, on the basis of best regression models, for any request the probability of belonging to each of the 38 different categories, with a cutoff of 50%. Recall and precision of a test sample were calculated as a measure of quality for the automatic classification. According to the manual classification of 988 documents, 102 (10%) documents fell into the category "in vitro fertilization (IVF)," 81 (8%) into the category "ovulation," 79 (8%) into "cycle," and 57 (6%) into "semen analysis." These were the four most frequent categories in the subject matter dimension (consisting of 32 categories). The expectation dimension comprised six categories; we classified 533 documents (54%) as "general information" and 351 (36%) as a wish for "treatment recommendations." The generation of indicator variables based on the chi-square analysis and Cramer's V proved to be the best approach for automatic classification in about half of the categories. In combination with the two other approaches, 100% precision and 100% recall were realized in 18 (47%) out of the 38 categories in the test sample. For 35 (92%) categories, precision and recall were better than 80%. For some categories, the input variables (ie, "words") also included variables from other categories, most often with a negative sign. For example, absence of words predictive for "menstruation" was a strong indicator for the category "pregnancy test." Our approach suggests a way of automatically classifying and analyzing unstructured information in Internet expert forums. The technique can perform a preliminary categorization of new requests and help Internet medical experts to better handle the mass of information and to give professional feedback.

  7. An overview of the BioCreative 2012 Workshop Track III: interactive text mining task

    PubMed Central

    Arighi, Cecilia N.; Carterette, Ben; Cohen, K. Bretonnel; Krallinger, Martin; Wilbur, W. John; Fey, Petra; Dodson, Robert; Cooper, Laurel; Van Slyke, Ceri E.; Dahdul, Wasila; Mabee, Paula; Li, Donghui; Harris, Bethany; Gillespie, Marc; Jimenez, Silvia; Roberts, Phoebe; Matthews, Lisa; Becker, Kevin; Drabkin, Harold; Bello, Susan; Licata, Luana; Chatr-aryamontri, Andrew; Schaeffer, Mary L.; Park, Julie; Haendel, Melissa; Van Auken, Kimberly; Li, Yuling; Chan, Juancarlos; Muller, Hans-Michael; Cui, Hong; Balhoff, James P.; Chi-Yang Wu, Johnny; Lu, Zhiyong; Wei, Chih-Hsuan; Tudor, Catalina O.; Raja, Kalpana; Subramani, Suresh; Natarajan, Jeyakumar; Cejuela, Juan Miguel; Dubey, Pratibha; Wu, Cathy

    2013-01-01

    In many databases, biocuration primarily involves literature curation, which usually involves retrieving relevant articles, extracting information that will translate into annotations and identifying new incoming literature. As the volume of biological literature increases, the use of text mining to assist in biocuration becomes increasingly relevant. A number of groups have developed tools for text mining from a computer science/linguistics perspective, and there are many initiatives to curate some aspect of biology from the literature. Some biocuration efforts already make use of a text mining tool, but there have not been many broad-based systematic efforts to study which aspects of a text mining tool contribute to its usefulness for a curation task. Here, we report on an effort to bring together text mining tool developers and database biocurators to test the utility and usability of tools. Six text mining systems presenting diverse biocuration tasks participated in a formal evaluation, and appropriate biocurators were recruited for testing. The performance results from this evaluation indicate that some of the systems were able to improve efficiency of curation by speeding up the curation task significantly (∼1.7- to 2.5-fold) over manual curation. In addition, some of the systems were able to improve annotation accuracy when compared with the performance on the manually curated set. In terms of inter-annotator agreement, the factors that contributed to significant differences for some of the systems included the expertise of the biocurator on the given curation task, the inherent difficulty of the curation and attention to annotation guidelines. After the task, annotators were asked to complete a survey to help identify strengths and weaknesses of the various systems. The analysis of this survey highlights how important task completion is to the biocurators’ overall experience of a system, regardless of the system’s high score on design, learnability and usability. In addition, strategies to refine the annotation guidelines and systems documentation, to adapt the tools to the needs and query types the end user might have and to evaluate performance in terms of efficiency, user interface, result export and traditional evaluation metrics have been analyzed during this task. This analysis will help to plan for a more intense study in BioCreative IV. PMID:23327936

  8. An overview of the BioCreative 2012 Workshop Track III: interactive text mining task.

    PubMed

    Arighi, Cecilia N; Carterette, Ben; Cohen, K Bretonnel; Krallinger, Martin; Wilbur, W John; Fey, Petra; Dodson, Robert; Cooper, Laurel; Van Slyke, Ceri E; Dahdul, Wasila; Mabee, Paula; Li, Donghui; Harris, Bethany; Gillespie, Marc; Jimenez, Silvia; Roberts, Phoebe; Matthews, Lisa; Becker, Kevin; Drabkin, Harold; Bello, Susan; Licata, Luana; Chatr-aryamontri, Andrew; Schaeffer, Mary L; Park, Julie; Haendel, Melissa; Van Auken, Kimberly; Li, Yuling; Chan, Juancarlos; Muller, Hans-Michael; Cui, Hong; Balhoff, James P; Chi-Yang Wu, Johnny; Lu, Zhiyong; Wei, Chih-Hsuan; Tudor, Catalina O; Raja, Kalpana; Subramani, Suresh; Natarajan, Jeyakumar; Cejuela, Juan Miguel; Dubey, Pratibha; Wu, Cathy

    2013-01-01

    In many databases, biocuration primarily involves literature curation, which usually involves retrieving relevant articles, extracting information that will translate into annotations and identifying new incoming literature. As the volume of biological literature increases, the use of text mining to assist in biocuration becomes increasingly relevant. A number of groups have developed tools for text mining from a computer science/linguistics perspective, and there are many initiatives to curate some aspect of biology from the literature. Some biocuration efforts already make use of a text mining tool, but there have not been many broad-based systematic efforts to study which aspects of a text mining tool contribute to its usefulness for a curation task. Here, we report on an effort to bring together text mining tool developers and database biocurators to test the utility and usability of tools. Six text mining systems presenting diverse biocuration tasks participated in a formal evaluation, and appropriate biocurators were recruited for testing. The performance results from this evaluation indicate that some of the systems were able to improve efficiency of curation by speeding up the curation task significantly (∼1.7- to 2.5-fold) over manual curation. In addition, some of the systems were able to improve annotation accuracy when compared with the performance on the manually curated set. In terms of inter-annotator agreement, the factors that contributed to significant differences for some of the systems included the expertise of the biocurator on the given curation task, the inherent difficulty of the curation and attention to annotation guidelines. After the task, annotators were asked to complete a survey to help identify strengths and weaknesses of the various systems. The analysis of this survey highlights how important task completion is to the biocurators' overall experience of a system, regardless of the system's high score on design, learnability and usability. In addition, strategies to refine the annotation guidelines and systems documentation, to adapt the tools to the needs and query types the end user might have and to evaluate performance in terms of efficiency, user interface, result export and traditional evaluation metrics have been analyzed during this task. This analysis will help to plan for a more intense study in BioCreative IV.

  9. 76 FR 12849 - Kentucky Regulatory Program

    Federal Register 2010, 2011, 2012, 2013, 2014

    2011-03-09

    ... (underground mining). The text of the Kentucky regulations can be found in the administrative record and online... DEPARTMENT OF THE INTERIOR Office of Surface Mining Reclamation and Enforcement 30 CFR Part 917 [KY-252-FOR; OSM-2009-0011] Kentucky Regulatory Program AGENCY: Office of Surface Mining Reclamation...

  10. Research on Classification of Chinese Text Data Based on SVM

    NASA Astrophysics Data System (ADS)

    Lin, Yuan; Yu, Hongzhi; Wan, Fucheng; Xu, Tao

    2017-09-01

    Data Mining has important application value in today’s industry and academia. Text classification is a very important technology in data mining. At present, there are many mature algorithms for text classification. KNN, NB, AB, SVM, decision tree and other classification methods all show good classification performance. Support Vector Machine’ (SVM) classification method is a good classifier in machine learning research. This paper will study the classification effect based on the SVM method in the Chinese text data, and use the support vector machine method in the chinese text to achieve the classify chinese text, and to able to combination of academia and practical application.

  11. StemTextSearch: Stem cell gene database with evidence from abstracts.

    PubMed

    Chen, Chou-Cheng; Ho, Chung-Liang

    2017-05-01

    Previous studies have used many methods to find biomarkers in stem cells, including text mining, experimental data and image storage. However, no text-mining methods have yet been developed which can identify whether a gene plays a positive or negative role in stem cells. StemTextSearch identifies the role of a gene in stem cells by using a text-mining method to find combinations of gene regulation, stem-cell regulation and cell processes in the same sentences of biomedical abstracts. The dataset includes 5797 genes, with 1534 genes having positive roles in stem cells, 1335 genes having negative roles, 1654 genes with both positive and negative roles, and 1274 with an uncertain role. The precision of gene role in StemTextSearch is 0.66, and the recall is 0.78. StemTextSearch is a web-based engine with queries that specify (i) gene, (ii) category of stem cell, (iii) gene role, (iv) gene regulation, (v) cell process, (vi) stem-cell regulation, and (vii) species. StemTextSearch is available through http://bio.yungyun.com.tw/StemTextSearch.aspx. Copyright © 2017. Published by Elsevier Inc.

  12. The structural and content aspects of abstracts versus bodies of full text journal articles are different

    PubMed Central

    2010-01-01

    Background An increase in work on the full text of journal articles and the growth of PubMedCentral have the opportunity to create a major paradigm shift in how biomedical text mining is done. However, until now there has been no comprehensive characterization of how the bodies of full text journal articles differ from the abstracts that until now have been the subject of most biomedical text mining research. Results We examined the structural and linguistic aspects of abstracts and bodies of full text articles, the performance of text mining tools on both, and the distribution of a variety of semantic classes of named entities between them. We found marked structural differences, with longer sentences in the article bodies and much heavier use of parenthesized material in the bodies than in the abstracts. We found content differences with respect to linguistic features. Three out of four of the linguistic features that we examined were statistically significantly differently distributed between the two genres. We also found content differences with respect to the distribution of semantic features. There were significantly different densities per thousand words for three out of four semantic classes, and clear differences in the extent to which they appeared in the two genres. With respect to the performance of text mining tools, we found that a mutation finder performed equally well in both genres, but that a wide variety of gene mention systems performed much worse on article bodies than they did on abstracts. POS tagging was also more accurate in abstracts than in article bodies. Conclusions Aspects of structure and content differ markedly between article abstracts and article bodies. A number of these differences may pose problems as the text mining field moves more into the area of processing full-text articles. However, these differences also present a number of opportunities for the extraction of data types, particularly that found in parenthesized text, that is present in article bodies but not in article abstracts. PMID:20920264

  13. Text-mining analysis of mHealth research.

    PubMed

    Ozaydin, Bunyamin; Zengul, Ferhat; Oner, Nurettin; Delen, Dursun

    2017-01-01

    In recent years, because of the advancements in communication and networking technologies, mobile technologies have been developing at an unprecedented rate. mHealth, the use of mobile technologies in medicine, and the related research has also surged parallel to these technological advancements. Although there have been several attempts to review mHealth research through manual processes such as systematic reviews, the sheer magnitude of the number of studies published in recent years makes this task very challenging. The most recent developments in machine learning and text mining offer some potential solutions to address this challenge by allowing analyses of large volumes of texts through semi-automated processes. The objective of this study is to analyze the evolution of mHealth research by utilizing text-mining and natural language processing (NLP) analyses. The study sample included abstracts of 5,644 mHealth research articles, which were gathered from five academic search engines by using search terms such as mobile health, and mHealth. The analysis used the Text Explorer module of JMP Pro 13 and an iterative semi-automated process involving tokenizing, phrasing, and terming. After developing the document term matrix (DTM) analyses such as single value decomposition (SVD), topic, and hierarchical document clustering were performed, along with the topic-informed document clustering approach. The results were presented in the form of word-clouds and trend analyses. There were several major findings regarding research clusters and trends. First, our results confirmed time-dependent nature of terminology use in mHealth research. For example, in earlier versus recent years the use of terminology changed from "mobile phone" to "smartphone" and from "applications" to "apps". Second, ten clusters for mHealth research were identified including (I) Clinical Research on Lifestyle Management, (II) Community Health, (III) Literature Review, (IV) Medical Interventions, (V) Research Design, (VI) Infrastructure, (VII) Applications, (VIII) Research and Innovation in Health Technologies, (IX) Sensor-based Devices and Measurement Algorithms, (X) Survey-based Research. Third, the trend analyses indicated the infrastructure cluster as the highest percentage researched area until 2014. The Research and Innovation in Health Technologies cluster experienced the largest increase in numbers of publications in recent years, especially after 2014. This study is unique because it is the only known study utilizing text-mining analyses to reveal the streams and trends for mHealth research. The fast growth in mobile technologies is expected to lead to higher numbers of studies focusing on mHealth and its implications for various healthcare outcomes. Findings of this study can be utilized by researchers in identifying areas for future studies.

  14. Text-mining analysis of mHealth research

    PubMed Central

    Zengul, Ferhat; Oner, Nurettin; Delen, Dursun

    2017-01-01

    In recent years, because of the advancements in communication and networking technologies, mobile technologies have been developing at an unprecedented rate. mHealth, the use of mobile technologies in medicine, and the related research has also surged parallel to these technological advancements. Although there have been several attempts to review mHealth research through manual processes such as systematic reviews, the sheer magnitude of the number of studies published in recent years makes this task very challenging. The most recent developments in machine learning and text mining offer some potential solutions to address this challenge by allowing analyses of large volumes of texts through semi-automated processes. The objective of this study is to analyze the evolution of mHealth research by utilizing text-mining and natural language processing (NLP) analyses. The study sample included abstracts of 5,644 mHealth research articles, which were gathered from five academic search engines by using search terms such as mobile health, and mHealth. The analysis used the Text Explorer module of JMP Pro 13 and an iterative semi-automated process involving tokenizing, phrasing, and terming. After developing the document term matrix (DTM) analyses such as single value decomposition (SVD), topic, and hierarchical document clustering were performed, along with the topic-informed document clustering approach. The results were presented in the form of word-clouds and trend analyses. There were several major findings regarding research clusters and trends. First, our results confirmed time-dependent nature of terminology use in mHealth research. For example, in earlier versus recent years the use of terminology changed from “mobile phone” to “smartphone” and from “applications” to “apps”. Second, ten clusters for mHealth research were identified including (I) Clinical Research on Lifestyle Management, (II) Community Health, (III) Literature Review, (IV) Medical Interventions, (V) Research Design, (VI) Infrastructure, (VII) Applications, (VIII) Research and Innovation in Health Technologies, (IX) Sensor-based Devices and Measurement Algorithms, (X) Survey-based Research. Third, the trend analyses indicated the infrastructure cluster as the highest percentage researched area until 2014. The Research and Innovation in Health Technologies cluster experienced the largest increase in numbers of publications in recent years, especially after 2014. This study is unique because it is the only known study utilizing text-mining analyses to reveal the streams and trends for mHealth research. The fast growth in mobile technologies is expected to lead to higher numbers of studies focusing on mHealth and its implications for various healthcare outcomes. Findings of this study can be utilized by researchers in identifying areas for future studies. PMID:29430456

  15. National Conference on Mining-Influenced Waters: Approaches for Characterization, Source Control and Treatment

    EPA Science Inventory

    The conference goal was to provide a forum for the exchange of scientific information on current and emerging approaches to assessing characterization, monitoring, source control, treatment and/or remediation on mining-influenced waters. The conference was aimed at mining remedi...

  16. Matisse: A Visual Analytics System for Exploring Emotion Trends in Social Media Text Streams

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Steed, Chad A; Drouhard, Margaret MEG G; Beaver, Justin M

    Dynamically mining textual information streams to gain real-time situational awareness is especially challenging with social media systems where throughput and velocity properties push the limits of a static analytical approach. In this paper, we describe an interactive visual analytics system, called Matisse, that aids with the discovery and investigation of trends in streaming text. Matisse addresses the challenges inherent to text stream mining through the following technical contributions: (1) robust stream data management, (2) automated sentiment/emotion analytics, (3) interactive coordinated visualizations, and (4) a flexible drill-down interaction scheme that accesses multiple levels of detail. In addition to positive/negative sentiment prediction,more » Matisse provides fine-grained emotion classification based on Valence, Arousal, and Dominance dimensions and a novel machine learning process. Information from the sentiment/emotion analytics are fused with raw data and summary information to feed temporal, geospatial, term frequency, and scatterplot visualizations using a multi-scale, coordinated interaction model. After describing these techniques, we conclude with a practical case study focused on analyzing the Twitter sample stream during the week of the 2013 Boston Marathon bombings. The case study demonstrates the effectiveness of Matisse at providing guided situational awareness of significant trends in social media streams by orchestrating computational power and human cognition.« less

  17. Biblio-MetReS for user-friendly mining of genes and biological processes in scientific documents.

    PubMed

    Usie, Anabel; Karathia, Hiren; Teixidó, Ivan; Alves, Rui; Solsona, Francesc

    2014-01-01

    One way to initiate the reconstruction of molecular circuits is by using automated text-mining techniques. Developing more efficient methods for such reconstruction is a topic of active research, and those methods are typically included by bioinformaticians in pipelines used to mine and curate large literature datasets. Nevertheless, experimental biologists have a limited number of available user-friendly tools that use text-mining for network reconstruction and require no programming skills to use. One of these tools is Biblio-MetReS. Originally, this tool permitted an on-the-fly analysis of documents contained in a number of web-based literature databases to identify co-occurrence of proteins/genes. This approach ensured results that were always up-to-date with the latest live version of the databases. However, this 'up-to-dateness' came at the cost of large execution times. Here we report an evolution of the application Biblio-MetReS that permits constructing co-occurrence networks for genes, GO processes, Pathways, or any combination of the three types of entities and graphically represent those entities. We show that the performance of Biblio-MetReS in identifying gene co-occurrence is as least as good as that of other comparable applications (STRING and iHOP). In addition, we also show that the identification of GO processes is on par to that reported in the latest BioCreAtIvE challenge. Finally, we also report the implementation of a new strategy that combines on-the-fly analysis of new documents with preprocessed information from documents that were encountered in previous analyses. This combination simultaneously decreases program run time and maintains 'up-to-dateness' of the results. http://metres.udl.cat/index.php/downloads, metres.cmb@gmail.com.

  18. A New Pivoting and Iterative Text Detection Algorithm for Biomedical Images

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Xu, Songhua; Krauthammer, Prof. Michael

    2010-01-01

    There is interest to expand the reach of literature mining to include the analysis of biomedical images, which often contain a paper's key findings. Examples include recent studies that use Optical Character Recognition (OCR) to extract image text, which is used to boost biomedical image retrieval and classification. Such studies rely on the robust identification of text elements in biomedical images, which is a non-trivial task. In this work, we introduce a new text detection algorithm for biomedical images based on iterative projection histograms. We study the effectiveness of our algorithm by evaluating the performance on a set of manuallymore » labeled random biomedical images, and compare the performance against other state-of-the-art text detection algorithms. We demonstrate that our projection histogram-based text detection approach is well suited for text detection in biomedical images, and that the iterative application of the algorithm boosts performance to an F score of .60. We provide a C++ implementation of our algorithm freely available for academic use.« less

  19. Implementation of a Flexible Tool for Automated Literature-Mining and Knowledgebase Development (DevToxMine)

    EPA Science Inventory

    Deriving novel relationships from the scientific literature is an important adjunct to datamining activities for complex datasets in genomics and high-throughput screening activities. Automated text-mining algorithms can be used to extract relevant content from the literature and...

  20. Examining Mobile Learning Trends 2003-2008: A Categorical Meta-Trend Analysis Using Text Mining Techniques

    ERIC Educational Resources Information Center

    Hung, Jui-Long; Zhang, Ke

    2012-01-01

    This study investigated the longitudinal trends of academic articles in Mobile Learning (ML) using text mining techniques. One hundred and nineteen (119) refereed journal articles and proceedings papers from the SCI/SSCI database were retrieved and analyzed. The taxonomies of ML publications were grouped into twelve clusters (topics) and four…

  1. Trends of E-Learning Research from 2000 to 2008: Use of Text Mining and Bibliometrics

    ERIC Educational Resources Information Center

    Hung, Jui-long

    2012-01-01

    This study investigated the longitudinal trends of e-learning research using text mining techniques. Six hundred and eighty-nine (689) refereed journal articles and proceedings were retrieved from the Science Citation Index/Social Science Citation Index database in the period from 2000 to 2008. All e-learning publications were grouped into two…

  2. Finding novel relationships with integrated gene-gene association network analysis of Synechocystis sp. PCC 6803 using species-independent text-mining

    PubMed Central

    Kreula, Sanna M.; Kaewphan, Suwisa; Ginter, Filip

    2018-01-01

    The increasing move towards open access full-text scientific literature enhances our ability to utilize advanced text-mining methods to construct information-rich networks that no human will be able to grasp simply from ‘reading the literature’. The utility of text-mining for well-studied species is obvious though the utility for less studied species, or those with no prior track-record at all, is not clear. Here we present a concept for how advanced text-mining can be used to create information-rich networks even for less well studied species and apply it to generate an open-access gene-gene association network resource for Synechocystis sp. PCC 6803, a representative model organism for cyanobacteria and first case-study for the methodology. By merging the text-mining network with networks generated from species-specific experimental data, network integration was used to enhance the accuracy of predicting novel interactions that are biologically relevant. A rule-based algorithm (filter) was constructed in order to automate the search for novel candidate genes with a high degree of likely association to known target genes by (1) ignoring established relationships from the existing literature, as they are already ‘known’, and (2) demanding multiple independent evidences for every novel and potentially relevant relationship. Using selected case studies, we demonstrate the utility of the network resource and filter to (i) discover novel candidate associations between different genes or proteins in the network, and (ii) rapidly evaluate the potential role of any one particular gene or protein. The full network is provided as an open-source resource. PMID:29844966

  3. Selecting Proper Plant Species for Mine Reclamation Using Fuzzy AHP Approach (Case Study: Chadormaloo Iron Mine of Iran)

    NASA Astrophysics Data System (ADS)

    Ebrahimabadi, Arash

    2016-12-01

    This paper describes an effective approach to select suitable plant species for reclamation of mined lands in Chadormaloo iron mine which is located in central part of Iran, near the city of Bafgh in Yazd province. After mine's total reserves are excavated, the mine requires to be permanently closed and reclaimed. Mine reclamation and post-mining land-use are the main issues in the phase of mine closure. In general, among various scenarios for mine reclamation process, i.e. planting, agriculture, forestry, residency, tourist attraction, etc., planting is the oldest and commonly-used technology for the reclamation of lands damaged by mining activities. Planting and vegetation play a major role in restoring productivity, ecosystem stability and biological diversity to degraded areas, therefore the main goal of this research work is to choose proper and suitable plants compatible with the conditions of Chadormaloo mined area, providing consistent conditions for future use. To ensure the sustainability of the reclaimed landscape, the most suitable plant species adapted to the mine conditions are selected. Plant species selection is a Multi Criteria Decision Making (MCDM) problem. In this paper, a fuzzy MCDM technique, namely Fuzzy Analytic Hierarchy Process (FAHP) is developed to assist chadormaloo iron mine managers and designers in the process of plant type selection for reclamation of the mine under fuzzy environment where the vagueness and uncertainty are taken into account with linguistic variables parameterized by triangular fuzzy numbers. The results achieved from using FAHP approach demonstrate that the most proper plant species are ranked as Artemisia sieberi, Salsola yazdiana, Halophytes types, and Zygophyllum, respectively for reclamation of Chadormaloo iron mine.

  4. Discriminative and informative features for biomolecular text mining with ensemble feature selection.

    PubMed

    Van Landeghem, Sofie; Abeel, Thomas; Saeys, Yvan; Van de Peer, Yves

    2010-09-15

    In the field of biomolecular text mining, black box behavior of machine learning systems currently limits understanding of the true nature of the predictions. However, feature selection (FS) is capable of identifying the most relevant features in any supervised learning setting, providing insight into the specific properties of the classification algorithm. This allows us to build more accurate classifiers while at the same time bridging the gap between the black box behavior and the end-user who has to interpret the results. We show that our FS methodology successfully discards a large fraction of machine-generated features, improving classification performance of state-of-the-art text mining algorithms. Furthermore, we illustrate how FS can be applied to gain understanding in the predictions of a framework for biomolecular event extraction from text. We include numerous examples of highly discriminative features that model either biological reality or common linguistic constructs. Finally, we discuss a number of insights from our FS analyses that will provide the opportunity to considerably improve upon current text mining tools. The FS algorithms and classifiers are available in Java-ML (http://java-ml.sf.net). The datasets are publicly available from the BioNLP'09 Shared Task web site (http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/SharedTask/).

  5. Text Mining for Protein Docking

    PubMed Central

    Badal, Varsha D.; Kundrotas, Petras J.; Vakser, Ilya A.

    2015-01-01

    The rapidly growing amount of publicly available information from biomedical research is readily accessible on the Internet, providing a powerful resource for predictive biomolecular modeling. The accumulated data on experimentally determined structures transformed structure prediction of proteins and protein complexes. Instead of exploring the enormous search space, predictive tools can simply proceed to the solution based on similarity to the existing, previously determined structures. A similar major paradigm shift is emerging due to the rapidly expanding amount of information, other than experimentally determined structures, which still can be used as constraints in biomolecular structure prediction. Automated text mining has been widely used in recreating protein interaction networks, as well as in detecting small ligand binding sites on protein structures. Combining and expanding these two well-developed areas of research, we applied the text mining to structural modeling of protein-protein complexes (protein docking). Protein docking can be significantly improved when constraints on the docking mode are available. We developed a procedure that retrieves published abstracts on a specific protein-protein interaction and extracts information relevant to docking. The procedure was assessed on protein complexes from Dockground (http://dockground.compbio.ku.edu). The results show that correct information on binding residues can be extracted for about half of the complexes. The amount of irrelevant information was reduced by conceptual analysis of a subset of the retrieved abstracts, based on the bag-of-words (features) approach. Support Vector Machine models were trained and validated on the subset. The remaining abstracts were filtered by the best-performing models, which decreased the irrelevant information for ~ 25% complexes in the dataset. The extracted constraints were incorporated in the docking protocol and tested on the Dockground unbound benchmark set, significantly increasing the docking success rate. PMID:26650466

  6. Reproducibility of studies on text mining for citation screening in systematic reviews: Evaluation and checklist.

    PubMed

    Olorisade, Babatunde Kazeem; Brereton, Pearl; Andras, Peter

    2017-09-01

    Independent validation of published scientific results through study replication is a pre-condition for accepting the validity of such results. In computation research, full replication is often unrealistic for independent results validation, therefore, study reproduction has been justified as the minimum acceptable standard to evaluate the validity of scientific claims. The application of text mining techniques to citation screening in the context of systematic literature reviews is a relatively young and growing computational field with high relevance for software engineering, medical research and other fields. However, there is little work so far on reproduction studies in the field. In this paper, we investigate the reproducibility of studies in this area based on information contained in published articles and we propose reporting guidelines that could improve reproducibility. The study was approached in two ways. Initially we attempted to reproduce results from six studies, which were based on the same raw dataset. Then, based on this experience, we identified steps considered essential to successful reproduction of text mining experiments and characterized them to measure how reproducible is a study given the information provided on these steps. 33 articles were systematically assessed for reproducibility using this approach. Our work revealed that it is currently difficult if not impossible to independently reproduce the results published in any of the studies investigated. The lack of information about the datasets used limits reproducibility of about 80% of the studies assessed. Also, information about the machine learning algorithms is inadequate in about 27% of the papers. On the plus side, the third party software tools used are mostly free and available. The reproducibility potential of most of the studies can be significantly improved if more attention is paid to information provided on the datasets used, how they were partitioned and utilized, and how any randomization was controlled. We introduce a checklist of information that needs to be provided in order to ensure that a published study can be reproduced. Copyright © 2017 Elsevier Inc. All rights reserved.

  7. Generative Topic Modeling in Image Data Mining and Bioinformatics Studies

    ERIC Educational Resources Information Center

    Chen, Xin

    2012-01-01

    Probabilistic topic models have been developed for applications in various domains such as text mining, information retrieval and computer vision and bioinformatics domain. In this thesis, we focus on developing novel probabilistic topic models for image mining and bioinformatics studies. Specifically, a probabilistic topic-connection (PTC) model…

  8. 78 FR 40496 - Notice of availability of the Final Environmental Impact Statement for the Proposed Hollister...

    Federal Register 2010, 2011, 2012, 2013, 2014

    2013-07-05

    ... silver mining operation. Most of the infrastructure to support a mining operation was authorized and.... The Proposed Action consists of underground mining, constructing a new production shaft, improving.... Public comments resulted in the addition of clarifying text, but did not significantly change the...

  9. Literature mining, gene-set enrichment and pathway analysis for target identification in Behçet's disease.

    PubMed

    Wilson, Paul; Larminie, Christopher; Smith, Rona

    2016-01-01

    To use literature mining to catalogue Behçet's associated genes, and advanced computational methods to improve the understanding of the pathways and signalling mechanisms that lead to the typical clinical characteristics of Behçet's patients. To extend this technique to identify potential treatment targets for further experimental validation. Text mining methods combined with gene enrichment tools, pathway analysis and causal analysis algorithms. This approach identified 247 human genes associated with Behçet's disease and the resulting disease map, comprising 644 nodes and 19220 edges, captured important details of the relationships between these genes and their associated pathways, as described in diverse data repositories. Pathway analysis has identified how Behçet's associated genes are likely to participate in innate and adaptive immune responses. Causal analysis algorithms have identified a number of potential therapeutic strategies for further investigation. Computational methods have captured pertinent features of the prominent disease characteristics presented in Behçet's disease and have highlighted NOD2, ICOS and IL18 signalling as potential therapeutic strategies.

  10. Using natural language processing techniques to inform research on nanotechnology

    PubMed Central

    Lewinski, Nastassja A

    2015-01-01

    Summary Literature in the field of nanotechnology is exponentially increasing with more and more engineered nanomaterials being created, characterized, and tested for performance and safety. With the deluge of published data, there is a need for natural language processing approaches to semi-automate the cataloguing of engineered nanomaterials and their associated physico-chemical properties, performance, exposure scenarios, and biological effects. In this paper, we review the different informatics methods that have been applied to patent mining, nanomaterial/device characterization, nanomedicine, and environmental risk assessment. Nine natural language processing (NLP)-based tools were identified: NanoPort, NanoMapper, TechPerceptor, a Text Mining Framework, a Nanodevice Analyzer, a Clinical Trial Document Classifier, Nanotoxicity Searcher, NanoSifter, and NEIMiner. We conclude with recommendations for sharing NLP-related tools through online repositories to broaden participation in nanoinformatics. PMID:26199848

  11. The Contribution of the Vaccine Adverse Event Text Mining System to the Classification of Possible Guillain-Barré Syndrome Reports

    PubMed Central

    Botsis, T.; Woo, E. J.; Ball, R.

    2013-01-01

    Background We previously demonstrated that a general purpose text mining system, the Vaccine adverse event Text Mining (VaeTM) system, could be used to automatically classify reports of an-aphylaxis for post-marketing safety surveillance of vaccines. Objective To evaluate the ability of VaeTM to classify reports to the Vaccine Adverse Event Reporting System (VAERS) of possible Guillain-Barré Syndrome (GBS). Methods We used VaeTM to extract the key diagnostic features from the text of reports in VAERS. Then, we applied the Brighton Collaboration (BC) case definition for GBS, and an information retrieval strategy (i.e. the vector space model) to quantify the specific information that is included in the key features extracted by VaeTM and compared it with the encoded information that is already stored in VAERS as Medical Dictionary for Regulatory Activities (MedDRA) Preferred Terms (PTs). We also evaluated the contribution of the primary (diagnosis and cause of death) and secondary (second level diagnosis and symptoms) diagnostic VaeTM-based features to the total VaeTM-based information. Results MedDRA captured more information and better supported the classification of reports for GBS than VaeTM (AUC: 0.904 vs. 0.777); the lower performance of VaeTM is likely due to the lack of extraction by VaeTM of specific laboratory results that are included in the BC criteria for GBS. On the other hand, the VaeTM-based classification exhibited greater specificity than the MedDRA-based approach (94.96% vs. 87.65%). Most of the VaeTM-based information was contained in the secondary diagnostic features. Conclusion For GBS, clinical signs and symptoms alone are not sufficient to match MedDRA coding for purposes of case classification, but are preferred if specificity is the priority. PMID:23650490

  12. Analysis of Nature of Science Included in Recent Popular Writing Using Text Mining Techniques

    ERIC Educational Resources Information Center

    Jiang, Feng; McComas, William F.

    2014-01-01

    This study examined the inclusion of nature of science (NOS) in popular science writing to determine whether it could serve supplementary resource for teaching NOS and to evaluate the accuracy of text mining and classification as a viable research tool in science education research. Four groups of documents published from 2001 to 2010 were…

  13. The Determination of Children's Knowledge of Global Lunar Patterns from Online Essays Using Text Mining Analysis

    ERIC Educational Resources Information Center

    Cheon, Jongpil; Lee, Sangno; Smith, Walter; Song, Jaeki; Kim, Yongjin

    2013-01-01

    The purpose of this study was to use text mining analysis of early adolescents' online essays to determine their knowledge of global lunar patterns. Australian and American students in grades five to seven wrote about global lunar patterns they had discovered by sharing observations with each other via the Internet. These essays were analyzed for…

  14. Impact of Text-Mining and Imitating Strategies on Lexical Richness, Lexical Diversity and General Success in Second Language Writing

    ERIC Educational Resources Information Center

    Çepni, Sevcan Bayraktar; Demirel, Elif Tokdemir

    2016-01-01

    This study aimed to find out the impact of "text mining and imitating" strategies on lexical richness, lexical diversity and general success of students in their compositions in second language writing. The participants were 98 students studying their first year in Karadeniz Technical University in English Language and Literature…

  15. Science and Technology Text Mining: Text Mining of the Journal Cortex

    DTIC Science & Technology

    2004-01-01

    Amnesia Retrograde Amnesia GENERAL Semantic Memory Episodic Memory Working Memory TEST Serial Position Curve...in Cortex can be reasonably divided into four categories (papers in each category in parenthesis): Semantic Memory (151); Handedness (145); Amnesia ... Semantic Memory (151) is divided into Verbal/ Numerical (76) and Visual/ Spatial (75). Amnesia (119) is divided into Amnesia Symptoms (50) and

  16. WHAT INNOVATIVE APPROACHES CAN BE DEVELOPED FOR MINING SITES?

    EPA Science Inventory

    Mining is essential to maintain our way of life. However, based upon industry's reporting in the most recent Toxic Release Inventory (TRI), the primary sources of heavy metal releases to the environment are mining and mining related activities. The hard rock mining industry rel...

  17. Full-text automated detection of surgical site infections secondary to neurosurgery in Rennes, France.

    PubMed

    Campillo-Gimenez, Boris; Garcelon, Nicolas; Jarno, Pascal; Chapplain, Jean Marc; Cuggia, Marc

    2013-01-01

    The surveillance of Surgical Site Infections (SSI) contributes to the management of risk in French hospitals. Manual identification of infections is costly, time-consuming and limits the promotion of preventive procedures by the dedicated teams. The introduction of alternative methods using automated detection strategies is promising to improve this surveillance. The present study describes an automated detection strategy for SSI in neurosurgery, based on textual analysis of medical reports stored in a clinical data warehouse. The method consists firstly, of enrichment and concept extraction from full-text reports using NOMINDEX, and secondly, text similarity measurement using a vector space model. The text detection was compared to the conventional strategy based on self-declaration and to the automated detection using the diagnosis-related group database. The text-mining approach showed the best detection accuracy, with recall and precision equal to 92% and 40% respectively, and confirmed the interest of reusing full-text medical reports to perform automated detection of SSI.

  18. Cinnamon gulch revisited: Another look at separating natural and mining-impacted contributions to instream metal load

    USGS Publications Warehouse

    Runkel, Robert L.; Verplanck, Philip; Kimball, Briant; Walton-Day, Katie

    2018-01-01

    Baseline, premining data for streams draining abandoned mine lands is virtually non existent, and indirect methods for estimating premining conditions are needed to establish realistic, cost effective cleanup goals. One such indirect method is the proximal analog approach, in which premining conditions are estimated using data from nearby mineralized areas that are unaffected by mining. In this paper, we combine the proximal analog approach with a quantitative mass balance framework using data from a spatially-detailed synoptic sampling campaign. The combined approach is applied to Cinnamon Gulch, a headwater stream with numerous draining adits. Synoptic sampling results indicate that three of the top five metal sources are affected by mining activities, and stream segments draining these sources account for a large percentage of overall metal loading within the study reach. These initial calculations overestimate the effects of mining, as the affected stream segments were likely acidic and metal rich prior to mining. Premining loads and concentrations were therefore determined through a replacement approach in which the chemistry of each mining-affected stream segment is revised based on proximal analog concentrations. The revised loading profiles indicate that 15–17% of the Al, Cd, Cu, Mn, Ni, and Zn loads are attributable to mining, whereas the mining contribution for Pb is 40%. Premining concentrations of Al, Cd, Cu, Mn, and Zn are estimated to be in excess of aquatic life standards over the length of the study reach.

  19. Texas lignite mining: Groundwater and slope stability control in the nineties and beyond

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Lawrence J.

    As lignite mining in Texas approaches and exceeds depths of 200 feet below ground level, rising costs demand that innovative mining approaches be used in order to maintain the economic viability of lignite mining. Groundwater and slope stability problems multiply at these depths, resulting in increasing focus on how to control these costs. Dewatering costs are consistently rising for the lignite industry, as deeper mining encounters more and larger saturated sand bodies. These sands require dewatering in order to improve slope stability. Planning and analysis become more important as the number of wells grows beyond what can be managed withmore » a simple {open_quotes}cookie-cutter{close_quotes} approach. Slope stability plays an increasing role in mining concerns as deeper lignite is recovered. Slope stability causes several problems, including loss of lignite, increased rehandle, and hazards to personnel and equipment. Traditional lignite mine planning involved a fairly {open_quotes}generic{close_quotes} pit design with one design highwall angle, one design spoil angle, and little geotechnical evaluation of the deposit. This {open_quotes}one mine-one design{close_quotes} approach, while cost-effective in the past, is now being replaced by a more critical analysis of the design requirements of each area. Geotechnical evaluation plays an increasing role in the planning and operational aspects of lignite mining. Laboratory core sample test results can be used for slope stability modeling, in order to obtain more accurate design and operational information.« less

  20. Supervised Learning Based Hypothesis Generation from Biomedical Literature.

    PubMed

    Sang, Shengtian; Yang, Zhihao; Li, Zongyao; Lin, Hongfei

    2015-01-01

    Nowadays, the amount of biomedical literatures is growing at an explosive speed, and there is much useful knowledge undiscovered in this literature. Researchers can form biomedical hypotheses through mining these works. In this paper, we propose a supervised learning based approach to generate hypotheses from biomedical literature. This approach splits the traditional processing of hypothesis generation with classic ABC model into AB model and BC model which are constructed with supervised learning method. Compared with the concept cooccurrence and grammar engineering-based approaches like SemRep, machine learning based models usually can achieve better performance in information extraction (IE) from texts. Then through combining the two models, the approach reconstructs the ABC model and generates biomedical hypotheses from literature. The experimental results on the three classic Swanson hypotheses show that our approach outperforms SemRep system.

  1. The Functional Genomics Network in the evolution of biological text mining over the past decade.

    PubMed

    Blaschke, Christian; Valencia, Alfonso

    2013-03-25

    Different programs of The European Science Foundation (ESF) have contributed significantly to connect researchers in Europe and beyond through several initiatives. This support was particularly relevant for the development of the areas related with extracting information from papers (text-mining) because it supported the field in its early phases long before it was recognized by the community. We review the historical development of text mining research and how it was introduced in bioinformatics. Specific applications in (functional) genomics are described like it's integration in genome annotation pipelines and the support to the analysis of high-throughput genomics experimental data, and we highlight the activities of evaluation of methods and benchmarking for which the ESF programme support was instrumental. Copyright © 2013 Elsevier B.V. All rights reserved.

  2. Chapter 2: The forestry reclamation approach

    Treesearch

    Jim Burger; Don Graves; Patrick Angel; Vic Davis; Carl Zipper

    2017-01-01

    The Forestry Reclamation Approach (FRA) is a method for reclaiming coal-mined land to forest under the federal Surface Mining Control and Reclamation Act of 1977 (SMCRA). The FRA is based on knowledge gained from both scientific research and experience (Fig. 2-1). The FRA can achieve cost-effective regulatory compliance for mine operators while creating productive...

  3. Supporting the annotation of chronic obstructive pulmonary disease (COPD) phenotypes with text mining workflows.

    PubMed

    Fu, Xiao; Batista-Navarro, Riza; Rak, Rafal; Ananiadou, Sophia

    2015-01-01

    Chronic obstructive pulmonary disease (COPD) is a life-threatening lung disorder whose recent prevalence has led to an increasing burden on public healthcare. Phenotypic information in electronic clinical records is essential in providing suitable personalised treatment to patients with COPD. However, as phenotypes are often "hidden" within free text in clinical records, clinicians could benefit from text mining systems that facilitate their prompt recognition. This paper reports on a semi-automatic methodology for producing a corpus that can ultimately support the development of text mining tools that, in turn, will expedite the process of identifying groups of COPD patients. A corpus of 30 full-text papers was formed based on selection criteria informed by the expertise of COPD specialists. We developed an annotation scheme that is aimed at producing fine-grained, expressive and computable COPD annotations without burdening our curators with a highly complicated task. This was implemented in the Argo platform by means of a semi-automatic annotation workflow that integrates several text mining tools, including a graphical user interface for marking up documents. When evaluated using gold standard (i.e., manually validated) annotations, the semi-automatic workflow was shown to obtain a micro-averaged F-score of 45.70% (with relaxed matching). Utilising the gold standard data to train new concept recognisers, we demonstrated that our corpus, although still a work in progress, can foster the development of significantly better performing COPD phenotype extractors. We describe in this work the means by which we aim to eventually support the process of COPD phenotype curation, i.e., by the application of various text mining tools integrated into an annotation workflow. Although the corpus being described is still under development, our results thus far are encouraging and show great potential in stimulating the development of further automatic COPD phenotype extractors.

  4. Classification of Traffic Related Short Texts to Analyse Road Problems in Urban Areas

    NASA Astrophysics Data System (ADS)

    Saldana-Perez, A. M. M.; Moreno-Ibarra, M.; Tores-Ruiz, M.

    2017-09-01

    The Volunteer Geographic Information (VGI) can be used to understand the urban dynamics. In the classification of traffic related short texts to analyze road problems in urban areas, a VGI data analysis is done over a social media's publications, in order to classify traffic events at big cities that modify the movement of vehicles and people through the roads, such as car accidents, traffic and closures. The classification of traffic events described in short texts is done by applying a supervised machine learning algorithm. In the approach users are considered as sensors which describe their surroundings and provide their geographic position at the social network. The posts are treated by a text mining process and classified into five groups. Finally, the classified events are grouped in a data corpus and geo-visualized in the study area, to detect the places with more vehicular problems.

  5. 78 FR 64397 - Mississippi Regulatory Program

    Federal Register 2010, 2011, 2012, 2013, 2014

    2013-10-29

    ... text of the program amendment available at www.regulations.gov . A. Mississippi Surface Coal Mining... DEPARTMENT OF THE INTERIOR Office of Surface Mining Reclamation and Enforcement 30 CFR Part 924...; S2D2SSS08011000SX066A00033F13XS501520] Mississippi Regulatory Program AGENCY: Office of Surface Mining Reclamation and Enforcement...

  6. A methodological toolkit for field assessments of artisanally mined alluvial diamond deposits

    USGS Publications Warehouse

    Chirico, Peter G.; Malpeli, Katherine C.

    2014-01-01

    This toolkit provides a standardized checklist of critical issues relevant to artisanal mining-related field research. An integrated sociophysical geographic approach to collecting data at artisanal mine sites is outlined. The implementation and results of a multistakeholder approach to data collection, carried out in the assessment of Guinea’s artisanally mined diamond deposits, also are summarized. This toolkit, based on recent and successful field campaigns in West Africa, has been developed as a reference document to assist other government agencies or organizations in collecting the data necessary for artisanal diamond mining or similar natural resource assessments.

  7. Managing the data deluge: data-driven GO category assignment improves while complexity of functional annotation increases.

    PubMed

    Gobeill, Julien; Pasche, Emilie; Vishnyakova, Dina; Ruch, Patrick

    2013-01-01

    The available curated data lag behind current biological knowledge contained in the literature. Text mining can assist biologists and curators to locate and access this knowledge, for instance by characterizing the functional profile of publications. Gene Ontology (GO) category assignment in free text already supports various applications, such as powering ontology-based search engines, finding curation-relevant articles (triage) or helping the curator to identify and encode functions. Popular text mining tools for GO classification are based on so called thesaurus-based--or dictionary-based--approaches, which exploit similarities between the input text and GO terms themselves. But their effectiveness remains limited owing to the complex nature of GO terms, which rarely occur in text. In contrast, machine learning approaches exploit similarities between the input text and already curated instances contained in a knowledge base to infer a functional profile. GO Annotations (GOA) and MEDLINE make possible to exploit a growing amount of curated abstracts (97 000 in November 2012) for populating this knowledge base. Our study compares a state-of-the-art thesaurus-based system with a machine learning system (based on a k-Nearest Neighbours algorithm) for the task of proposing a functional profile for unseen MEDLINE abstracts, and shows how resources and performances have evolved. Systems are evaluated on their ability to propose for a given abstract the GO terms (2.8 on average) used for curation in GOA. We show that since 2006, although a massive effort was put into adding synonyms in GO (+300%), our thesaurus-based system effectiveness is rather constant, reaching from 0.28 to 0.31 for Recall at 20 (R20). In contrast, thanks to its knowledge base growth, our machine learning system has steadily improved, reaching from 0.38 in 2006 to 0.56 for R20 in 2012. Integrated in semi-automatic workflows or in fully automatic pipelines, such systems are more and more efficient to provide assistance to biologists. DATABASE URL: http://eagl.unige.ch/GOCat/

  8. Simulation of patient flow in multiple healthcare units using process and data mining techniques for model identification.

    PubMed

    Kovalchuk, Sergey V; Funkner, Anastasia A; Metsker, Oleg G; Yakovlev, Aleksey N

    2018-06-01

    An approach to building a hybrid simulation of patient flow is introduced with a combination of data-driven methods for automation of model identification. The approach is described with a conceptual framework and basic methods for combination of different techniques. The implementation of the proposed approach for simulation of the acute coronary syndrome (ACS) was developed and used in an experimental study. A combination of data, text, process mining techniques, and machine learning approaches for the analysis of electronic health records (EHRs) with discrete-event simulation (DES) and queueing theory for the simulation of patient flow was proposed. The performed analysis of EHRs for ACS patients enabled identification of several classes of clinical pathways (CPs) which were used to implement a more realistic simulation of the patient flow. The developed solution was implemented using Python libraries (SimPy, SciPy, and others). The proposed approach enables more a realistic and detailed simulation of the patient flow within a group of related departments. An experimental study shows an improved simulation of patient length of stay for ACS patient flow obtained from EHRs in Almazov National Medical Research Centre in Saint Petersburg, Russia. The proposed approach, methods, and solutions provide a conceptual, methodological, and programming framework for the implementation of a simulation of complex and diverse scenarios within a flow of patients for different purposes: decision making, training, management optimization, and others. Copyright © 2018 Elsevier Inc. All rights reserved.

  9. Phylogeny-guided (meta)genome mining approach for the targeted discovery of new microbial natural products.

    PubMed

    Kang, Hahk-Soo

    2017-02-01

    Genomics-based methods are now commonplace in natural products research. A phylogeny-guided mining approach provides a means to quickly screen a large number of microbial genomes or metagenomes in search of new biosynthetic gene clusters of interest. In this approach, biosynthetic genes serve as molecular markers, and phylogenetic trees built with known and unknown marker gene sequences are used to quickly prioritize biosynthetic gene clusters for their metabolites characterization. An increase in the use of this approach has been observed for the last couple of years along with the emergence of low cost sequencing technologies. The aim of this review is to discuss the basic concept of a phylogeny-guided mining approach, and also to provide examples in which this approach was successfully applied to discover new natural products from microbial genomes and metagenomes. I believe that the phylogeny-guided mining approach will continue to play an important role in genomics-based natural products research.

  10. Abandoned Mine Lands: Site Information

    EPA Pesticide Factsheets

    A catalogue of mining sites proposed for and listed on the NPL as well as mining sites being cleaned up using the Superfund Alternative Approach. Also mine sites not on the NPL but that have had removal or emergency response cleanup actions.

  11. Application of Ferulic Acid for Alzheimer’s Disease: Combination of Text Mining and Experimental Validation

    PubMed Central

    Meng, Guilin; Meng, Xiulin; Ma, Xiaoye; Zhang, Gengping; Hu, Xiaolin; Jin, Aiping; Liu, Xueyuan

    2018-01-01

    Alzheimer’s disease (AD) is an increasing concern in human health. Despite significant research, highly effective drugs to treat AD are lacking. The present study describes the text mining process to identify drug candidates from a traditional Chinese medicine (TCM) database, along with associated protein target mechanisms. We carried out text mining to identify literatures that referenced both AD and TCM and focused on identifying compounds and protein targets of interest. After targeting one potential TCM candidate, corresponding protein-protein interaction (PPI) networks were assembled in STRING to decipher the most possible mechanism of action. This was followed by validation using Western blot and co-immunoprecipitation in an AD cell model. The text mining strategy using a vast amount of AD-related literature and the TCM database identified curcumin, whose major component was ferulic acid (FA). This was used as a key candidate compound for further study. Using the top calculated interaction score in STRING, BACE1 and MMP2 were implicated in the activity of FA in AD. Exposure of SHSY5Y-APP cells to FA resulted in the decrease in expression levels of BACE-1 and APP, while the expression of MMP-2 and MMP-9 increased in a dose-dependent manner. This suggests that FA induced BACE1 and MMP2 pathways maybe novel potential mechanisms involved in AD. The text mining of literature and TCM database related to AD suggested FA as a promising TCM ingredient for the treatment of AD. Potential mechanisms interconnected and integrated with Aβ aggregation inhibition and extracellular matrix remodeling underlying the activity of FA were identified using in vitro studies. PMID:29896095

  12. Application of Ferulic Acid for Alzheimer's Disease: Combination of Text Mining and Experimental Validation.

    PubMed

    Meng, Guilin; Meng, Xiulin; Ma, Xiaoye; Zhang, Gengping; Hu, Xiaolin; Jin, Aiping; Zhao, Yanxin; Liu, Xueyuan

    2018-01-01

    Alzheimer's disease (AD) is an increasing concern in human health. Despite significant research, highly effective drugs to treat AD are lacking. The present study describes the text mining process to identify drug candidates from a traditional Chinese medicine (TCM) database, along with associated protein target mechanisms. We carried out text mining to identify literatures that referenced both AD and TCM and focused on identifying compounds and protein targets of interest. After targeting one potential TCM candidate, corresponding protein-protein interaction (PPI) networks were assembled in STRING to decipher the most possible mechanism of action. This was followed by validation using Western blot and co-immunoprecipitation in an AD cell model. The text mining strategy using a vast amount of AD-related literature and the TCM database identified curcumin, whose major component was ferulic acid (FA). This was used as a key candidate compound for further study. Using the top calculated interaction score in STRING, BACE1 and MMP2 were implicated in the activity of FA in AD. Exposure of SHSY5Y-APP cells to FA resulted in the decrease in expression levels of BACE-1 and APP, while the expression of MMP-2 and MMP-9 increased in a dose-dependent manner. This suggests that FA induced BACE1 and MMP2 pathways maybe novel potential mechanisms involved in AD. The text mining of literature and TCM database related to AD suggested FA as a promising TCM ingredient for the treatment of AD. Potential mechanisms interconnected and integrated with Aβ aggregation inhibition and extracellular matrix remodeling underlying the activity of FA were identified using in vitro studies.

  13. Text Mining Effectively Scores and Ranks the Literature for Improving Chemical-Gene-Disease Curation at the Comparative Toxicogenomics Database

    PubMed Central

    Johnson, Robin J.; Lay, Jean M.; Lennon-Hopkins, Kelley; Saraceni-Richards, Cynthia; Sciaky, Daniela; Murphy, Cynthia Grondin; Mattingly, Carolyn J.

    2013-01-01

    The Comparative Toxicogenomics Database (CTD; http://ctdbase.org/) is a public resource that curates interactions between environmental chemicals and gene products, and their relationships to diseases, as a means of understanding the effects of environmental chemicals on human health. CTD provides a triad of core information in the form of chemical-gene, chemical-disease, and gene-disease interactions that are manually curated from scientific articles. To increase the efficiency, productivity, and data coverage of manual curation, we have leveraged text mining to help rank and prioritize the triaged literature. Here, we describe our text-mining process that computes and assigns each article a document relevancy score (DRS), wherein a high DRS suggests that an article is more likely to be relevant for curation at CTD. We evaluated our process by first text mining a corpus of 14,904 articles triaged for seven heavy metals (cadmium, cobalt, copper, lead, manganese, mercury, and nickel). Based upon initial analysis, a representative subset corpus of 3,583 articles was then selected from the 14,094 articles and sent to five CTD biocurators for review. The resulting curation of these 3,583 articles was analyzed for a variety of parameters, including article relevancy, novel data content, interaction yield rate, mean average precision, and biological and toxicological interpretability. We show that for all measured parameters, the DRS is an effective indicator for scoring and improving the ranking of literature for the curation of chemical-gene-disease information at CTD. Here, we demonstrate how fully incorporating text mining-based DRS scoring into our curation pipeline enhances manual curation by prioritizing more relevant articles, thereby increasing data content, productivity, and efficiency. PMID:23613709

  14. A New Framework for Textual Information Mining over Parse Trees. CRESST Report 805

    ERIC Educational Resources Information Center

    Mousavi, Hamid; Kerr, Deirdre; Iseli, Markus R.

    2011-01-01

    Textual information mining is a challenging problem that has resulted in the creation of many different rule-based linguistic query languages. However, these languages generally are not optimized for the purpose of text mining. In other words, they usually consider queries as individuals and only return raw results for each query. Moreover they…

  15. Data Mining: A Hybrid Methodology for Complex and Dynamic Research

    ERIC Educational Resources Information Center

    Lang, Susan; Baehr, Craig

    2012-01-01

    This article provides an overview of the ways in which data and text mining have potential as research methodologies in composition studies. It introduces data mining in the context of the field of composition studies and discusses ways in which this methodology can complement and extend our existing research practices by blending the best of what…

  16. Stochastic production phase design for an open pit mining complex with multiple processing streams

    NASA Astrophysics Data System (ADS)

    Asad, Mohammad Waqar Ali; Dimitrakopoulos, Roussos; van Eldert, Jeroen

    2014-08-01

    In a mining complex, the mine is a source of supply of valuable material (ore) to a number of processes that convert the raw ore to a saleable product or a metal concentrate for production of the refined metal. In this context, expected variation in metal content throughout the extent of the orebody defines the inherent uncertainty in the supply of ore, which impacts the subsequent ore and metal production targets. Traditional optimization methods for designing production phases and ultimate pit limit of an open pit mine not only ignore the uncertainty in metal content, but, in addition, commonly assume that the mine delivers ore to a single processing facility. A stochastic network flow approach is proposed that jointly integrates uncertainty in supply of ore and multiple ore destinations into the development of production phase design and ultimate pit limit. An application at a copper mine demonstrates the intricacies of the new approach. The case study shows a 14% higher discounted cash flow when compared to the traditional approach.

  17. DOE Office of Scientific and Technical Information (OSTI.GOV)

    Raymond, David W.; Gaither, Katherine N.; Polsky, Yarom

    Sandia National Laboratories (Sandia) has a long history in developing compact, mobile, very high-speed drilling systems and this technology could be applied to increasing the rate at which boreholes are drilled during a mine accident response. The present study reviews current technical approaches, primarily based on technology developed under other programs, analyzes mine rescue specific requirements to develop a conceptual mine rescue drilling approach, and finally, proposes development of a phased mine rescue drilling system (MRDS) that accomplishes (1) development of rapid drilling MRDS equipment; (2) structuring improved web communication through the Mine Safety & Health Administration (MSHA) web site;more » (3) development of an improved protocol for employment of existing drilling technology in emergencies; (4) deployment of advanced technologies to complement mine rescue drilling operations during emergency events; and (5) preliminary discussion of potential future technology development of specialized MRDS equipment. This phased approach allows for rapid fielding of a basic system for improved rescue drilling, with the ability to improve the system over time at a reasonable cost.« less

  18. Text mining and medicine: usefulness in respiratory diseases.

    PubMed

    Piedra, David; Ferrer, Antoni; Gea, Joaquim

    2014-03-01

    It is increasingly common to have medical information in electronic format. This includes scientific articles as well as clinical management reviews, and even records from health institutions with patient data. However, traditional instruments, both individual and institutional, are of little use for selecting the most appropriate information in each case, either in the clinical or research field. So-called text or data «mining» enables this huge amount of information to be managed, extracting it from various sources using processing systems (filtration and curation), integrating it and permitting the generation of new knowledge. This review aims to provide an overview of text and data mining, and of the potential usefulness of this bioinformatic technique in the exercise of care in respiratory medicine and in research in the same field. Copyright © 2013 SEPAR. Published by Elsevier Espana. All rights reserved.

  19. Overview of the gene ontology task at BioCreative IV.

    PubMed

    Mao, Yuqing; Van Auken, Kimberly; Li, Donghui; Arighi, Cecilia N; McQuilton, Peter; Hayman, G Thomas; Tweedie, Susan; Schaeffer, Mary L; Laulederkind, Stanley J F; Wang, Shur-Jen; Gobeill, Julien; Ruch, Patrick; Luu, Anh Tuan; Kim, Jung-Jae; Chiang, Jung-Hsien; Chen, Yu-De; Yang, Chia-Jung; Liu, Hongfang; Zhu, Dongqing; Li, Yanpeng; Yu, Hong; Emadzadeh, Ehsan; Gonzalez, Graciela; Chen, Jian-Ming; Dai, Hong-Jie; Lu, Zhiyong

    2014-01-01

    Gene ontology (GO) annotation is a common task among model organism databases (MODs) for capturing gene function data from journal articles. It is a time-consuming and labor-intensive task, and is thus often considered as one of the bottlenecks in literature curation. There is a growing need for semiautomated or fully automated GO curation techniques that will help database curators to rapidly and accurately identify gene function information in full-length articles. Despite multiple attempts in the past, few studies have proven to be useful with regard to assisting real-world GO curation. The shortage of sentence-level training data and opportunities for interaction between text-mining developers and GO curators has limited the advances in algorithm development and corresponding use in practical circumstances. To this end, we organized a text-mining challenge task for literature-based GO annotation in BioCreative IV. More specifically, we developed two subtasks: (i) to automatically locate text passages that contain GO-relevant information (a text retrieval task) and (ii) to automatically identify relevant GO terms for the genes in a given article (a concept-recognition task). With the support from five MODs, we provided teams with >4000 unique text passages that served as the basis for each GO annotation in our task data. Such evidence text information has long been recognized as critical for text-mining algorithm development but was never made available because of the high cost of curation. In total, seven teams participated in the challenge task. From the team results, we conclude that the state of the art in automatically mining GO terms from literature has improved over the past decade while much progress is still needed for computer-assisted GO curation. Future work should focus on addressing remaining technical challenges for improved performance of automatic GO concept recognition and incorporating practical benefits of text-mining tools into real-world GO annotation. http://www.biocreative.org/tasks/biocreative-iv/track-4-GO/. Published by Oxford University Press 2014. This work is written by US Government employees and is in the public domain in the US.

  20. VALUING ACID MINE DRAINAGE REMEDIATION OF IMPAIRED WATERWAYS IN WEST VIRGINIA: A HEDONIC MODELING APPROACH

    EPA Science Inventory

    States with active and abandoned mines face large private and public costs to remediate damage to streams and rivers from acid mine drainage (AMD), the metal rich runoff flowing primarily from abandoned mines and surface deposits of mine waste. AMD can lower stream and river pH ...

  1. PPInterFinder--a mining tool for extracting causal relations on human proteins from literature.

    PubMed

    Raja, Kalpana; Subramani, Suresh; Natarajan, Jeyakumar

    2013-01-01

    One of the most common and challenging problem in biomedical text mining is to mine protein-protein interactions (PPIs) from MEDLINE abstracts and full-text research articles because PPIs play a major role in understanding the various biological processes and the impact of proteins in diseases. We implemented, PPInterFinder--a web-based text mining tool to extract human PPIs from biomedical literature. PPInterFinder uses relation keyword co-occurrences with protein names to extract information on PPIs from MEDLINE abstracts and consists of three phases. First, it identifies the relation keyword using a parser with Tregex and a relation keyword dictionary. Next, it automatically identifies the candidate PPI pairs with a set of rules related to PPI recognition. Finally, it extracts the relations by matching the sentence with a set of 11 specific patterns based on the syntactic nature of PPI pair. We find that PPInterFinder is capable of predicting PPIs with the accuracy of 66.05% on AIMED corpus and outperforms most of the existing systems. DATABASE URL: http://www.biomining-bu.in/ppinterfinder/

  2. PPInterFinder—a mining tool for extracting causal relations on human proteins from literature

    PubMed Central

    Raja, Kalpana; Subramani, Suresh; Natarajan, Jeyakumar

    2013-01-01

    One of the most common and challenging problem in biomedical text mining is to mine protein–protein interactions (PPIs) from MEDLINE abstracts and full-text research articles because PPIs play a major role in understanding the various biological processes and the impact of proteins in diseases. We implemented, PPInterFinder—a web-based text mining tool to extract human PPIs from biomedical literature. PPInterFinder uses relation keyword co-occurrences with protein names to extract information on PPIs from MEDLINE abstracts and consists of three phases. First, it identifies the relation keyword using a parser with Tregex and a relation keyword dictionary. Next, it automatically identifies the candidate PPI pairs with a set of rules related to PPI recognition. Finally, it extracts the relations by matching the sentence with a set of 11 specific patterns based on the syntactic nature of PPI pair. We find that PPInterFinder is capable of predicting PPIs with the accuracy of 66.05% on AIMED corpus and outperforms most of the existing systems. Database URL: http://www.biomining-bu.in/ppinterfinder/ PMID:23325628

  3. Mining Clinicians' Electronic Documentation to Identify Heart Failure Patients with Ineffective Self-Management: A Pilot Text-Mining Study.

    PubMed

    Topaz, Maxim; Radhakrishnan, Kavita; Lei, Victor; Zhou, Li

    2016-01-01

    Effective self-management can decrease up to 50% of heart failure hospitalizations. Unfortunately, self-management by patients with heart failure remains poor. This pilot study aimed to explore the use of text-mining to identify heart failure patients with ineffective self-management. We first built a comprehensive self-management vocabulary based on the literature and clinical notes review. We then randomly selected 545 heart failure patients treated within Partners Healthcare hospitals (Boston, MA, USA) and conducted a regular expression search with the compiled vocabulary within 43,107 interdisciplinary clinical notes of these patients. We found that 38.2% (n = 208) patients had documentation of ineffective heart failure self-management in the domains of poor diet adherence (28.4%), missed medical encounters (26.4%) poor medication adherence (20.2%) and non-specified self-management issues (e.g., "compliance issues", 34.6%). We showed the feasibility of using text-mining to identify patients with ineffective self-management. More natural language processing algorithms are needed to help busy clinicians identify these patients.

  4. Integration of Artificial Market Simulation and Text Mining for Market Analysis

    NASA Astrophysics Data System (ADS)

    Izumi, Kiyoshi; Matsui, Hiroki; Matsuo, Yutaka

    We constructed an evaluation system of the self-impact in a financial market using an artificial market and text-mining technology. Economic trends were first extracted from text data circulating in the real world. Then, the trends were inputted into the market simulation. Our simulation revealed that an operation by intervention could reduce over 70% of rate fluctuation in 1995. By the simulation results, the system was able to help for its user to find the exchange policy which can stabilize the yen-dollar rate.

  5. Integrated Text Mining and Chemoinformatics Analysis Associates Diet to Health Benefit at Molecular Level

    PubMed Central

    Jensen, Kasper; Panagiotou, Gianni; Kouskoumvekaki, Irene

    2014-01-01

    Awareness that disease susceptibility is not only dependent on genetic make up, but can be affected by lifestyle decisions, has brought more attention to the role of diet. However, food is often treated as a black box, or the focus is limited to few, well-studied compounds, such as polyphenols, lipids and nutrients. In this work, we applied text mining and Naïve Bayes classification to assemble the knowledge space of food-phytochemical and food-disease associations, where we distinguish between disease prevention/amelioration and disease progression. We subsequently searched for frequently occurring phytochemical-disease pairs and we identified 20,654 phytochemicals from 16,102 plants associated to 1,592 human disease phenotypes. We selected colon cancer as a case study and analyzed our results in three directions; i) one stop legacy knowledge-shop for the effect of food on disease, ii) discovery of novel bioactive compounds with drug-like properties, and iii) discovery of novel health benefits from foods. This works represents a systematized approach to the association of food with health effect, and provides the phytochemical layer of information for nutritional systems biology research. PMID:24453957

  6. Concept recognition for extracting protein interaction relations from biomedical text

    PubMed Central

    Baumgartner, William A; Lu, Zhiyong; Johnson, Helen L; Caporaso, J Gregory; Paquette, Jesse; Lindemann, Anna; White, Elizabeth K; Medvedeva, Olga; Cohen, K Bretonnel; Hunter, Lawrence

    2008-01-01

    Background: Reliable information extraction applications have been a long sought goal of the biomedical text mining community, a goal that if reached would provide valuable tools to benchside biologists in their increasingly difficult task of assimilating the knowledge contained in the biomedical literature. We present an integrated approach to concept recognition in biomedical text. Concept recognition provides key information that has been largely missing from previous biomedical information extraction efforts, namely direct links to well defined knowledge resources that explicitly cement the concept's semantics. The BioCreative II tasks discussed in this special issue have provided a unique opportunity to demonstrate the effectiveness of concept recognition in the field of biomedical language processing. Results: Through the modular construction of a protein interaction relation extraction system, we present several use cases of concept recognition in biomedical text, and relate these use cases to potential uses by the benchside biologist. Conclusion: Current information extraction technologies are approaching performance standards at which concept recognition can begin to deliver high quality data to the benchside biologist. Our system is available as part of the BioCreative Meta-Server project and on the internet . PMID:18834500

  7. Disambiguating the species of biomedical named entities using natural language parsers

    PubMed Central

    Wang, Xinglong; Tsujii, Jun'ichi; Ananiadou, Sophia

    2010-01-01

    Motivation: Text mining technologies have been shown to reduce the laborious work involved in organizing the vast amount of information hidden in the literature. One challenge in text mining is linking ambiguous word forms to unambiguous biological concepts. This article reports on a comprehensive study on resolving the ambiguity in mentions of biomedical named entities with respect to model organisms and presents an array of approaches, with focus on methods utilizing natural language parsers. Results: We build a corpus for organism disambiguation where every occurrence of protein/gene entity is manually tagged with a species ID, and evaluate a number of methods on it. Promising results are obtained by training a machine learning model on syntactic parse trees, which is then used to decide whether an entity belongs to the model organism denoted by a neighbouring species-indicating word (e.g. yeast). The parser-based approaches are also compared with a supervised classification method and results indicate that the former are a more favorable choice when domain portability is of concern. The best overall performance is obtained by combining the strengths of syntactic features and supervised classification. Availability: The corpus and demo are available at http://www.nactem.ac.uk/deca_details/start.cgi, and the software is freely available as U-Compare components (Kano et al., 2009): NaCTeM Species Word Detector and NaCTeM Species Disambiguator. U-Compare is available at http://-compare.org/ Contact: xinglong.wang@manchester.ac.uk PMID:20053840

  8. BioC implementations in Go, Perl, Python and Ruby

    PubMed Central

    Liu, Wanli; Islamaj Doğan, Rezarta; Kwon, Dongseop; Marques, Hernani; Rinaldi, Fabio; Wilbur, W. John; Comeau, Donald C.

    2014-01-01

    As part of a communitywide effort for evaluating text mining and information extraction systems applied to the biomedical domain, BioC is focused on the goal of interoperability, currently a major barrier to wide-scale adoption of text mining tools. BioC is a simple XML format, specified by DTD, for exchanging data for biomedical natural language processing. With initial implementations in C++ and Java, BioC provides libraries of code for reading and writing BioC text documents and annotations. We extend BioC to Perl, Python, Go and Ruby. We used SWIG to extend the C++ implementation for Perl and one Python implementation. A second Python implementation and the Ruby implementation use native data structures and libraries. BioC is also implemented in the Google language Go. BioC modules are functional in all of these languages, which can facilitate text mining tasks. BioC implementations are freely available through the BioC site: http://bioc.sourceforge.net. Database URL: http://bioc.sourceforge.net/ PMID:24961236

  9. MegaMiner: A Tool for Lead Identification Through Text Mining Using Chemoinformatics Tools and Cloud Computing Environment.

    PubMed

    Karthikeyan, Muthukumarasamy; Pandit, Yogesh; Pandit, Deepak; Vyas, Renu

    2015-01-01

    Virtual screening is an indispensable tool to cope with the massive amount of data being tossed by the high throughput omics technologies. With the objective of enhancing the automation capability of virtual screening process a robust portal termed MegaMiner has been built using the cloud computing platform wherein the user submits a text query and directly accesses the proposed lead molecules along with their drug-like, lead-like and docking scores. Textual chemical structural data representation is fraught with ambiguity in the absence of a global identifier. We have used a combination of statistical models, chemical dictionary and regular expression for building a disease specific dictionary. To demonstrate the effectiveness of this approach, a case study on malaria has been carried out in the present work. MegaMiner offered superior results compared to other text mining search engines, as established by F score analysis. A single query term 'malaria' in the portlet led to retrieval of related PubMed records, protein classes, drug classes and 8000 scaffolds which were internally processed and filtered to suggest new molecules as potential anti-malarials. The results obtained were validated by docking the virtual molecules into relevant protein targets. It is hoped that MegaMiner will serve as an indispensable tool for not only identifying hidden relationships between various biological and chemical entities but also for building better corpus and ontologies.

  10. [Application of text mining approach to pre-education prior to clinical practice].

    PubMed

    Koinuma, Masayoshi; Koike, Katsuya; Nakamura, Hitoshi

    2008-06-01

    We developed a new survey analysis technique to understand students' actual aims for effective pretraining prior to clinical practice. We asked third-year undergraduate students to write fixed-style complete and free sentences on "preparation of drug dispensing." Then, we converted their sentence data in to text style and performed Japanese-language morphologic analysis on the data using language analysis software. We classified key words, which were created on the basis of the word class information of the Japanese language morphologic analysis, into categories based on causes and characteristics. In addition to this, we classified the characteristics into six categories consisting of those concepts including "knowledge," "skill and attitude," "image," etc. with the KJ method technique. The results showed that the awareness of students of "preparation of drug dispensing" tended to be approximately three-fold more frequent in "skill and attitude," "risk," etc. than in "knowledge." Regarding the characteristics in the category of the "image," words like "hard," "challenging," "responsibility," "life," etc. frequently occurred. The results of corresponding analysis showed that the characteristics of the words "knowledge" and "skills and attitude" were independent. As the result of developing a cause-and-effect diagram, it was demonstrated that the phase "hanging tough" described most of the various factors. We thus could understand students' actual feelings by applying text-mining as a new survey analysis technique.

  11. PathText: a text mining integrator for biological pathway visualizations

    PubMed Central

    Kemper, Brian; Matsuzaki, Takuya; Matsuoka, Yukiko; Tsuruoka, Yoshimasa; Kitano, Hiroaki; Ananiadou, Sophia; Tsujii, Jun'ichi

    2010-01-01

    Motivation: Metabolic and signaling pathways are an increasingly important part of organizing knowledge in systems biology. They serve to integrate collective interpretations of facts scattered throughout literature. Biologists construct a pathway by reading a large number of articles and interpreting them as a consistent network, but most of the models constructed currently lack direct links to those articles. Biologists who want to check the original articles have to spend substantial amounts of time to collect relevant articles and identify the sections relevant to the pathway. Furthermore, with the scientific literature expanding by several thousand papers per week, keeping a model relevant requires a continuous curation effort. In this article, we present a system designed to integrate a pathway visualizer, text mining systems and annotation tools into a seamless environment. This will enable biologists to freely move between parts of a pathway and relevant sections of articles, as well as identify relevant papers from large text bases. The system, PathText, is developed by Systems Biology Institute, Okinawa Institute of Science and Technology, National Centre for Text Mining (University of Manchester) and the University of Tokyo, and is being used by groups of biologists from these locations. Contact: brian@monrovian.com. PMID:20529930

  12. miRTex: A Text Mining System for miRNA-Gene Relation Extraction

    PubMed Central

    Li, Gang; Ross, Karen E.; Arighi, Cecilia N.; Peng, Yifan; Wu, Cathy H.; Vijay-Shanker, K.

    2015-01-01

    MicroRNAs (miRNAs) regulate a wide range of cellular and developmental processes through gene expression suppression or mRNA degradation. Experimentally validated miRNA gene targets are often reported in the literature. In this paper, we describe miRTex, a text mining system that extracts miRNA-target relations, as well as miRNA-gene and gene-miRNA regulation relations. The system achieves good precision and recall when evaluated on a literature corpus of 150 abstracts with F-scores close to 0.90 on the three different types of relations. We conducted full-scale text mining using miRTex to process all the Medline abstracts and all the full-length articles in the PubMed Central Open Access Subset. The results for all the Medline abstracts are stored in a database for interactive query and file download via the website at http://proteininformationresource.org/mirtex. Using miRTex, we identified genes potentially regulated by miRNAs in Triple Negative Breast Cancer, as well as miRNA-gene relations that, in conjunction with kinase-substrate relations, regulate the response to abiotic stress in Arabidopsis thaliana. These two use cases demonstrate the usefulness of miRTex text mining in the analysis of miRNA-regulated biological processes. PMID:26407127

  13. LimTox: a web tool for applied text mining of adverse event and toxicity associations of compounds, drugs and genes

    PubMed Central

    Cañada, Andres; Rabal, Obdulia; Oyarzabal, Julen; Valencia, Alfonso

    2017-01-01

    Abstract A considerable effort has been devoted to retrieve systematically information for genes and proteins as well as relationships between them. Despite the importance of chemical compounds and drugs as a central bio-entity in pharmacological and biological research, only a limited number of freely available chemical text-mining/search engine technologies are currently accessible. Here we present LimTox (Literature Mining for Toxicology), a web-based online biomedical search tool with special focus on adverse hepatobiliary reactions. It integrates a range of text mining, named entity recognition and information extraction components. LimTox relies on machine-learning, rule-based, pattern-based and term lookup strategies. This system processes scientific abstracts, a set of full text articles and medical agency assessment reports. Although the main focus of LimTox is on adverse liver events, it enables also basic searches for other organ level toxicity associations (nephrotoxicity, cardiotoxicity, thyrotoxicity and phospholipidosis). This tool supports specialized search queries for: chemical compounds/drugs, genes (with additional emphasis on key enzymes in drug metabolism, namely P450 cytochromes—CYPs) and biochemical liver markers. The LimTox website is free and open to all users and there is no login requirement. LimTox can be accessed at: http://limtox.bioinfo.cnio.es PMID:28531339

  14. The risk of collapse in abandoned mine sites: the issue of data uncertainty

    NASA Astrophysics Data System (ADS)

    Longoni, Laura; Papini, Monica; Brambilla, Davide; Arosio, Diego; Zanzi, Luigi

    2016-04-01

    Ground collapses over abandoned underground mines constitute a new environmental risk in the world. The high risk associated with subsurface voids, together with lack of knowledge of the geometric and geomechanical features of mining areas, makes abandoned underground mines one of the current challenges for countries with a long mining history. In this study, a stability analysis of Montevecchia marl mine is performed in order to validate a general approach that takes into account the poor local information and the variability of the input data. The collapse risk was evaluated through a numerical approach that, starting with some simplifying assumptions, is able to provide an overview of the collapse probability. The final results is an easy-accessible-transparent summary graph that shows the collapse probability. This approach may be useful for public administrators called upon to manage this environmental risk. The approach tries to simplify this complex problem in order to achieve a roughly risk assessment, but, since it relies on just a small amount of information, any final user should be aware that a comprehensive and detailed risk scenario can be generated only through more exhaustive investigations.

  15. From IHE Audit Trails to XES Event Logs Facilitating Process Mining.

    PubMed

    Paster, Ferdinand; Helm, Emmanuel

    2015-01-01

    Recently Business Intelligence approaches like process mining are applied to the healthcare domain. The goal of process mining is to gain process knowledge, compliance and room for improvement by investigating recorded event data. Previous approaches focused on process discovery by event data from various specific systems. IHE, as a globally recognized basis for healthcare information systems, defines in its ATNA profile how real-world events must be recorded in centralized event logs. The following approach presents how audit trails collected by the means of ATNA can be transformed to enable process mining. Using the standardized audit trails provides the ability to apply these methods to all IHE based information systems.

  16. Design risk assessment for burst-prone mines: Application in a Canadian mine

    NASA Astrophysics Data System (ADS)

    Cheung, David J.

    A proactive stance towards improving the effectiveness and consistency of risk assessments has been adopted recently by mining companies and industry. The next 10-20 years forecasts that ore deposits accessible using shallow mining techniques will diminish. The industry continues to strive for success in "deeper" mining projects in order to keep up with the continuing demand for raw materials. Although the returns are quite profitable, many projects have been sidelined due to high uncertainty and technical risk in the mining of the mineral deposit. Several hardrock mines have faced rockbursting and seismicity problems. Within those reported, mines in countries like South Africa, Australia and Canada have documented cases of severe rockburst conditions attributed to the mining depth. Severe rockburst conditions known as "burst-prone" can be effectively managed with design. Adopting a more robust design can ameliorate the exposure of workers and equipment to adverse conditions and minimize the economic consequences, which can hinder the bottom line of an operation. This thesis presents a methodology created for assessing the design risk in burst-prone mines. The methodology includes an evaluation of relative risk ratings for scenarios with options of risk reduction through several design principles. With rockbursts being a hazard of seismic events, the methodology is based on research in the area of mining seismicity factoring in rockmass failure mechanisms, which results from a combination of mining induced stress, geological structures, rockmass properties and mining influences. The methodology was applied to case studies at Craig Mine of Xstrata Nickel in Sudbury, Ontario, which is known to contain seismically active fault zones. A customized risk assessment was created and applied to rockburst case studies, evaluating the seismic vulnerability and consequence for each case. Application of the methodology to Craig Mine demonstrates that changes in the design can reduce both exposure risk (personnel and equipment), and economical risk (revenue and costs). Fatal and catastrophic consequences can be averted through robust planning and design. Two customized approaches were developed to conduct risk assessment of case studies at Craig Mine. Firstly, the Brownfield Approach utilizes the seismic database to determine the seismic hazard from a rating system that evaluates frequency-magnitude, event size, and event-blast relation. Secondly, the Greenfield Approach utilizes the seismic database, focusing on larger magnitude events, rocktype, and geological structure. The customized Greenfield Approach can also be applied in the evaluation of design risk in deep mines with the same setting and condition as Craig Mine. Other mines with different settings and conditions can apply the principles in the methodology to evaluate design alternatives and risk reduction strategies for burst-prone mines.

  17. Text Mining and Natural Language Processing Approaches for Automatic Categorization of Lay Requests to Web-Based Expert Forums

    PubMed Central

    Reincke, Ulrich; Michelmann, Hans Wilhelm

    2009-01-01

    Background Both healthy and sick people increasingly use electronic media to obtain medical information and advice. For example, Internet users may send requests to Web-based expert forums, or so-called “ask the doctor” services. Objective To automatically classify lay requests to an Internet medical expert forum using a combination of different text-mining strategies. Methods We first manually classified a sample of 988 requests directed to a involuntary childlessness forum on the German website “Rund ums Baby” (“Everything about Babies”) into one or more of 38 categories belonging to two dimensions (“subject matter” and “expectations”). After creating start and synonym lists, we calculated the average Cramer’s V statistic for the association of each word with each category. We also used principle component analysis and singular value decomposition as further text-mining strategies. With these measures we trained regression models and determined, on the basis of best regression models, for any request the probability of belonging to each of the 38 different categories, with a cutoff of 50%. Recall and precision of a test sample were calculated as a measure of quality for the automatic classification. Results According to the manual classification of 988 documents, 102 (10%) documents fell into the category “in vitro fertilization (IVF),” 81 (8%) into the category “ovulation,” 79 (8%) into “cycle,” and 57 (6%) into “semen analysis.” These were the four most frequent categories in the subject matter dimension (consisting of 32 categories). The expectation dimension comprised six categories; we classified 533 documents (54%) as “general information” and 351 (36%) as a wish for “treatment recommendations.” The generation of indicator variables based on the chi-square analysis and Cramer’s V proved to be the best approach for automatic classification in about half of the categories. In combination with the two other approaches, 100% precision and 100% recall were realized in 18 (47%) out of the 38 categories in the test sample. For 35 (92%) categories, precision and recall were better than 80%. For some categories, the input variables (ie, “words”) also included variables from other categories, most often with a negative sign. For example, absence of words predictive for “menstruation” was a strong indicator for the category “pregnancy test.” Conclusions Our approach suggests a way of automatically classifying and analyzing unstructured information in Internet expert forums. The technique can perform a preliminary categorization of new requests and help Internet medical experts to better handle the mass of information and to give professional feedback. PMID:19632978

  18. A new pivoting and iterative text detection algorithm for biomedical images.

    PubMed

    Xu, Songhua; Krauthammer, Michael

    2010-12-01

    There is interest to expand the reach of literature mining to include the analysis of biomedical images, which often contain a paper's key findings. Examples include recent studies that use Optical Character Recognition (OCR) to extract image text, which is used to boost biomedical image retrieval and classification. Such studies rely on the robust identification of text elements in biomedical images, which is a non-trivial task. In this work, we introduce a new text detection algorithm for biomedical images based on iterative projection histograms. We study the effectiveness of our algorithm by evaluating the performance on a set of manually labeled random biomedical images, and compare the performance against other state-of-the-art text detection algorithms. We demonstrate that our projection histogram-based text detection approach is well suited for text detection in biomedical images, and that the iterative application of the algorithm boosts performance to an F score of .60. We provide a C++ implementation of our algorithm freely available for academic use. Copyright © 2010 Elsevier Inc. All rights reserved.

  19. Image Information Mining Utilizing Hierarchical Segmentation

    NASA Technical Reports Server (NTRS)

    Tilton, James C.; Marchisio, Giovanni; Koperski, Krzysztof; Datcu, Mihai

    2002-01-01

    The Hierarchical Segmentation (HSEG) algorithm is an approach for producing high quality, hierarchically related image segmentations. The VisiMine image information mining system utilizes clustering and segmentation algorithms for reducing visual information in multispectral images to a manageable size. The project discussed herein seeks to enhance the VisiMine system through incorporating hierarchical segmentations from HSEG into the VisiMine system.

  20. A probabilistic approach for mine burial prediction

    NASA Astrophysics Data System (ADS)

    Barbu, Costin; Valent, Philip; Richardson, Michael; Abelev, Andrei; Plant, Nathaniel

    2004-09-01

    Predicting the degree of burial of mines in soft sediments is one of the main concerns of Naval Mine CounterMeasures (MCM) operations. This is a difficult problem to solve due to uncertainties and variability of the sediment parameters (i.e., density and shear strength) and of the mine state at contact with the seafloor (i.e., vertical and horizontal velocity, angular rotation rate, and pitch angle at the mudline). A stochastic approach is proposed in this paper to better incorporate the dynamic nature of free-falling cylindrical mines in the modeling of impact burial. The orientation, trajectory and velocity of cylindrical mines, after about 4 meters free-fall in the water column, are very strongly influenced by boundary layer effects causing quite chaotic behavior. The model's convolution of the uncertainty through its nonlinearity is addressed by employing Monte Carlo simulations. Finally a risk analysis based on the probability of encountering an undetectable mine is performed.

  1. Flooded Underground Coal Mines: A Significant Source of Inexpensive Geothermal Energy

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Watzlaf, G.R.; Ackman, T.E.

    2007-04-01

    Many mining regions in the United States contain extensive areas of flooded underground mines. The water within these mines represents a significant and widespread opportunity for extracting low-grade, geothermal energy. Based on current energy prices, geothermal heat pump systems using mine water could reduce the annual costs for heating to over 70 percent compared to conventional heating methods (natural gas or heating oil). These same systems could reduce annual cooling costs by up to 50 percent over standard air conditioning in many areas of the country. (Formatted full-text version is released by permission of publisher)

  2. Numerical Modeling Tools for the Prediction of Solution Migration Applicable to Mining Site

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Martell, M.; Vaughn, P.

    1999-01-06

    Mining has always had an important influence on cultures and traditions of communities around the globe and throughout history. Today, because mining legislation places heavy emphasis on environmental protection, there is great interest in having a comprehensive understanding of ancient mining and mining sites. Multi-disciplinary approaches (i.e., Pb isotopes as tracers) are being used to explore the distribution of metals in natural environments. Another successful approach is to model solution migration numerically. A proven method to simulate solution migration in natural rock salt has been applied to project through time for 10,000 years the system performance and solution concentrations surroundingmore » a proposed nuclear waste repository. This capability is readily adaptable to simulate solution migration around mining.« less

  3. Mining Quality Phrases from Massive Text Corpora

    PubMed Central

    Liu, Jialu; Shang, Jingbo; Wang, Chi; Ren, Xiang; Han, Jiawei

    2015-01-01

    Text data are ubiquitous and play an essential role in big data applications. However, text data are mostly unstructured. Transforming unstructured text into structured units (e.g., semantically meaningful phrases) will substantially reduce semantic ambiguity and enhance the power and efficiency at manipulating such data using database technology. Thus mining quality phrases is a critical research problem in the field of databases. In this paper, we propose a new framework that extracts quality phrases from text corpora integrated with phrasal segmentation. The framework requires only limited training but the quality of phrases so generated is close to human judgment. Moreover, the method is scalable: both computation time and required space grow linearly as corpus size increases. Our experiments on large text corpora demonstrate the quality and efficiency of the new method. PMID:26705375

  4. Ensemble Learning Method for Hidden Markov Models

    DTIC Science & Technology

    2014-12-01

    Ensemble HMM landmine detector Mine signatures vary according to the mine type, mine size , and burial depth. Similarly, clutter signatures vary with soil ...approaches for the di erent K groups depending on their size and homogeneity. In particular, we investigate the maximum likelihood (ML), the minimum...propose using and optimizing various training approaches for the different K groups depending on their size and homogeneity. In particular, we

  5. Field-scale study of the influence of differing remediation strategies on trace metal geochemistry in metal mine tailings from the Irish Midlands.

    PubMed

    Perkins, William T; Bird, Graham; Jacobs, Suzanne R; Devoy, Cora

    2016-03-01

    Mine tailings represent a globally significant source of potentially harmful elements (PHEs) to the environment. The management of large volumes of mine tailings represents a major challenge to the mining industry and environmental managers. This field-scale study evaluates the impact of two highly contrasting remediation approaches to the management and stabilisation of mine tailings. The geochemistry of the tailings, overlying amendment layers and vegetation are examined in the light of the different management approaches. Pseudo-total As, Cd and Pb concentrations and solid-state partitioning (speciation), determined via sequential extraction, were established for two Tailings Management Facilities (TMFs) in Ireland subjected to the following: (1) a 'walk-away' approach (Silvermines) and (2) application of an amendment layer (Galmoy). PHE concentrations in roots and herbage of grasses growing on the TMFs were also determined. Results identify very different PHE concentration profiles with depth through the TMFs and the impact of remediation approach on concentrations and their potential bioavailability in the rooting zone of grass species. Data also highlight the importance of choice of grass species in remediation approaches and the benefits of relatively shallow-rooting Agrostis capillaris and Festuca rubra varieties. In addition, data from the Galmoy TMF indicate the importance of regional soil geochemistry for interpreting the influence of the PHE geochemistry of capping and amendment layers applied to mine tailings.

  6. The Spatial Assessment of the Current Seismic Hazard State for Hard Rock Underground Mines

    NASA Astrophysics Data System (ADS)

    Wesseloo, Johan

    2018-06-01

    Mining-induced seismic hazard assessment is an important component in the management of safety and financial risk in mines. As the seismic hazard is a response to the mining activity, it is non-stationary and variable both in space and time. This paper presents an approach for implementing a probabilistic seismic hazard assessment to assess the current hazard state of a mine. Each of the components of the probabilistic seismic hazard assessment is considered within the context of hard rock underground mines. The focus of this paper is the assessment of the in-mine hazard distribution and does not consider the hazard to nearby public or structures. A rating system and methodologies to present hazard maps, for the purpose of communicating to different stakeholders in the mine, i.e. mine managers, technical personnel and the work force, are developed. The approach allows one to update the assessment with relative ease and within short time periods as new data become available, enabling the monitoring of the spatial and temporal change in the seismic hazard.

  7. Systematic drug repositioning through mining adverse event data in ClinicalTrials.gov.

    PubMed

    Su, Eric Wen; Sanger, Todd M

    2017-01-01

    Drug repositioning (i.e., drug repurposing) is the process of discovering new uses for marketed drugs. Historically, such discoveries were serendipitous. However, the rapid growth in electronic clinical data and text mining tools makes it feasible to systematically identify drugs with the potential to be repurposed. Described here is a novel method of drug repositioning by mining ClinicalTrials.gov. The text mining tools I2E (Linguamatics) and PolyAnalyst (Megaputer) were utilized. An I2E query extracts "Serious Adverse Events" (SAE) data from randomized trials in ClinicalTrials.gov. Through a statistical algorithm, a PolyAnalyst workflow ranks the drugs where the treatment arm has fewer predefined SAEs than the control arm, indicating that potentially the drug is reducing the level of SAE. Hypotheses could then be generated for the new use of these drugs based on the predefined SAE that is indicative of disease (for example, cancer).

  8. Exploratory analysis of textual data from the Mother and Child Handbook using the text-mining method: Relationships with maternal traits and post-partum depression.

    PubMed

    Matsuda, Yoshio; Manaka, Tomoko; Kobayashi, Makiko; Sato, Shuhei; Ohwada, Michitaka

    2016-06-01

    The aim of the present study was to examine the possibility of screening apprehensive pregnant women and mothers at risk for post-partum depression from an analysis of the textual data in the Mother and Child Handbook by using the text-mining method. Uncomplicated pregnant women (n = 58) were divided into two groups according to State-Trait Anxiety Inventory grade (high trait [group I, n = 21] and low trait [group II, n = 37]) or Edinburgh Postnatal Depression Scale score (high score [group III, n = 15] and low score [group IV, n = 43]). An exploratory analysis of the textual data from the Maternal and Child Handbook was conducted using the text-mining method with the Word Miner software program. A comparison of the 'structure elements' was made between the two groups. The number of structure elements extracted by separated words from text data was 20 004 and the number of structure elements with a threshold of 2 or more as an initial value was 1168. Fifteen key words related to maternal anxiety, and six key words related to post-partum depression were extracted. The text-mining method is useful for the exploratory analysis of textual data obtained from pregnant woman, and this screening method has been suggested to be useful for apprehensive pregnant women and mothers at risk for post-partum depression. © 2016 Japan Society of Obstetrics and Gynecology.

  9. An Integrated Suite of Text and Data Mining Tools - Phase II

    DTIC Science & Technology

    2005-08-30

    Riverside, CA, USA Mazda Motor Corp, Jpn Univ of Darmstadt, Darmstadt, Ger Navy Center for Applied Research in Artificial Intelligence Univ of...with Georgia Tech Research Corporation developed a desktop text-mining software tool named TechOASIS (known commercially as VantagePoint). By the...of this dataset and groups the Corporate Source items that co-occur with the found items. He decides he is only interested in the institutions

  10. Approaches to Post-Mining Land Reclamation in Polish Open-Cast Lignite Mining

    NASA Astrophysics Data System (ADS)

    Kasztelewicz, Zbigniew

    2014-06-01

    The paper presents the situation regarding the reclamation of post-mining land in the case of particular lignite mines in Poland until 2012 against the background of the whole opencast mining. It discusses the process of land purchase for mining operations and its sales after reclamation. It presents the achievements of mines in the reclamation and regeneration of post-mining land as a result of which-after development processes carried out according to European standards-it now serves the inhabitants as a recreational area that increases the attractiveness of the regions.

  11. THE FUTURE OF COMPUTER-BASED TOXICITY PREDICTION: MECHANISM-BASED MODELS VS. INFORMATION MINING APPROACHES

    EPA Science Inventory


    The Future of Computer-Based Toxicity Prediction:
    Mechanism-Based Models vs. Information Mining Approaches

    When we speak of computer-based toxicity prediction, we are generally referring to a broad array of approaches which rely primarily upon chemical structure ...

  12. The structure and infrastructure of the global nanotechnology literature

    NASA Astrophysics Data System (ADS)

    Kostoff, Ronald N.; Stump, Jesse A.; Johnson, Dustin; Murday, James S.; Lau, Clifford G. Y.; Tolles, William M.

    2006-08-01

    Text mining is the extraction of useful information from large volumes of text. A text mining analysis of the global open nanotechnology literature was performed. Records from the Science Citation Index (SCI)/Social SCI were analyzed to provide the infrastructure of the global nanotechnology literature (prolific authors/journals/institutions/countries, most cited authors/papers/journals) and the thematic structure (taxonomy) of the global nanotechnology literature, from a science perspective. Records from the Engineering Compendex (EC) were analyzed to provide a taxonomy from a technology perspective. The Far Eastern countries have expanded nanotechnology publication output dramatically in the past decade.

  13. PubMed-EX: a web browser extension to enhance PubMed search with text mining features.

    PubMed

    Tsai, Richard Tzong-Han; Dai, Hong-Jie; Lai, Po-Ting; Huang, Chi-Hsin

    2009-11-15

    PubMed-EX is a browser extension that marks up PubMed search results with additional text-mining information. PubMed-EX's page mark-up, which includes section categorization and gene/disease and relation mark-up, can help researchers to quickly focus on key terms and provide additional information on them. All text processing is performed server-side, freeing up user resources. PubMed-EX is freely available at http://bws.iis.sinica.edu.tw/PubMed-EX and http://iisr.cse.yzu.edu.tw:8000/PubMed-EX/.

  14. Compositional mining of multiple object API protocols through state abstraction.

    PubMed

    Dai, Ziying; Mao, Xiaoguang; Lei, Yan; Qi, Yuhua; Wang, Rui; Gu, Bin

    2013-01-01

    API protocols specify correct sequences of method invocations. Despite their usefulness, API protocols are often unavailable in practice because writing them is cumbersome and error prone. Multiple object API protocols are more expressive than single object API protocols. However, the huge number of objects of typical object-oriented programs poses a major challenge to the automatic mining of multiple object API protocols: besides maintaining scalability, it is important to capture various object interactions. Current approaches utilize various heuristics to focus on small sets of methods. In this paper, we present a general, scalable, multiple object API protocols mining approach that can capture all object interactions. Our approach uses abstract field values to label object states during the mining process. We first mine single object typestates as finite state automata whose transitions are annotated with states of interacting objects before and after the execution of the corresponding method and then construct multiple object API protocols by composing these annotated single object typestates. We implement our approach for Java and evaluate it through a series of experiments.

  15. Compositional Mining of Multiple Object API Protocols through State Abstraction

    PubMed Central

    Mao, Xiaoguang; Qi, Yuhua; Wang, Rui; Gu, Bin

    2013-01-01

    API protocols specify correct sequences of method invocations. Despite their usefulness, API protocols are often unavailable in practice because writing them is cumbersome and error prone. Multiple object API protocols are more expressive than single object API protocols. However, the huge number of objects of typical object-oriented programs poses a major challenge to the automatic mining of multiple object API protocols: besides maintaining scalability, it is important to capture various object interactions. Current approaches utilize various heuristics to focus on small sets of methods. In this paper, we present a general, scalable, multiple object API protocols mining approach that can capture all object interactions. Our approach uses abstract field values to label object states during the mining process. We first mine single object typestates as finite state automata whose transitions are annotated with states of interacting objects before and after the execution of the corresponding method and then construct multiple object API protocols by composing these annotated single object typestates. We implement our approach for Java and evaluate it through a series of experiments. PMID:23844378

  16. Linguistic feature analysis for protein interaction extraction

    PubMed Central

    2009-01-01

    Background The rapid growth of the amount of publicly available reports on biomedical experimental results has recently caused a boost of text mining approaches for protein interaction extraction. Most approaches rely implicitly or explicitly on linguistic, i.e., lexical and syntactic, data extracted from text. However, only few attempts have been made to evaluate the contribution of the different feature types. In this work, we contribute to this evaluation by studying the relative importance of deep syntactic features, i.e., grammatical relations, shallow syntactic features (part-of-speech information) and lexical features. For this purpose, we use a recently proposed approach that uses support vector machines with structured kernels. Results Our results reveal that the contribution of the different feature types varies for the different data sets on which the experiments were conducted. The smaller the training corpus compared to the test data, the more important the role of grammatical relations becomes. Moreover, deep syntactic information based classifiers prove to be more robust on heterogeneous texts where no or only limited common vocabulary is shared. Conclusion Our findings suggest that grammatical relations play an important role in the interaction extraction task. Moreover, the net advantage of adding lexical and shallow syntactic features is small related to the number of added features. This implies that efficient classifiers can be built by using only a small fraction of the features that are typically being used in recent approaches. PMID:19909518

  17. Learning in the context of distribution drift

    DTIC Science & Technology

    2017-05-09

    published in the leading data mining journal, Data Mining and Knowledge Discovery (Webb et. al., 2016)1. We have shown that the previous qualitative...learner Low-bias learner Aggregated classifier Figure 7: Architecture for learning fr m streaming data in th co text of variable or unknown...Learning limited dependence Bayesian classifiers, in Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD

  18. Enhancements for a Dynamic Data Warehousing and Mining System for Large-scale HSCB Data

    DTIC Science & Technology

    2016-07-20

    Intelligent Automation Incorporated Enhancements for a Dynamic Data Warehousing and Mining ...Page | 2 Intelligent Automation Incorporated Monthly Report No. 4 Enhancements for a Dynamic Data Warehousing and Mining System Large-Scale HSCB...including Top Videos, Top Users, Top Words, and Top Languages, and also applied NER to the text associated with YouTube posts. We have also developed UI for

  19. Enhancements for a Dynamic Data Warehousing and Mining System for Large-Scale HSCB Data

    DTIC Science & Technology

    2016-07-20

    Intelligent Automation Incorporated Enhancements for a Dynamic Data Warehousing and Mining ...Page | 2 Intelligent Automation Incorporated Monthly Report No. 4 Enhancements for a Dynamic Data Warehousing and Mining System Large-Scale HSCB...including Top Videos, Top Users, Top Words, and Top Languages, and also applied NER to the text associated with YouTube posts. We have also developed UI for

  20. APPLYING DATA MINING APPROACHES TO FURTHER ...

    EPA Pesticide Factsheets

    This dataset will be used to illustrate various data mining techniques to biologically profile the chemical space. This dataset will be used to illustrate various data mining techniques to biologically profile the chemical space.

  1. Comparisons and Selections of Features and Classifiers for Short Text Classification

    NASA Astrophysics Data System (ADS)

    Wang, Ye; Zhou, Zhi; Jin, Shan; Liu, Debin; Lu, Mi

    2017-10-01

    Short text is considerably different from traditional long text documents due to its shortness and conciseness, which somehow hinders the applications of conventional machine learning and data mining algorithms in short text classification. According to traditional artificial intelligence methods, we divide short text classification into three steps, namely preprocessing, feature selection and classifier comparison. In this paper, we have illustrated step-by-step how we approach our goals. Specifically, in feature selection, we compared the performance and robustness of the four methods of one-hot encoding, tf-idf weighting, word2vec and paragraph2vec, and in the classification part, we deliberately chose and compared Naive Bayes, Logistic Regression, Support Vector Machine, K-nearest Neighbor and Decision Tree as our classifiers. Then, we compared and analysed the classifiers horizontally with each other and vertically with feature selections. Regarding the datasets, we crawled more than 400,000 short text files from Shanghai and Shenzhen Stock Exchanges and manually labeled them into two classes, the big and the small. There are eight labels in the big class, and 59 labels in the small class.

  2. Empirical advances with text mining of electronic health records.

    PubMed

    Delespierre, T; Denormandie, P; Bar-Hen, A; Josseran, L

    2017-08-22

    Korian is a private group specializing in medical accommodations for elderly and dependent people. A professional data warehouse (DWH) established in 2010 hosts all of the residents' data. Inside this information system (IS), clinical narratives (CNs) were used only by medical staff as a residents' care linking tool. The objective of this study was to show that, through qualitative and quantitative textual analysis of a relatively small physiotherapy and well-defined CN sample, it was possible to build a physiotherapy corpus and, through this process, generate a new body of knowledge by adding relevant information to describe the residents' care and lives. Meaningful words were extracted through Standard Query Language (SQL) with the LIKE function and wildcards to perform pattern matching, followed by text mining and a word cloud using R® packages. Another step involved principal components and multiple correspondence analyses, plus clustering on the same residents' sample as well as on other health data using a health model measuring the residents' care level needs. By combining these techniques, physiotherapy treatments could be characterized by a list of constructed keywords, and the residents' health characteristics were built. Feeding defects or health outlier groups could be detected, physiotherapy residents' data and their health data were matched, and differences in health situations showed qualitative and quantitative differences in physiotherapy narratives. This textual experiment using a textual process in two stages showed that text mining and data mining techniques provide convenient tools to improve residents' health and quality of care by adding new, simple, useable data to the electronic health record (EHR). When used with a normalized physiotherapy problem list, text mining through information extraction (IE), named entity recognition (NER) and data mining (DM) can provide a real advantage to describe health care, adding new medical material and helping to integrate the EHR system into the health staff work environment.

  3. Database citation in full text biomedical articles.

    PubMed

    Kafkas, Şenay; Kim, Jee-Hyub; McEntyre, Johanna R

    2013-01-01

    Molecular biology and literature databases represent essential infrastructure for life science research. Effective integration of these data resources requires that there are structured cross-references at the level of individual articles and biological records. Here, we describe the current patterns of how database entries are cited in research articles, based on analysis of the full text Open Access articles available from Europe PMC. Focusing on citation of entries in the European Nucleotide Archive (ENA), UniProt and Protein Data Bank, Europe (PDBe), we demonstrate that text mining doubles the number of structured annotations of database record citations supplied in journal articles by publishers. Many thousands of new literature-database relationships are found by text mining, since these relationships are also not present in the set of articles cited by database records. We recommend that structured annotation of database records in articles is extended to other databases, such as ArrayExpress and Pfam, entries from which are also cited widely in the literature. The very high precision and high-throughput of this text-mining pipeline makes this activity possible both accurately and at low cost, which will allow the development of new integrated data services.

  4. Database Citation in Full Text Biomedical Articles

    PubMed Central

    Kafkas, Şenay; Kim, Jee-Hyub; McEntyre, Johanna R.

    2013-01-01

    Molecular biology and literature databases represent essential infrastructure for life science research. Effective integration of these data resources requires that there are structured cross-references at the level of individual articles and biological records. Here, we describe the current patterns of how database entries are cited in research articles, based on analysis of the full text Open Access articles available from Europe PMC. Focusing on citation of entries in the European Nucleotide Archive (ENA), UniProt and Protein Data Bank, Europe (PDBe), we demonstrate that text mining doubles the number of structured annotations of database record citations supplied in journal articles by publishers. Many thousands of new literature-database relationships are found by text mining, since these relationships are also not present in the set of articles cited by database records. We recommend that structured annotation of database records in articles is extended to other databases, such as ArrayExpress and Pfam, entries from which are also cited widely in the literature. The very high precision and high-throughput of this text-mining pipeline makes this activity possible both accurately and at low cost, which will allow the development of new integrated data services. PMID:23734176

  5. What Online Communities Can Tell Us About Electronic Cigarettes and Hookah Use: A Study Using Text Mining and Visualization Techniques.

    PubMed

    Chen, Annie T; Zhu, Shu-Hong; Conway, Mike

    2015-09-29

    The rise in popularity of electronic cigarettes (e-cigarettes) and hookah over recent years has been accompanied by some confusion and uncertainty regarding the development of an appropriate regulatory response towards these emerging products. Mining online discussion content can lead to insights into people's experiences, which can in turn further our knowledge of how to address potential health implications. In this work, we take a novel approach to understanding the use and appeal of these emerging products by applying text mining techniques to compare consumer experiences across discussion forums. This study examined content from the websites Vapor Talk, Hookah Forum, and Reddit to understand people's experiences with different tobacco products. Our investigation involves three parts. First, we identified contextual factors that inform our understanding of tobacco use behaviors, such as setting, time, social relationships, and sensory experience, and compared the forums to identify the ones where content on these factors is most common. Second, we compared how the tobacco use experience differs with combustible cigarettes and e-cigarettes. Third, we investigated differences between e-cigarette and hookah use. In the first part of our study, we employed a lexicon-based extraction approach to estimate prevalence of contextual factors, and then we generated a heat map based on these estimates to compare the forums. In the second and third parts of the study, we employed a text mining technique called topic modeling to identify important topics and then developed a visualization, Topic Bars, to compare topic coverage across forums. In the first part of the study, we identified two forums, Vapor Talk Health & Safety and the Stopsmoking subreddit, where discussion concerning contextual factors was particularly common. The second part showed that the discussion in Vapor Talk Health & Safety focused on symptoms and comparisons of combustible cigarettes and e-cigarettes, and the Stopsmoking subreddit focused on psychological aspects of quitting. Last, we examined the discussion content on Vapor Talk and Hookah Forum. Prominent topics included equipment, technique, experiential elements of use, and the buying and selling of equipment. This study has three main contributions. Discussion forums differ in the extent to which their content may help us understand behaviors with potential health implications. Identifying dimensions of interest and using a heat map visualization to compare across forums can be helpful for identifying forums with the greatest density of health information. Additionally, our work has shown that the quitting experience can potentially be very different depending on whether or not e-cigarettes are used. Finally, e-cigarette and hookah forums are similar in that members represent a "hobbyist culture" that actively engages in information exchange. These differences have important implications for both tobacco regulation and smoking cessation intervention design.

  6. HPIminer: A text mining system for building and visualizing human protein interaction networks and pathways.

    PubMed

    Subramani, Suresh; Kalpana, Raja; Monickaraj, Pankaj Moses; Natarajan, Jeyakumar

    2015-04-01

    The knowledge on protein-protein interactions (PPI) and their related pathways are equally important to understand the biological functions of the living cell. Such information on human proteins is highly desirable to understand the mechanism of several diseases such as cancer, diabetes, and Alzheimer's disease. Because much of that information is buried in biomedical literature, an automated text mining system for visualizing human PPI and pathways is highly desirable. In this paper, we present HPIminer, a text mining system for visualizing human protein interactions and pathways from biomedical literature. HPIminer extracts human PPI information and PPI pairs from biomedical literature, and visualize their associated interactions, networks and pathways using two curated databases HPRD and KEGG. To our knowledge, HPIminer is the first system to build interaction networks from literature as well as curated databases. Further, the new interactions mined only from literature and not reported earlier in databases are highlighted as new. A comparative study with other similar tools shows that the resultant network is more informative and provides additional information on interacting proteins and their associated networks. Copyright © 2015 Elsevier Inc. All rights reserved.

  7. Total Factor Productivity Growth, Technical Progress & Efficiency Change in Vietnam Coal Industry - Nonparametric Approach

    NASA Astrophysics Data System (ADS)

    Phuong, Vu Hung

    2018-03-01

    This research applies Data Envelopment Analysis (DEA) approach to analyze Total Factor Productivity (TFP) and efficiency changes in Vietnam coal mining industry from 2007 to 2013. The TFP of Vietnam coal mining companies decreased due to slow technological progress and unimproved efficiency. The decadence of technical efficiency in many enterprises proved that the coal mining industry has a large potential to increase productivity through technical efficiency improvement. Enhancing human resource training, technology and research & development investment could help the industry to improve efficiency and productivity in Vietnam coal mining industry.

  8. Nano-enabled drug delivery systems for brain cancer and Alzheimer's disease: research patterns and opportunities.

    PubMed

    Ma, Jing; Porter, Alan L; Aminabhavi, Tejraj M; Zhu, Donghua

    2015-10-01

    "Tech mining" applies bibliometric and text analytic methods to scientific literature of a target field. In this study, we compare the evolution of nano-enabled drug delivery (NEDD) systems for two different applications - viz., brain cancer (BC) and Alzheimer's disease (AD) - using this approach. In this process, we derive research intelligence from papers indexed in MEDLINE. Review by domain specialists helps understand the macro-level disease problems and pathologies to identify commonalities and differences between BC and AD. Results provide a fresh perspective on the developmental pathways for NEDD approaches that have been used in the treatment of BC and AD. Results also point toward finding future solutions to drug delivery issues that are critical to medical practitioners and pharmaceutical scientists addressing the brain. Drug delivery to brain cells has been very challenging due to the presence of the blood-brain barrier (BBB). Suitable and effective nano-enabled drug delivery (NEDD) system is urgently needed. In this study, the authors utilized "tech-mining" tools to describe and compare various choices of delivery system available for the diagnosis, as well as treatment, of brain cancer and Alzheimer's disease. Copyright © 2015 Elsevier Inc. All rights reserved.

  9. Terminologies for text-mining; an experiment in the lipoprotein metabolism domain

    PubMed Central

    Alexopoulou, Dimitra; Wächter, Thomas; Pickersgill, Laura; Eyre, Cecilia; Schroeder, Michael

    2008-01-01

    Background The engineering of ontologies, especially with a view to a text-mining use, is still a new research field. There does not yet exist a well-defined theory and technology for ontology construction. Many of the ontology design steps remain manual and are based on personal experience and intuition. However, there exist a few efforts on automatic construction of ontologies in the form of extracted lists of terms and relations between them. Results We share experience acquired during the manual development of a lipoprotein metabolism ontology (LMO) to be used for text-mining. We compare the manually created ontology terms with the automatically derived terminology from four different automatic term recognition (ATR) methods. The top 50 predicted terms contain up to 89% relevant terms. For the top 1000 terms the best method still generates 51% relevant terms. In a corpus of 3066 documents 53% of LMO terms are contained and 38% can be generated with one of the methods. Conclusions Given high precision, automatic methods can help decrease development time and provide significant support for the identification of domain-specific vocabulary. The coverage of the domain vocabulary depends strongly on the underlying documents. Ontology development for text mining should be performed in a semi-automatic way; taking ATR results as input and following the guidelines we described. Availability The TFIDF term recognition is available as Web Service, described at PMID:18460175

  10. LimTox: a web tool for applied text mining of adverse event and toxicity associations of compounds, drugs and genes.

    PubMed

    Cañada, Andres; Capella-Gutierrez, Salvador; Rabal, Obdulia; Oyarzabal, Julen; Valencia, Alfonso; Krallinger, Martin

    2017-07-03

    A considerable effort has been devoted to retrieve systematically information for genes and proteins as well as relationships between them. Despite the importance of chemical compounds and drugs as a central bio-entity in pharmacological and biological research, only a limited number of freely available chemical text-mining/search engine technologies are currently accessible. Here we present LimTox (Literature Mining for Toxicology), a web-based online biomedical search tool with special focus on adverse hepatobiliary reactions. It integrates a range of text mining, named entity recognition and information extraction components. LimTox relies on machine-learning, rule-based, pattern-based and term lookup strategies. This system processes scientific abstracts, a set of full text articles and medical agency assessment reports. Although the main focus of LimTox is on adverse liver events, it enables also basic searches for other organ level toxicity associations (nephrotoxicity, cardiotoxicity, thyrotoxicity and phospholipidosis). This tool supports specialized search queries for: chemical compounds/drugs, genes (with additional emphasis on key enzymes in drug metabolism, namely P450 cytochromes-CYPs) and biochemical liver markers. The LimTox website is free and open to all users and there is no login requirement. LimTox can be accessed at: http://limtox.bioinfo.cnio.es. © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.

  11. Text Mining for Drugs and Chemical Compounds: Methods, Tools and Applications.

    PubMed

    Vazquez, Miguel; Krallinger, Martin; Leitner, Florian; Valencia, Alfonso

    2011-06-01

    Providing prior knowledge about biological properties of chemicals, such as kinetic values, protein targets, or toxic effects, can facilitate many aspects of drug development. Chemical information is rapidly accumulating in all sorts of free text documents like patents, industry reports, or scientific articles, which has motivated the development of specifically tailored text mining applications. Despite the potential gains, chemical text mining still faces significant challenges. One of the most salient is the recognition of chemical entities mentioned in text. To help practitioners contribute to this area, a good portion of this review is devoted to this issue, and presents the basic concepts and principles underlying the main strategies. The technical details are introduced and accompanied by relevant bibliographic references. Other tasks discussed are retrieving relevant articles, identifying relationships between chemicals and other entities, or determining the chemical structures of chemicals mentioned in text. This review also introduces a number of published applications that can be used to build pipelines in topics like drug side effects, toxicity, and protein-disease-compound network analysis. We conclude the review with an outlook on how we expect the field to evolve, discussing its possibilities and its current limitations. Copyright © 2011 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.

  12. Mining free-text medical records for companion animal enteric syndrome surveillance.

    PubMed

    Anholt, R M; Berezowski, J; Jamal, I; Ribble, C; Stephen, C

    2014-03-01

    Large amounts of animal health care data are present in veterinary electronic medical records (EMR) and they present an opportunity for companion animal disease surveillance. Veterinary patient records are largely in free-text without clinical coding or fixed vocabulary. Text-mining, a computer and information technology application, is needed to identify cases of interest and to add structure to the otherwise unstructured data. In this study EMR's were extracted from veterinary management programs of 12 participating veterinary practices and stored in a data warehouse. Using commercially available text-mining software (WordStat™), we developed a categorization dictionary that could be used to automatically classify and extract enteric syndrome cases from the warehoused electronic medical records. The diagnostic accuracy of the text-miner for retrieving cases of enteric syndrome was measured against human reviewers who independently categorized a random sample of 2500 cases as enteric syndrome positive or negative. Compared to the reviewers, the text-miner retrieved cases with enteric signs with a sensitivity of 87.6% (95%CI, 80.4-92.9%) and a specificity of 99.3% (95%CI, 98.9-99.6%). Automatic and accurate detection of enteric syndrome cases provides an opportunity for community surveillance of enteric pathogens in companion animals. Copyright © 2014 Elsevier B.V. All rights reserved.

  13. Argo: an integrative, interactive, text mining-based workbench supporting curation

    PubMed Central

    Rak, Rafal; Rowley, Andrew; Black, William; Ananiadou, Sophia

    2012-01-01

    Curation of biomedical literature is often supported by the automatic analysis of textual content that generally involves a sequence of individual processing components. Text mining (TM) has been used to enhance the process of manual biocuration, but has been focused on specific databases and tasks rather than an environment integrating TM tools into the curation pipeline, catering for a variety of tasks, types of information and applications. Processing components usually come from different sources and often lack interoperability. The well established Unstructured Information Management Architecture is a framework that addresses interoperability by defining common data structures and interfaces. However, most of the efforts are targeted towards software developers and are not suitable for curators, or are otherwise inconvenient to use on a higher level of abstraction. To overcome these issues we introduce Argo, an interoperable, integrative, interactive and collaborative system for text analysis with a convenient graphic user interface to ease the development of processing workflows and boost productivity in labour-intensive manual curation. Robust, scalable text analytics follow a modular approach, adopting component modules for distinct levels of text analysis. The user interface is available entirely through a web browser that saves the user from going through often complicated and platform-dependent installation procedures. Argo comes with a predefined set of processing components commonly used in text analysis, while giving the users the ability to deposit their own components. The system accommodates various areas and levels of user expertise, from TM and computational linguistics to ontology-based curation. One of the key functionalities of Argo is its ability to seamlessly incorporate user-interactive components, such as manual annotation editors, into otherwise completely automatic pipelines. As a use case, we demonstrate the functionality of an in-built manual annotation editor that is well suited for in-text corpus annotation tasks. Database URL: http://www.nactem.ac.uk/Argo PMID:22434844

  14. Data mining in radiology

    PubMed Central

    Kharat, Amit T; Singh, Amarjit; Kulkarni, Vilas M; Shah, Digish

    2014-01-01

    Data mining facilitates the study of radiology data in various dimensions. It converts large patient image and text datasets into useful information that helps in improving patient care and provides informative reports. Data mining technology analyzes data within the Radiology Information System and Hospital Information System using specialized software which assesses relationships and agreement in available information. By using similar data analysis tools, radiologists can make informed decisions and predict the future outcome of a particular imaging finding. Data, information and knowledge are the components of data mining. Classes, Clusters, Associations, Sequential patterns, Classification, Prediction and Decision tree are the various types of data mining. Data mining has the potential to make delivery of health care affordable and ensure that the best imaging practices are followed. It is a tool for academic research. Data mining is considered to be ethically neutral, however concerns regarding privacy and legality exists which need to be addressed to ensure success of data mining. PMID:25024513

  15. Analyzing asset management data using data and text mining.

    DOT National Transportation Integrated Search

    2014-07-01

    Predictive models using text from a sample competitively bid California highway projects have been used to predict a construction : projects likely level of cost overrun. A text description of the project and the text of the five largest project line...

  16. BioC implementations in Go, Perl, Python and Ruby.

    PubMed

    Liu, Wanli; Islamaj Doğan, Rezarta; Kwon, Dongseop; Marques, Hernani; Rinaldi, Fabio; Wilbur, W John; Comeau, Donald C

    2014-01-01

    As part of a communitywide effort for evaluating text mining and information extraction systems applied to the biomedical domain, BioC is focused on the goal of interoperability, currently a major barrier to wide-scale adoption of text mining tools. BioC is a simple XML format, specified by DTD, for exchanging data for biomedical natural language processing. With initial implementations in C++ and Java, BioC provides libraries of code for reading and writing BioC text documents and annotations. We extend BioC to Perl, Python, Go and Ruby. We used SWIG to extend the C++ implementation for Perl and one Python implementation. A second Python implementation and the Ruby implementation use native data structures and libraries. BioC is also implemented in the Google language Go. BioC modules are functional in all of these languages, which can facilitate text mining tasks. BioC implementations are freely available through the BioC site: http://bioc.sourceforge.net. Database URL: http://bioc.sourceforge.net/ Published by Oxford University Press 2014. This work is written by US Government employees and is in the public domain in the US.

  17. Textpresso site-specific recombinases: A text-mining server for the recombinase literature including Cre mice and conditional alleles.

    PubMed

    Urbanski, William M; Condie, Brian G

    2009-12-01

    Textpresso Site Specific Recombinases (http://ssrc.genetics.uga.edu/) is a text-mining web server for searching a database of more than 9,000 full-text publications. The papers and abstracts in this database represent a wide range of topics related to site-specific recombinase (SSR) research tools. Included in the database are most of the papers that report the characterization or use of mouse strains that express Cre recombinase as well as papers that describe or analyze mouse lines that carry conditional (floxed) alleles or SSR-activated transgenes/knockins. The database also includes reports describing SSR-based cloning methods such as the Gateway or the Creator systems, papers reporting the development or use of SSR-based tools in systems such as Drosophila, bacteria, parasites, stem cells, yeast, plants, zebrafish, and Xenopus as well as publications that describe the biochemistry, genetics, or molecular structure of the SSRs themselves. Textpresso Site Specific Recombinases is the only comprehensive text-mining resource available for the literature describing the biology and technical applications of SSRs. (c) 2009 Wiley-Liss, Inc.

  18. Text mining for metabolic pathways, signaling cascades, and protein networks.

    PubMed

    Hoffmann, Robert; Krallinger, Martin; Andres, Eduardo; Tamames, Javier; Blaschke, Christian; Valencia, Alfonso

    2005-05-10

    The complexity of the information stored in databases and publications on metabolic and signaling pathways, the high throughput of experimental data, and the growing number of publications make it imperative to provide systems to help the researcher navigate through these interrelated information resources. Text-mining methods have started to play a key role in the creation and maintenance of links between the information stored in biological databases and its original sources in the literature. These links will be extremely useful for database updating and curation, especially if a number of technical problems can be solved satisfactorily, including the identification of protein and gene names (entities in general) and the characterization of their types of interactions. The first generation of openly accessible text-mining systems, such as iHOP (Information Hyperlinked over Proteins), provides additional functions to facilitate the reconstruction of protein interaction networks, combine database and text information, and support the scientist in the formulation of novel hypotheses. The next challenge is the generation of comprehensive information regarding the general function of signaling pathways and protein interaction networks.

  19. TOY SAFETY SURVEILLANCE FROM ONLINE REVIEWS

    PubMed Central

    Winkler, Matt; Abrahams, Alan S.; Gruss, Richard; Ehsani, Johnathan P.

    2016-01-01

    Toy-related injuries account for a significant number of childhood injuries and the prevention of these injuries remains a goal for regulatory agencies and manufacturers. Text-mining is an increasingly prevalent method for uncovering the significance of words using big data. This research sets out to determine the effectiveness of text-mining in uncovering potentially dangerous children’s toys. We develop a danger word list, also known as a ‘smoke word’ list, from injury and recall text narratives. We then use the smoke word lists to score over one million Amazon reviews, with the top scores denoting potential safety concerns. We compare the smoke word list to conventional sentiment analysis techniques, in terms of both word overlap and effectiveness. We find that smoke word lists are highly distinct from conventional sentiment dictionaries and provide a statistically significant method for identifying safety concerns in children’s toy reviews. Our findings indicate that text-mining is, in fact, an effective method for the surveillance of safety concerns in children’s toys and could be a gateway to effective prevention of toy-product-related injuries. PMID:27942092

  20. Cataloging the biomedical world of pain through semi-automated curation of molecular interactions

    PubMed Central

    Jamieson, Daniel G.; Roberts, Phoebe M.; Robertson, David L.; Sidders, Ben; Nenadic, Goran

    2013-01-01

    The vast collection of biomedical literature and its continued expansion has presented a number of challenges to researchers who require structured findings to stay abreast of and analyze molecular mechanisms relevant to their domain of interest. By structuring literature content into topic-specific machine-readable databases, the aggregate data from multiple articles can be used to infer trends that can be compared and contrasted with similar findings from topic-independent resources. Our study presents a generalized procedure for semi-automatically creating a custom topic-specific molecular interaction database through the use of text mining to assist manual curation. We apply the procedure to capture molecular events that underlie ‘pain’, a complex phenomenon with a large societal burden and unmet medical need. We describe how existing text mining solutions are used to build a pain-specific corpus, extract molecular events from it, add context to the extracted events and assess their relevance. The pain-specific corpus contains 765 692 documents from Medline and PubMed Central, from which we extracted 356 499 unique normalized molecular events, with 261 438 single protein events and 93 271 molecular interactions supplied by BioContext. Event chains are annotated with negation, speculation, anatomy, Gene Ontology terms, mutations, pain and disease relevance, which collectively provide detailed insight into how that event chain is associated with pain. The extracted relations are visualized in a wiki platform (wiki-pain.org) that enables efficient manual curation and exploration of the molecular mechanisms that underlie pain. Curation of 1500 grouped event chains ranked by pain relevance revealed 613 accurately extracted unique molecular interactions that in the future can be used to study the underlying mechanisms involved in pain. Our approach demonstrates that combining existing text mining tools with domain-specific terms and wiki-based visualization can facilitate rapid curation of molecular interactions to create a custom database. Database URL: ••• PMID:23707966

  1. Cataloging the biomedical world of pain through semi-automated curation of molecular interactions.

    PubMed

    Jamieson, Daniel G; Roberts, Phoebe M; Robertson, David L; Sidders, Ben; Nenadic, Goran

    2013-01-01

    The vast collection of biomedical literature and its continued expansion has presented a number of challenges to researchers who require structured findings to stay abreast of and analyze molecular mechanisms relevant to their domain of interest. By structuring literature content into topic-specific machine-readable databases, the aggregate data from multiple articles can be used to infer trends that can be compared and contrasted with similar findings from topic-independent resources. Our study presents a generalized procedure for semi-automatically creating a custom topic-specific molecular interaction database through the use of text mining to assist manual curation. We apply the procedure to capture molecular events that underlie 'pain', a complex phenomenon with a large societal burden and unmet medical need. We describe how existing text mining solutions are used to build a pain-specific corpus, extract molecular events from it, add context to the extracted events and assess their relevance. The pain-specific corpus contains 765 692 documents from Medline and PubMed Central, from which we extracted 356 499 unique normalized molecular events, with 261 438 single protein events and 93 271 molecular interactions supplied by BioContext. Event chains are annotated with negation, speculation, anatomy, Gene Ontology terms, mutations, pain and disease relevance, which collectively provide detailed insight into how that event chain is associated with pain. The extracted relations are visualized in a wiki platform (wiki-pain.org) that enables efficient manual curation and exploration of the molecular mechanisms that underlie pain. Curation of 1500 grouped event chains ranked by pain relevance revealed 613 accurately extracted unique molecular interactions that in the future can be used to study the underlying mechanisms involved in pain. Our approach demonstrates that combining existing text mining tools with domain-specific terms and wiki-based visualization can facilitate rapid curation of molecular interactions to create a custom database. Database URL: •••

  2. Writers Identification Based on Multiple Windows Features Mining

    NASA Astrophysics Data System (ADS)

    Fadhil, Murad Saadi; Alkawaz, Mohammed Hazim; Rehman, Amjad; Saba, Tanzila

    2016-03-01

    Now a days, writer identification is at high demand to identify the original writer of the script at high accuracy. The one of the main challenge in writer identification is how to extract the discriminative features of different authors' scripts to classify precisely. In this paper, the adaptive division method on the offline Latin script has been implemented using several variant window sizes. Fragments of binarized text a set of features are extracted and classified into clusters in the form of groups or classes. Finally, the proposed approach in this paper has been tested on various parameters in terms of text division and window sizes. It is observed that selection of the right window size yields a well positioned window division. The proposed approach is tested on IAM standard dataset (IAM, Institut für Informatik und angewandte Mathematik, University of Bern, Bern, Switzerland) that is a constraint free script database. Finally, achieved results are compared with several techniques reported in the literature.

  3. Theoretical approaches to creation of robotic coal mines based on the synthesis of simulation technologies

    NASA Astrophysics Data System (ADS)

    Fryanov, V. N.; Pavlova, L. D.; Temlyantsev, M. V.

    2017-09-01

    Methodological approaches to theoretical substantiation of the structure and parameters of robotic coal mines are outlined. The results of mathematical and numerical modeling revealed the features of manifestation of geomechanical and gas dynamic processes in the conditions of robotic mines. Technological solutions for the design and manufacture of technical means for robotic mine are adopted using the method of economic and mathematical modeling and in accordance with the current regulatory documents. For a comparative performance evaluation of technological schemes of traditional and robotic mines, methods of cognitive modeling and matrix search for subsystem elements in the synthesis of a complex geotechnological system are applied. It is substantiated that the process of technical re-equipment of a traditional mine with a phased transition to a robotic mine will reduce unit costs by almost 1.5 times with a significant social effect due to a reduction in the number of personnel engaged in hazardous work.

  4. Environmental considerations related to mining of nonfuel minerals

    USGS Publications Warehouse

    Seal, Robert R.; Piatak, Nadine M.; Kimball, Bryn E.; Hammarstrom, Jane M.; Schulz, Klaus J.; DeYoung,, John H.; Seal, Robert R.; Bradley, Dwight C.

    2017-12-19

    Throughout most of human history, environmental stewardship during mining has not been a priority partly because of the lack of applicable laws and regulations and partly because of ignorance about the effects that mining can have on the environment. In the United States, the National Environmental Policy Act of 1969, in conjunction with related laws, codified a more modern approach to mining, including the responsibility for environmental stewardship, and provided a framework for incorporating environmental protection into mine planning. Today, similar frameworks are in place in the other developed countries of the world, and international mining companies generally follow similar procedures wherever they work in the world. The regulatory guidance has fostered an international effort among all stakeholders to identify best practices for environmental stewardship.The modern approach to mining using best practices involves the following: (a) establishment of a pre-mining baseline from which to monitor environmental effects during mining and help establish geologically reasonable closure goals; (b) identification of environmental risks related to mining through standardized approaches; and (c) formulation of an environmental closure plan before the start of mining. A key aspect of identifying the environmental risks and mitigating those risks is understanding how the risks vary from one deposit type to another—a concept that forms the basis for geoenvironmental mineral-deposit models.Accompanying the quest for best practices is the goal of making mining sustainable into the future. Sustainable mine development is generally considered to be development that meets the needs of the present generation without compromising the ability of future generations to meet their own needs. The concept extends beyond the availability of nonrenewable mineral commodities and includes the environmental and social effects of mine development.Global population growth, meanwhile, has decreased the percentage of inhabitable land available to support society’s material needs. Presently, the land area available to supply the mineral resources, energy resources, water, food, shelter, and waste disposal needs of all Earth’s inhabitants is estimated to be 135 square meters per person. Continued global population growth will only increase the challenges of sustainable mining.Current trends in mining are also expected to lead to new environmental challenges in the future, among which are mine-waste management issues related to mining larger deposits for lower ore grade; water-management issues related to both the mining of larger deposits and the changes in precipitation brought about by climate change; and greenhouse gas issues related to reducing the carbon footprint of larger, more energy-intensive mining operations.

  5. Development and application of the Safe Performance Index as a risk-based methodology for identifying major hazard-related safety issues in underground coal mines

    NASA Astrophysics Data System (ADS)

    Kinilakodi, Harisha

    The underground coal mining industry has been under constant watch due to the high risk involved in its activities, and scrutiny increased because of the disasters that occurred in 2006-07. In the aftermath of the incidents, the U.S. Congress passed the Mine Improvement and New Emergency Response Act of 2006 (MINER Act), which strengthened the existing regulations and mandated new laws to address the various issues related to a safe working environment in the mines. Risk analysis in any form should be done on a regular basis to tackle the possibility of unwanted major hazard-related events such as explosions, outbursts, airbursts, inundations, spontaneous combustion, and roof fall instabilities. One of the responses by the Mine Safety and Health Administration (MSHA) in 2007 involved a new pattern of violations (POV) process to target mines with a poor safety performance, specifically to improve their safety. However, the 2010 disaster (worst in 40 years) gave an impression that the collective effort of the industry, federal/state agencies, and researchers to achieve the goal of zero fatalities and serious injuries has gone awry. The Safe Performance Index (SPI) methodology developed in this research is a straight-forward, effective, transparent, and reproducible approach that can help in identifying and addressing some of the existing issues while targeting (poor safety performance) mines which need help. It combines three injury and three citation measures that are scaled to have an equal mean (5.0) in a balanced way with proportionate weighting factors (0.05, 0.15, 0.30) and overall normalizing factor (15) into a mine safety performance evaluation tool. It can be used to assess the relative safety-related risk of mines, including by mine-size category. Using 2008 and 2009 data, comparisons were made of SPI-associated, normalized safety performance measures across mine-size categories, with emphasis on small-mine safety performance as compared to large- and medium-sized mines. The accident rates (NDL IR, NFDL IR, SM/100) of very small and small mines in 2008 and 2009 were less than those of medium and large mines. The data indicates a heavy occurrence of very severe injuries in a number of very small and small mines. In another application which is a part of this research, the six normalized safety measures and the SPI are used to evaluate the risk that existed at mines in the two years preceding the occurrence of a fatality. This mine safety performance tracking method could have been helpful to the companies, state agency, or MSHA in recognizing and addressing emerging problems with actions that may have been able to prevent high-risk conditions, the fatality, and/or other serious injuries. The approach would have given scrutiny to the risk of mines that encompassed 74% of the fatalities during 2007-2010. In order to assess the SPI as a comparable risk measurement tool, a traditional risk approach is also developed using data embracing frequency and severity in the final equation to analyze the relative risk for all underground coal mines for the years 2007--2010. Then, the SPI is compared with this traditional risk analysis method to demonstrate that the results attained by either method provide the relative safety-related risk of underground coal mines regarding injuries and citations for violations of regulations. The comparison reveals that the SPI does emulate a traditional approach to risk analysis. A correlation coefficient of --0.89 or more was observed between the results of these two methodologies and either can be used to assist companies, the Mine Safety and Health Administration (MSHA), or state agencies in target-ing mines with high risk for serious injuries and elevated citations for remediation of their injury and/or violation experience. The SPI, however, provides a more understandable approach for mine operators to apply using measures compatible with MSHA's enforcement tools. These methodologies form an all-encompassing approach that can be used to assist companies, the MSHA, or state agencies in targeting mines with high risk for serious injuries and elevated citations. Once targeted as high risk, mines can then pursue appropriate intervention to remediate their violation and/or injury experience. This research may help in plugging the gap in the safety system and better pursue the goal of zero fatalities and serious injuries in the underground coal mines.

  6. ASCOT: a text mining-based web-service for efficient search and assisted creation of clinical trials

    PubMed Central

    2012-01-01

    Clinical trials are mandatory protocols describing medical research on humans and among the most valuable sources of medical practice evidence. Searching for trials relevant to some query is laborious due to the immense number of existing protocols. Apart from search, writing new trials includes composing detailed eligibility criteria, which might be time-consuming, especially for new researchers. In this paper we present ASCOT, an efficient search application customised for clinical trials. ASCOT uses text mining and data mining methods to enrich clinical trials with metadata, that in turn serve as effective tools to narrow down search. In addition, ASCOT integrates a component for recommending eligibility criteria based on a set of selected protocols. PMID:22595088

  7. ASCOT: a text mining-based web-service for efficient search and assisted creation of clinical trials.

    PubMed

    Korkontzelos, Ioannis; Mu, Tingting; Ananiadou, Sophia

    2012-04-30

    Clinical trials are mandatory protocols describing medical research on humans and among the most valuable sources of medical practice evidence. Searching for trials relevant to some query is laborious due to the immense number of existing protocols. Apart from search, writing new trials includes composing detailed eligibility criteria, which might be time-consuming, especially for new researchers. In this paper we present ASCOT, an efficient search application customised for clinical trials. ASCOT uses text mining and data mining methods to enrich clinical trials with metadata, that in turn serve as effective tools to narrow down search. In addition, ASCOT integrates a component for recommending eligibility criteria based on a set of selected protocols.

  8. Studies on medicinal herbs for cognitive enhancement based on the text mining of Dongeuibogam and preliminary evaluation of its effects.

    PubMed

    Pak, Malk Eun; Kim, Yu Ri; Kim, Ha Neui; Ahn, Sung Min; Shin, Hwa Kyoung; Baek, Jin Ung; Choi, Byung Tae

    2016-02-17

    In literature on Korean medicine, Dongeuibogam (Treasured Mirror of Eastern Medicine), published in 1613, represents the overall results of the traditional medicines of North-East Asia based on prior medicinal literature of this region. We utilized this medicinal literature by text mining to establish a list of candidate herbs for cognitive enhancement in the elderly and then performed an evaluation of their effects. Text mining was performed for selection of candidate herbs. Cell viability was determined in HT22 hippocampal cells and immunohistochemistry and behavioral analysis was performed in a kainic acid (KA) mice model in order to observe alterations of hippocampal cells and cognition. Twenty four herbs for cognitive enhancement in the elderly were selected by text mining of Dongeuibogam. In HT22 cells, pretreatment with 3 candidate herbs resulted in significantly reduced glutamate-induced cell death. Panax ginseng was the most neuroprotective herb against glutamate-induced cell death. In the hippocampus of a KA mice model, pretreatment with 11 candidate herbs resulted in suppression of caspase-3 expression. Treatment with 7 candidate herbs resulted in significantly enhanced expression levels of phosphorylated cAMP response element binding protein. Number of proliferated cells indicated by BrdU labeling was increased by treatment with 10 candidate herbs. Schisandra chinensis was the most effective herb against cell death and proliferation of progenitor cells and Rehmannia glutinosa in neuroprotection in the hippocampus of a KA mice model. In a KA mice model, we confirmed improved spatial and short memory by treatment with the 3 most effective candidate herbs and these recovered functions were involved in a higher number of newly formed neurons from progenitor cells in the hippocampus. These established herbs and their combinations identified by text-mining technique and evaluation for effectiveness may have value in further experimental and clinical applications for cognitive enhancement in the elderly. Copyright © 2016 Elsevier Ireland Ltd. All rights reserved.

  9. Mining the Text: 34 Text Features that Can Ease or Obstruct Text Comprehension and Use

    ERIC Educational Resources Information Center

    White, Sheida

    2012-01-01

    This article presents 34 characteristics of texts and tasks ("text features") that can make continuous (prose), noncontinuous (document), and quantitative texts easier or more difficult for adolescents and adults to comprehend and use. The text features were identified by examining the assessment tasks and associated texts in the national…

  10. Application of data mining approaches to drug delivery.

    PubMed

    Ekins, Sean; Shimada, Jun; Chang, Cheng

    2006-11-30

    Computational approaches play a key role in all areas of the pharmaceutical industry from data mining, experimental and clinical data capture to pharmacoeconomics and adverse events monitoring. They will likely continue to be indispensable assets along with a growing library of software applications. This is primarily due to the increasingly massive amount of biology, chemistry and clinical data, which is now entering the public domain mainly as a result of NIH and commercially funded projects. We are therefore in need of new methods for mining this mountain of data in order to enable new hypothesis generation. The computational approaches include, but are not limited to, database compilation, quantitative structure activity relationships (QSAR), pharmacophores, network visualization models, decision trees, machine learning algorithms and multidimensional data visualization software that could be used to improve drug delivery after mining public and/or proprietary data. We will discuss some areas of unmet needs in the area of data mining for drug delivery that can be addressed with new software tools or databases of relevance to future pharmaceutical projects.

  11. Dietary patterns analysis using data mining method. An application to data from the CYKIDS study.

    PubMed

    Lazarou, Chrystalleni; Karaolis, Minas; Matalas, Antonia-Leda; Panagiotakos, Demosthenes B

    2012-11-01

    Data mining is a computational method that permits the extraction of patterns from large databases. We applied the data mining approach in data from 1140 children (9-13 years), in order to derive dietary habits related to children's obesity status. Rules emerged via data mining approach revealed the detrimental influence of the increased consumption of soft dinks, delicatessen meat, sweets, fried and junk food. For example, frequent (3-5 times/week) consumption of all these foods increases the risk for being obese by 75%, whereas in children who have a similar dietary pattern, but eat >2 times/week fish and seafood the risk for obesity is reduced by 33%. In conclusion patterns revealed from data mining technique refer to specific groups of children and demonstrate the effect on the risk associated with obesity status when a single dietary habit might be modified. Thus, a more individualized approach when translating public health messages could be achieved. Copyright © 2011 Elsevier Ireland Ltd. All rights reserved.

  12. A review on computational systems biology of pathogen–host interactions

    PubMed Central

    Durmuş, Saliha; Çakır, Tunahan; Özgür, Arzucan; Guthke, Reinhard

    2015-01-01

    Pathogens manipulate the cellular mechanisms of host organisms via pathogen–host interactions (PHIs) in order to take advantage of the capabilities of host cells, leading to infections. The crucial role of these interspecies molecular interactions in initiating and sustaining infections necessitates a thorough understanding of the corresponding mechanisms. Unlike the traditional approach of considering the host or pathogen separately, a systems-level approach, considering the PHI system as a whole is indispensable to elucidate the mechanisms of infection. Following the technological advances in the post-genomic era, PHI data have been produced in large-scale within the last decade. Systems biology-based methods for the inference and analysis of PHI regulatory, metabolic, and protein–protein networks to shed light on infection mechanisms are gaining increasing demand thanks to the availability of omics data. The knowledge derived from the PHIs may largely contribute to the identification of new and more efficient therapeutics to prevent or cure infections. There are recent efforts for the detailed documentation of these experimentally verified PHI data through Web-based databases. Despite these advances in data archiving, there are still large amounts of PHI data in the biomedical literature yet to be discovered, and novel text mining methods are in development to unearth such hidden data. Here, we review a collection of recent studies on computational systems biology of PHIs with a special focus on the methods for the inference and analysis of PHI networks, covering also the Web-based databases and text-mining efforts to unravel the data hidden in the literature. PMID:25914674

  13. 75 FR 8316 - Office of Postsecondary Education; Overview Information; Erma Byrd Scholarship Program; Notice...

    Federal Register 2010, 2011, 2012, 2013, 2014

    2010-02-24

    ... Transmittal of Applications: March 26, 2010. Full Text of Announcement I. Funding Opportunity Description... related to industrial health and safety: Mining and mineral engineering, industrial engineering... technology/technician, hazardous materials information systems technology/technician, mining technology...

  14. Regional and temporal variability of the isotope composition (O, S) of atmospheric sulphate in the region of Freiberg, Germany, and consequences for dissolved sulphate in groundwater and river water.

    PubMed

    Tichomirowa, Marion; Heidel, Claudia

    2012-01-01

    The isotope composition of dissolved sulphate and strontium in atmospheric deposition, groundwater, mine water and river water in the region of Freiberg was investigated to better understand the fate of these components in the regional and global water cycle. Most of the isotope variations of dissolved sulphates in atmospheric deposition from three locations sampled bi- or tri-monthly can be explained by fractionation processes leading to lower [Formula: see text] (of about 2-3‰) and higher [Formula: see text] (of about 8-10‰) values in summer compared with the winter period. These samples showed a negative correlation between [Formula: see text] and [Formula: see text] values and a weak positive correlation between [Formula: see text] and [Formula: see text] values. They reflect the sulphate formed by aqueous oxidation from long-range transport in clouds. However, these isotope variations were superimposed by changes of the dominating atmospheric sulphate source. At two of the sampling points, large variations of mean annual [Formula: see text] values from atmospheric bulk deposition were recorded. From 2008 to 2009, the mean annual [Formula: see text] value increased by about 5‰; and decreased by about 4‰ from 2009 to 2010. A change in the dominating sulphate source or oxidation pathways of SO(2) in the atmosphere is proposed to cause these shifts. No changes were found in corresponding [Formula: see text] values. Groundwater, river water and some mine waters (where groundwater was the dominating sulphate source) also showed temporal shifts in their [Formula: see text] values corresponding to those of bulk atmospheric deposition, albeit to a lower degree. The mean transit time of atmospheric sulphur through the soil into the groundwater and river water was less than a year and therefore much shorter than previously suggested. Mining activities of about 800 years in the Freiberg region may have led to large subsurface areas with an enhanced groundwater flow along fractures and mined-refilled ore lodes which may shorten transit times of sulphate from precipitation through groundwater into river water.

  15. GNormPlus: An Integrative Approach for Tagging Genes, Gene Families, and Protein Domains

    PubMed Central

    Lu, Zhiyong

    2015-01-01

    The automatic recognition of gene names and their associated database identifiers from biomedical text has been widely studied in recent years, as these tasks play an important role in many downstream text-mining applications. Despite significant previous research, only a small number of tools are publicly available and these tools are typically restricted to detecting only mention level gene names or only document level gene identifiers. In this work, we report GNormPlus: an end-to-end and open source system that handles both gene mention and identifier detection. We created a new corpus of 694 PubMed articles to support our development of GNormPlus, containing manual annotations for not only gene names and their identifiers, but also closely related concepts useful for gene name disambiguation, such as gene families and protein domains. GNormPlus integrates several advanced text-mining techniques, including SimConcept for resolving composite gene names. As a result, GNormPlus compares favorably to other state-of-the-art methods when evaluated on two widely used public benchmarking datasets, achieving 86.7% F1-score on the BioCreative II Gene Normalization task dataset and 50.1% F1-score on the BioCreative III Gene Normalization task dataset. The GNormPlus source code and its annotated corpus are freely available, and the results of applying GNormPlus to the entire PubMed are freely accessible through our web-based tool PubTator. PMID:26380306

  16. A simulation-based approach for estimating premining water quality: Red Mountain Creek, Colorado

    USGS Publications Warehouse

    Runkel, Robert L.; Kimball, Briant A; Walton-Day, Katherine; Verplanck, Philip L.

    2007-01-01

    Regulatory agencies are often charged with the task of setting site-specific numeric water quality standards for impaired streams. This task is particularly difficult for streams draining highly mineralized watersheds with past mining activity. Baseline water quality data obtained prior to mining are often non-existent and application of generic water quality standards developed for unmineralized watersheds is suspect given the geology of most watersheds affected by mining. Various approaches have been used to estimate premining conditions, but none of the existing approaches rigorously consider the physical and geochemical processes that ultimately determine instream water quality. An approach based on simulation modeling is therefore proposed herein. The approach utilizes synoptic data that provide spatially-detailed profiles of concentration, streamflow, and constituent load along the study reach. This field data set is used to calibrate a reactive stream transport model that considers the suite of physical and geochemical processes that affect constituent concentrations during instream transport. A key input to the model is the quality and quantity of waters entering the study reach. This input is based on chemical analyses available from synoptic sampling and observed increases in streamflow along the study reach. Given the calibrated model, additional simulations are conducted to estimate premining conditions. In these simulations, the chemistry of mining-affected sources is replaced with the chemistry of waters that are thought to be unaffected by mining (proximal, premining analogues). The resultant simulations provide estimates of premining water quality that reflect both the reduced loads that were present prior to mining and the processes that affect these loads as they are transported downstream. This simulation-based approach is demonstrated using data from Red Mountain Creek, Colorado, a small stream draining a heavily-mined watershed. Model application to the premining problem for Red Mountain Creek is based on limited field reconnaissance and chemical analyses; additional field work and analyses may be needed to develop definitive, quantitative estimates of premining water quality.

  17. A novel web informatics approach for automated surveillance of cancer mortality trends✩

    PubMed Central

    Tourassi, Georgia; Yoon, Hong-Jun; Xu, Songhua

    2016-01-01

    Cancer surveillance data are collected every year in the United States via the National Program of Cancer Registries (NPCR) and the Surveillance, Epidemiology and End Results (SEER) Program of the National Cancer Institute (NCI). General trends are closely monitored to measure the nation's progress against cancer. The objective of this study was to apply a novel web informatics approach for enabling fully automated monitoring of cancer mortality trends. The approach involves automated collection and text mining of online obituaries to derive the age distribution, geospatial, and temporal trends of cancer deaths in the US. Using breast and lung cancer as examples, we mined 23,850 cancer-related and 413,024 general online obituaries spanning the timeframe 2008–2012. There was high correlation between the web-derived mortality trends and the official surveillance statistics reported by NCI with respect to the age distribution (ρ = 0.981 for breast; ρ = 0.994 for lung), the geospatial distribution (ρ = 0.939 for breast; ρ = 0.881 for lung), and the annual rates of cancer deaths (ρ = 0.661 for breast; ρ = 0.839 for lung). Additional experiments investigated the effect of sample size on the consistency of the web-based findings. Overall, our study findings support web informatics as a promising, cost-effective way to dynamically monitor spatiotemporal cancer mortality trends. PMID:27044930

  18. A novel web informatics approach for automated surveillance of cancer mortality trends

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Tourassi, Georgia; Yoon, Hong -Jun; Xu, Songhua

    Cancer surveillance data are collected every year in the United States via the National Program of Cancer Registries (NPCR) and the Surveillance, Epidemiology and End Results (SEER) Program of the National Cancer Institute (NCI). General trends are closely monitored to measure the nation’s progress against cancer. The objective of this study was to apply a novel web informatics approach for enabling fully automated monitoring of cancer mortality trends. The approach involves automated collection and text mining of online obituaries to derive the age distribution, geospatial, and temporal trends of cancer deaths in the US. Using breast and lung cancer asmore » examples, we mined 23,850 cancer-related and 413,024 general online obituaries spanning the timeframe 2008–2012. There was high correlation between the web-derived mortality trends and the official surveillance statistics reported by NCI with respect to the age distribution (ρ = 0.981 for breast; ρ = 0.994 for lung), the geospatial distribution (ρ = 0.939 for breast; ρ = 0.881 for lung), and the annual rates of cancer deaths (ρ = 0.661 for breast; ρ = 0.839 for lung). Additional experiments investigated the effect of sample size on the consistency of the web-based findings. Altogether, our study findings support web informatics as a promising, cost-effective way to dynamically monitor spatiotemporal cancer mortality trends.« less

  19. A novel web informatics approach for automated surveillance of cancer mortality trends

    DOE PAGES

    Tourassi, Georgia; Yoon, Hong -Jun; Xu, Songhua

    2016-04-01

    Cancer surveillance data are collected every year in the United States via the National Program of Cancer Registries (NPCR) and the Surveillance, Epidemiology and End Results (SEER) Program of the National Cancer Institute (NCI). General trends are closely monitored to measure the nation’s progress against cancer. The objective of this study was to apply a novel web informatics approach for enabling fully automated monitoring of cancer mortality trends. The approach involves automated collection and text mining of online obituaries to derive the age distribution, geospatial, and temporal trends of cancer deaths in the US. Using breast and lung cancer asmore » examples, we mined 23,850 cancer-related and 413,024 general online obituaries spanning the timeframe 2008–2012. There was high correlation between the web-derived mortality trends and the official surveillance statistics reported by NCI with respect to the age distribution (ρ = 0.981 for breast; ρ = 0.994 for lung), the geospatial distribution (ρ = 0.939 for breast; ρ = 0.881 for lung), and the annual rates of cancer deaths (ρ = 0.661 for breast; ρ = 0.839 for lung). Additional experiments investigated the effect of sample size on the consistency of the web-based findings. Altogether, our study findings support web informatics as a promising, cost-effective way to dynamically monitor spatiotemporal cancer mortality trends.« less

  20. Data Visualization in Information Retrieval and Data Mining (SIG VIS).

    ERIC Educational Resources Information Center

    Efthimiadis, Efthimis

    2000-01-01

    Presents abstracts that discuss using data visualization for information retrieval and data mining, including immersive information space and spatial metaphors; spatial data using multi-dimensional matrices with maps; TREC (Text Retrieval Conference) experiments; users' information needs in cartographic information retrieval; and users' relevance…

  1. Macromolecule mass spectrometry: citation mining of user documents.

    PubMed

    Kostoff, Ronald N; Bedford, Clifford D; del Río, J Antonio; Cortes, Héctor D; Karypis, George

    2004-03-01

    Identifying research users, applications, and impact is important for research performers, managers, evaluators, and sponsors. Identification of the user audience and the research impact is complex and time consuming due to the many indirect pathways through which fundamental research can impact applications. This paper identified the literature pathways through which two highly-cited papers of 2002 Chemistry Nobel Laureates Fenn and Tanaka impacted research, technology development, and applications. Citation Mining, an integration of citation bibliometrics and text mining, was applied to the >1600 first generation Science Citation Index (SCI) citing papers to Fenn's 1989 Science paper on Electrospray Ionization for Mass Spectrometry, and to the >400 first generation SCI citing papers to Tanaka's 1988 Rapid Communications in Mass Spectrometry paper on Laser Ionization Time-of-Flight Mass Spectrometry. Bibliometrics was performed on the citing papers to profile the user characteristics. Text mining was performed on the citing papers to identify the technical areas impacted by the research, and the relationships among these technical areas.

  2. Enhancements for a Dynamic Data Warehousing and Mining System for Large-Scale Human Social Cultural Behavioral (HSBC) Data

    DTIC Science & Technology

    2016-09-26

    Intelligent Automation Incorporated Enhancements for a Dynamic Data Warehousing and Mining ...Enhancements for a Dynamic Data Warehousing and Mining System for N00014-16-P-3014 Large-Scale Human Social Cultural Behavioral (HSBC) Data 5b. GRANT NUMBER...Representative Media Gallery View. We perform Scraawl’s NER algorithm to the text associated with YouTube post, which classifies the named entities into

  3. Detecting Malicious Tweets in Twitter Using Runtime Monitoring With Hidden Information

    DTIC Science & Technology

    2016-06-01

    text mining using Twitter streaming API and python [Online]. Available: http://adilmoujahid.com/posts/2014/07/twitter-analytics/ [22] M. Singh, B...sites with 645,750,000 registered users [3] and has open source public tweets for data mining . 2. Malicious Users and Tweets In the modern world...want to data mine in Twitter, and presents the natural language assertions and corresponding rule patterns. It then describes the steps performed using

  4. Numerical linear algebra in data mining

    NASA Astrophysics Data System (ADS)

    Eldén, Lars

    Ideas and algorithms from numerical linear algebra are important in several areas of data mining. We give an overview of linear algebra methods in text mining (information retrieval), pattern recognition (classification of handwritten digits), and PageRank computations for web search engines. The emphasis is on rank reduction as a method of extracting information from a data matrix, low-rank approximation of matrices using the singular value decomposition and clustering, and on eigenvalue methods for network analysis.

  5. Identifying Understudied Nuclear Reactions by Text-mining the EXFOR Experimental Nuclear Reaction Library

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Hirdt, J.A.; Brown, D.A., E-mail: dbrown@bnl.gov

    The EXFOR library contains the largest collection of experimental nuclear reaction data available as well as the data's bibliographic information and experimental details. We text-mined the REACTION and MONITOR fields of the ENTRYs in the EXFOR library in order to identify understudied reactions and quantities. Using the results of the text-mining, we created an undirected graph from the EXFOR datasets with each graph node representing a single reaction and quantity and graph links representing the various types of connections between these reactions and quantities. This graph is an abstract representation of the connections in EXFOR, similar to graphs of socialmore » networks, authorship networks, etc. We use various graph theoretical tools to identify important yet understudied reactions and quantities in EXFOR. Although we identified a few cross sections relevant for shielding applications and isotope production, mostly we identified charged particle fluence monitor cross sections. As a side effect of this work, we learn that our abstract graph is typical of other real-world graphs.« less

  6. Knowledge acquisition, semantic text mining, and security risks in health and biomedical informatics

    PubMed Central

    Huang, Jingshan; Dou, Dejing; Dang, Jiangbo; Pardue, J Harold; Qin, Xiao; Huan, Jun; Gerthoffer, William T; Tan, Ming

    2012-01-01

    Computational techniques have been adopted in medical and biological systems for a long time. There is no doubt that the development and application of computational methods will render great help in better understanding biomedical and biological functions. Large amounts of datasets have been produced by biomedical and biological experiments and simulations. In order for researchers to gain knowledge from original data, nontrivial transformation is necessary, which is regarded as a critical link in the chain of knowledge acquisition, sharing, and reuse. Challenges that have been encountered include: how to efficiently and effectively represent human knowledge in formal computing models, how to take advantage of semantic text mining techniques rather than traditional syntactic text mining, and how to handle security issues during the knowledge sharing and reuse. This paper summarizes the state-of-the-art in these research directions. We aim to provide readers with an introduction of major computing themes to be applied to the medical and biological research. PMID:22371823

  7. Identifying Understudied Nuclear Reactions by Text-mining the EXFOR Experimental Nuclear Reaction Library

    NASA Astrophysics Data System (ADS)

    Hirdt, J. A.; Brown, D. A.

    2016-01-01

    The EXFOR library contains the largest collection of experimental nuclear reaction data available as well as the data's bibliographic information and experimental details. We text-mined the REACTION and MONITOR fields of the ENTRYs in the EXFOR library in order to identify understudied reactions and quantities. Using the results of the text-mining, we created an undirected graph from the EXFOR datasets with each graph node representing a single reaction and quantity and graph links representing the various types of connections between these reactions and quantities. This graph is an abstract representation of the connections in EXFOR, similar to graphs of social networks, authorship networks, etc. We use various graph theoretical tools to identify important yet understudied reactions and quantities in EXFOR. Although we identified a few cross sections relevant for shielding applications and isotope production, mostly we identified charged particle fluence monitor cross sections. As a side effect of this work, we learn that our abstract graph is typical of other real-world graphs.

  8. DeepText2GO: Improving large-scale protein function prediction with deep semantic text representation.

    PubMed

    You, Ronghui; Huang, Xiaodi; Zhu, Shanfeng

    2018-06-06

    As of April 2018, UniProtKB has collected more than 115 million protein sequences. Less than 0.15% of these proteins, however, have been associated with experimental GO annotations. As such, the use of automatic protein function prediction (AFP) to reduce this huge gap becomes increasingly important. The previous studies conclude that sequence homology based methods are highly effective in AFP. In addition, mining motif, domain, and functional information from protein sequences has been found very helpful for AFP. Other than sequences, alternative information sources such as text, however, may be useful for AFP as well. Instead of using BOW (bag of words) representation in traditional text-based AFP, we propose a new method called DeepText2GO that relies on deep semantic text representation, together with different kinds of available protein information such as sequence homology, families, domains, and motifs, to improve large-scale AFP. Furthermore, DeepText2GO integrates text-based methods with sequence-based ones by means of a consensus approach. Extensive experiments on the benchmark dataset extracted from UniProt/SwissProt have demonstrated that DeepText2GO significantly outperformed both text-based and sequence-based methods, validating its superiority. Copyright © 2018 Elsevier Inc. All rights reserved.

  9. Large-Scale Event Extraction from Literature with Multi-Level Gene Normalization

    PubMed Central

    Wei, Chih-Hsuan; Hakala, Kai; Pyysalo, Sampo; Ananiadou, Sophia; Kao, Hung-Yu; Lu, Zhiyong; Salakoski, Tapio; Van de Peer, Yves; Ginter, Filip

    2013-01-01

    Text mining for the life sciences aims to aid database curation, knowledge summarization and information retrieval through the automated processing of biomedical texts. To provide comprehensive coverage and enable full integration with existing biomolecular database records, it is crucial that text mining tools scale up to millions of articles and that their analyses can be unambiguously linked to information recorded in resources such as UniProt, KEGG, BioGRID and NCBI databases. In this study, we investigate how fully automated text mining of complex biomolecular events can be augmented with a normalization strategy that identifies biological concepts in text, mapping them to identifiers at varying levels of granularity, ranging from canonicalized symbols to unique gene and proteins and broad gene families. To this end, we have combined two state-of-the-art text mining components, previously evaluated on two community-wide challenges, and have extended and improved upon these methods by exploiting their complementary nature. Using these systems, we perform normalization and event extraction to create a large-scale resource that is publicly available, unique in semantic scope, and covers all 21.9 million PubMed abstracts and 460 thousand PubMed Central open access full-text articles. This dataset contains 40 million biomolecular events involving 76 million gene/protein mentions, linked to 122 thousand distinct genes from 5032 species across the full taxonomic tree. Detailed evaluations and analyses reveal promising results for application of this data in database and pathway curation efforts. The main software components used in this study are released under an open-source license. Further, the resulting dataset is freely accessible through a novel API, providing programmatic and customized access (http://www.evexdb.org/api/v001/). Finally, to allow for large-scale bioinformatic analyses, the entire resource is available for bulk download from http://evexdb.org/download/, under the Creative Commons – Attribution – Share Alike (CC BY-SA) license. PMID:23613707

  10. Text mining facilitates database curation - extraction of mutation-disease associations from Bio-medical literature.

    PubMed

    Ravikumar, Komandur Elayavilli; Wagholikar, Kavishwar B; Li, Dingcheng; Kocher, Jean-Pierre; Liu, Hongfang

    2015-06-06

    Advances in the next generation sequencing technology has accelerated the pace of individualized medicine (IM), which aims to incorporate genetic/genomic information into medicine. One immediate need in interpreting sequencing data is the assembly of information about genetic variants and their corresponding associations with other entities (e.g., diseases or medications). Even with dedicated effort to capture such information in biological databases, much of this information remains 'locked' in the unstructured text of biomedical publications. There is a substantial lag between the publication and the subsequent abstraction of such information into databases. Multiple text mining systems have been developed, but most of them focus on the sentence level association extraction with performance evaluation based on gold standard text annotations specifically prepared for text mining systems. We developed and evaluated a text mining system, MutD, which extracts protein mutation-disease associations from MEDLINE abstracts by incorporating discourse level analysis, using a benchmark data set extracted from curated database records. MutD achieves an F-measure of 64.3% for reconstructing protein mutation disease associations in curated database records. Discourse level analysis component of MutD contributed to a gain of more than 10% in F-measure when compared against the sentence level association extraction. Our error analysis indicates that 23 of the 64 precision errors are true associations that were not captured by database curators and 68 of the 113 recall errors are caused by the absence of associated disease entities in the abstract. After adjusting for the defects in the curated database, the revised F-measure of MutD in association detection reaches 81.5%. Our quantitative analysis reveals that MutD can effectively extract protein mutation disease associations when benchmarking based on curated database records. The analysis also demonstrates that incorporating discourse level analysis significantly improved the performance of extracting the protein-mutation-disease association. Future work includes the extension of MutD for full text articles.

  11. Combined mining: discovering informative knowledge in complex data.

    PubMed

    Cao, Longbing; Zhang, Huaifeng; Zhao, Yanchang; Luo, Dan; Zhang, Chengqi

    2011-06-01

    Enterprise data mining applications often involve complex data such as multiple large heterogeneous data sources, user preferences, and business impact. In such situations, a single method or one-step mining is often limited in discovering informative knowledge. It would also be very time and space consuming, if not impossible, to join relevant large data sources for mining patterns consisting of multiple aspects of information. It is crucial to develop effective approaches for mining patterns combining necessary information from multiple relevant business lines, catering for real business settings and decision-making actions rather than just providing a single line of patterns. The recent years have seen increasing efforts on mining more informative patterns, e.g., integrating frequent pattern mining with classifications to generate frequent pattern-based classifiers. Rather than presenting a specific algorithm, this paper builds on our existing works and proposes combined mining as a general approach to mining for informative patterns combining components from either multiple data sets or multiple features or by multiple methods on demand. We summarize general frameworks, paradigms, and basic processes for multifeature combined mining, multisource combined mining, and multimethod combined mining. Novel types of combined patterns, such as incremental cluster patterns, can result from such frameworks, which cannot be directly produced by the existing methods. A set of real-world case studies has been conducted to test the frameworks, with some of them briefed in this paper. They identify combined patterns for informing government debt prevention and improving government service objectives, which show the flexibility and instantiation capability of combined mining in discovering informative knowledge in complex data.

  12. A semantic model for multimodal data mining in healthcare information systems.

    PubMed

    Iakovidis, Dimitris; Smailis, Christos

    2012-01-01

    Electronic health records (EHRs) are representative examples of multimodal/multisource data collections; including measurements, images and free texts. The diversity of such information sources and the increasing amounts of medical data produced by healthcare institutes annually, pose significant challenges in data mining. In this paper we present a novel semantic model that describes knowledge extracted from the lowest-level of a data mining process, where information is represented by multiple features i.e. measurements or numerical descriptors extracted from measurements, images, texts or other medical data, forming multidimensional feature spaces. Knowledge collected by manual annotation or extracted by unsupervised data mining from one or more feature spaces is modeled through generalized qualitative spatial semantics. This model enables a unified representation of knowledge across multimodal data repositories. It contributes to bridging the semantic gap, by enabling direct links between low-level features and higher-level concepts e.g. describing body parts, anatomies and pathological findings. The proposed model has been developed in web ontology language based on description logics (OWL-DL) and can be applied to a variety of data mining tasks in medical informatics. It utility is demonstrated for automatic annotation of medical data.

  13. Exploratory analysis of textual data from the Mother and Child Handbook using a text mining method (II): Monthly changes in the words recorded by mothers.

    PubMed

    Tagawa, Miki; Matsuda, Yoshio; Manaka, Tomoko; Kobayashi, Makiko; Ohwada, Michitaka; Matsubara, Shigeki

    2017-01-01

    The aim of the study was to examine the possibility of converting subjective textual data written in the free column space of the Mother and Child Handbook (MCH) into objective information using text mining and to compare any monthly changes in the words written by the mothers. Pregnant women without complications (n = 60) were divided into two groups according to State-Trait Anxiety Inventory grade: low trait anxiety (group I, n = 39) and high trait anxiety (group II, n = 21). Exploratory analysis of the textual data from the MCH was conducted by text mining using the Word Miner software program. Using 1203 structural elements extracted after processing, a comparison of monthly changes in the words used in the mothers' comments was made between the two groups. The data was mainly analyzed by a correspondence analysis. The structural elements in groups I and II were divided into seven and six clusters, respectively, by cluster analysis. Correspondence analysis revealed clear monthly changes in the words used in the mothers' comments as the pregnancy progressed in group I, whereas the association was not clear in group II. The text mining method was useful for exploratory analysis of the textual data obtained from pregnant women, and the monthly change in the words used in the mothers' comments as pregnancy progressed differed according to their degree of unease. © 2016 Japan Society of Obstetrics and Gynecology.

  14. Functional Genome Mining for Metabolites Encoded by Large Gene Clusters through Heterologous Expression of a Whole-Genome Bacterial Artificial Chromosome Library in Streptomyces spp.

    PubMed Central

    Xu, Min; Wang, Yemin; Zhao, Zhilong; Gao, Guixi; Huang, Sheng-Xiong; Kang, Qianjin; He, Xinyi; Lin, Shuangjun; Pang, Xiuhua; Deng, Zixin

    2016-01-01

    ABSTRACT Genome sequencing projects in the last decade revealed numerous cryptic biosynthetic pathways for unknown secondary metabolites in microbes, revitalizing drug discovery from microbial metabolites by approaches called genome mining. In this work, we developed a heterologous expression and functional screening approach for genome mining from genomic bacterial artificial chromosome (BAC) libraries in Streptomyces spp. We demonstrate mining from a strain of Streptomyces rochei, which is known to produce streptothricins and borrelidin, by expressing its BAC library in the surrogate host Streptomyces lividans SBT5, and screening for antimicrobial activity. In addition to the successful capture of the streptothricin and borrelidin biosynthetic gene clusters, we discovered two novel linear lipopeptides and their corresponding biosynthetic gene cluster, as well as a novel cryptic gene cluster for an unknown antibiotic from S. rochei. This high-throughput functional genome mining approach can be easily applied to other streptomycetes, and it is very suitable for the large-scale screening of genomic BAC libraries for bioactive natural products and the corresponding biosynthetic pathways. IMPORTANCE Microbial genomes encode numerous cryptic biosynthetic gene clusters for unknown small metabolites with potential biological activities. Several genome mining approaches have been developed to activate and bring these cryptic metabolites to biological tests for future drug discovery. Previous sequence-guided procedures relied on bioinformatic analysis to predict potentially interesting biosynthetic gene clusters. In this study, we describe an efficient approach based on heterologous expression and functional screening of a whole-genome library for the mining of bioactive metabolites from Streptomyces. The usefulness of this function-driven approach was demonstrated by the capture of four large biosynthetic gene clusters for metabolites of various chemical types, including streptothricins, borrelidin, two novel lipopeptides, and one unknown antibiotic from Streptomyces rochei Sal35. The transfer, expression, and screening of the library were all performed in a high-throughput way, so that this approach is scalable and adaptable to industrial automation for next-generation antibiotic discovery. PMID:27451447

  15. Application of LANDSAT data to monitor land reclamation progress in Belmont County, Ohio

    NASA Technical Reports Server (NTRS)

    Bloemer, H. H. L.; Brumfield, J. O.; Campbell, W. J.; Witt, R. G.; Bly, B. G.

    1981-01-01

    Strip and contour mining techniques are reviewed as well as some studies conducted to determine the applicability of LANDSAT and associated digital image processing techniques to the surficial problems associated with mining operations. A nontraditional unsupervised classification approach to multispectral data is considered which renders increased classification separability in land cover analysis of surface mined areas. The approach also reduces the dimensionality of the data and requires only minimal analytical skills in digital data processing.

  16. Comparing digital data processing techniques for surface mine and reclamation monitoring

    NASA Technical Reports Server (NTRS)

    Witt, R. G.; Bly, B. G.; Campbell, W. J.; Bloemer, H. H. L.; Brumfield, J. O.

    1982-01-01

    The results of three techniques used for processing Landsat digital data are compared for their utility in delineating areas of surface mining and subsequent reclamation. An unsupervised clustering algorithm (ISOCLS), a maximum-likelihood classifier (CLASFY), and a hybrid approach utilizing canonical analysis (ISOCLS/KLTRANS/ISOCLS) were compared by means of a detailed accuracy assessment with aerial photography at NASA's Goddard Space Flight Center. Results show that the hybrid approach was superior to the traditional techniques in distinguishing strip mined and reclaimed areas.

  17. Mining Health-Related Issues in Consumer Product Reviews by Using Scalable Text Analytics

    PubMed Central

    Torii, Manabu; Tilak, Sameer S.; Doan, Son; Zisook, Daniel S.; Fan, Jung-wei

    2016-01-01

    In an era when most of our life activities are digitized and recorded, opportunities abound to gain insights about population health. Online product reviews present a unique data source that is currently underexplored. Health-related information, although scarce, can be systematically mined in online product reviews. Leveraging natural language processing and machine learning tools, we were able to mine 1.3 million grocery product reviews for health-related information. The objectives of the study were as follows: (1) conduct quantitative and qualitative analysis on the types of health issues found in consumer product reviews; (2) develop a machine learning classifier to detect reviews that contain health-related issues; and (3) gain insights about the task characteristics and challenges for text analytics to guide future research. PMID:27375358

  18. Mining Health-Related Issues in Consumer Product Reviews by Using Scalable Text Analytics.

    PubMed

    Torii, Manabu; Tilak, Sameer S; Doan, Son; Zisook, Daniel S; Fan, Jung-Wei

    2016-01-01

    In an era when most of our life activities are digitized and recorded, opportunities abound to gain insights about population health. Online product reviews present a unique data source that is currently underexplored. Health-related information, although scarce, can be systematically mined in online product reviews. Leveraging natural language processing and machine learning tools, we were able to mine 1.3 million grocery product reviews for health-related information. The objectives of the study were as follows: (1) conduct quantitative and qualitative analysis on the types of health issues found in consumer product reviews; (2) develop a machine learning classifier to detect reviews that contain health-related issues; and (3) gain insights about the task characteristics and challenges for text analytics to guide future research.

  19. Weighted mining of massive collections of [Formula: see text]-values by convex optimization.

    PubMed

    Dobriban, Edgar

    2018-06-01

    Researchers in data-rich disciplines-think of computational genomics and observational cosmology-often wish to mine large bodies of [Formula: see text]-values looking for significant effects, while controlling the false discovery rate or family-wise error rate. Increasingly, researchers also wish to prioritize certain hypotheses, for example, those thought to have larger effect sizes, by upweighting, and to impose constraints on the underlying mining, such as monotonicity along a certain sequence. We introduce Princessp , a principled method for performing weighted multiple testing by constrained convex optimization. Our method elegantly allows one to prioritize certain hypotheses through upweighting and to discount others through downweighting, while constraining the underlying weights involved in the mining process. When the [Formula: see text]-values derive from monotone likelihood ratio families such as the Gaussian means model, the new method allows exact solution of an important optimal weighting problem previously thought to be non-convex and computationally infeasible. Our method scales to massive data set sizes. We illustrate the applications of Princessp on a series of standard genomics data sets and offer comparisons with several previous 'standard' methods. Princessp offers both ease of operation and the ability to scale to extremely large problem sizes. The method is available as open-source software from github.com/dobriban/pvalue_weighting_matlab (accessed 11 October 2017).

  20. Co-evolutionary data mining for fuzzy rules: automatic fitness function creation phase space, and experiments

    NASA Astrophysics Data System (ADS)

    Smith, James F., III; Blank, Joseph A.

    2003-03-01

    An approach is being explored that involves embedding a fuzzy logic based resource manager in an electronic game environment. Game agents can function under their own autonomous logic or human control. This approach automates the data mining problem. The game automatically creates a cleansed database reflecting the domain expert's knowledge, it calls a data mining function, a genetic algorithm, for data mining of the data base as required and allows easy evaluation of the information extracted. The co-evolutionary fitness functions, chromosomes and stopping criteria for ending the game are discussed. Genetic algorithm and genetic program based data mining procedures are discussed that automatically discover new fuzzy rules and strategies. The strategy tree concept and its relationship to co-evolutionary data mining are examined as well as the associated phase space representation of fuzzy concepts. The overlap of fuzzy concepts in phase space reduces the effective strategies available to adversaries. Co-evolutionary data mining alters the geometric properties of the overlap region known as the admissible region of phase space significantly enhancing the performance of the resource manager. Procedures for validation of the information data mined are discussed and significant experimental results provided.

  1. 30 CFR 900.15 - Federal lands program cooperative agreements.

    Code of Federal Regulations, 2010 CFR

    2010-07-01

    .... 900.15 Section 900.15 Mineral Resources OFFICE OF SURFACE MINING RECLAMATION AND ENFORCEMENT, DEPARTMENT OF THE INTERIOR PROGRAMS FOR THE CONDUCT OF SURFACE MINING OPERATIONS WITHIN EACH STATE INTRODUCTION § 900.15 Federal lands program cooperative agreements. The full text of any State and Federal...

  2. 30 CFR 900.12 - State regulatory programs.

    Code of Federal Regulations, 2010 CFR

    2010-07-01

    ....12 Mineral Resources OFFICE OF SURFACE MINING RECLAMATION AND ENFORCEMENT, DEPARTMENT OF THE INTERIOR PROGRAMS FOR THE CONDUCT OF SURFACE MINING OPERATIONS WITHIN EACH STATE INTRODUCTION § 900.12 State... to be codified under the applicable part number assigned to the State. The full text will not appear...

  3. Literature Mining Methods for Toxicology and Construction of ...

    EPA Pesticide Factsheets

    Webinar Presentation on text-mining methodologies in use at NCCT and how they can be used to assist with the OECD Retinoid project. Presentation to 1st Workshop/Scientific Expert Group meeting on the OECD Retinoid Project - April 26, 2016 –Brussels, Presented remotely via web.

  4. Untangling Topic Threads in Chat-Based Communication: A Case Study

    DTIC Science & Technology

    2011-08-01

    learning techniques such as clustering are very popular for analyzing text for topic identification (Anjewierden,, Kollöffel and Hulshof 2007; Adams...Anjewierden, A., Kollöffel, B., and Hulshof , C. (2007). Towards educational data mining: Using data mining methods for automated chat analysis to

  5. GROUNDWATER IMPACTED BY ACID MINE DRAINAGE

    EPA Science Inventory

    The generation and release of acidic, metal-rich water from mine wastes continues to be an intractable environmental problem. Although the effects of acid mine drainage (AMD) are most evident in surface waters, there is an obvious need for developing cost-effective approaches fo...

  6. TOXICITY APPROACHES TO ASSESSING MINING IMPACTS AND MINE WASTE TREATMENT EFFECTIVENESS

    EPA Science Inventory

    The USEPA Office of Research and Development's National Exposure Research Laboratory and National Risk Management Research Laboratory have been evaluating the impact of mining sites on receiving streams and the effectiveness of waste treatment technologies in removing toxicity fo...

  7. Computer-Aided Analysis of Patents for Product Technology Maturity Forecasting

    NASA Astrophysics Data System (ADS)

    Liang, Yanhong; Gan, Dequan; Guo, Yingchun; Zhang, Peng

    Product technology maturity foresting is vital for any enterprises to hold the chance for innovation and keep competitive for a long term. The Theory of Invention Problem Solving (TRIZ) is acknowledged both as a systematic methodology for innovation and a powerful tool for technology forecasting. Based on TRIZ, the state -of-the-art on the technology maturity of product and the limits of application are discussed. With the application of text mining and patent analysis technologies, this paper proposes a computer-aided approach for product technology maturity forecasting. It can overcome the shortcomings of the current methods.

  8. Detection of antipersonnel (AP) mines using mechatronics approach

    NASA Astrophysics Data System (ADS)

    Shahri, Ali M.; Naghdy, Fazel

    1998-09-01

    At present there are approximately 110 million land-mines scattered around the world in 64 countries. The clearance of these mines takes place manually. Unfortunately, on average for every 5000 mines cleared one mine clearer is killed. A Mine Detector Arm (MDA) using mechatronics approach is under development in this work. The robot arm imitates manual hand- prodding technique for mine detection. It inserts a bayonet into the soil and models the dynamics of the manipulator and environment parameters, such as stiffness variation in the soil to control the impact caused by contacting a stiff object. An explicit impact control scheme is applied as the main control scheme, while two different intelligent control methods are designed to deal with uncertainties and varying environmental parameters. Firstly, a neuro-fuzzy adaptive gain controller (NFAGC) is designed to adapt the force gain control according to the estimated environment stiffness. Then, an adaptive neuro-fuzzy plus PID controller is employed to switch from a conventional PID controller to neuro-fuzzy impact control (NFIC), when an impact is detected. The developed control schemes are validated through computer simulation and experimental work.

  9. A systems biology approach to the global analysis of transcription factors in colorectal cancer.

    PubMed

    Pradhan, Meeta P; Prasad, Nagendra K A; Palakal, Mathew J

    2012-08-01

    Biological entities do not perform in isolation, and often, it is the nature and degree of interactions among numerous biological entities which ultimately determines any final outcome. Hence, experimental data on any single biological entity can be of limited value when considered only in isolation. To address this, we propose that augmenting individual entity data with the literature will not only better define the entity's own significance but also uncover relationships with novel biological entities.To test this notion, we developed a comprehensive text mining and computational methodology that focused on discovering new targets of one class of molecular entities, transcription factors (TF), within one particular disease, colorectal cancer (CRC). We used 39 molecular entities known to be associated with CRC along with six colorectal cancer terms as the bait list, or list of search terms, for mining the biomedical literature to identify CRC-specific genes and proteins. Using the literature-mined data, we constructed a global TF interaction network for CRC. We then developed a multi-level, multi-parametric methodology to identify TFs to CRC. The small bait list, when augmented with literature-mined data, identified a large number of biological entities associated with CRC. The relative importance of these TF and their associated modules was identified using functional and topological features. Additional validation of these highly-ranked TF using the literature strengthened our findings. Some of the novel TF that we identified were: SLUG, RUNX1, IRF1, HIF1A, ATF-2, ABL1, ELK-1 and GATA-1. Some of these TFs are associated with functional modules in known pathways of CRC, including the Beta-catenin/development, immune response, transcription, and DNA damage pathways. Our methodology of using text mining data and a multi-level, multi-parameter scoring technique was able to identify both known and novel TF that have roles in CRC. Starting with just one TF (SMAD3) in the bait list, the literature mining process identified an additional 116 CRC-associated TFs. Our network-based analysis showed that these TFs all belonged to any of 13 major functional groups that are known to play important roles in CRC. Among these identified TFs, we obtained a novel six-node module consisting of ATF2-P53-JNK1-ELK1-EPHB2-HIF1A, from which the novel JNK1-ELK1 association could potentially be a significant marker for CRC.

  10. Mining Social Media and Web Searches For Disease Detection

    PubMed Central

    Yang, Y. Tony; Horneffer, Michael; DiLisio, Nicole

    2013-01-01

    Web-based social media is increasingly being used across different settings in the health care industry. The increased frequency in the use of the Internet via computer or mobile devices provides an opportunity for social media to be the medium through which people can be provided with valuable health information quickly and directly. While traditional methods of detection relied predominately on hierarchical or bureaucratic lines of communication, these often failed to yield timely and accurate epidemiological intelligence. New web-based platforms promise increased opportunities for a more timely and accurate spreading of information and analysis. This article aims to provide an overview and discussion of the availability of timely and accurate information. It is especially useful for the rapid identification of an outbreak of an infectious disease that is necessary to promptly and effectively develop public health responses. These web-based platforms include search queries, data mining of web and social media, process and analysis of blogs containing epidemic key words, text mining, and geographical information system data analyses. These new sources of analysis and information are intended to complement traditional sources of epidemic intelligence. Despite the attractiveness of these new approaches, further study is needed to determine the accuracy of blogger statements, as increases in public participation may not necessarily mean the information provided is more accurate. PMID:25170475

  11. Mining social media and web searches for disease detection.

    PubMed

    Yang, Y Tony; Horneffer, Michael; DiLisio, Nicole

    2013-04-28

    Web-based social media is increasingly being used across different settings in the health care industry. The increased frequency in the use of the Internet via computer or mobile devices provides an opportunity for social media to be the medium through which people can be provided with valuable health information quickly and directly. While traditional methods of detection relied predominately on hierarchical or bureaucratic lines of communication, these often failed to yield timely and accurate epidemiological intelligence. New web-based platforms promise increased opportunities for a more timely and accurate spreading of information and analysis. This article aims to provide an overview and discussion of the availability of timely and accurate information. It is especially useful for the rapid identification of an outbreak of an infectious disease that is necessary to promptly and effectively develop public health responses. These web-based platforms include search queries, data mining of web and social media, process and analysis of blogs containing epidemic key words, text mining, and geographical information system data analyses. These new sources of analysis and information are intended to complement traditional sources of epidemic intelligence. Despite the attractiveness of these new approaches, further study is needed to determine the accuracy of blogger statements, as increases in public participation may not necessarily mean the information provided is more accurate.

  12. Comparison between BIDE, PrefixSpan, and TRuleGrowth for Mining of Indonesian Text

    NASA Astrophysics Data System (ADS)

    Sa'adillah Maylawati, Dian; Irfan, Mohamad; Budiawan Zulfikar, Wildan

    2017-01-01

    Mining proscess for Indonesian language still be an interesting research. Multiple of words representation was claimed can keep the meaning of text better than bag of words. In this paper, we compare several sequential pattern algortihm, among others BIDE (BIDirectional Extention), PrefixSpan, and TRuleGrowth. All of those algorithm produce frequent word sequence to keep the meaning of text. However, the experiment result, with 14.006 of Indonesian tweet from Twitter, shows that BIDE can produce more efficient frequent word sequence than PrefixSpan and TRuleGrowth without missing the meaning of text. Then, the average of time process of PrefixSpan is faster than BIDE and TRuleGrowth. In the other hand, PrefixSpan and TRuleGrowth is more efficient in using memory than BIDE.

  13. Online discourse on fibromyalgia: text-mining to identify clinical distinction and patient concerns.

    PubMed

    Park, Jungsik; Ryu, Young Uk

    2014-10-07

    The purpose of this study was to evaluate the possibility of using text-mining to identify clinical distinctions and patient concerns in online memoires posted by patients with fibromyalgia (FM). A total of 399 memoirs were collected from an FM group website. The unstructured data of memoirs associated with FM were collected through a crawling process and converted into structured data with a concordance, parts of speech tagging, and word frequency. We also conducted a lexical analysis and phrase pattern identification. After examining the data, a set of FM-related keywords were obtained and phrase net relationships were set through a web-based visualization tool. The clinical distinction of FM was verified. Pain is the biggest issue to the FM patients. The pains were affecting body parts including 'muscles,' 'leg,' 'neck,' 'back,' 'joints,' and 'shoulders' with accompanying symptoms such as 'spasms,' 'stiffness,' and 'aching,' and were described as 'sever,' 'chronic,' and 'constant.' This study also demonstrated that it was possible to understand the interests and concerns of FM patients through text-mining. FM patients wanted to escape from the pain and symptoms, so they were interested in medical treatment and help. Also, they seemed to have interest in their work and occupation, and hope to continue to live life through the relationships with the people around them. This research shows the potential for extracting keywords to confirm the clinical distinction of a certain disease, and text-mining can help objectively understand the concerns of patients by generalizing their large number of subjective illness experiences. However, it is believed that there are limitations to the processes and methods for organizing and classifying large amounts of text, so these limits have to be considered when analyzing the results. The development of research methodology to overcome these limitations is greatly needed.

  14. Unsupervised text mining for assessing and augmenting GWAS results.

    PubMed

    Ailem, Melissa; Role, François; Nadif, Mohamed; Demenais, Florence

    2016-04-01

    Text mining can assist in the analysis and interpretation of large-scale biomedical data, helping biologists to quickly and cheaply gain confirmation of hypothesized relationships between biological entities. We set this question in the context of genome-wide association studies (GWAS), an actively emerging field that contributed to identify many genes associated with multifactorial diseases. These studies allow to identify groups of genes associated with the same phenotype, but provide no information about the relationships between these genes. Therefore, our objective is to leverage unsupervised text mining techniques using text-based cosine similarity comparisons and clustering applied to candidate and random gene vectors, in order to augment the GWAS results. We propose a generic framework which we used to characterize the relationships between 10 genes reported associated with asthma by a previous GWAS. The results of this experiment showed that the similarities between these 10 genes were significantly stronger than would be expected by chance (one-sided p-value<0.01). The clustering of observed and randomly selected gene also allowed to generate hypotheses about potential functional relationships between these genes and thus contributed to the discovery of new candidate genes for asthma. Copyright © 2016 Elsevier Inc. All rights reserved.

  15. Text mining for search term development in systematic reviewing: A discussion of some methods and challenges.

    PubMed

    Stansfield, Claire; O'Mara-Eves, Alison; Thomas, James

    2017-09-01

    Using text mining to aid the development of database search strings for topics described by diverse terminology has potential benefits for systematic reviews; however, methods and tools for accomplishing this are poorly covered in the research methods literature. We briefly review the literature on applications of text mining for search term development for systematic reviewing. We found that the tools can be used in 5 overarching ways: improving the precision of searches; identifying search terms to improve search sensitivity; aiding the translation of search strategies across databases; searching and screening within an integrated system; and developing objectively derived search strategies. Using a case study and selected examples, we then reflect on the utility of certain technologies (term frequency-inverse document frequency and Termine, term frequency, and clustering) in improving the precision and sensitivity of searches. Challenges in using these tools are discussed. The utility of these tools is influenced by the different capabilities of the tools, the way the tools are used, and the text that is analysed. Increased awareness of how the tools perform facilitates the further development of methods for their use in systematic reviews. Copyright © 2017 John Wiley & Sons, Ltd.

  16. U-Compare: share and compare text mining tools with UIMA.

    PubMed

    Kano, Yoshinobu; Baumgartner, William A; McCrohon, Luke; Ananiadou, Sophia; Cohen, K Bretonnel; Hunter, Lawrence; Tsujii, Jun'ichi

    2009-08-01

    Due to the increasing number of text mining resources (tools and corpora) available to biologists, interoperability issues between these resources are becoming significant obstacles to using them effectively. UIMA, the Unstructured Information Management Architecture, is an open framework designed to aid in the construction of more interoperable tools. U-Compare is built on top of the UIMA framework, and provides both a concrete framework for out-of-the-box text mining and a sophisticated evaluation platform allowing users to run specific tools on any target text, generating both detailed statistics and instance-based visualizations of outputs. U-Compare is a joint project, providing the world's largest, and still growing, collection of UIMA-compatible resources. These resources, originally developed by different groups for a variety of domains, include many famous tools and corpora. U-Compare can be launched straight from the web, without needing to be manually installed. All U-Compare components are provided ready-to-use and can be combined easily via a drag-and-drop interface without any programming. External UIMA components can also simply be mixed with U-Compare components, without distinguishing between locally and remotely deployed resources. http://u-compare.org/

  17. A data mining based approach to predict spatiotemporal changes in satellite images

    NASA Astrophysics Data System (ADS)

    Boulila, W.; Farah, I. R.; Ettabaa, K. Saheb; Solaiman, B.; Ghézala, H. Ben

    2011-06-01

    The interpretation of remotely sensed images in a spatiotemporal context is becoming a valuable research topic. However, the constant growth of data volume in remote sensing imaging makes reaching conclusions based on collected data a challenging task. Recently, data mining appears to be a promising research field leading to several interesting discoveries in various areas such as marketing, surveillance, fraud detection and scientific discovery. By integrating data mining and image interpretation techniques, accurate and relevant information (i.e. functional relation between observed parcels and a set of informational contents) can be automatically elicited. This study presents a new approach to predict spatiotemporal changes in satellite image databases. The proposed method exploits fuzzy sets and data mining concepts to build predictions and decisions for several remote sensing fields. It takes into account imperfections related to the spatiotemporal mining process in order to provide more accurate and reliable information about land cover changes in satellite images. The proposed approach is validated using SPOT images representing the Saint-Denis region, capital of Reunion Island. Results show good performances of the proposed framework in predicting change for the urban zone.

  18. The impact of mining activities on Mongolia's protected areas: a status report with policy recommendations.

    PubMed

    Farrington, John D

    2005-07-01

    Mongolia's protected areas cover 20.5 million ha or 13.1% of its national territory. Existing and proposed protected areas, however, are threatened by mining. Mining impacts on Mongolia's protected areas are diverse and include licensed and unlicensed mineral activities in protected areas, buffer zone disturbance, and prevention of the establishment of proposed protected areas. Review of United States, Canadian, and Australian policies revealed 9 basic approaches to resolving conflicts between protected areas and mining. Four approaches suitable for Mongolia are granting land trades and special dispensations in exchange for mineral licenses in protected areas; granting protected status to all lapsed mineral licenses in protected areas; voluntary forfeiting of mineral licenses in protected areas in exchange for positive corporate publicity; and prohibiting all new mineral activities in existing and proposed protected areas. Mining is Mongolia's most important industry, however, and the long-term benefits of preserving Mongolia's natural heritage must be considered and weighed against the economic benefits and costs of mining activities.

  19. Mines Systems Safety Improvement Using an Integrated Event Tree and Fault Tree Analysis

    NASA Astrophysics Data System (ADS)

    Kumar, Ranjan; Ghosh, Achyuta Krishna

    2017-04-01

    Mines systems such as ventilation system, strata support system, flame proof safety equipment, are exposed to dynamic operational conditions such as stress, humidity, dust, temperature, etc., and safety improvement of such systems can be done preferably during planning and design stage. However, the existing safety analysis methods do not handle the accident initiation and progression of mine systems explicitly. To bridge this gap, this paper presents an integrated Event Tree (ET) and Fault Tree (FT) approach for safety analysis and improvement of mine systems design. This approach includes ET and FT modeling coupled with redundancy allocation technique. In this method, a concept of top hazard probability is introduced for identifying system failure probability and redundancy is allocated to the system either at component or system level. A case study on mine methane explosion safety with two initiating events is performed. The results demonstrate that the presented method can reveal the accident scenarios and improve the safety of complex mine systems simultaneously.

  20. Effective application of improved profit-mining algorithm for the interday trading model.

    PubMed

    Hsieh, Yu-Lung; Yang, Don-Lin; Wu, Jungpin

    2014-01-01

    Many real world applications of association rule mining from large databases help users make better decisions. However, they do not work well in financial markets at this time. In addition to a high profit, an investor also looks for a low risk trading with a better rate of winning. The traditional approach of using minimum confidence and support thresholds needs to be changed. Based on an interday model of trading, we proposed effective profit-mining algorithms which provide investors with profit rules including information about profit, risk, and winning rate. Since profit-mining in the financial market is still in its infant stage, it is important to detail the inner working of mining algorithms and illustrate the best way to apply them. In this paper we go into details of our improved profit-mining algorithm and showcase effective applications with experiments using real world trading data. The results show that our approach is practical and effective with good performance for various datasets.

  1. Effective Application of Improved Profit-Mining Algorithm for the Interday Trading Model

    PubMed Central

    Wu, Jungpin

    2014-01-01

    Many real world applications of association rule mining from large databases help users make better decisions. However, they do not work well in financial markets at this time. In addition to a high profit, an investor also looks for a low risk trading with a better rate of winning. The traditional approach of using minimum confidence and support thresholds needs to be changed. Based on an interday model of trading, we proposed effective profit-mining algorithms which provide investors with profit rules including information about profit, risk, and winning rate. Since profit-mining in the financial market is still in its infant stage, it is important to detail the inner working of mining algorithms and illustrate the best way to apply them. In this paper we go into details of our improved profit-mining algorithm and showcase effective applications with experiments using real world trading data. The results show that our approach is practical and effective with good performance for various datasets. PMID:24688442

  2. DOE Office of Scientific and Technical Information (OSTI.GOV)

    Kargupta, H.; Stafford, B.; Hamzaoglu, I.

    This paper describes an experimental parallel/distributed data mining system PADMA (PArallel Data Mining Agents) that uses software agents for local data accessing and analysis and a web based interface for interactive data visualization. It also presents the results of applying PADMA for detecting patterns in unstructured texts of postmortem reports and laboratory test data for Hepatitis C patients.

  3. 29 CFR 570.33 - Prohibited occupations for minors 14 and 15 years of age.

    Code of Federal Regulations, 2010 CFR

    2010-07-01

    ... shall apply to all occupations other than the following: (a) Manufacturing, mining, or processing... revised text is set forth as follows: § 570.33 Occupations that are prohibited to minors 14 and 15 years... age: (a) Manufacturing, mining, or processing occupations, including occupations requiring the...

  4. 76 FR 45300 - Notice of Issuance of Materials License SUA-1597 and Record of Decision for Uranerz Energy...

    Federal Register 2010, 2011, 2012, 2013, 2014

    2011-07-28

    ... considered but eliminated from detailed analysis include conventional uranium mining and milling, conventional mining and heap leach processing, alternative site location, alternate lixiviants, and alternate...'s Agencywide Document Access and Management System (ADAMS), which provides text and image files of...

  5. 30 CFR 732.17 - State program amendments.

    Code of Federal Regulations, 2010 CFR

    2010-07-01

    ... Mineral Resources OFFICE OF SURFACE MINING RECLAMATION AND ENFORCEMENT, DEPARTMENT OF THE INTERIOR... the number or size of coal exploration or surface coal mining and reclamation operations in the State... amendment(s) is being reviewed by the Director and will include the following: (i) The text or a summary of...

  6. 30 CFR 745.11 - Application and agreement.

    Code of Federal Regulations, 2010 CFR

    2010-07-01

    ....11 Mineral Resources OFFICE OF SURFACE MINING RECLAMATION AND ENFORCEMENT, DEPARTMENT OF THE INTERIOR... approval under part 731 of this chapter, and has or may have within the State surface coal mining and... the full text of the terms of the proposed cooperative agreement as submitted or as subsequently...

  7. 29 CFR 570.118 - Sixteen-year minimum.

    Code of Federal Regulations, 2010 CFR

    2010-07-01

    ... for employment in manufacturing or mining occupations. Furthermore, this age minimum is applicable to... convenience of the user, the revised text is set forth as follows: § 570.118 Sixteen-year minimum. The Act sets a 16-year-age minimum for employment in manufacturing or mining occupations, although under FLSA...

  8. 29 CFR 570.122 - General.

    Code of Federal Regulations, 2010 CFR

    2010-07-01

    ... other than the following: (1) Manufacturing, (2) Mining, (3) An occupation found by the Secretary to be..., the revised text is set forth as follows: § 570.122 General. (a) Specific exemptions from the child... sixteen years in any occupation other than manufacturing, mining, or an occupation found by the Secretary...

  9. 29 CFR 570.119 - Fourteen-year minimum.

    Code of Federal Regulations, 2010 CFR

    2010-07-01

    ... occupations other than manufacturing and mining, the Secretary is authorized to issue regulations or orders... Subpart C of this part. 29-30 [Reserved] (a) Manufacturing, mining, or processing occupations; (b... of the user, the revised text is set forth as follows: § 570.119 Fourteen-year minimum. With respect...

  10. EVALUATION OF A TWO-STAGE PASSIVE TREATMENT APPROACH FOR MINING INFLUENCE WATERS

    EPA Science Inventory

    A two-stage passive treatment approach was assessed at bench-scale using two Colorado Mining Influenced Waters (MIWs). The first-stage was a limestone drain with the purpose of removing iron and aluminum and mitigating the potential effects of mineral acidity. The second stage w...

  11. Revealing Learner Interests through Topic Mining from Question-Answering Data

    ERIC Educational Resources Information Center

    Dun, Yijie; Wang, Na; Wang, Min; Hao, Tianyong

    2017-01-01

    In a question-answering system, learner generated content including asked and answered questions is a meaningful resource to capture learning interests. This paper proposes an approach based on question topic mining for revealing learners' concerned topics in real community question-answering systems. The authors' approach firstly preprocesses all…

  12. OSCAR4: a flexible architecture for chemical text-mining.

    PubMed

    Jessop, David M; Adams, Sam E; Willighagen, Egon L; Hawizy, Lezan; Murray-Rust, Peter

    2011-10-14

    The Open-Source Chemistry Analysis Routines (OSCAR) software, a toolkit for the recognition of named entities and data in chemistry publications, has been developed since 2002. Recent work has resulted in the separation of the core OSCAR functionality and its release as the OSCAR4 library. This library features a modular API (based on reduction of surface coupling) that permits client programmers to easily incorporate it into external applications. OSCAR4 offers a domain-independent architecture upon which chemistry specific text-mining tools can be built, and its development and usage are discussed.

  13. The Forestry Reclamation Approach: guide to successful reforestation of mined lands

    Treesearch

    Mary Beth Adams

    2017-01-01

    Appalachian forests are among the most productive and diverse in the world. The land underlying them is also rich in coal, and surface mines operated on more than 2.4 million acres in the region from 1977, when the federal Surface Mining Control and Reclamation Act was passed, through 2015. Many efforts to reclaim mined lands most often resulted in the establishment of...

  14. MinePath: Mining for Phenotype Differential Sub-paths in Molecular Pathways

    PubMed Central

    Koumakis, Lefteris; Kartsaki, Evgenia; Chatzimina, Maria; Zervakis, Michalis; Vassou, Despoina; Marias, Kostas; Moustakis, Vassilis; Potamias, George

    2016-01-01

    Pathway analysis methodologies couple traditional gene expression analysis with knowledge encoded in established molecular pathway networks, offering a promising approach towards the biological interpretation of phenotype differentiating genes. Early pathway analysis methodologies, named as gene set analysis (GSA), view pathways just as plain lists of genes without taking into account either the underlying pathway network topology or the involved gene regulatory relations. These approaches, even if they achieve computational efficiency and simplicity, consider pathways that involve the same genes as equivalent in terms of their gene enrichment characteristics. Most recent pathway analysis approaches take into account the underlying gene regulatory relations by examining their consistency with gene expression profiles and computing a score for each profile. Even with this approach, assessing and scoring single-relations limits the ability to reveal key gene regulation mechanisms hidden in longer pathway sub-paths. We introduce MinePath, a pathway analysis methodology that addresses and overcomes the aforementioned problems. MinePath facilitates the decomposition of pathways into their constituent sub-paths. Decomposition leads to the transformation of single-relations to complex regulation sub-paths. Regulation sub-paths are then matched with gene expression sample profiles in order to evaluate their functional status and to assess phenotype differential power. Assessment of differential power supports the identification of the most discriminant profiles. In addition, MinePath assess the significance of the pathways as a whole, ranking them by their p-values. Comparison results with state-of-the-art pathway analysis systems are indicative for the soundness and reliability of the MinePath approach. In contrast with many pathway analysis tools, MinePath is a web-based system (www.minepath.org) offering dynamic and rich pathway visualization functionality, with the unique characteristic to color regulatory relations between genes and reveal their phenotype inclination. This unique characteristic makes MinePath a valuable tool for in silico molecular biology experimentation as it serves the biomedical researchers’ exploratory needs to reveal and interpret the regulatory mechanisms that underlie and putatively govern the expression of target phenotypes. PMID:27832067

  15. MinePath: Mining for Phenotype Differential Sub-paths in Molecular Pathways.

    PubMed

    Koumakis, Lefteris; Kanterakis, Alexandros; Kartsaki, Evgenia; Chatzimina, Maria; Zervakis, Michalis; Tsiknakis, Manolis; Vassou, Despoina; Kafetzopoulos, Dimitris; Marias, Kostas; Moustakis, Vassilis; Potamias, George

    2016-11-01

    Pathway analysis methodologies couple traditional gene expression analysis with knowledge encoded in established molecular pathway networks, offering a promising approach towards the biological interpretation of phenotype differentiating genes. Early pathway analysis methodologies, named as gene set analysis (GSA), view pathways just as plain lists of genes without taking into account either the underlying pathway network topology or the involved gene regulatory relations. These approaches, even if they achieve computational efficiency and simplicity, consider pathways that involve the same genes as equivalent in terms of their gene enrichment characteristics. Most recent pathway analysis approaches take into account the underlying gene regulatory relations by examining their consistency with gene expression profiles and computing a score for each profile. Even with this approach, assessing and scoring single-relations limits the ability to reveal key gene regulation mechanisms hidden in longer pathway sub-paths. We introduce MinePath, a pathway analysis methodology that addresses and overcomes the aforementioned problems. MinePath facilitates the decomposition of pathways into their constituent sub-paths. Decomposition leads to the transformation of single-relations to complex regulation sub-paths. Regulation sub-paths are then matched with gene expression sample profiles in order to evaluate their functional status and to assess phenotype differential power. Assessment of differential power supports the identification of the most discriminant profiles. In addition, MinePath assess the significance of the pathways as a whole, ranking them by their p-values. Comparison results with state-of-the-art pathway analysis systems are indicative for the soundness and reliability of the MinePath approach. In contrast with many pathway analysis tools, MinePath is a web-based system (www.minepath.org) offering dynamic and rich pathway visualization functionality, with the unique characteristic to color regulatory relations between genes and reveal their phenotype inclination. This unique characteristic makes MinePath a valuable tool for in silico molecular biology experimentation as it serves the biomedical researchers' exploratory needs to reveal and interpret the regulatory mechanisms that underlie and putatively govern the expression of target phenotypes.

  16. A novel approach for incremental uncertainty rule generation from databases with missing values handling: application to dynamic medical databases.

    PubMed

    Konias, Sokratis; Chouvarda, Ioanna; Vlahavas, Ioannis; Maglaveras, Nicos

    2005-09-01

    Current approaches for mining association rules usually assume that the mining is performed in a static database, where the problem of missing attribute values does not practically exist. However, these assumptions are not preserved in some medical databases, like in a home care system. In this paper, a novel uncertainty rule algorithm is illustrated, namely URG-2 (Uncertainty Rule Generator), which addresses the problem of mining dynamic databases containing missing values. This algorithm requires only one pass from the initial dataset in order to generate the item set, while new metrics corresponding to the notion of Support and Confidence are used. URG-2 was evaluated over two medical databases, introducing randomly multiple missing values for each record's attribute (rate: 5-20% by 5% increments) in the initial dataset. Compared with the classical approach (records with missing values are ignored), the proposed algorithm was more robust in mining rules from datasets containing missing values. In all cases, the difference in preserving the initial rules ranged between 30% and 60% in favour of URG-2. Moreover, due to its incremental nature, URG-2 saved over 90% of the time required for thorough re-mining. Thus, the proposed algorithm can offer a preferable solution for mining in dynamic relational databases.

  17. Integrated approach of environmental impact and risk assessment of Rosia Montana Mining Area, Romania.

    PubMed

    Stefănescu, Lucrina; Robu, Brînduşa Mihaela; Ozunu, Alexandru

    2013-11-01

    The environmental impact assessment of mining sites represents nowadays a large interest topic in Romania. Historical pollution in the Rosia Montana mining area of Romania caused extensive damage to environmental media. This paper has two goals: to investigate the environmental pollution induced by mining activities in the Rosia Montana area and to quantify the environmental impacts and associated risks by means of an integrated approach. Thus, a new method was developed and applied for quantifying the impact of mining activities, taking account of the quality of environmental media in the mining area, and used as case study in the present paper. The associated risks are a function of the environmental impacts and the probability of their occurrence. The results show that the environmental impacts and quantified risks, based on quality indicators to characterize the environmental quality, are of a higher order, and thus measures for pollution remediation and control need to be considered in the investigated area. The conclusion drawn is that an integrated approach for the assessment of environmental impact and associated risks is a valuable and more objective method, and is an important tool that can be applied in the decision-making process for national authorities in the prioritization of emergency action.

  18. Comparative Analysis of Document level Text Classification Algorithms using R

    NASA Astrophysics Data System (ADS)

    Syamala, Maganti; Nalini, N. J., Dr; Maguluri, Lakshamanaphaneendra; Ragupathy, R., Dr.

    2017-08-01

    From the past few decades there has been tremendous volumes of data available in Internet either in structured or unstructured form. Also, there is an exponential growth of information on Internet, so there is an emergent need of text classifiers. Text mining is an interdisciplinary field which draws attention on information retrieval, data mining, machine learning, statistics and computational linguistics. And to handle this situation, a wide range of supervised learning algorithms has been introduced. Among all these K-Nearest Neighbor(KNN) is efficient and simplest classifier in text classification family. But KNN suffers from imbalanced class distribution and noisy term features. So, to cope up with this challenge we use document based centroid dimensionality reduction(CentroidDR) using R Programming. By combining these two text classification techniques, KNN and Centroid classifiers, we propose a scalable and effective flat classifier, called MCenKNN which works well substantially better than CenKNN.

  19. Two modelling approaches to water-quality simulation in a flooded iron-ore mine (Saizerais, Lorraine, France): a semi-distributed chemical reactor model and a physically based distributed reactive transport pipe network model.

    PubMed

    Hamm, V; Collon-Drouaillet, P; Fabriol, R

    2008-02-19

    The flooding of abandoned mines in the Lorraine Iron Basin (LIB) over the past 25 years has degraded the quality of the groundwater tapped for drinking water. High concentrations of dissolved sulphate have made the water unsuitable for human consumption. This problematic issue has led to the development of numerical tools to support water-resource management in mining contexts. Here we examine two modelling approaches using different numerical tools that we tested on the Saizerais flooded iron-ore mine (Lorraine, France). A first approach considers the Saizerais Mine as a network of two chemical reactors (NCR). The second approach is based on a physically distributed pipe network model (PNM) built with EPANET 2 software. This approach considers the mine as a network of pipes defined by their geometric and chemical parameters. Each reactor in the NCR model includes a detailed chemical model built to simulate quality evolution in the flooded mine water. However, in order to obtain a robust PNM, we simplified the detailed chemical model into a specific sulphate dissolution-precipitation model that is included as sulphate source/sink in both a NCR model and a pipe network model. Both the NCR model and the PNM, based on different numerical techniques, give good post-calibration agreement between the simulated and measured sulphate concentrations in the drinking-water well and overflow drift. The NCR model incorporating the detailed chemical model is useful when a detailed chemical behaviour at the overflow is needed. The PNM incorporating the simplified sulphate dissolution-precipitation model provides better information of the physics controlling the effect of flow and low flow zones, and the time of solid sulphate removal whereas the NCR model will underestimate clean-up time due to the complete mixing assumption. In conclusion, the detailed NCR model will give a first assessment of chemical processes at overflow, and in a second time, the PNM model will provide more detailed information on flow and chemical behaviour (dissolved sulphate concentrations, remaining mass of solid sulphate) in the network. Nevertheless, both modelling methods require hydrological and chemical parameters (recharge flow rate, outflows, volume of mine voids, mass of solids, kinetic constants of the dissolution-precipitation reactions), which are commonly not available for a mine and therefore call for calibration data.

  20. Efficient discovery of risk patterns in medical data.

    PubMed

    Li, Jiuyong; Fu, Ada Wai-chee; Fahey, Paul

    2009-01-01

    This paper studies a problem of efficiently discovering risk patterns in medical data. Risk patterns are defined by a statistical metric, relative risk, which has been widely used in epidemiological research. To avoid fruitless search in the complete exploration of risk patterns, we define optimal risk pattern set to exclude superfluous patterns, i.e. complicated patterns with lower relative risk than their corresponding simpler form patterns. We prove that mining optimal risk pattern sets conforms an anti-monotone property that supports an efficient mining algorithm. We propose an efficient algorithm for mining optimal risk pattern sets based on this property. We also propose a hierarchical structure to present discovered patterns for the easy perusal by domain experts. The proposed approach is compared with two well-known rule discovery methods, decision tree and association rule mining approaches on benchmark data sets and applied to a real world application. The proposed method discovers more and better quality risk patterns than a decision tree approach. The decision tree method is not designed for such applications and is inadequate for pattern exploring. The proposed method does not discover a large number of uninteresting superfluous patterns as an association mining approach does. The proposed method is more efficient than an association rule mining method. A real world case study shows that the method reveals some interesting risk patterns to medical practitioners. The proposed method is an efficient approach to explore risk patterns. It quickly identifies cohorts of patients that are vulnerable to a risk outcome from a large data set. The proposed method is useful for exploratory study on large medical data to generate and refine hypotheses. The method is also useful for designing medical surveillance systems.

  1. Extraction and Classification of Emotions for Business Research

    NASA Astrophysics Data System (ADS)

    Verma, Rajib

    The commercial study of emotions has not embraced Internet / social mining yet, even though it has important applications in management. This is surprising since the emotional content is freeform, wide spread, can give a better indication of feelings (for instance with taboo subjects), and is inexpensive compared to other business research methods. A brief framework for applying text mining to this new research domain is shown and classification issues are discussed in an effort to quickly get businessman and researchers to adopt the mining methodology.

  2. Using remote sensing imagery to monitoring sea surface pollution cause by abandoned gold-copper mine

    NASA Astrophysics Data System (ADS)

    Kao, H. M.; Ren, H.; Lee, Y. T.

    2010-08-01

    The Chinkuashih Benshen mine was the largest gold-copper mine in Taiwan before the owner had abandoned the mine in 1987. However, even the mine had been closed, the mineral still interacts with rain and underground water and flowed into the sea. The polluted sea surface had appeared yellow, green and even white color, and the pollutants had carried by the coast current. In this study, we used the optical satellite images to monitoring the sea surface. Several image processing algorithms are employed especial the subpixel technique and linear mixture model to estimate the concentration of pollutants. The change detection approach is also applied to track them. We also conduct the chemical analysis of the polluted water to provide the ground truth validation. By the correlation analysis between the satellite observation and the ground truth chemical analysis, an effective approach to monitoring water pollution could be established.

  3. Activity recognition from minimal distinguishing subsequence mining

    NASA Astrophysics Data System (ADS)

    Iqbal, Mohammad; Pao, Hsing-Kuo

    2017-08-01

    Human activity recognition is one of the most important research topics in the era of Internet of Things. To separate different activities given sensory data, we utilize a Minimal Distinguishing Subsequence (MDS) mining approach to efficiently find distinguishing patterns among different activities. We first transform the sensory data into a series of sensor triggering events and operate the MDS mining procedure afterwards. The gap constraints are also considered in the MDS mining. Given the multi-class nature of most activity recognition tasks, we modify the MDS mining approach from a binary case to a multi-class one to fit the need for multiple activity recognition. We also study how to select the best parameter set including the minimal and the maximal support thresholds in finding the MDSs for effective activity recognition. Overall, the prediction accuracy is 86.59% on the van Kasteren dataset which consists of four different activities for recognition.

  4. Monitoring food safety violation reports from internet forums.

    PubMed

    Kate, Kiran; Negi, Sumit; Kalagnanam, Jayant

    2014-01-01

    Food-borne illness is a growing public health concern in the world. Government bodies, which regulate and monitor the state of food safety, solicit citizen feedback about food hygiene practices followed by food establishments. They use traditional channels like call center, e-mail for such feedback collection. With the growing popularity of Web 2.0 and social media, citizens often post such feedback on internet forums, message boards etc. The system proposed in this paper applies text mining techniques to identify and mine such food safety complaints posted by citizens on web data sources thereby enabling the government agencies to gather more information about the state of food safety. In this paper, we discuss the architecture of our system and the text mining methods used. We also present results which demonstrate the effectiveness of this system in a real-world deployment.

  5. What Online Communities Can Tell Us About Electronic Cigarettes and Hookah Use: A Study Using Text Mining and Visualization Techniques

    PubMed Central

    Zhu, Shu-Hong; Conway, Mike

    2015-01-01

    Background The rise in popularity of electronic cigarettes (e-cigarettes) and hookah over recent years has been accompanied by some confusion and uncertainty regarding the development of an appropriate regulatory response towards these emerging products. Mining online discussion content can lead to insights into people’s experiences, which can in turn further our knowledge of how to address potential health implications. In this work, we take a novel approach to understanding the use and appeal of these emerging products by applying text mining techniques to compare consumer experiences across discussion forums. Objective This study examined content from the websites Vapor Talk, Hookah Forum, and Reddit to understand people’s experiences with different tobacco products. Our investigation involves three parts. First, we identified contextual factors that inform our understanding of tobacco use behaviors, such as setting, time, social relationships, and sensory experience, and compared the forums to identify the ones where content on these factors is most common. Second, we compared how the tobacco use experience differs with combustible cigarettes and e-cigarettes. Third, we investigated differences between e-cigarette and hookah use. Methods In the first part of our study, we employed a lexicon-based extraction approach to estimate prevalence of contextual factors, and then we generated a heat map based on these estimates to compare the forums. In the second and third parts of the study, we employed a text mining technique called topic modeling to identify important topics and then developed a visualization, Topic Bars, to compare topic coverage across forums. Results In the first part of the study, we identified two forums, Vapor Talk Health & Safety and the Stopsmoking subreddit, where discussion concerning contextual factors was particularly common. The second part showed that the discussion in Vapor Talk Health & Safety focused on symptoms and comparisons of combustible cigarettes and e-cigarettes, and the Stopsmoking subreddit focused on psychological aspects of quitting. Last, we examined the discussion content on Vapor Talk and Hookah Forum. Prominent topics included equipment, technique, experiential elements of use, and the buying and selling of equipment. Conclusions This study has three main contributions. Discussion forums differ in the extent to which their content may help us understand behaviors with potential health implications. Identifying dimensions of interest and using a heat map visualization to compare across forums can be helpful for identifying forums with the greatest density of health information. Additionally, our work has shown that the quitting experience can potentially be very different depending on whether or not e-cigarettes are used. Finally, e-cigarette and hookah forums are similar in that members represent a “hobbyist culture” that actively engages in information exchange. These differences have important implications for both tobacco regulation and smoking cessation intervention design. PMID:26420469

  6. LABORATORY EVALUATION OF ZERO-VALENT IRON TO TREAT GROUNDWATER IMPACTED BY ACID MINE DRAINAGE

    EPA Science Inventory

    The generation and release of acidic, metal-rich water from mine wastes continues to be an intractable environmental problem. Although the effects of acid mine drainage (AMD) are most evident in surface waters, there is an obvious need for developing cost-effective approaches fo...

  7. American elm in mine land reforestation

    Treesearch

    M.B. Adams; P. Angel; C. Barton; J. Slavicek

    2015-01-01

    Reforestation of mined land in the Appalachians realizes many important benefits and provides important ecosystem services. Because much of the reclaimed mine lands in Appalachia were previously in forest, reclaiming these drastically disturbed areas to forests is desirable, feasible and cost-effective. The Forestry Reclamation Approach (FRA) provides a five-step...

  8. VALUING ACID MINE DRAINAGE REMEDIATION IN WEST VIRGINIA: A HEDONIC MODELING APPROACH

    EPA Science Inventory

    States with active and abandoned mines face large private and public costs to remediate damage to streams and rivers from acid mine drainage (AMD). Appalachian states have an especially large number of contaminated streams and rivers, and the USGS places AMD as the primary source...

  9. Miners' Misconceptions of Flow Distribution Within Circuits as a Factor Influencing Underground Mining Accidents.

    NASA Astrophysics Data System (ADS)

    Passaro, Perry David

    Misconceptions can be thought of as naive approaches to problem solving that are perceptually appealing but incorrect and inconsistent with scientific evidence (Piaget, 1929). One type of misconception involves flow distributions within circuits. This concept is important because miners' conceptual errors about flow distribution changes within complex circuits may be in part responsible for fatal mine disasters. Based on the theory that misconceptions of flow distribution changes within circuits were responsible for underground mine disasters involving mine ventilation circuits, a series of studies was undertaken with mining engineering students, professional mining engineers, as well as mine foremen, mine supervisors, mine rescue members, mine maintenance personnel, mining researchers and working miners to identify these conceptual errors and errors in mine ventilation procedures. Results indicate that misconceptions of flow distribution changes within circuits exist in over 70 percent of the subjects sampled. It is assumed that these misconceptions of flow distribution changes within circuits result in errors of judgment when miners are faced with inferring and changing ventilation arrangements when two or more mine sections are connected. Furthermore, it is assumed that these misconceptions are pervasive in the mining industry and may be responsible for at least two mine ventilation disasters. The findings of this study are consistent with Piaget's (1929) model of figurative and operative knowledge. This model states that misconceptions are in part due to a lack of knowledge of dynamic transformations and how to apply content information. Recommendations for future research include the development of an interactive expert system for training miners with ventilation arrangements. Such a system would meet the educational recommendations made by Piaget (1973b) by involving a hands-on approach that allows discovery, interaction, the opportunity to make mistakes and to review the cognitive concepts on which the subject relied during his manipulation of the ventilation system.

  10. TEES 2.2: Biomedical Event Extraction for Diverse Corpora

    PubMed Central

    2015-01-01

    Background The Turku Event Extraction System (TEES) is a text mining program developed for the extraction of events, complex biomedical relationships, from scientific literature. Based on a graph-generation approach, the system detects events with the use of a rich feature set built via dependency parsing. The TEES system has achieved record performance in several of the shared tasks of its domain, and continues to be used in a variety of biomedical text mining tasks. Results The TEES system was quickly adapted to the BioNLP'13 Shared Task in order to provide a public baseline for derived systems. An automated approach was developed for learning the underlying annotation rules of event type, allowing immediate adaptation to the various subtasks, and leading to a first place in four out of eight tasks. The system for the automated learning of annotation rules is further enhanced in this paper to the point of requiring no manual adaptation to any of the BioNLP'13 tasks. Further, the scikit-learn machine learning library is integrated into the system, bringing a wide variety of machine learning methods usable with TEES in addition to the default SVM. A scikit-learn ensemble method is also used to analyze the importances of the features in the TEES feature sets. Conclusions The TEES system was introduced for the BioNLP'09 Shared Task and has since then demonstrated good performance in several other shared tasks. By applying the current TEES 2.2 system to multiple corpora from these past shared tasks an overarching analysis of the most promising methods and possible pitfalls in the evolving field of biomedical event extraction are presented. PMID:26551925

  11. The first step in the development of text mining technology for cancer risk assessment: identifying and organizing scientific evidence in risk assessment literature

    PubMed Central

    Korhonen, Anna; Silins, Ilona; Sun, Lin; Stenius, Ulla

    2009-01-01

    Background One of the most neglected areas of biomedical Text Mining (TM) is the development of systems based on carefully assessed user needs. We have recently investigated the user needs of an important task yet to be tackled by TM -- Cancer Risk Assessment (CRA). Here we take the first step towards the development of TM technology for the task: identifying and organizing the scientific evidence required for CRA in a taxonomy which is capable of supporting extensive data gathering from biomedical literature. Results The taxonomy is based on expert annotation of 1297 abstracts downloaded from relevant PubMed journals. It classifies 1742 unique keywords found in the corpus to 48 classes which specify core evidence required for CRA. We report promising results with inter-annotator agreement tests and automatic classification of PubMed abstracts to taxonomy classes. A simple user test is also reported in a near real-world CRA scenario which demonstrates along with other evaluation that the resources we have built are well-defined, accurate, and applicable in practice. Conclusion We present our annotation guidelines and a tool which we have designed for expert annotation of PubMed abstracts. A corpus annotated for keywords and document relevance is also presented, along with the taxonomy which organizes the keywords into classes defining core evidence for CRA. As demonstrated by the evaluation, the materials we have constructed provide a good basis for classification of CRA literature along multiple dimensions. They can support current manual CRA as well as facilitate the development of an approach based on TM. We discuss extending the taxonomy further via manual and machine learning approaches and the subsequent steps required to develop TM technology for the needs of CRA. PMID:19772619

  12. Spectrum of illness among returned Australian travellers from Bali, Indonesia: a 5-year retrospective observational study.

    PubMed

    Sohail, Asma; McGuinness, Sarah L; Lightowler, Rachel; Leder, Karin; Jomon, Bismi; Bain, Christopher A; Peleg, Anton Y

    2018-06-05

    Bali, Indonesia presents significant infectious and non-infectious health risks for Australian travellers. Understanding this spectrum of illnesses has the potential to assist clinicians in evaluating unwell returning travellers and guide provision of pre-travel advice. To describe the spectrum of illnesses seen in returned travellers from Bali. Using a novel text mining approach, we performed a retrospective, observational study of all adult emergency department (ED) presentations to a metropolitan health service in Melbourne, from 2011-2015. Outcome measures included demographic, clinical and laboratory features of travel-related illnesses. A total of 464 patients met inclusion criteria. Gastroenteritis (119/464, 26%), systemic febrile illness (88/464, 19%) and respiratory tract infection (51/464, 11%) were the most common diagnoses. Dengue was the most common laboratory-confirmed diagnosis (25/464, 5%). No cases of malaria were identified. Common non-infectious presentations included traumatic injury (47/464, 10%) and animal bites requiring rabies post-exposure prophylaxis (29/464, 6%). 110 patients (24%) were admitted to hospital; those presenting with systemic febrile illness were more likely to be admitted compared to those presenting with other illnesses (OR 3.42, CI 2.02-5.75, p<0.001). This is the first study to use a text mining approach to identify and describe ED presentations related to diseases acquired in Bali by Australian travellers. While infections are important causes of illness, trauma and animal bites account for a significant number of hospital presentations. Our findings contribute to knowledge on the health risks for travellers to Bali, and will assist clinicians in relevant pre- and post-travel evaluations. This article is protected by copyright. All rights reserved. This article is protected by copyright. All rights reserved.

  13. Core Values in Nursing Care Based on the Experiences of Nurses Engaged in Neonatal Nursing: A Text-mining Approach for Analyzing Reflection Records

    PubMed Central

    Watanabe, Hiromi; Okuda, Reiko; Hagino, Hiroshi

    2018-01-01

    Background Strong feelings about and enthusiasm for nursing care are reflected in nurses’ thoughts and behaviors in clinical practice and affect their profession. This study was conducted to identify the characteristics of core values in nursing care based on the experiences of nurses engaged in neonatal nursing through a process for recognizing the conceptualization of nursing. Methods We conceptualized nursing care in 43 nurses who were involved in neonatal nursing using a reflection sheet. We classified descriptions on a sheet based on the Three-Staged Recognition scheme and analyzed them using a text-mining approach. Results Nurses involved in neonatal nursing recognized that they must take care of the “child,” “mother,” and “family.” Important elements of nursing in nurses with less than 5 years versus 5 or more years of neonatal nursing experience were classified into seven clusters, respectively. These elements were mainly related to family members in both groups. In nurses with less than 5 years of experience, four clusters of one-way communication by nurses were observed in the analysis of the key elements in nursing. On the other hand, five clusters of mutual relationships between patients, their family members, and nurses were observed in nurses with 5 or more years of experience. Conclusion In conclusion, the core value of nurses engaged in neonatal nursing is family-oriented nursing. Nurses with 5 or more years of neonatal nursing experience understand patients and their family members well through establishing relationships and providing comfort and safety while taking care of them. PMID:29599621

  14. Discovery of novel biomarkers and phenotypes by semantic technologies

    PubMed Central

    2013-01-01

    Background Biomarkers and target-specific phenotypes are important to targeted drug design and individualized medicine, thus constituting an important aspect of modern pharmaceutical research and development. More and more, the discovery of relevant biomarkers is aided by in silico techniques based on applying data mining and computational chemistry on large molecular databases. However, there is an even larger source of valuable information available that can potentially be tapped for such discoveries: repositories constituted by research documents. Results This paper reports on a pilot experiment to discover potential novel biomarkers and phenotypes for diabetes and obesity by self-organized text mining of about 120,000 PubMed abstracts, public clinical trial summaries, and internal Merck research documents. These documents were directly analyzed by the InfoCodex semantic engine, without prior human manipulations such as parsing. Recall and precision against established, but different benchmarks lie in ranges up to 30% and 50% respectively. Retrieval of known entities missed by other traditional approaches could be demonstrated. Finally, the InfoCodex semantic engine was shown to discover new diabetes and obesity biomarkers and phenotypes. Amongst these were many interesting candidates with a high potential, although noticeable noise (uninteresting or obvious terms) was generated. Conclusions The reported approach of employing autonomous self-organising semantic engines to aid biomarker discovery, supplemented by appropriate manual curation processes, shows promise and has potential to impact, conservatively, a faster alternative to vocabulary processes dependent on humans having to read and analyze all the texts. More optimistically, it could impact pharmaceutical research, for example to shorten time-to-market of novel drugs, or speed up early recognition of dead ends and adverse reactions. PMID:23402646

  15. TEES 2.2: Biomedical Event Extraction for Diverse Corpora.

    PubMed

    Björne, Jari; Salakoski, Tapio

    2015-01-01

    The Turku Event Extraction System (TEES) is a text mining program developed for the extraction of events, complex biomedical relationships, from scientific literature. Based on a graph-generation approach, the system detects events with the use of a rich feature set built via dependency parsing. The TEES system has achieved record performance in several of the shared tasks of its domain, and continues to be used in a variety of biomedical text mining tasks. The TEES system was quickly adapted to the BioNLP'13 Shared Task in order to provide a public baseline for derived systems. An automated approach was developed for learning the underlying annotation rules of event type, allowing immediate adaptation to the various subtasks, and leading to a first place in four out of eight tasks. The system for the automated learning of annotation rules is further enhanced in this paper to the point of requiring no manual adaptation to any of the BioNLP'13 tasks. Further, the scikit-learn machine learning library is integrated into the system, bringing a wide variety of machine learning methods usable with TEES in addition to the default SVM. A scikit-learn ensemble method is also used to analyze the importances of the features in the TEES feature sets. The TEES system was introduced for the BioNLP'09 Shared Task and has since then demonstrated good performance in several other shared tasks. By applying the current TEES 2.2 system to multiple corpora from these past shared tasks an overarching analysis of the most promising methods and possible pitfalls in the evolving field of biomedical event extraction are presented.

  16. A Need for Systems Architecture Approach for Next Generation Mine Warfare Capability

    DTIC Science & Technology

    2006-09-01

    MRUUV Mission Reconfigurable Unmanned Undersea Vehicle MSC Mine Countermeasures Ship Coastal MSO Mine Countermeasures Ship Open-ocean P3I Preplanned...Helicopter, the Remote Mine Hunting System (RMS), the Mission Reconfigurable Unmanned Undersea Vehicle (MRUUV) and finally the Littoral Combat Ship (LCS...guarding against the sophisticated Soviet blue-water, air, and undersea threats. Yet since World War II, U.S. Naval Forces have suffered significantly

  17. Development and application of biotechnologies in the metal mining industry.

    PubMed

    Johnson, D Barrie

    2013-11-01

    Metal mining faces a number of significant economic and environmental challenges in the twenty-first century for which established and emerging biotechnologies may, at least in part, provide the answers. Bioprocessing of mineral ores and concentrates is already used in variously engineered formats to extract base (e.g., copper, cobalt, and nickel) and precious (gold and silver) metals in mines throughout the world, though it remains a niche technology. However, current projections of an increasing future need to use low-grade primary metal ores, to reprocess mine wastes, and to develop in situ leaching technologies to extract metals from deep-buried ore bodies, all of which are economically more amenable to bioprocessing than conventional approaches (e.g., pyrometallurgy), would suggest that biomining will become more extensively utilized in the future. Recent research has also shown that bioleaching could be used to process a far wider range of metal ores (e.g., oxidized ores) than has previously been the case. Biotechnologies are also being developed to control mine-related pollution, including securing mine wastes (rocks and tailings) by using "ecological engineering" approaches, and also to remediate and recover metals from waste waters, such as acid mine drainage. This article reviews the current status of biotechnologies within the mining sector and considers how these may be developed and applied in future years.

  18. A Study on Environmental Research Trends Using Text-Mining Method - Focus on Spatial information and ICT -

    NASA Astrophysics Data System (ADS)

    Lee, M. J.; Oh, K. Y.; Joung-ho, L.

    2016-12-01

    Recently there are many research about analysing the interaction between entities by text-mining analysis in various fields. In this paper, we aimed to quantitatively analyse research-trends in the area of environmental research relating either spatial information or ICT (Information and Communications Technology) by Text-mining analysis. To do this, we applied low-dimensional embedding method, clustering analysis, and association rule to find meaningful associative patterns of key words frequently appeared in the articles. As the authors suppose that KCI (Korea Citation Index) articles reflect academic demands, total 1228 KCI articles that have been published from 1996 to 2015 were reviewed and analysed by Text-mining method. First, we derived KCI articles from NDSL(National Discovery for Science Leaders) site. And then we pre-processed their key-words elected from abstract and then classified those in separable sectors. We investigated the appearance rates and association rule of key-words for articles in the two fields: spatial-information and ICT. In order to detect historic trends, analysis was conducted separately for the four periods: 1996-2000, 2001-2005, 2006-2010, 2011-2015. These analysis were conducted with the usage of R-software. As a result, we conformed that environmental research relating spatial information mainly focused upon such fields as `GIS(35%)', `Remote-Sensing(25%)', `environmental theme map(15.7%)'. Next, `ICT technology(23.6%)', `ICT service(5.4%)', `mobile(24%)', `big data(10%)', `AI(7%)' are primarily emerging from environmental research relating ICT. Thus, from the analysis results, this paper asserts that research trends and academic progresses are well-structured to review recent spatial information and ICT technology and the outcomes of the analysis can be an adequate guidelines to establish environment policies and strategies. KEY WORDS: Big data, Test-mining, Environmental research, Spatial-information, ICT Acknowledgements: The authors appreciate the support that this study has received from `Building application frame of environmental issues, to respond to the latest ICT trends'.

  19. Text mining-based in silico drug discovery in oral mucositis caused by high-dose cancer therapy.

    PubMed

    Kirk, Jon; Shah, Nirav; Noll, Braxton; Stevens, Craig B; Lawler, Marshall; Mougeot, Farah B; Mougeot, Jean-Luc C

    2018-08-01

    Oral mucositis (OM) is a major dose-limiting side effect of chemotherapy and radiation used in cancer treatment. Due to the complex nature of OM, currently available drug-based treatments are of limited efficacy. Our objectives were (i) to determine genes and molecular pathways associated with OM and wound healing using computational tools and publicly available data and (ii) to identify drugs formulated for topical use targeting the relevant OM molecular pathways. OM and wound healing-associated genes were determined by text mining, and the intersection of the two gene sets was selected for gene ontology analysis using the GeneCodis program. Protein interaction network analysis was performed using STRING-db. Enriched gene sets belonging to the identified pathways were queried against the Drug-Gene Interaction database to find drug candidates for topical use in OM. Our analysis identified 447 genes common to both the "OM" and "wound healing" text mining concepts. Gene enrichment analysis yielded 20 genes representing six pathways and targetable by a total of 32 drugs which could possibly be formulated for topical application. A manual search on ClinicalTrials.gov confirmed no relevant pathway/drug candidate had been overlooked. Twenty-five of the 32 drugs can directly affect the PTGS2 (COX-2) pathway, the pathway that has been targeted in previous clinical trials with limited success. Drug discovery using in silico text mining and pathway analysis tools can facilitate the identification of existing drugs that have the potential of topical administration to improve OM treatment.

  20. Influence on Learning of a Collaborative Learning Method Comprising the Jigsaw Method and Problem-based Learning (PBL).

    PubMed

    Takeda, Kayoko; Takahashi, Kiyoshi; Masukawa, Hiroyuki; Shimamori, Yoshimitsu

    2017-01-01

    Recently, the practice of active learning has spread, increasingly recognized as an essential component of academic studies. Classes incorporating small group discussion (SGD) are conducted at many universities. At present, assessments of the effectiveness of SGD have mostly involved evaluation by questionnaires conducted by teachers, by peer assessment, and by self-evaluation of students. However, qualitative data, such as open-ended descriptions by students, have not been widely evaluated. As a result, we have been unable to analyze the processes and methods involved in how students acquire knowledge in SGD. In recent years, due to advances in information and communication technology (ICT), text mining has enabled the analysis of qualitative data. We therefore investigated whether the introduction of a learning system comprising the jigsaw method and problem-based learning (PBL) would improve student attitudes toward learning; we did this by text mining analysis of the content of student reports. We found that by applying the jigsaw method before PBL, we were able to improve student attitudes toward learning and increase the depth of their understanding of the area of study as a result of working with others. The use of text mining to analyze qualitative data also allowed us to understand the processes and methods by which students acquired knowledge in SGD and also changes in students' understanding and performance based on improvements to the class. This finding suggests that the use of text mining to analyze qualitative data could enable teachers to evaluate the effectiveness of various methods employed to improve learning.

  1. Text feature extraction based on deep learning: a review.

    PubMed

    Liang, Hong; Sun, Xiao; Sun, Yunlei; Gao, Yuan

    2017-01-01

    Selection of text feature item is a basic and important matter for text mining and information retrieval. Traditional methods of feature extraction require handcrafted features. To hand-design, an effective feature is a lengthy process, but aiming at new applications, deep learning enables to acquire new effective feature representation from training data. As a new feature extraction method, deep learning has made achievements in text mining. The major difference between deep learning and conventional methods is that deep learning automatically learns features from big data, instead of adopting handcrafted features, which mainly depends on priori knowledge of designers and is highly impossible to take the advantage of big data. Deep learning can automatically learn feature representation from big data, including millions of parameters. This thesis outlines the common methods used in text feature extraction first, and then expands frequently used deep learning methods in text feature extraction and its applications, and forecasts the application of deep learning in feature extraction.

  2. 30 CFR 730.11 - Inconsistent and more stringent State laws and regulations.

    Code of Federal Regulations, 2010 CFR

    2010-07-01

    ... regulations. 730.11 Section 730.11 Mineral Resources OFFICE OF SURFACE MINING RECLAMATION AND ENFORCEMENT... Register setting forth the text or a summary of any State law or regulation initially determined by him to... stringent land use and environmental controls and regulations of coal exploration and surface coal mining...

  3. Using Syntactic Patterns to Enhance Text Analytics

    ERIC Educational Resources Information Center

    Meyer, Bradley B.

    2017-01-01

    Large scale product and service reviews proliferate and are commonly found across the web. The ability to harvest, digest and analyze a large corpus of reviews from online websites is still however a difficult problem. This problem is referred to as "opinion mining." Opinion mining is an important area of research as advances in the…

  4. 76 FR 5216 - Notice of Availability of Final Supplemental Environmental Impact Statement for the Nichols Ranch...

    Federal Register 2010, 2011, 2012, 2013, 2014

    2011-01-28

    ... Uranium Recovery Project, located in the Pumpkin Buttes Uranium Mining District within the Powder River.... Alternatives that were considered, but were eliminated from detailed analysis, include conventional mining and... an Agencywide Documents and Management System (ADAMS), which provides text and image files of the NRC...

  5. 76 FR 53500 - Notice of the Nuclear Regulatory Commission Issuance of Materials License SUA-1598 and Record of...

    Federal Register 2010, 2011, 2012, 2013, 2014

    2011-08-26

    ... (ADAMS), which provides text and image files of the NRC's public documents in the NRC Library at http... considered, but eliminated from detailed analysis, include conventional uranium mining and milling, conventional mining and heap leach processing, alternate lixiviants, and alternative wastewater disposal...

  6. Biogeochemical behaviour and bioremediation of uranium in waters of abandoned mines.

    PubMed

    Mkandawire, Martin

    2013-11-01

    The discharges of uranium and associated radionuclides as well as heavy metals and metalloids from waste and tailing dumps in abandoned uranium mining and processing sites pose contamination risks to surface and groundwater. Although many more are being planned for nuclear energy purposes, most of the abandoned uranium mines are a legacy of uranium production that fuelled arms race during the cold war of the last century. Since the end of cold war, there have been efforts to rehabilitate the mining sites, initially, using classical remediation techniques based on high chemical and civil engineering. Recently, bioremediation technology has been sought as alternatives to the classical approach due to reasons, which include: (a) high demand of sites requiring remediation; (b) the economic implication of running and maintaining the facilities due to high energy and work force demand; and (c) the pattern and characteristics of contaminant discharges in most of the former uranium mining and processing sites prevents the use of classical methods. This review discusses risks of uranium contamination from abandoned uranium mines from the biogeochemical point of view and the potential and limitation of uranium bioremediation technique as alternative to classical approach in abandoned uranium mining and processing sites.

  7. Discovering and visualizing indirect associations between biomedical concepts

    PubMed Central

    Tsuruoka, Yoshimasa; Miwa, Makoto; Hamamoto, Kaisei; Tsujii, Jun'ichi; Ananiadou, Sophia

    2011-01-01

    Motivation: Discovering useful associations between biomedical concepts has been one of the main goals in biomedical text-mining, and understanding their biomedical contexts is crucial in the discovery process. Hence, we need a text-mining system that helps users explore various types of (possibly hidden) associations in an easy and comprehensible manner. Results: This article describes FACTA+, a real-time text-mining system for finding and visualizing indirect associations between biomedical concepts from MEDLINE abstracts. The system can be used as a text search engine like PubMed with additional features to help users discover and visualize indirect associations between important biomedical concepts such as genes, diseases and chemical compounds. FACTA+ inherits all functionality from its predecessor, FACTA, and extends it by incorporating three new features: (i) detecting biomolecular events in text using a machine learning model, (ii) discovering hidden associations using co-occurrence statistics between concepts, and (iii) visualizing associations to improve the interpretability of the output. To the best of our knowledge, FACTA+ is the first real-time web application that offers the functionality of finding concepts involving biomolecular events and visualizing indirect associations of concepts with both their categories and importance. Availability: FACTA+ is available as a web application at http://refine1-nactem.mc.man.ac.uk/facta/, and its visualizer is available at http://refine1-nactem.mc.man.ac.uk/facta-visualizer/. Contact: tsuruoka@jaist.ac.jp PMID:21685059

  8. Accelerating literature curation with text-mining tools: a case study of using PubTator to curate genes in PubMed abstracts

    PubMed Central

    Lu, Zhiyong

    2012-01-01

    Today’s biomedical research has become heavily dependent on access to the biological knowledge encoded in expert curated biological databases. As the volume of biological literature grows rapidly, it becomes increasingly difficult for biocurators to keep up with the literature because manual curation is an expensive and time-consuming endeavour. Past research has suggested that computer-assisted curation can improve efficiency, but few text-mining systems have been formally evaluated in this regard. Through participation in the interactive text-mining track of the BioCreative 2012 workshop, we developed PubTator, a PubMed-like system that assists with two specific human curation tasks: document triage and bioconcept annotation. On the basis of evaluation results from two external user groups, we find that the accuracy of PubTator-assisted curation is comparable with that of manual curation and that PubTator can significantly increase human curatorial speed. These encouraging findings warrant further investigation with a larger number of publications to be annotated. Database URL: http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/PubTator/ PMID:23160414

  9. Cart'Eaux: an automatic mapping procedure for wastewater networks using machine learning and data mining

    NASA Astrophysics Data System (ADS)

    Bailly, J. S.; Delenne, C.; Chahinian, N.; Bringay, S.; Commandré, B.; Chaumont, M.; Derras, M.; Deruelle, L.; Roche, M.; Rodriguez, F.; Subsol, G.; Teisseire, M.

    2017-12-01

    In France, local government institutions must establish a detailed description of wastewater networks. The information should be available, but it remains fragmented (different formats held by different stakeholders) and incomplete. In the "Cart'Eaux" project, a multidisciplinary team, including an industrial partner, develops a global methodology using Machine Learning and Data Mining approaches applied to various types of large data to recover information in the aim of mapping urban sewage systems for hydraulic modelling. Deep-learning is first applied using a Convolution Neural Network to localize manhole covers on 5 cm resolution aerial RGB images. The detected manhole covers are then automatically connected using a tree-shaped graph constrained by industry rules. Based on a Delaunay triangulation, connections are chosen to minimize a cost function depending on pipe length, slope and possible intersection with roads or buildings. A stochastic version of this algorithm is currently being developed to account for positional uncertainty and detection errors, and generate sets of probable networks. As more information is required for hydraulic modeling (slopes, diameters, materials, etc.), text data mining is used to extract network characteristics from data posted on the Web or available through governmental or specific databases. Using an appropriate list of keywords, the web is scoured for documents which are saved in text format. The thematic entities are identified and linked to the surrounding spatial and temporal entities. The methodology is developed and tested on two towns in southern France. The primary results are encouraging: 54% of manhole covers are detected with few false detections, enabling the reconstruction of probable networks. The data mining results are still being investigated. It is clear at this stage that getting numerical values on specific pipes will be challenging. Thus, when no information is found, decision rules will be used to assign admissible numerical values to enable the final hydraulic modelling. Consequently, sensitivity analysis of the hydraulic model will be performed to take into account the uncertainty associated with each piece of information. Project funded by the European Regional Development Fund and the Occitanie Region.

  10. A new approach to preserve privacy data mining based on fuzzy theory in numerical database

    NASA Astrophysics Data System (ADS)

    Cui, Run; Kim, Hyoung Joong

    2014-01-01

    With the rapid development of information techniques, data mining approaches have become one of the most important tools to discover the in-deep associations of tuples in large-scale database. Hence how to protect the private information is quite a huge challenge, especially during the data mining procedure. In this paper, a new method is proposed for privacy protection which is based on fuzzy theory. The traditional fuzzy approach in this area will apply fuzzification to the data without considering its readability. A new style of obscured data expression is introduced to provide more details of the subsets without reducing the readability. Also we adopt a balance approach between the privacy level and utility when to achieve the suitable subgroups. An experiment is provided to show that this approach is suitable for the classification without a lower accuracy. In the future, this approach can be adapted to the data stream as the low computation complexity of the fuzzy function with a suitable modification.

  11. Integrative genomic mining for enzyme function to enable engineering of a non-natural biosynthetic pathway.

    PubMed

    Mak, Wai Shun; Tran, Stephen; Marcheschi, Ryan; Bertolani, Steve; Thompson, James; Baker, David; Liao, James C; Siegel, Justin B

    2015-11-24

    The ability to biosynthetically produce chemicals beyond what is commonly found in Nature requires the discovery of novel enzyme function. Here we utilize two approaches to discover enzymes that enable specific production of longer-chain (C5-C8) alcohols from sugar. The first approach combines bioinformatics and molecular modelling to mine sequence databases, resulting in a diverse panel of enzymes capable of catalysing the targeted reaction. The median catalytic efficiency of the computationally selected enzymes is 75-fold greater than a panel of naively selected homologues. This integrative genomic mining approach establishes a unique avenue for enzyme function discovery in the rapidly expanding sequence databases. The second approach uses computational enzyme design to reprogramme specificity. Both approaches result in enzymes with >100-fold increase in specificity for the targeted reaction. When enzymes from either approach are integrated in vivo, longer-chain alcohol production increases over 10-fold and represents >95% of the total alcohol products.

  12. New approach to generating insights for aging research based on literature mining and knowledge integration

    PubMed Central

    Kwon, Yeondae; Natori, Yukikazu

    2017-01-01

    The proportion of the elderly population in most countries worldwide is increasing dramatically. Therefore, social interest in the fields of health, longevity, and anti-aging has been increasing as well. However, the basic research results obtained from a reductionist approach in biology and a bioinformatic approach in genome science have limited usefulness for generating insights on future health, longevity, and anti-aging-related research on a case by case basis. We propose a new approach that uses our literature mining technique and bioinformatics, which lead to a better perspective on research trends by providing an expanded knowledge base to work from. We demonstrate that our approach provides useful information that deepens insights on future trends which differs from data obtained conventionally, and this methodology is already paving the way for a new field in aging-related research based on literature mining. One compelling example of this is how our new approach can be a useful tool in drug repositioning. PMID:28817730

  13. An Efficient Pattern Mining Approach for Event Detection in Multivariate Temporal Data

    PubMed Central

    Batal, Iyad; Cooper, Gregory; Fradkin, Dmitriy; Harrison, James; Moerchen, Fabian; Hauskrecht, Milos

    2015-01-01

    This work proposes a pattern mining approach to learn event detection models from complex multivariate temporal data, such as electronic health records. We present Recent Temporal Pattern mining, a novel approach for efficiently finding predictive patterns for event detection problems. This approach first converts the time series data into time-interval sequences of temporal abstractions. It then constructs more complex time-interval patterns backward in time using temporal operators. We also present the Minimal Predictive Recent Temporal Patterns framework for selecting a small set of predictive and non-spurious patterns. We apply our methods for predicting adverse medical events in real-world clinical data. The results demonstrate the benefits of our methods in learning accurate event detection models, which is a key step for developing intelligent patient monitoring and decision support systems. PMID:26752800

  14. An Empirical Model for Mine-Blast Loading

    DTIC Science & Technology

    2014-10-17

    fledged experimental program. The numerical approach however suffers from several drawbacks in the mine blast simulations. First, it is a very...Suffield consisted in a pendulum type device to measure global impulse of buried mine [15]. One of the main purposes of the ONAGER pendulum was to study...TP-1 Terminal effects, KTA 1-34 report, 2004. [15] Bues, R., Hlady, S.L. and Bergeron, D.M., Pendulum Measurement of Land Mine Blast Output, Volume

  15. Relation extraction for biological pathway construction using node2vec.

    PubMed

    Kim, Munui; Baek, Seung Han; Song, Min

    2018-06-13

    Systems biology is an important field for understanding whole biological mechanisms composed of interactions between biological components. One approach for understanding complex and diverse mechanisms is to analyze biological pathways. However, because these pathways consist of important interactions and information on these interactions is disseminated in a large number of biomedical reports, text-mining techniques are essential for extracting these relationships automatically. In this study, we applied node2vec, an algorithmic framework for feature learning in networks, for relationship extraction. To this end, we extracted genes from paper abstracts using pkde4j, a text-mining tool for detecting entities and relationships. Using the extracted genes, a co-occurrence network was constructed and node2vec was used with the network to generate a latent representation. To demonstrate the efficacy of node2vec in extracting relationships between genes, performance was evaluated for gene-gene interactions involved in a type 2 diabetes pathway. Moreover, we compared the results of node2vec to those of baseline methods such as co-occurrence and DeepWalk. Node2vec outperformed existing methods in detecting relationships in the type 2 diabetes pathway, demonstrating that this method is appropriate for capturing the relatedness between pairs of biological entities involved in biological pathways. The results demonstrated that node2vec is useful for automatic pathway construction.

  16. Knowledge-based identification of the ERK2/STAT3 signal pathway as a therapeutic target for type 2 diabetes and drug discovery.

    PubMed

    Kinoshita, Takayoshi; Doi, Kentaro; Sugiyama, Hajime; Kinoshita, Shuhei; Wada, Mutsuyo; Naruto, Shuji; Tomonaga, Atsushi

    2011-09-01

    Many existing agents for diabetes therapy are unable to restore or maintain normal glucose homeostasis or prevent the eventual emergence of hyperglycemia-related complication. Therefore, agents based on novel mechanisms are sought to complement and extend the current therapeutic approaches. Based on the initial paper research, we focused on active STAT3 as an attractive pharmacological target for type 2 diabetes. The subsequent text mining with a unique query to identify suppressors but not activators of STAT3 revealed the ERK2/STAT3 pathway as a novel diabetes target. The description of ERK2 inhibitors as diabetes target had not been found in our text mining research at present. The mechanism-based peptide inhibitor for ERK2 was identified using the knowledge of the KIM sequence, which has an important role in the recognition of cognate kinases, phosphatases, scaffold proteins, and substrates. The peptide inhibitor was confirmed to exert effects in vitro and in vivo. The peptide inhibitor conferred a significant decrease in HOMA-IR levels on Day 28 compared with that in the vehicle group. Besides lowering the fasting blood glucose level, the peptide inhibitor also attenuated the blood glucose increment in the fed state, as compared with the vehicle group. © 2011 John Wiley & Sons A/S.

  17. A Data Mining Approach to Reveal Representative Collaboration Indicators in Open Collaboration Frameworks

    ERIC Educational Resources Information Center

    Anaya, Antonio R.; Boticario, Jesus G.

    2009-01-01

    Data mining methods are successful in educational environments to discover new knowledge or learner skills or features. Unfortunately, they have not been used in depth with collaboration. We have developed a scalable data mining method, whose objective is to infer information on the collaboration during the collaboration process in a…

  18. VALUING ACID MINE DRAINAGE REMEDIATION IN WEST VIRGINIA: A HEDONIC MODELING APPROACH INCORPORATING GEOGRAPHIC INFORMATION SYSTEMS

    EPA Science Inventory

    States with active and abandoned mines face large private and public costs to remediate damage to streams and rivers from acid mine drainage (AMD). Appalachian states have an especially large number of contaminated streams and rivers, and the USGS places AMD as the primary source...

  19. Sustainable Remediation of Legacy Mine Drainage: A Case Study of the Flight 93 National Memorial.

    PubMed

    Emili, Lisa A; Pizarchik, Joseph; Mahan, Carolyn G

    2016-03-01

    Pollution from mining activities is a global environmental concern, not limited to areas of current resource extraction, but including a broader geographic area of historic (legacy) and abandoned mines. The pollution of surface waters from acid mine drainage is a persistent problem and requires a holistic and sustainable approach to addressing the spatial and temporal complexity of mining-specific problems. In this paper, we focus on the environmental, socio-economic, and legal challenges associated with the concurrent activities to remediate a coal mine site and to develop a national memorial following a catastrophic event. We provide a conceptual construct of a socio-ecological system defined at several spatial, temporal, and organizational scales and a critical synthesis of the technical and social learning processes necessary to achieving sustainable environmental remediation. Our case study is an example of a multi-disciplinary management approach, whereby collaborative interaction of stakeholders, the emergence of functional linkages for information exchange, and mediation led to scientifically informed decision making, creative management solutions, and ultimately environmental policy change.

  20. OSCAR4: a flexible architecture for chemical text-mining

    PubMed Central

    2011-01-01

    The Open-Source Chemistry Analysis Routines (OSCAR) software, a toolkit for the recognition of named entities and data in chemistry publications, has been developed since 2002. Recent work has resulted in the separation of the core OSCAR functionality and its release as the OSCAR4 library. This library features a modular API (based on reduction of surface coupling) that permits client programmers to easily incorporate it into external applications. OSCAR4 offers a domain-independent architecture upon which chemistry specific text-mining tools can be built, and its development and usage are discussed. PMID:21999457

Top