ParaBTM: A Parallel Processing Framework for Biomedical Text Mining on Supercomputers.
Xing, Yuting; Wu, Chengkun; Yang, Xi; Wang, Wei; Zhu, En; Yin, Jianping
2018-04-27
A prevailing way of extracting valuable information from biomedical literature is to apply text mining methods on unstructured texts. However, the massive amount of literature that needs to be analyzed poses a big data challenge to the processing efficiency of text mining. In this paper, we address this challenge by introducing parallel processing on a supercomputer. We developed paraBTM, a runnable framework that enables parallel text mining on the Tianhe-2 supercomputer. It employs a low-cost yet effective load balancing strategy to maximize the efficiency of parallel processing. We evaluated the performance of paraBTM on several datasets, utilizing three types of named entity recognition tasks as demonstration. Results show that, in most cases, the processing efficiency can be greatly improved with parallel processing, and the proposed load balancing strategy is simple and effective. In addition, our framework can be readily applied to other tasks of biomedical text mining besides NER.
NASA Astrophysics Data System (ADS)
Scheele, C. J.; Huang, Q.
2016-12-01
In the past decade, the rise in social media has led to the development of a vast number of social media services and applications. Disaster management represents one of such applications leveraging massive data generated for event detection, response, and recovery. In order to find disaster relevant social media data, current approaches utilize natural language processing (NLP) methods based on keywords, or machine learning algorithms relying on text only. However, these approaches cannot be perfectly accurate due to the variability and uncertainty in language used on social media. To improve current methods, the enhanced text-mining framework is proposed to incorporate location information from social media and authoritative remote sensing datasets for detecting disaster relevant social media posts, which are determined by assessing the textual content using common text mining methods and how the post relates spatiotemporally to the disaster event. To assess the framework, geo-tagged Tweets were collected for three different spatial and temporal disaster events: hurricane, flood, and tornado. Remote sensing data and products for each event were then collected using RealEarthTM. Both Naive Bayes and Logistic Regression classifiers were used to compare the accuracy within the enhanced text-mining framework. Finally, the accuracies from the enhanced text-mining framework were compared to the current text-only methods for each of the case study disaster events. The results from this study address the need for more authoritative data when using social media in disaster management applications.
SparkText: Biomedical Text Mining on Big Data Framework.
Ye, Zhan; Tafti, Ahmad P; He, Karen Y; Wang, Kai; He, Max M
Many new biomedical research articles are published every day, accumulating rich information, such as genetic variants, genes, diseases, and treatments. Rapid yet accurate text mining on large-scale scientific literature can discover novel knowledge to better understand human diseases and to improve the quality of disease diagnosis, prevention, and treatment. In this study, we designed and developed an efficient text mining framework called SparkText on a Big Data infrastructure, which is composed of Apache Spark data streaming and machine learning methods, combined with a Cassandra NoSQL database. To demonstrate its performance for classifying cancer types, we extracted information (e.g., breast, prostate, and lung cancers) from tens of thousands of articles downloaded from PubMed, and then employed Naïve Bayes, Support Vector Machine (SVM), and Logistic Regression to build prediction models to mine the articles. The accuracy of predicting a cancer type by SVM using the 29,437 full-text articles was 93.81%. While competing text-mining tools took more than 11 hours, SparkText mined the dataset in approximately 6 minutes. This study demonstrates the potential for mining large-scale scientific articles on a Big Data infrastructure, with real-time update from new articles published daily. SparkText can be extended to other areas of biomedical research.
SparkText: Biomedical Text Mining on Big Data Framework
He, Karen Y.; Wang, Kai
2016-01-01
Background Many new biomedical research articles are published every day, accumulating rich information, such as genetic variants, genes, diseases, and treatments. Rapid yet accurate text mining on large-scale scientific literature can discover novel knowledge to better understand human diseases and to improve the quality of disease diagnosis, prevention, and treatment. Results In this study, we designed and developed an efficient text mining framework called SparkText on a Big Data infrastructure, which is composed of Apache Spark data streaming and machine learning methods, combined with a Cassandra NoSQL database. To demonstrate its performance for classifying cancer types, we extracted information (e.g., breast, prostate, and lung cancers) from tens of thousands of articles downloaded from PubMed, and then employed Naïve Bayes, Support Vector Machine (SVM), and Logistic Regression to build prediction models to mine the articles. The accuracy of predicting a cancer type by SVM using the 29,437 full-text articles was 93.81%. While competing text-mining tools took more than 11 hours, SparkText mined the dataset in approximately 6 minutes. Conclusions This study demonstrates the potential for mining large-scale scientific articles on a Big Data infrastructure, with real-time update from new articles published daily. SparkText can be extended to other areas of biomedical research. PMID:27685652
PubRunner: A light-weight framework for updating text mining results.
Anekalla, Kishore R; Courneya, J P; Fiorini, Nicolas; Lever, Jake; Muchow, Michael; Busby, Ben
2017-01-01
Biomedical text mining promises to assist biologists in quickly navigating the combined knowledge in their domain. This would allow improved understanding of the complex interactions within biological systems and faster hypothesis generation. New biomedical research articles are published daily and text mining tools are only as good as the corpus from which they work. Many text mining tools are underused because their results are static and do not reflect the constantly expanding knowledge in the field. In order for biomedical text mining to become an indispensable tool used by researchers, this problem must be addressed. To this end, we present PubRunner, a framework for regularly running text mining tools on the latest publications. PubRunner is lightweight, simple to use, and can be integrated with an existing text mining tool. The workflow involves downloading the latest abstracts from PubMed, executing a user-defined tool, pushing the resulting data to a public FTP or Zenodo dataset, and publicizing the location of these results on the public PubRunner website. We illustrate the use of this tool by re-running the commonly used word2vec tool on the latest PubMed abstracts to generate up-to-date word vector representations for the biomedical domain. This shows a proof of concept that we hope will encourage text mining developers to build tools that truly will aid biologists in exploring the latest publications.
Adaptive semantic tag mining from heterogeneous clinical research texts.
Hao, T; Weng, C
2015-01-01
To develop an adaptive approach to mine frequent semantic tags (FSTs) from heterogeneous clinical research texts. We develop a "plug-n-play" framework that integrates replaceable unsupervised kernel algorithms with formatting, functional, and utility wrappers for FST mining. Temporal information identification and semantic equivalence detection were two example functional wrappers. We first compared this approach's recall and efficiency for mining FSTs from ClinicalTrials.gov to that of a recently published tag-mining algorithm. Then we assessed this approach's adaptability to two other types of clinical research texts: clinical data requests and clinical trial protocols, by comparing the prevalence trends of FSTs across three texts. Our approach increased the average recall and speed by 12.8% and 47.02% respectively upon the baseline when mining FSTs from ClinicalTrials.gov, and maintained an overlap in relevant FSTs with the base- line ranging between 76.9% and 100% for varying FST frequency thresholds. The FSTs saturated when the data size reached 200 documents. Consistent trends in the prevalence of FST were observed across the three texts as the data size or frequency threshold changed. This paper contributes an adaptive tag-mining framework that is scalable and adaptable without sacrificing its recall. This component-based architectural design can be potentially generalizable to improve the adaptability of other clinical text mining methods.
A Framework for Text Mining in Scientometric Study: A Case Study in Biomedicine Publications
NASA Astrophysics Data System (ADS)
Silalahi, V. M. M.; Hardiyati, R.; Nadhiroh, I. M.; Handayani, T.; Rahmaida, R.; Amelia, M.
2018-04-01
The data of Indonesians research publications in the domain of biomedicine has been collected to be text mined for the purpose of a scientometric study. The goal is to build a predictive model that provides a classification of research publications on the potency for downstreaming. The model is based on the drug development processes adapted from the literatures. An effort is described to build the conceptual model and the development of a corpus on the research publications in the domain of Indonesian biomedicine. Then an investigation is conducted relating to the problems associated with building a corpus and validating the model. Based on our experience, a framework is proposed to manage the scientometric study based on text mining. Our method shows the effectiveness of conducting a scientometric study based on text mining in order to get a valid classification model. This valid model is mainly supported by the iterative and close interactions with the domain experts starting from identifying the issues, building a conceptual model, to the labelling, validation and results interpretation.
Mining Quality Phrases from Massive Text Corpora
Liu, Jialu; Shang, Jingbo; Wang, Chi; Ren, Xiang; Han, Jiawei
2015-01-01
Text data are ubiquitous and play an essential role in big data applications. However, text data are mostly unstructured. Transforming unstructured text into structured units (e.g., semantically meaningful phrases) will substantially reduce semantic ambiguity and enhance the power and efficiency at manipulating such data using database technology. Thus mining quality phrases is a critical research problem in the field of databases. In this paper, we propose a new framework that extracts quality phrases from text corpora integrated with phrasal segmentation. The framework requires only limited training but the quality of phrases so generated is close to human judgment. Moreover, the method is scalable: both computation time and required space grow linearly as corpus size increases. Our experiments on large text corpora demonstrate the quality and efficiency of the new method. PMID:26705375
U-Compare: share and compare text mining tools with UIMA.
Kano, Yoshinobu; Baumgartner, William A; McCrohon, Luke; Ananiadou, Sophia; Cohen, K Bretonnel; Hunter, Lawrence; Tsujii, Jun'ichi
2009-08-01
Due to the increasing number of text mining resources (tools and corpora) available to biologists, interoperability issues between these resources are becoming significant obstacles to using them effectively. UIMA, the Unstructured Information Management Architecture, is an open framework designed to aid in the construction of more interoperable tools. U-Compare is built on top of the UIMA framework, and provides both a concrete framework for out-of-the-box text mining and a sophisticated evaluation platform allowing users to run specific tools on any target text, generating both detailed statistics and instance-based visualizations of outputs. U-Compare is a joint project, providing the world's largest, and still growing, collection of UIMA-compatible resources. These resources, originally developed by different groups for a variety of domains, include many famous tools and corpora. U-Compare can be launched straight from the web, without needing to be manually installed. All U-Compare components are provided ready-to-use and can be combined easily via a drag-and-drop interface without any programming. External UIMA components can also simply be mixed with U-Compare components, without distinguishing between locally and remotely deployed resources. http://u-compare.org/
A New Framework for Textual Information Mining over Parse Trees. CRESST Report 805
ERIC Educational Resources Information Center
Mousavi, Hamid; Kerr, Deirdre; Iseli, Markus R.
2011-01-01
Textual information mining is a challenging problem that has resulted in the creation of many different rule-based linguistic query languages. However, these languages generally are not optimized for the purpose of text mining. In other words, they usually consider queries as individuals and only return raw results for each query. Moreover they…
A Text-Mining Framework for Supporting Systematic Reviews.
Li, Dingcheng; Wang, Zhen; Wang, Liwei; Sohn, Sunghwan; Shen, Feichen; Murad, Mohammad Hassan; Liu, Hongfang
2016-11-01
Systematic reviews (SRs) involve the identification, appraisal, and synthesis of all relevant studies for focused questions in a structured reproducible manner. High-quality SRs follow strict procedures and require significant resources and time. We investigated advanced text-mining approaches to reduce the burden associated with abstract screening in SRs and provide high-level information summary. A text-mining SR supporting framework consisting of three self-defined semantics-based ranking metrics was proposed, including keyword relevance, indexed-term relevance and topic relevance. Keyword relevance is based on the user-defined keyword list used in the search strategy. Indexed-term relevance is derived from indexed vocabulary developed by domain experts used for indexing journal articles and books. Topic relevance is defined as the semantic similarity among retrieved abstracts in terms of topics generated by latent Dirichlet allocation, a Bayesian-based model for discovering topics. We tested the proposed framework using three published SRs addressing a variety of topics (Mass Media Interventions, Rectal Cancer and Influenza Vaccine). The results showed that when 91.8%, 85.7%, and 49.3% of the abstract screening labor was saved, the recalls were as high as 100% for the three cases; respectively. Relevant studies identified manually showed strong topic similarity through topic analysis, which supported the inclusion of topic analysis as relevance metric. It was demonstrated that advanced text mining approaches can significantly reduce the abstract screening labor of SRs and provide an informative summary of relevant studies.
U-Compare: share and compare text mining tools with UIMA
Kano, Yoshinobu; Baumgartner, William A.; McCrohon, Luke; Ananiadou, Sophia; Cohen, K. Bretonnel; Hunter, Lawrence; Tsujii, Jun'ichi
2009-01-01
Summary: Due to the increasing number of text mining resources (tools and corpora) available to biologists, interoperability issues between these resources are becoming significant obstacles to using them effectively. UIMA, the Unstructured Information Management Architecture, is an open framework designed to aid in the construction of more interoperable tools. U-Compare is built on top of the UIMA framework, and provides both a concrete framework for out-of-the-box text mining and a sophisticated evaluation platform allowing users to run specific tools on any target text, generating both detailed statistics and instance-based visualizations of outputs. U-Compare is a joint project, providing the world's largest, and still growing, collection of UIMA-compatible resources. These resources, originally developed by different groups for a variety of domains, include many famous tools and corpora. U-Compare can be launched straight from the web, without needing to be manually installed. All U-Compare components are provided ready-to-use and can be combined easily via a drag-and-drop interface without any programming. External UIMA components can also simply be mixed with U-Compare components, without distinguishing between locally and remotely deployed resources. Availability: http://u-compare.org/ Contact: kano@is.s.u-tokyo.ac.jp PMID:19414535
An open-source framework for large-scale, flexible evaluation of biomedical text mining systems.
Baumgartner, William A; Cohen, K Bretonnel; Hunter, Lawrence
2008-01-29
Improved evaluation methodologies have been identified as a necessary prerequisite to the improvement of text mining theory and practice. This paper presents a publicly available framework that facilitates thorough, structured, and large-scale evaluations of text mining technologies. The extensibility of this framework and its ability to uncover system-wide characteristics by analyzing component parts as well as its usefulness for facilitating third-party application integration are demonstrated through examples in the biomedical domain. Our evaluation framework was assembled using the Unstructured Information Management Architecture. It was used to analyze a set of gene mention identification systems involving 225 combinations of system, evaluation corpus, and correctness measure. Interactions between all three were found to affect the relative rankings of the systems. A second experiment evaluated gene normalization system performance using as input 4,097 combinations of gene mention systems and gene mention system-combining strategies. Gene mention system recall is shown to affect gene normalization system performance much more than does gene mention system precision, and high gene normalization performance is shown to be achievable with remarkably low levels of gene mention system precision. The software presented in this paper demonstrates the potential for novel discovery resulting from the structured evaluation of biomedical language processing systems, as well as the usefulness of such an evaluation framework for promoting collaboration between developers of biomedical language processing technologies. The code base is available as part of the BioNLP UIMA Component Repository on SourceForge.net.
An open-source framework for large-scale, flexible evaluation of biomedical text mining systems
Baumgartner, William A; Cohen, K Bretonnel; Hunter, Lawrence
2008-01-01
Background Improved evaluation methodologies have been identified as a necessary prerequisite to the improvement of text mining theory and practice. This paper presents a publicly available framework that facilitates thorough, structured, and large-scale evaluations of text mining technologies. The extensibility of this framework and its ability to uncover system-wide characteristics by analyzing component parts as well as its usefulness for facilitating third-party application integration are demonstrated through examples in the biomedical domain. Results Our evaluation framework was assembled using the Unstructured Information Management Architecture. It was used to analyze a set of gene mention identification systems involving 225 combinations of system, evaluation corpus, and correctness measure. Interactions between all three were found to affect the relative rankings of the systems. A second experiment evaluated gene normalization system performance using as input 4,097 combinations of gene mention systems and gene mention system-combining strategies. Gene mention system recall is shown to affect gene normalization system performance much more than does gene mention system precision, and high gene normalization performance is shown to be achievable with remarkably low levels of gene mention system precision. Conclusion The software presented in this paper demonstrates the potential for novel discovery resulting from the structured evaluation of biomedical language processing systems, as well as the usefulness of such an evaluation framework for promoting collaboration between developers of biomedical language processing technologies. The code base is available as part of the BioNLP UIMA Component Repository on SourceForge.net. PMID:18230184
Lucini, Filipe R; S Fogliatto, Flavio; C da Silveira, Giovani J; L Neyeloff, Jeruza; Anzanello, Michel J; de S Kuchenbecker, Ricardo; D Schaan, Beatriz
2017-04-01
Emergency department (ED) overcrowding is a serious issue for hospitals. Early information on short-term inward bed demand from patients receiving care at the ED may reduce the overcrowding problem, and optimize the use of hospital resources. In this study, we use text mining methods to process data from early ED patient records using the SOAP framework, and predict future hospitalizations and discharges. We try different approaches for pre-processing of text records and to predict hospitalization. Sets-of-words are obtained via binary representation, term frequency, and term frequency-inverse document frequency. Unigrams, bigrams and trigrams are tested for feature formation. Feature selection is based on χ 2 and F-score metrics. In the prediction module, eight text mining methods are tested: Decision Tree, Random Forest, Extremely Randomized Tree, AdaBoost, Logistic Regression, Multinomial Naïve Bayes, Support Vector Machine (Kernel linear) and Nu-Support Vector Machine (Kernel linear). Prediction performance is evaluated by F1-scores. Precision and Recall values are also informed for all text mining methods tested. Nu-Support Vector Machine was the text mining method with the best overall performance. Its average F1-score in predicting hospitalization was 77.70%, with a standard deviation (SD) of 0.66%. The method could be used to manage daily routines in EDs such as capacity planning and resource allocation. Text mining could provide valuable information and facilitate decision-making by inward bed management teams. Copyright © 2017 Elsevier Ireland Ltd. All rights reserved.
Extraction and Classification of Emotions for Business Research
NASA Astrophysics Data System (ADS)
Verma, Rajib
The commercial study of emotions has not embraced Internet / social mining yet, even though it has important applications in management. This is surprising since the emotional content is freeform, wide spread, can give a better indication of feelings (for instance with taboo subjects), and is inexpensive compared to other business research methods. A brief framework for applying text mining to this new research domain is shown and classification issues are discussed in an effort to quickly get businessman and researchers to adopt the mining methodology.
Clustering and Dimensionality Reduction to Discover Interesting Patterns in Binary Data
NASA Astrophysics Data System (ADS)
Palumbo, Francesco; D'Enza, Alfonso Iodice
The attention towards binary data coding increased consistently in the last decade due to several reasons. The analysis of binary data characterizes several fields of application, such as market basket analysis, DNA microarray data, image mining, text mining and web-clickstream mining. The paper illustrates two different approaches exploiting a profitable combination of clustering and dimensionality reduction for the identification of non-trivial association structures in binary data. An application in the Association Rules framework supports the theory with the empirical evidence.
Signal Detection Framework Using Semantic Text Mining Techniques
ERIC Educational Resources Information Center
Sudarsan, Sithu D.
2009-01-01
Signal detection is a challenging task for regulatory and intelligence agencies. Subject matter experts in those agencies analyze documents, generally containing narrative text in a time bound manner for signals by identification, evaluation and confirmation, leading to follow-up action e.g., recalling a defective product or public advisory for…
Van Landeghem, Sofie; Abeel, Thomas; Saeys, Yvan; Van de Peer, Yves
2010-09-15
In the field of biomolecular text mining, black box behavior of machine learning systems currently limits understanding of the true nature of the predictions. However, feature selection (FS) is capable of identifying the most relevant features in any supervised learning setting, providing insight into the specific properties of the classification algorithm. This allows us to build more accurate classifiers while at the same time bridging the gap between the black box behavior and the end-user who has to interpret the results. We show that our FS methodology successfully discards a large fraction of machine-generated features, improving classification performance of state-of-the-art text mining algorithms. Furthermore, we illustrate how FS can be applied to gain understanding in the predictions of a framework for biomolecular event extraction from text. We include numerous examples of highly discriminative features that model either biological reality or common linguistic constructs. Finally, we discuss a number of insights from our FS analyses that will provide the opportunity to considerably improve upon current text mining tools. The FS algorithms and classifiers are available in Java-ML (http://java-ml.sf.net). The datasets are publicly available from the BioNLP'09 Shared Task web site (http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/SharedTask/).
A Semi-supervised Heat Kernel Pagerank MBO Algorithm for Data Classification
2016-07-01
financial predictions, etc. and is finding growing use in text mining studies. In this paper, we present an efficient algorithm for classification of high...video data, set of images, hyperspectral data, medical data, text data, etc. Moreover, the framework provides a way to analyze data whose different...also be incorporated. For text classification, one can use tfidf (term frequency inverse document frequency) to form feature vectors for each document
Unapparent Information Revelation: Text Mining for Counterterrorism
NASA Astrophysics Data System (ADS)
Srihari, Rohini K.
Unapparent information revelation (UIR) is a special case of text mining that focuses on detecting possible links between concepts across multiple text documents by generating an evidence trail explaining the connection. A traditional search involving, for example, two or more person names will attempt to find documents mentioning both these individuals. This research focuses on a different interpretation of such a query: what is the best evidence trail across documents that explains a connection between these individuals? For example, all may be good golfers. A generalization of this task involves query terms representing general concepts (e.g. indictment, foreign policy). Previous approaches to this problem have focused on graph mining involving hyperlinked documents, and link analysis exploiting named entities. A new robust framework is presented, based on (i) generating concept chain graphs, a hybrid content representation, (ii) performing graph matching to select candidate subgraphs, and (iii) subsequently using graphical models to validate hypotheses using ranked evidence trails. We adapt the DUC data set for cross-document summarization to evaluate evidence trails generated by this approach
Text Mining Metal-Organic Framework Papers.
Park, Sanghoon; Kim, Baekjun; Choi, Sihoon; Boyd, Peter G; Smit, Berend; Kim, Jihan
2018-02-26
We have developed a simple text mining algorithm that allows us to identify surface area and pore volumes of metal-organic frameworks (MOFs) using manuscript html files as inputs. The algorithm searches for common units (e.g., m 2 /g, cm 3 /g) associated with these two quantities to facilitate the search. From the sample set data of over 200 MOFs, the algorithm managed to identify 90% and 88.8% of the correct surface area and pore volume values. Further application to a test set of randomly chosen MOF html files yielded 73.2% and 85.1% accuracies for the two respective quantities. Most of the errors stem from unorthodox sentence structures that made it difficult to identify the correct data as well as bolded notations of MOFs (e.g., 1a) that made it difficult identify its real name. These types of tools will become useful when it comes to discovering structure-property relationships among MOFs as well as collecting a large set of data for references.
Unsupervised text mining for assessing and augmenting GWAS results.
Ailem, Melissa; Role, François; Nadif, Mohamed; Demenais, Florence
2016-04-01
Text mining can assist in the analysis and interpretation of large-scale biomedical data, helping biologists to quickly and cheaply gain confirmation of hypothesized relationships between biological entities. We set this question in the context of genome-wide association studies (GWAS), an actively emerging field that contributed to identify many genes associated with multifactorial diseases. These studies allow to identify groups of genes associated with the same phenotype, but provide no information about the relationships between these genes. Therefore, our objective is to leverage unsupervised text mining techniques using text-based cosine similarity comparisons and clustering applied to candidate and random gene vectors, in order to augment the GWAS results. We propose a generic framework which we used to characterize the relationships between 10 genes reported associated with asthma by a previous GWAS. The results of this experiment showed that the similarities between these 10 genes were significantly stronger than would be expected by chance (one-sided p-value<0.01). The clustering of observed and randomly selected gene also allowed to generate hypotheses about potential functional relationships between these genes and thus contributed to the discovery of new candidate genes for asthma. Copyright © 2016 Elsevier Inc. All rights reserved.
Li, Yanpeng; Hu, Xiaohua; Lin, Hongfei; Yang, Zhihao
2011-01-01
Feature representation is essential to machine learning and text mining. In this paper, we present a feature coupling generalization (FCG) framework for generating new features from unlabeled data. It selects two special types of features, i.e., example-distinguishing features (EDFs) and class-distinguishing features (CDFs) from original feature set, and then generalizes EDFs into higher-level features based on their coupling degrees with CDFs in unlabeled data. The advantage is: EDFs with extreme sparsity in labeled data can be enriched by their co-occurrences with CDFs in unlabeled data so that the performance of these low-frequency features can be greatly boosted and new information from unlabeled can be incorporated. We apply this approach to three tasks in biomedical literature mining: gene named entity recognition (NER), protein-protein interaction extraction (PPIE), and text classification (TC) for gene ontology (GO) annotation. New features are generated from over 20 GB unlabeled PubMed abstracts. The experimental results on BioCreative 2, AIMED corpus, and TREC 2005 Genomics Track show that 1) FCG can utilize well the sparse features ignored by supervised learning. 2) It improves the performance of supervised baselines by 7.8 percent, 5.0 percent, and 5.8 percent, respectively, in the tree tasks. 3) Our methods achieve 89.1, 64.5 F-score, and 60.1 normalized utility on the three benchmark data sets.
Elayavilli, Ravikumar Komandur; Liu, Hongfang
2016-01-01
Computational modeling of biological cascades is of great interest to quantitative biologists. Biomedical text has been a rich source for quantitative information. Gathering quantitative parameters and values from biomedical text is one significant challenge in the early steps of computational modeling as it involves huge manual effort. While automatically extracting such quantitative information from bio-medical text may offer some relief, lack of ontological representation for a subdomain serves as impedance in normalizing textual extractions to a standard representation. This may render textual extractions less meaningful to the domain experts. In this work, we propose a rule-based approach to automatically extract relations involving quantitative data from biomedical text describing ion channel electrophysiology. We further translated the quantitative assertions extracted through text mining to a formal representation that may help in constructing ontology for ion channel events using a rule based approach. We have developed Ion Channel ElectroPhysiology Ontology (ICEPO) by integrating the information represented in closely related ontologies such as, Cell Physiology Ontology (CPO), and Cardiac Electro Physiology Ontology (CPEO) and the knowledge provided by domain experts. The rule-based system achieved an overall F-measure of 68.93% in extracting the quantitative data assertions system on an independently annotated blind data set. We further made an initial attempt in formalizing the quantitative data assertions extracted from the biomedical text into a formal representation that offers potential to facilitate the integration of text mining into ontological workflow, a novel aspect of this study. This work is a case study where we created a platform that provides formal interaction between ontology development and text mining. We have achieved partial success in extracting quantitative assertions from the biomedical text and formalizing them in ontological framework. The ICEPO ontology is available for download at http://openbionlp.org/mutd/supplementarydata/ICEPO/ICEPO.owl.
ERIC Educational Resources Information Center
Mu, Jin; Stegmann, Karsten; Mayfield, Elijah; Rose, Carolyn; Fischer, Frank
2012-01-01
Research related to online discussions frequently faces the problem of analyzing huge corpora. Natural Language Processing (NLP) technologies may allow automating this analysis. However, the state-of-the-art in machine learning and text mining approaches yields models that do not transfer well between corpora related to different topics. Also,…
The Use of Gamification in Education: A Bibliometric and Text Mining Analysis
ERIC Educational Resources Information Center
Martí-Parreño, J.; Méndez-Ibáñez, E.; Alonso-Arroyo, A.
2016-01-01
The use of games in education represents a promising tool to motivate and engage students in their learning process. Most of previous research on the topic has focused to develop theoretical frameworks or to conduct experiments as a means to analyse learning outcomes such as knowledge retention, problem-solving skills gains or attitudes toward…
Jurca, Gabriela; Addam, Omar; Aksac, Alper; Gao, Shang; Özyer, Tansel; Demetrick, Douglas; Alhajj, Reda
2016-04-26
Breast cancer is a serious disease which affects many women and may lead to death. It has received considerable attention from the research community. Thus, biomedical researchers aim to find genetic biomarkers indicative of the disease. Novel biomarkers can be elucidated from the existing literature. However, the vast amount of scientific publications on breast cancer make this a daunting task. This paper presents a framework which investigates existing literature data for informative discoveries. It integrates text mining and social network analysis in order to identify new potential biomarkers for breast cancer. We utilized PubMed for the testing. We investigated gene-gene interactions, as well as novel interactions such as gene-year, gene-country, and abstract-country to find out how the discoveries varied over time and how overlapping/diverse are the discoveries and the interest of various research groups in different countries. Interesting trends have been identified and discussed, e.g., different genes are highlighted in relationship to different countries though the various genes were found to share functionality. Some text analysis based results have been validated against results from other tools that predict gene-gene relations and gene functions.
EXACT2: the semantics of biomedical protocols
2014-01-01
Background The reliability and reproducibility of experimental procedures is a cornerstone of scientific practice. There is a pressing technological need for the better representation of biomedical protocols to enable other agents (human or machine) to better reproduce results. A framework that ensures that all information required for the replication of experimental protocols is essential to achieve reproducibility. Methods We have developed the ontology EXACT2 (EXperimental ACTions) that is designed to capture the full semantics of biomedical protocols required for their reproducibility. To construct EXACT2 we manually inspected hundreds of published and commercial biomedical protocols from several areas of biomedicine. After establishing a clear pattern for extracting the required information we utilized text-mining tools to translate the protocols into a machine amenable format. We have verified the utility of EXACT2 through the successful processing of previously 'unseen' (not used for the construction of EXACT2) protocols. Results The paper reports on a fundamentally new version EXACT2 that supports the semantically-defined representation of biomedical protocols. The ability of EXACT2 to capture the semantics of biomedical procedures was verified through a text mining use case. In this EXACT2 is used as a reference model for text mining tools to identify terms pertinent to experimental actions, and their properties, in biomedical protocols expressed in natural language. An EXACT2-based framework for the translation of biomedical protocols to a machine amenable format is proposed. Conclusions The EXACT2 ontology is sufficient to record, in a machine processable form, the essential information about biomedical protocols. EXACT2 defines explicit semantics of experimental actions, and can be used by various computer applications. It can serve as a reference model for for the translation of biomedical protocols in natural language into a semantically-defined format. PMID:25472549
Transfer Learning beyond Text Classification
NASA Astrophysics Data System (ADS)
Yang, Qiang
Transfer learning is a new machine learning and data mining framework that allows the training and test data to come from different distributions or feature spaces. We can find many novel applications of machine learning and data mining where transfer learning is necessary. While much has been done in transfer learning in text classification and reinforcement learning, there has been a lack of documented success stories of novel applications of transfer learning in other areas. In this invited article, I will argue that transfer learning is in fact quite ubiquitous in many real world applications. In this article, I will illustrate this point through an overview of a broad spectrum of applications of transfer learning that range from collaborative filtering to sensor based location estimation and logical action model learning for AI planning. I will also discuss some potential future directions of transfer learning.
Crowley, Rebecca S; Castine, Melissa; Mitchell, Kevin; Chavan, Girish; McSherry, Tara; Feldman, Michael
2010-01-01
The authors report on the development of the Cancer Tissue Information Extraction System (caTIES)--an application that supports collaborative tissue banking and text mining by leveraging existing natural language processing methods and algorithms, grid communication and security frameworks, and query visualization methods. The system fills an important need for text-derived clinical data in translational research such as tissue-banking and clinical trials. The design of caTIES addresses three critical issues for informatics support of translational research: (1) federation of research data sources derived from clinical systems; (2) expressive graphical interfaces for concept-based text mining; and (3) regulatory and security model for supporting multi-center collaborative research. Implementation of the system at several Cancer Centers across the country is creating a potential network of caTIES repositories that could provide millions of de-identified clinical reports to users. The system provides an end-to-end application of medical natural language processing to support multi-institutional translational research programs.
Text Mining in Biomedical Domain with Emphasis on Document Clustering.
Renganathan, Vinaitheerthan
2017-07-01
With the exponential increase in the number of articles published every year in the biomedical domain, there is a need to build automated systems to extract unknown information from the articles published. Text mining techniques enable the extraction of unknown knowledge from unstructured documents. This paper reviews text mining processes in detail and the software tools available to carry out text mining. It also reviews the roles and applications of text mining in the biomedical domain. Text mining processes, such as search and retrieval of documents, pre-processing of documents, natural language processing, methods for text clustering, and methods for text classification are described in detail. Text mining techniques can facilitate the mining of vast amounts of knowledge on a given topic from published biomedical research articles and draw meaningful conclusions that are not possible otherwise.
Text Mining in Biomedical Domain with Emphasis on Document Clustering
2017-01-01
Objectives With the exponential increase in the number of articles published every year in the biomedical domain, there is a need to build automated systems to extract unknown information from the articles published. Text mining techniques enable the extraction of unknown knowledge from unstructured documents. Methods This paper reviews text mining processes in detail and the software tools available to carry out text mining. It also reviews the roles and applications of text mining in the biomedical domain. Results Text mining processes, such as search and retrieval of documents, pre-processing of documents, natural language processing, methods for text clustering, and methods for text classification are described in detail. Conclusions Text mining techniques can facilitate the mining of vast amounts of knowledge on a given topic from published biomedical research articles and draw meaningful conclusions that are not possible otherwise. PMID:28875048
Deploying and sharing U-Compare workflows as web services.
Kontonatsios, Georgios; Korkontzelos, Ioannis; Kolluru, Balakrishna; Thompson, Paul; Ananiadou, Sophia
2013-02-18
U-Compare is a text mining platform that allows the construction, evaluation and comparison of text mining workflows. U-Compare contains a large library of components that are tuned to the biomedical domain. Users can rapidly develop biomedical text mining workflows by mixing and matching U-Compare's components. Workflows developed using U-Compare can be exported and sent to other users who, in turn, can import and re-use them. However, the resulting workflows are standalone applications, i.e., software tools that run and are accessible only via a local machine, and that can only be run with the U-Compare platform. We address the above issues by extending U-Compare to convert standalone workflows into web services automatically, via a two-click process. The resulting web services can be registered on a central server and made publicly available. Alternatively, users can make web services available on their own servers, after installing the web application framework, which is part of the extension to U-Compare. We have performed a user-oriented evaluation of the proposed extension, by asking users who have tested the enhanced functionality of U-Compare to complete questionnaires that assess its functionality, reliability, usability, efficiency and maintainability. The results obtained reveal that the new functionality is well received by users. The web services produced by U-Compare are built on top of open standards, i.e., REST and SOAP protocols, and therefore, they are decoupled from the underlying platform. Exported workflows can be integrated with any application that supports these open standards. We demonstrate how the newly extended U-Compare enhances the cross-platform interoperability of workflows, by seamlessly importing a number of text mining workflow web services exported from U-Compare into Taverna, i.e., a generic scientific workflow construction platform.
Deploying and sharing U-Compare workflows as web services
2013-01-01
Background U-Compare is a text mining platform that allows the construction, evaluation and comparison of text mining workflows. U-Compare contains a large library of components that are tuned to the biomedical domain. Users can rapidly develop biomedical text mining workflows by mixing and matching U-Compare’s components. Workflows developed using U-Compare can be exported and sent to other users who, in turn, can import and re-use them. However, the resulting workflows are standalone applications, i.e., software tools that run and are accessible only via a local machine, and that can only be run with the U-Compare platform. Results We address the above issues by extending U-Compare to convert standalone workflows into web services automatically, via a two-click process. The resulting web services can be registered on a central server and made publicly available. Alternatively, users can make web services available on their own servers, after installing the web application framework, which is part of the extension to U-Compare. We have performed a user-oriented evaluation of the proposed extension, by asking users who have tested the enhanced functionality of U-Compare to complete questionnaires that assess its functionality, reliability, usability, efficiency and maintainability. The results obtained reveal that the new functionality is well received by users. Conclusions The web services produced by U-Compare are built on top of open standards, i.e., REST and SOAP protocols, and therefore, they are decoupled from the underlying platform. Exported workflows can be integrated with any application that supports these open standards. We demonstrate how the newly extended U-Compare enhances the cross-platform interoperability of workflows, by seamlessly importing a number of text mining workflow web services exported from U-Compare into Taverna, i.e., a generic scientific workflow construction platform. PMID:23419017
Pressing needs of biomedical text mining in biocuration and beyond: opportunities and challenges
Singhal, Ayush; Leaman, Robert; Catlett, Natalie; Lemberger, Thomas; McEntyre, Johanna; Polson, Shawn; Xenarios, Ioannis; Arighi, Cecilia; Lu, Zhiyong
2016-01-01
Text mining in the biomedical sciences is rapidly transitioning from small-scale evaluation to large-scale application. In this article, we argue that text-mining technologies have become essential tools in real-world biomedical research. We describe four large scale applications of text mining, as showcased during a recent panel discussion at the BioCreative V Challenge Workshop. We draw on these applications as case studies to characterize common requirements for successfully applying text-mining techniques to practical biocuration needs. We note that system ‘accuracy’ remains a challenge and identify several additional common difficulties and potential research directions including (i) the ‘scalability’ issue due to the increasing need of mining information from millions of full-text articles, (ii) the ‘interoperability’ issue of integrating various text-mining systems into existing curation workflows and (iii) the ‘reusability’ issue on the difficulty of applying trained systems to text genres that are not seen previously during development. We then describe related efforts within the text-mining community, with a special focus on the BioCreative series of challenge workshops. We believe that focusing on the near-term challenges identified in this work will amplify the opportunities afforded by the continued adoption of text-mining tools. Finally, in order to sustain the curation ecosystem and have text-mining systems adopted for practical benefits, we call for increased collaboration between text-mining researchers and various stakeholders, including researchers, publishers and biocurators. PMID:28025348
Pressing needs of biomedical text mining in biocuration and beyond: opportunities and challenges
Singhal, Ayush; Leaman, Robert; Catlett, Natalie; ...
2016-12-26
Text mining in the biomedical sciences is rapidly transitioning from small-scale evaluation to large-scale application. In this article, we argue that text-mining technologies have become essential tools in real-world biomedical research. We describe four large scale applications of text mining, as showcased during a recent panel discussion at the BioCreative V Challenge Workshop. We draw on these applications as case studies to characterize common requirements for successfully applying text-mining techniques to practical biocuration needs. We note that system ‘accuracy’ remains a challenge and identify several additional common difficulties and potential research directions including (i) the ‘scalability’ issue due to themore » increasing need of mining information from millions of full-text articles, (ii) the ‘interoperability’ issue of integrating various text-mining systems into existing curation workflows and (iii) the ‘reusability’ issue on the difficulty of applying trained systems to text genres that are not seen previously during development. We then describe related efforts within the text-mining community, with a special focus on the BioCreative series of challenge workshops. We believe that focusing on the near-term challenges identified in this work will amplify the opportunities afforded by the continued adoption of text-mining tools. In conclusion, in order to sustain the curation ecosystem and have text-mining systems adopted for practical benefits, we call for increased collaboration between text-mining researchers and various stakeholders, including researchers, publishers and biocurators.« less
Pressing needs of biomedical text mining in biocuration and beyond: opportunities and challenges
DOE Office of Scientific and Technical Information (OSTI.GOV)
Singhal, Ayush; Leaman, Robert; Catlett, Natalie
Text mining in the biomedical sciences is rapidly transitioning from small-scale evaluation to large-scale application. In this article, we argue that text-mining technologies have become essential tools in real-world biomedical research. We describe four large scale applications of text mining, as showcased during a recent panel discussion at the BioCreative V Challenge Workshop. We draw on these applications as case studies to characterize common requirements for successfully applying text-mining techniques to practical biocuration needs. We note that system ‘accuracy’ remains a challenge and identify several additional common difficulties and potential research directions including (i) the ‘scalability’ issue due to themore » increasing need of mining information from millions of full-text articles, (ii) the ‘interoperability’ issue of integrating various text-mining systems into existing curation workflows and (iii) the ‘reusability’ issue on the difficulty of applying trained systems to text genres that are not seen previously during development. We then describe related efforts within the text-mining community, with a special focus on the BioCreative series of challenge workshops. We believe that focusing on the near-term challenges identified in this work will amplify the opportunities afforded by the continued adoption of text-mining tools. In conclusion, in order to sustain the curation ecosystem and have text-mining systems adopted for practical benefits, we call for increased collaboration between text-mining researchers and various stakeholders, including researchers, publishers and biocurators.« less
SOMA: A Proposed Framework for Trend Mining in Large UK Diabetic Retinopathy Temporal Databases
NASA Astrophysics Data System (ADS)
Somaraki, Vassiliki; Harding, Simon; Broadbent, Deborah; Coenen, Frans
In this paper, we present SOMA, a new trend mining framework; and Aretaeus, the associated trend mining algorithm. The proposed framework is able to detect different kinds of trends within longitudinal datasets. The prototype trends are defined mathematically so that they can be mapped onto the temporal patterns. Trends are defined and generated in terms of the frequency of occurrence of pattern changes over time. To evaluate the proposed framework the process was applied to a large collection of medical records, forming part of the diabetic retinopathy screening programme at the Royal Liverpool University Hospital.
Wagland, Richard; Recio-Saucedo, Alejandra; Simon, Michael; Bracher, Michael; Hunt, Katherine; Foster, Claire; Downing, Amy; Glaser, Adam; Corner, Jessica
2016-08-01
Quality of cancer care may greatly impact on patients' health-related quality of life (HRQoL). Free-text responses to patient-reported outcome measures (PROMs) provide rich data but analysis is time and resource-intensive. This study developed and tested a learning-based text-mining approach to facilitate analysis of patients' experiences of care and develop an explanatory model illustrating impact on HRQoL. Respondents to a population-based survey of colorectal cancer survivors provided free-text comments regarding their experience of living with and beyond cancer. An existing coding framework was tested and adapted, which informed learning-based text mining of the data. Machine-learning algorithms were trained to identify comments relating to patients' specific experiences of service quality, which were verified by manual qualitative analysis. Comparisons between coded retrieved comments and a HRQoL measure (EQ5D) were explored. The survey response rate was 63.3% (21 802/34 467), of which 25.8% (n=5634) participants provided free-text comments. Of retrieved comments on experiences of care (n=1688), over half (n=1045, 62%) described positive care experiences. Most negative experiences concerned a lack of post-treatment care (n=191, 11% of retrieved comments) and insufficient information concerning self-management strategies (n=135, 8%) or treatment side effects (n=160, 9%). Associations existed between HRQoL scores and coded algorithm-retrieved comments. Analysis indicated that the mechanism by which service quality impacted on HRQoL was the extent to which services prevented or alleviated challenges associated with disease and treatment burdens. Learning-based text mining techniques were found useful and practical tools to identify specific free-text comments within a large dataset, facilitating resource-efficient qualitative analysis. This method should be considered for future PROM analysis to inform policy and practice. Study findings indicated that perceived care quality directly impacts on HRQoL. Published by the BMJ Publishing Group Limited. For permission to use (where not already granted under a licence) please go to http://www.bmj.com/company/products-services/rights-and-licensing/
Mining Recent Temporal Patterns for Event Detection in Multivariate Time Series Data
Batal, Iyad; Fradkin, Dmitriy; Harrison, James; Moerchen, Fabian; Hauskrecht, Milos
2015-01-01
Improving the performance of classifiers using pattern mining techniques has been an active topic of data mining research. In this work we introduce the recent temporal pattern mining framework for finding predictive patterns for monitoring and event detection problems in complex multivariate time series data. This framework first converts time series into time-interval sequences of temporal abstractions. It then constructs more complex temporal patterns backwards in time using temporal operators. We apply our framework to health care data of 13,558 diabetic patients and show its benefits by efficiently finding useful patterns for detecting and diagnosing adverse medical conditions that are associated with diabetes. PMID:25937993
Design of material management system of mining group based on Hadoop
NASA Astrophysics Data System (ADS)
Xia, Zhiyuan; Tan, Zhuoying; Qi, Kuan; Li, Wen
2018-01-01
Under the background of persistent slowdown in mining market at present, improving the management level in mining group has become the key link to improve the economic benefit of the mine. According to the practical material management in mining group, three core components of Hadoop are applied: distributed file system HDFS, distributed computing framework Map/Reduce and distributed database HBase. Material management system of mining group based on Hadoop is constructed with the three core components of Hadoop and SSH framework technology. This system was found to strengthen collaboration between mining group and affiliated companies, and then the problems such as inefficient management, server pressure, hardware equipment performance deficiencies that exist in traditional mining material-management system are solved, and then mining group materials management is optimized, the cost of mining management is saved, the enterprise profit is increased.
Pressing needs of biomedical text mining in biocuration and beyond: opportunities and challenges.
Singhal, Ayush; Leaman, Robert; Catlett, Natalie; Lemberger, Thomas; McEntyre, Johanna; Polson, Shawn; Xenarios, Ioannis; Arighi, Cecilia; Lu, Zhiyong
2016-01-01
Text mining in the biomedical sciences is rapidly transitioning from small-scale evaluation to large-scale application. In this article, we argue that text-mining technologies have become essential tools in real-world biomedical research. We describe four large scale applications of text mining, as showcased during a recent panel discussion at the BioCreative V Challenge Workshop. We draw on these applications as case studies to characterize common requirements for successfully applying text-mining techniques to practical biocuration needs. We note that system 'accuracy' remains a challenge and identify several additional common difficulties and potential research directions including (i) the 'scalability' issue due to the increasing need of mining information from millions of full-text articles, (ii) the 'interoperability' issue of integrating various text-mining systems into existing curation workflows and (iii) the 'reusability' issue on the difficulty of applying trained systems to text genres that are not seen previously during development. We then describe related efforts within the text-mining community, with a special focus on the BioCreative series of challenge workshops. We believe that focusing on the near-term challenges identified in this work will amplify the opportunities afforded by the continued adoption of text-mining tools. Finally, in order to sustain the curation ecosystem and have text-mining systems adopted for practical benefits, we call for increased collaboration between text-mining researchers and various stakeholders, including researchers, publishers and biocurators. Published by Oxford University Press 2016. This work is written by US Government employees and is in the public domain in the US.
PKDE4J: Entity and relation extraction for public knowledge discovery.
Song, Min; Kim, Won Chul; Lee, Dahee; Heo, Go Eun; Kang, Keun Young
2015-10-01
Due to an enormous number of scientific publications that cannot be handled manually, there is a rising interest in text-mining techniques for automated information extraction, especially in the biomedical field. Such techniques provide effective means of information search, knowledge discovery, and hypothesis generation. Most previous studies have primarily focused on the design and performance improvement of either named entity recognition or relation extraction. In this paper, we present PKDE4J, a comprehensive text-mining system that integrates dictionary-based entity extraction and rule-based relation extraction in a highly flexible and extensible framework. Starting with the Stanford CoreNLP, we developed the system to cope with multiple types of entities and relations. The system also has fairly good performance in terms of accuracy as well as the ability to configure text-processing components. We demonstrate its competitive performance by evaluating it on many corpora and found that it surpasses existing systems with average F-measures of 85% for entity extraction and 81% for relation extraction. Copyright © 2015 Elsevier Inc. All rights reserved.
Text Mining for Adverse Drug Events: the Promise, Challenges, and State of the Art
Harpaz, Rave; Callahan, Alison; Tamang, Suzanne; Low, Yen; Odgers, David; Finlayson, Sam; Jung, Kenneth; LePendu, Paea; Shah, Nigam H.
2014-01-01
Text mining is the computational process of extracting meaningful information from large amounts of unstructured text. Text mining is emerging as a tool to leverage underutilized data sources that can improve pharmacovigilance, including the objective of adverse drug event detection and assessment. This article provides an overview of recent advances in pharmacovigilance driven by the application of text mining, and discusses several data sources—such as biomedical literature, clinical narratives, product labeling, social media, and Web search logs—that are amenable to text-mining for pharmacovigilance. Given the state of the art, it appears text mining can be applied to extract useful ADE-related information from multiple textual sources. Nonetheless, further research is required to address remaining technical challenges associated with the text mining methodologies, and to conclusively determine the relative contribution of each textual source to improving pharmacovigilance. PMID:25151493
ERIC Educational Resources Information Center
Trybula, Walter J.
1999-01-01
Reviews the state of research in text mining, focusing on newer developments. The intent is to describe the disparate investigations currently included under the term text mining and provide a cohesive structure for these efforts. A summary of research identifies key organizations responsible for pushing the development of text mining. A section…
Rinaldi, Fabio; Ellendorff, Tilia Renate; Madan, Sumit; Clematide, Simon; van der Lek, Adrian; Mevissen, Theo; Fluck, Juliane
2016-01-01
Automatic extraction of biological network information is one of the most desired and most complex tasks in biological and medical text mining. Track 4 at BioCreative V attempts to approach this complexity using fragments of large-scale manually curated biological networks, represented in Biological Expression Language (BEL), as training and test data. BEL is an advanced knowledge representation format which has been designed to be both human readable and machine processable. The specific goal of track 4 was to evaluate text mining systems capable of automatically constructing BEL statements from given evidence text, and of retrieving evidence text for given BEL statements. Given the complexity of the task, we designed an evaluation methodology which gives credit to partially correct statements. We identified various levels of information expressed by BEL statements, such as entities, functions, relations, and introduced an evaluation framework which rewards systems capable of delivering useful BEL fragments at each of these levels. The aim of this evaluation method is to help identify the characteristics of the systems which, if combined, would be most useful for achieving the overall goal of automatically constructing causal biological networks from text. © The Author(s) 2016. Published by Oxford University Press.
A Framework for Web Usage Mining in Electronic Government
NASA Astrophysics Data System (ADS)
Zhou, Ping; Le, Zhongjian
Web usage mining has been a major component of management strategy to enhance organizational analysis and decision. The literature on Web usage mining that deals with strategies and technologies for effectively employing Web usage mining is quite vast. In recent years, E-government has received much attention from researchers and practitioners. Huge amounts of user access data are produced in Electronic government Web site everyday. The role of these data in the success of government management cannot be overstated because they affect government analysis, prediction, strategies, tactical, operational planning and control. Web usage miming in E-government has an important role to play in setting government objectives, discovering citizen behavior, and determining future courses of actions. Web usage mining in E-government has not received adequate attention from researchers or practitioners. We developed a framework to promote a better understanding of the importance of Web usage mining in E-government. Using the current literature, we developed the framework presented herein, in hopes that it would stimulate more interest in this important area.
Biomedical text mining and its applications in cancer research.
Zhu, Fei; Patumcharoenpol, Preecha; Zhang, Cheng; Yang, Yang; Chan, Jonathan; Meechai, Asawin; Vongsangnak, Wanwipa; Shen, Bairong
2013-04-01
Cancer is a malignant disease that has caused millions of human deaths. Its study has a long history of well over 100years. There have been an enormous number of publications on cancer research. This integrated but unstructured biomedical text is of great value for cancer diagnostics, treatment, and prevention. The immense body and rapid growth of biomedical text on cancer has led to the appearance of a large number of text mining techniques aimed at extracting novel knowledge from scientific text. Biomedical text mining on cancer research is computationally automatic and high-throughput in nature. However, it is error-prone due to the complexity of natural language processing. In this review, we introduce the basic concepts underlying text mining and examine some frequently used algorithms, tools, and data sets, as well as assessing how much these algorithms have been utilized. We then discuss the current state-of-the-art text mining applications in cancer research and we also provide some resources for cancer text mining. With the development of systems biology, researchers tend to understand complex biomedical systems from a systems biology viewpoint. Thus, the full utilization of text mining to facilitate cancer systems biology research is fast becoming a major concern. To address this issue, we describe the general workflow of text mining in cancer systems biology and each phase of the workflow. We hope that this review can (i) provide a useful overview of the current work of this field; (ii) help researchers to choose text mining tools and datasets; and (iii) highlight how to apply text mining to assist cancer systems biology research. Copyright © 2012 Elsevier Inc. All rights reserved.
ERIC Educational Resources Information Center
Yu, Chong Ho; Jannasch-Pennell, Angel; DiGangi, Samuel
2011-01-01
The objective of this article is to illustrate that text mining and qualitative research are epistemologically compatible. First, like many qualitative research approaches, such as grounded theory, text mining encourages open-mindedness and discourages preconceptions. Contrary to the popular belief that text mining is a linear and fully automated…
Text mining meets workflow: linking U-Compare with Taverna
Kano, Yoshinobu; Dobson, Paul; Nakanishi, Mio; Tsujii, Jun'ichi; Ananiadou, Sophia
2010-01-01
Summary: Text mining from the biomedical literature is of increasing importance, yet it is not easy for the bioinformatics community to create and run text mining workflows due to the lack of accessibility and interoperability of the text mining resources. The U-Compare system provides a wide range of bio text mining resources in a highly interoperable workflow environment where workflows can very easily be created, executed, evaluated and visualized without coding. We have linked U-Compare to Taverna, a generic workflow system, to expose text mining functionality to the bioinformatics community. Availability: http://u-compare.org/taverna.html, http://u-compare.org Contact: kano@is.s.u-tokyo.ac.jp Supplementary information: Supplementary data are available at Bioinformatics online. PMID:20709690
Jiang, Guoqian; Wang, Chen; Zhu, Qian; Chute, Christopher G
2013-01-01
Knowledge-driven text mining is becoming an important research area for identifying pharmacogenomics target genes. However, few of such studies have been focused on the pharmacogenomics targets of adverse drug events (ADEs). The objective of the present study is to build a framework of knowledge integration and discovery that aims to support pharmacogenomics target predication of ADEs. We integrate a semantically annotated literature corpus Semantic MEDLINE with a semantically coded ADE knowledgebase known as ADEpedia using a semantic web based framework. We developed a knowledge discovery approach combining a network analysis of a protein-protein interaction (PPI) network and a gene functional classification approach. We performed a case study of drug-induced long QT syndrome for demonstrating the usefulness of the framework in predicting potential pharmacogenomics targets of ADEs.
An investigation of social media data during a product recall scandal
NASA Astrophysics Data System (ADS)
Tse, Ying Kei; Loh, Hanlin; Ding, Juling; Zhang, Minhao
2018-07-01
As social media has become an important part of modern daily life, users often share product opinions online and these tend to spike when large companies undergo crises. This paper investigates customer online responses to a large company crisis by uncovering hidden insights in social media comments and presents a framework for handling social media data and crisis management. Analysis of textual Facebook data from users responding to the 2013 horsemeat scandal is presented. In this study, we used a novel comprehensive data analysis framework alongside a text-mining framework to objectively classify and understand customer perceptions during this horsemeat scandal. This framework provides an effective approach for investigating customer perception during a company crisis and measures the effectiveness of crisis management practices which the company has adopted. Our analyses show that social media can provide important insights into customer behaviour during crisis communications.
Combined mining: discovering informative knowledge in complex data.
Cao, Longbing; Zhang, Huaifeng; Zhao, Yanchang; Luo, Dan; Zhang, Chengqi
2011-06-01
Enterprise data mining applications often involve complex data such as multiple large heterogeneous data sources, user preferences, and business impact. In such situations, a single method or one-step mining is often limited in discovering informative knowledge. It would also be very time and space consuming, if not impossible, to join relevant large data sources for mining patterns consisting of multiple aspects of information. It is crucial to develop effective approaches for mining patterns combining necessary information from multiple relevant business lines, catering for real business settings and decision-making actions rather than just providing a single line of patterns. The recent years have seen increasing efforts on mining more informative patterns, e.g., integrating frequent pattern mining with classifications to generate frequent pattern-based classifiers. Rather than presenting a specific algorithm, this paper builds on our existing works and proposes combined mining as a general approach to mining for informative patterns combining components from either multiple data sets or multiple features or by multiple methods on demand. We summarize general frameworks, paradigms, and basic processes for multifeature combined mining, multisource combined mining, and multimethod combined mining. Novel types of combined patterns, such as incremental cluster patterns, can result from such frameworks, which cannot be directly produced by the existing methods. A set of real-world case studies has been conducted to test the frameworks, with some of them briefed in this paper. They identify combined patterns for informing government debt prevention and improving government service objectives, which show the flexibility and instantiation capability of combined mining in discovering informative knowledge in complex data.
Survey of Natural Language Processing Techniques in Bioinformatics.
Zeng, Zhiqiang; Shi, Hua; Wu, Yun; Hong, Zhiling
2015-01-01
Informatics methods, such as text mining and natural language processing, are always involved in bioinformatics research. In this study, we discuss text mining and natural language processing methods in bioinformatics from two perspectives. First, we aim to search for knowledge on biology, retrieve references using text mining methods, and reconstruct databases. For example, protein-protein interactions and gene-disease relationship can be mined from PubMed. Then, we analyze the applications of text mining and natural language processing techniques in bioinformatics, including predicting protein structure and function, detecting noncoding RNA. Finally, numerous methods and applications, as well as their contributions to bioinformatics, are discussed for future use by text mining and natural language processing researchers.
Text Mining in Organizational Research
Kobayashi, Vladimer B.; Berkers, Hannah A.; Kismihók, Gábor; Den Hartog, Deanne N.
2017-01-01
Despite the ubiquity of textual data, so far few researchers have applied text mining to answer organizational research questions. Text mining, which essentially entails a quantitative approach to the analysis of (usually) voluminous textual data, helps accelerate knowledge discovery by radically increasing the amount data that can be analyzed. This article aims to acquaint organizational researchers with the fundamental logic underpinning text mining, the analytical stages involved, and contemporary techniques that may be used to achieve different types of objectives. The specific analytical techniques reviewed are (a) dimensionality reduction, (b) distance and similarity computing, (c) clustering, (d) topic modeling, and (e) classification. We describe how text mining may extend contemporary organizational research by allowing the testing of existing or new research questions with data that are likely to be rich, contextualized, and ecologically valid. After an exploration of how evidence for the validity of text mining output may be generated, we conclude the article by illustrating the text mining process in a job analysis setting using a dataset composed of job vacancies. PMID:29881248
Text Mining in Organizational Research.
Kobayashi, Vladimer B; Mol, Stefan T; Berkers, Hannah A; Kismihók, Gábor; Den Hartog, Deanne N
2018-07-01
Despite the ubiquity of textual data, so far few researchers have applied text mining to answer organizational research questions. Text mining, which essentially entails a quantitative approach to the analysis of (usually) voluminous textual data, helps accelerate knowledge discovery by radically increasing the amount data that can be analyzed. This article aims to acquaint organizational researchers with the fundamental logic underpinning text mining, the analytical stages involved, and contemporary techniques that may be used to achieve different types of objectives. The specific analytical techniques reviewed are (a) dimensionality reduction, (b) distance and similarity computing, (c) clustering, (d) topic modeling, and (e) classification. We describe how text mining may extend contemporary organizational research by allowing the testing of existing or new research questions with data that are likely to be rich, contextualized, and ecologically valid. After an exploration of how evidence for the validity of text mining output may be generated, we conclude the article by illustrating the text mining process in a job analysis setting using a dataset composed of job vacancies.
ChemBrowser: a flexible framework for mining chemical documents.
Wu, Xian; Zhang, Li; Chen, Ying; Rhodes, James; Griffin, Thomas D; Boyer, Stephen K; Alba, Alfredo; Cai, Keke
2010-01-01
The ability to extract chemical and biological entities and relations from text documents automatically has great value to biochemical research and development activities. The growing maturity of text mining and artificial intelligence technologies shows promise in enabling such automatic chemical entity extraction capabilities (called "Chemical Annotation" in this paper). Many techniques have been reported in the literature, ranging from dictionary and rule-based techniques to machine learning approaches. In practice, we found that no single technique works well in all cases. A combinatorial approach that allows one to quickly compose different annotation techniques together for a given situation is most effective. In this paper, we describe the key challenges we face in real-world chemical annotation scenarios. We then present a solution called ChemBrowser which has a flexible framework for chemical annotation. ChemBrowser includes a suite of customizable processing units that might be utilized in a chemical annotator, a high-level language that describes the composition of various processing units that would form a chemical annotator, and an execution engine that translates the composition language to an actual annotator that can generate annotation results for a given set of documents. We demonstrate the impact of this approach by tailoring an annotator for extracting chemical names from patent documents and show how this annotator can be easily modified with simple configuration alone.
Doukas, Charalampos; Goudas, Theodosis; Fischer, Simon; Mierswa, Ingo; Chatziioannou, Aristotle; Maglogiannis, Ilias
2010-01-01
This paper presents an open image-mining framework that provides access to tools and methods for the characterization of medical images. Several image processing and feature extraction operators have been implemented and exposed through Web Services. Rapid-Miner, an open source data mining system has been utilized for applying classification operators and creating the essential processing workflows. The proposed framework has been applied for the detection of salient objects in Obstructive Nephropathy microscopy images. Initial classification results are quite promising demonstrating the feasibility of automated characterization of kidney biopsy images.
Peng, Yifan; Torii, Manabu; Wu, Cathy H; Vijay-Shanker, K
2014-08-23
Text mining is increasingly used in the biomedical domain because of its ability to automatically gather information from large amount of scientific articles. One important task in biomedical text mining is relation extraction, which aims to identify designated relations among biological entities reported in literature. A relation extraction system achieving high performance is expensive to develop because of the substantial time and effort required for its design and implementation. Here, we report a novel framework to facilitate the development of a pattern-based biomedical relation extraction system. It has several unique design features: (1) leveraging syntactic variations possible in a language and automatically generating extraction patterns in a systematic manner, (2) applying sentence simplification to improve the coverage of extraction patterns, and (3) identifying referential relations between a syntactic argument of a predicate and the actual target expected in the relation extraction task. A relation extraction system derived using the proposed framework achieved overall F-scores of 72.66% for the Simple events and 55.57% for the Binding events on the BioNLP-ST 2011 GE test set, comparing favorably with the top performing systems that participated in the BioNLP-ST 2011 GE task. We obtained similar results on the BioNLP-ST 2013 GE test set (80.07% and 60.58%, respectively). We conducted additional experiments on the training and development sets to provide a more detailed analysis of the system and its individual modules. This analysis indicates that without increasing the number of patterns, simplification and referential relation linking play a key role in the effective extraction of biomedical relations. In this paper, we present a novel framework for fast development of relation extraction systems. The framework requires only a list of triggers as input, and does not need information from an annotated corpus. Thus, we reduce the involvement of domain experts, who would otherwise have to provide manual annotations and help with the design of hand crafted patterns. We demonstrate how our framework is used to develop a system which achieves state-of-the-art performance on a public benchmark corpus.
Text mining for adverse drug events: the promise, challenges, and state of the art.
Harpaz, Rave; Callahan, Alison; Tamang, Suzanne; Low, Yen; Odgers, David; Finlayson, Sam; Jung, Kenneth; LePendu, Paea; Shah, Nigam H
2014-10-01
Text mining is the computational process of extracting meaningful information from large amounts of unstructured text. It is emerging as a tool to leverage underutilized data sources that can improve pharmacovigilance, including the objective of adverse drug event (ADE) detection and assessment. This article provides an overview of recent advances in pharmacovigilance driven by the application of text mining, and discusses several data sources-such as biomedical literature, clinical narratives, product labeling, social media, and Web search logs-that are amenable to text mining for pharmacovigilance. Given the state of the art, it appears text mining can be applied to extract useful ADE-related information from multiple textual sources. Nonetheless, further research is required to address remaining technical challenges associated with the text mining methodologies, and to conclusively determine the relative contribution of each textual source to improving pharmacovigilance.
GeoSegmenter: A statistically learned Chinese word segmenter for the geoscience domain
NASA Astrophysics Data System (ADS)
Huang, Lan; Du, Youfu; Chen, Gongyang
2015-03-01
Unlike English, the Chinese language has no space between words. Segmenting texts into words, known as the Chinese word segmentation (CWS) problem, thus becomes a fundamental issue for processing Chinese documents and the first step in many text mining applications, including information retrieval, machine translation and knowledge acquisition. However, for the geoscience subject domain, the CWS problem remains unsolved. Although a generic segmenter can be applied to process geoscience documents, they lack the domain specific knowledge and consequently their segmentation accuracy drops dramatically. This motivated us to develop a segmenter specifically for the geoscience subject domain: the GeoSegmenter. We first proposed a generic two-step framework for domain specific CWS. Following this framework, we built GeoSegmenter using conditional random fields, a principled statistical framework for sequence learning. Specifically, GeoSegmenter first identifies general terms by using a generic baseline segmenter. Then it recognises geoscience terms by learning and applying a model that can transform the initial segmentation into the goal segmentation. Empirical experimental results on geoscience documents and benchmark datasets showed that GeoSegmenter could effectively recognise both geoscience terms and general terms.
Using Workflows to Explore and Optimise Named Entity Recognition for Chemistry
Kolluru, BalaKrishna; Hawizy, Lezan; Murray-Rust, Peter; Tsujii, Junichi; Ananiadou, Sophia
2011-01-01
Chemistry text mining tools should be interoperable and adaptable regardless of system-level implementation, installation or even programming issues. We aim to abstract the functionality of these tools from the underlying implementation via reconfigurable workflows for automatically identifying chemical names. To achieve this, we refactored an established named entity recogniser (in the chemistry domain), OSCAR and studied the impact of each component on the net performance. We developed two reconfigurable workflows from OSCAR using an interoperable text mining framework, U-Compare. These workflows can be altered using the drag-&-drop mechanism of the graphical user interface of U-Compare. These workflows also provide a platform to study the relationship between text mining components such as tokenisation and named entity recognition (using maximum entropy Markov model (MEMM) and pattern recognition based classifiers). Results indicate that, for chemistry in particular, eliminating noise generated by tokenisation techniques lead to a slightly better performance than others, in terms of named entity recognition (NER) accuracy. Poor tokenisation translates into poorer input to the classifier components which in turn leads to an increase in Type I or Type II errors, thus, lowering the overall performance. On the Sciborg corpus, the workflow based system, which uses a new tokeniser whilst retaining the same MEMM component, increases the F-score from 82.35% to 84.44%. On the PubMed corpus, it recorded an F-score of 84.84% as against 84.23% by OSCAR. PMID:21633495
Using workflows to explore and optimise named entity recognition for chemistry.
Kolluru, Balakrishna; Hawizy, Lezan; Murray-Rust, Peter; Tsujii, Junichi; Ananiadou, Sophia
2011-01-01
Chemistry text mining tools should be interoperable and adaptable regardless of system-level implementation, installation or even programming issues. We aim to abstract the functionality of these tools from the underlying implementation via reconfigurable workflows for automatically identifying chemical names. To achieve this, we refactored an established named entity recogniser (in the chemistry domain), OSCAR and studied the impact of each component on the net performance. We developed two reconfigurable workflows from OSCAR using an interoperable text mining framework, U-Compare. These workflows can be altered using the drag-&-drop mechanism of the graphical user interface of U-Compare. These workflows also provide a platform to study the relationship between text mining components such as tokenisation and named entity recognition (using maximum entropy Markov model (MEMM) and pattern recognition based classifiers). Results indicate that, for chemistry in particular, eliminating noise generated by tokenisation techniques lead to a slightly better performance than others, in terms of named entity recognition (NER) accuracy. Poor tokenisation translates into poorer input to the classifier components which in turn leads to an increase in Type I or Type II errors, thus, lowering the overall performance. On the Sciborg corpus, the workflow based system, which uses a new tokeniser whilst retaining the same MEMM component, increases the F-score from 82.35% to 84.44%. On the PubMed corpus, it recorded an F-score of 84.84% as against 84.23% by OSCAR.
The Mining Minds digital health and wellness framework.
Banos, Oresti; Bilal Amin, Muhammad; Ali Khan, Wajahat; Afzal, Muhammad; Hussain, Maqbool; Kang, Byeong Ho; Lee, Sungyong
2016-07-15
The provision of health and wellness care is undergoing an enormous transformation. A key element of this revolution consists in prioritizing prevention and proactivity based on the analysis of people's conducts and the empowerment of individuals in their self-management. Digital technologies are unquestionably destined to be the main engine of this change, with an increasing number of domain-specific applications and devices commercialized every year; however, there is an apparent lack of frameworks capable of orchestrating and intelligently leveraging, all the data, information and knowledge generated through these systems. This work presents Mining Minds, a novel framework that builds on the core ideas of the digital health and wellness paradigms to enable the provision of personalized support. Mining Minds embraces some of the most prominent digital technologies, ranging from Big Data and Cloud Computing to Wearables and Internet of Things, as well as modern concepts and methods, such as context-awareness, knowledge bases or analytics, to holistically and continuously investigate on people's lifestyles and provide a variety of smart coaching and support services. This paper comprehensively describes the efficient and rational combination and interoperation of these technologies and methods through Mining Minds, while meeting the essential requirements posed by a framework for personalized health and wellness support. Moreover, this work presents a realization of the key architectural components of Mining Minds, as well as various exemplary user applications and expert tools to illustrate some of the potential services supported by the proposed framework. Mining Minds constitutes an innovative holistic means to inspect human behavior and provide personalized health and wellness support. The principles behind this framework uncover new research ideas and may serve as a reference for similar initiatives.
Health Terrain: Visualizing Large Scale Health Data
2015-12-01
Text mining ; Data mining . 16. SECURITY CLASSIFICATION OF: 17... text mining algorithms to construct a concept space. A browser-‐based user interface is developed to...Public health data, Notifiable condition detector, Text mining , Data mining 4 of 29 Disease Patient Location Term
Using natural language processing techniques to inform research on nanotechnology.
Lewinski, Nastassja A; McInnes, Bridget T
2015-01-01
Literature in the field of nanotechnology is exponentially increasing with more and more engineered nanomaterials being created, characterized, and tested for performance and safety. With the deluge of published data, there is a need for natural language processing approaches to semi-automate the cataloguing of engineered nanomaterials and their associated physico-chemical properties, performance, exposure scenarios, and biological effects. In this paper, we review the different informatics methods that have been applied to patent mining, nanomaterial/device characterization, nanomedicine, and environmental risk assessment. Nine natural language processing (NLP)-based tools were identified: NanoPort, NanoMapper, TechPerceptor, a Text Mining Framework, a Nanodevice Analyzer, a Clinical Trial Document Classifier, Nanotoxicity Searcher, NanoSifter, and NEIMiner. We conclude with recommendations for sharing NLP-related tools through online repositories to broaden participation in nanoinformatics.
Introducing Text Analytics as a Graduate Business School Course
ERIC Educational Resources Information Center
Edgington, Theresa M.
2011-01-01
Text analytics refers to the process of analyzing unstructured data from documented sources, including open-ended surveys, blogs, and other types of web dialog. Text analytics has enveloped the concept of text mining, an analysis approach influenced heavily from data mining. While text mining has been covered extensively in various computer…
Role of Knowledge Management and Analytical CRM in Business: Data Mining Based Framework
ERIC Educational Resources Information Center
Ranjan, Jayanthi; Bhatnagar, Vishal
2011-01-01
Purpose: The purpose of the paper is to provide a thorough analysis of the concepts of business intelligence (BI), knowledge management (KM) and analytical CRM (aCRM) and to establish a framework for integrating all the three to each other. The paper also seeks to establish a KM and aCRM based framework using data mining (DM) techniques, which…
Van Landeghem, Sofie; De Bodt, Stefanie; Drebert, Zuzanna J; Inzé, Dirk; Van de Peer, Yves
2013-03-01
Despite the availability of various data repositories for plant research, a wealth of information currently remains hidden within the biomolecular literature. Text mining provides the necessary means to retrieve these data through automated processing of texts. However, only recently has advanced text mining methodology been implemented with sufficient computational power to process texts at a large scale. In this study, we assess the potential of large-scale text mining for plant biology research in general and for network biology in particular using a state-of-the-art text mining system applied to all PubMed abstracts and PubMed Central full texts. We present extensive evaluation of the textual data for Arabidopsis thaliana, assessing the overall accuracy of this new resource for usage in plant network analyses. Furthermore, we combine text mining information with both protein-protein and regulatory interactions from experimental databases. Clusters of tightly connected genes are delineated from the resulting network, illustrating how such an integrative approach is essential to grasp the current knowledge available for Arabidopsis and to uncover gene information through guilt by association. All large-scale data sets, as well as the manually curated textual data, are made publicly available, hereby stimulating the application of text mining data in future plant biology studies.
Integration of Text- and Data-Mining Technologies for Use in Banking Applications
NASA Astrophysics Data System (ADS)
Maslankowski, Jacek
Unstructured data, most of it in the form of text files, typically accounts for 85% of an organization's knowledge stores, but it's not always easy to find, access, analyze or use (Robb 2004). That is why it is important to use solutions based on text and data mining. This solution is known as duo mining. This leads to improve management based on knowledge owned in organization. The results are interesting. Data mining provides to lead with structuralized data, usually powered from data warehouses. Text mining, sometimes called web mining, looks for patterns in unstructured data — memos, document and www. Integrating text-based information with structured data enriches predictive modeling capabilities and provides new stores of insightful and valuable information for driving business and research initiatives forward.
A Domain Independent Framework for Extracting Linked Semantic Data from Tables
2012-01-01
Evidence - based medicine , the essential role of systematic reviews, and the need for automated text mining tools. In: Proc. 1st ACM Int. Health... based medicine : what it is and what it isn’t. Bmj 312(7023), 71 (1996) 23. Sahoo, S.S., Halb, W., Hellmann, S., Idehen, K., Thibodeau Jr, T., Auer, S., Se...Journal of Artificial Intelligence Research 11(1), 95–130 (1999) 22. Sackett, D., Rosenberg, W., Gray, J., Haynes, R., Richardson, W.: Evidence
An Adaptive Sensor Mining Framework for Pervasive Computing Applications
NASA Astrophysics Data System (ADS)
Rashidi, Parisa; Cook, Diane J.
Analyzing sensor data in pervasive computing applications brings unique challenges to the KDD community. The challenge is heightened when the underlying data source is dynamic and the patterns change. We introduce a new adaptive mining framework that detects patterns in sensor data, and more importantly, adapts to the changes in the underlying model. In our framework, the frequent and periodic patterns of data are first discovered by the Frequent and Periodic Pattern Miner (FPPM) algorithm; and then any changes in the discovered patterns over the lifetime of the system are discovered by the Pattern Adaptation Miner (PAM) algorithm, in order to adapt to the changing environment. This framework also captures vital context information present in pervasive computing applications, such as the startup triggers and temporal information. In this paper, we present a description of our mining framework and validate the approach using data collected in the CASAS smart home testbed.
News release from February 10, 2005 announcing a memorandum of understanding (MOU) that offers a joint framework to improve permit application procedures for surface coal mining operations that place dredged or fill material in waters of the United States.
ERIC Educational Resources Information Center
Qin, Jian; Jurisica, Igor; Liddy, Elizabeth D.; Jansen, Bernard J; Spink, Amanda; Priss, Uta; Norton, Melanie J.
2000-01-01
These six articles discuss knowledge discovery in databases (KDD). Topics include data mining; knowledge management systems; applications of knowledge discovery; text and Web mining; text mining and information retrieval; user search patterns through Web log analysis; concept analysis; data collection; and data structure inconsistency. (LRW)
Westergaard, David; Stærfeldt, Hans-Henrik; Tønsberg, Christian; Jensen, Lars Juhl; Brunak, Søren
2018-02-01
Across academia and industry, text mining has become a popular strategy for keeping up with the rapid growth of the scientific literature. Text mining of the scientific literature has mostly been carried out on collections of abstracts, due to their availability. Here we present an analysis of 15 million English scientific full-text articles published during the period 1823-2016. We describe the development in article length and publication sub-topics during these nearly 250 years. We showcase the potential of text mining by extracting published protein-protein, disease-gene, and protein subcellular associations using a named entity recognition system, and quantitatively report on their accuracy using gold standard benchmark data sets. We subsequently compare the findings to corresponding results obtained on 16.5 million abstracts included in MEDLINE and show that text mining of full-text articles consistently outperforms using abstracts only.
Westergaard, David; Stærfeldt, Hans-Henrik
2018-01-01
Across academia and industry, text mining has become a popular strategy for keeping up with the rapid growth of the scientific literature. Text mining of the scientific literature has mostly been carried out on collections of abstracts, due to their availability. Here we present an analysis of 15 million English scientific full-text articles published during the period 1823–2016. We describe the development in article length and publication sub-topics during these nearly 250 years. We showcase the potential of text mining by extracting published protein–protein, disease–gene, and protein subcellular associations using a named entity recognition system, and quantitatively report on their accuracy using gold standard benchmark data sets. We subsequently compare the findings to corresponding results obtained on 16.5 million abstracts included in MEDLINE and show that text mining of full-text articles consistently outperforms using abstracts only. PMID:29447159
Discovering Knowledge from Noisy Databases Using Genetic Programming.
ERIC Educational Resources Information Center
Wong, Man Leung; Leung, Kwong Sak; Cheng, Jack C. Y.
2000-01-01
Presents a framework that combines Genetic Programming and Inductive Logic Programming, two approaches in data mining, to induce knowledge from noisy databases. The framework is based on a formalism of logic grammars and is implemented as a data mining system called LOGENPRO (Logic Grammar-based Genetic Programming System). (Contains 34…
Text mining for the biocuration workflow
Hirschman, Lynette; Burns, Gully A. P. C; Krallinger, Martin; Arighi, Cecilia; Cohen, K. Bretonnel; Valencia, Alfonso; Wu, Cathy H.; Chatr-Aryamontri, Andrew; Dowell, Karen G.; Huala, Eva; Lourenço, Anália; Nash, Robert; Veuthey, Anne-Lise; Wiegers, Thomas; Winter, Andrew G.
2012-01-01
Molecular biology has become heavily dependent on biological knowledge encoded in expert curated biological databases. As the volume of biological literature increases, biocurators need help in keeping up with the literature; (semi-) automated aids for biocuration would seem to be an ideal application for natural language processing and text mining. However, to date, there have been few documented successes for improving biocuration throughput using text mining. Our initial investigations took place for the workshop on ‘Text Mining for the BioCuration Workflow’ at the third International Biocuration Conference (Berlin, 2009). We interviewed biocurators to obtain workflows from eight biological databases. This initial study revealed high-level commonalities, including (i) selection of documents for curation; (ii) indexing of documents with biologically relevant entities (e.g. genes); and (iii) detailed curation of specific relations (e.g. interactions); however, the detailed workflows also showed many variabilities. Following the workshop, we conducted a survey of biocurators. The survey identified biocurator priorities, including the handling of full text indexed with biological entities and support for the identification and prioritization of documents for curation. It also indicated that two-thirds of the biocuration teams had experimented with text mining and almost half were using text mining at that time. Analysis of our interviews and survey provide a set of requirements for the integration of text mining into the biocuration workflow. These can guide the identification of common needs across curated databases and encourage joint experimentation involving biocurators, text mining developers and the larger biomedical research community. PMID:22513129
Text mining for the biocuration workflow.
Hirschman, Lynette; Burns, Gully A P C; Krallinger, Martin; Arighi, Cecilia; Cohen, K Bretonnel; Valencia, Alfonso; Wu, Cathy H; Chatr-Aryamontri, Andrew; Dowell, Karen G; Huala, Eva; Lourenço, Anália; Nash, Robert; Veuthey, Anne-Lise; Wiegers, Thomas; Winter, Andrew G
2012-01-01
Molecular biology has become heavily dependent on biological knowledge encoded in expert curated biological databases. As the volume of biological literature increases, biocurators need help in keeping up with the literature; (semi-) automated aids for biocuration would seem to be an ideal application for natural language processing and text mining. However, to date, there have been few documented successes for improving biocuration throughput using text mining. Our initial investigations took place for the workshop on 'Text Mining for the BioCuration Workflow' at the third International Biocuration Conference (Berlin, 2009). We interviewed biocurators to obtain workflows from eight biological databases. This initial study revealed high-level commonalities, including (i) selection of documents for curation; (ii) indexing of documents with biologically relevant entities (e.g. genes); and (iii) detailed curation of specific relations (e.g. interactions); however, the detailed workflows also showed many variabilities. Following the workshop, we conducted a survey of biocurators. The survey identified biocurator priorities, including the handling of full text indexed with biological entities and support for the identification and prioritization of documents for curation. It also indicated that two-thirds of the biocuration teams had experimented with text mining and almost half were using text mining at that time. Analysis of our interviews and survey provide a set of requirements for the integration of text mining into the biocuration workflow. These can guide the identification of common needs across curated databases and encourage joint experimentation involving biocurators, text mining developers and the larger biomedical research community.
Van Landeghem, Sofie; De Bodt, Stefanie; Drebert, Zuzanna J.; Inzé, Dirk; Van de Peer, Yves
2013-01-01
Despite the availability of various data repositories for plant research, a wealth of information currently remains hidden within the biomolecular literature. Text mining provides the necessary means to retrieve these data through automated processing of texts. However, only recently has advanced text mining methodology been implemented with sufficient computational power to process texts at a large scale. In this study, we assess the potential of large-scale text mining for plant biology research in general and for network biology in particular using a state-of-the-art text mining system applied to all PubMed abstracts and PubMed Central full texts. We present extensive evaluation of the textual data for Arabidopsis thaliana, assessing the overall accuracy of this new resource for usage in plant network analyses. Furthermore, we combine text mining information with both protein–protein and regulatory interactions from experimental databases. Clusters of tightly connected genes are delineated from the resulting network, illustrating how such an integrative approach is essential to grasp the current knowledge available for Arabidopsis and to uncover gene information through guilt by association. All large-scale data sets, as well as the manually curated textual data, are made publicly available, hereby stimulating the application of text mining data in future plant biology studies. PMID:23532071
Frontiers of biomedical text mining: current progress
Zweigenbaum, Pierre; Demner-Fushman, Dina; Yu, Hong; Cohen, Kevin B.
2008-01-01
It is now almost 15 years since the publication of the first paper on text mining in the genomics domain, and decades since the first paper on text mining in the medical domain. Enormous progress has been made in the areas of information retrieval, evaluation methodologies and resource construction. Some problems, such as abbreviation-handling, can essentially be considered solved problems, and others, such as identification of gene mentions in text, seem likely to be solved soon. However, a number of problems at the frontiers of biomedical text mining continue to present interesting challenges and opportunities for great improvements and interesting research. In this article we review the current state of the art in biomedical text mining or ‘BioNLP’ in general, focusing primarily on papers published within the past year. PMID:17977867
Automated detection of follow-up appointments using text mining of discharge records.
Ruud, Kari L; Johnson, Matthew G; Liesinger, Juliette T; Grafft, Carrie A; Naessens, James M
2010-06-01
To determine whether text mining can accurately detect specific follow-up appointment criteria in free-text hospital discharge records. Cross-sectional study. Mayo Clinic Rochester hospitals. Inpatients discharged from general medicine services in 2006 (n = 6481). Textual hospital dismissal summaries were manually reviewed to determine whether the records contained specific follow-up appointment arrangement elements: date, time and either physician or location for an appointment. The data set was evaluated for the same criteria using SAS Text Miner software. The two assessments were compared to determine the accuracy of text mining for detecting records containing follow-up appointment arrangements. Agreement of text-mined appointment findings with gold standard (manual abstraction) including sensitivity, specificity, positive predictive and negative predictive values (PPV and NPV). About 55.2% (3576) of discharge records contained all criteria for follow-up appointment arrangements according to the manual review, 3.2% (113) of which were missed through text mining. Text mining incorrectly identified 3.7% (107) follow-up appointments that were not considered valid through manual review. Therefore, the text mining analysis concurred with the manual review in 96.6% of the appointment findings. Overall sensitivity and specificity were 96.8 and 96.3%, respectively; and PPV and NPV were 97.0 and 96.1%, respectively. of individual appointment criteria resulted in accuracy rates of 93.5% for date, 97.4% for time, 97.5% for physician and 82.9% for location. Text mining of unstructured hospital dismissal summaries can accurately detect documentation of follow-up appointment arrangement elements, thus saving considerable resources for performance assessment and quality-related research.
Automatic target validation based on neuroscientific literature mining for tractography
Vasques, Xavier; Richardet, Renaud; Hill, Sean L.; Slater, David; Chappelier, Jean-Cedric; Pralong, Etienne; Bloch, Jocelyne; Draganski, Bogdan; Cif, Laura
2015-01-01
Target identification for tractography studies requires solid anatomical knowledge validated by an extensive literature review across species for each seed structure to be studied. Manual literature review to identify targets for a given seed region is tedious and potentially subjective. Therefore, complementary approaches would be useful. We propose to use text-mining models to automatically suggest potential targets from the neuroscientific literature, full-text articles and abstracts, so that they can be used for anatomical connection studies and more specifically for tractography. We applied text-mining models to three structures: two well-studied structures, since validated deep brain stimulation targets, the internal globus pallidus and the subthalamic nucleus and, the nucleus accumbens, an exploratory target for treating psychiatric disorders. We performed a systematic review of the literature to document the projections of the three selected structures and compared it with the targets proposed by text-mining models, both in rat and primate (including human). We ran probabilistic tractography on the nucleus accumbens and compared the output with the results of the text-mining models and literature review. Overall, text-mining the literature could find three times as many targets as two man-weeks of curation could. The overall efficiency of the text-mining against literature review in our study was 98% recall (at 36% precision), meaning that over all the targets for the three selected seeds, only one target has been missed by text-mining. We demonstrate that connectivity for a structure of interest can be extracted from a very large amount of publications and abstracts. We believe this tool will be useful in helping the neuroscience community to facilitate connectivity studies of particular brain regions. The text mining tools used for the study are part of the HBP Neuroinformatics Platform, publicly available at http://connectivity-brainer.rhcloud.com/. PMID:26074781
Jácome, Alberto G; Fdez-Riverola, Florentino; Lourenço, Anália
2016-07-01
Text mining and semantic analysis approaches can be applied to the construction of biomedical domain-specific search engines and provide an attractive alternative to create personalized and enhanced search experiences. Therefore, this work introduces the new open-source BIOMedical Search Engine Framework for the fast and lightweight development of domain-specific search engines. The rationale behind this framework is to incorporate core features typically available in search engine frameworks with flexible and extensible technologies to retrieve biomedical documents, annotate meaningful domain concepts, and develop highly customized Web search interfaces. The BIOMedical Search Engine Framework integrates taggers for major biomedical concepts, such as diseases, drugs, genes, proteins, compounds and organisms, and enables the use of domain-specific controlled vocabulary. Technologies from the Typesafe Reactive Platform, the AngularJS JavaScript framework and the Bootstrap HTML/CSS framework support the customization of the domain-oriented search application. Moreover, the RESTful API of the BIOMedical Search Engine Framework allows the integration of the search engine into existing systems or a complete web interface personalization. The construction of the Smart Drug Search is described as proof-of-concept of the BIOMedical Search Engine Framework. This public search engine catalogs scientific literature about antimicrobial resistance, microbial virulence and topics alike. The keyword-based queries of the users are transformed into concepts and search results are presented and ranked accordingly. The semantic graph view portraits all the concepts found in the results, and the researcher may look into the relevance of different concepts, the strength of direct relations, and non-trivial, indirect relations. The number of occurrences of the concept shows its importance to the query, and the frequency of concept co-occurrence is indicative of biological relations meaningful to that particular scope of research. Conversely, indirect concept associations, i.e. concepts related by other intermediary concepts, can be useful to integrate information from different studies and look into non-trivial relations. The BIOMedical Search Engine Framework supports the development of domain-specific search engines. The key strengths of the framework are modularity and extensibilityin terms of software design, the use of open-source consolidated Web technologies, and the ability to integrate any number of biomedical text mining tools and information resources. Currently, the Smart Drug Search keeps over 1,186,000 documents, containing more than 11,854,000 annotations for 77,200 different concepts. The Smart Drug Search is publicly accessible at http://sing.ei.uvigo.es/sds/. The BIOMedical Search Engine Framework is freely available for non-commercial use at https://github.com/agjacome/biomsef. Copyright © 2016 Elsevier Ireland Ltd. All rights reserved.
ECO: A Framework for Entity Co-Occurrence Exploration with Faceted Navigation
DOE Office of Scientific and Technical Information (OSTI.GOV)
Halliday, K. D.
2010-08-20
Even as highly structured databases and semantic knowledge bases become more prevalent, a substantial amount of human knowledge is reported as written prose. Typical textual reports, such as news articles, contain information about entities (people, organizations, and locations) and their relationships. Automatically extracting such relationships from large text corpora is a key component of corporate and government knowledge bases. The primary goal of the ECO project is to develop a scalable framework for extracting and presenting these relationships for exploration using an easily navigable faceted user interface. ECO uses entity co-occurrence relationships to identify related entities. The system aggregates andmore » indexes information on each entity pair, allowing the user to rapidly discover and mine relational information.« less
Annotating images by mining image search results.
Wang, Xin-Jing; Zhang, Lei; Li, Xirong; Ma, Wei-Ying
2008-11-01
Although it has been studied for years by the computer vision and machine learning communities, image annotation is still far from practical. In this paper, we propose a novel attempt at model-free image annotation, which is a data-driven approach that annotates images by mining their search results. Some 2.4 million images with their surrounding text are collected from a few photo forums to support this approach. The entire process is formulated in a divide-and-conquer framework where a query keyword is provided along with the uncaptioned image to improve both the effectiveness and efficiency. This is helpful when the collected data set is not dense everywhere. In this sense, our approach contains three steps: 1) the search process to discover visually and semantically similar search results, 2) the mining process to identify salient terms from textual descriptions of the search results, and 3) the annotation rejection process to filter out noisy terms yielded by Step 2. To ensure real-time annotation, two key techniques are leveraged-one is to map the high-dimensional image visual features into hash codes, the other is to implement it as a distributed system, of which the search and mining processes are provided as Web services. As a typical result, the entire process finishes in less than 1 second. Since no training data set is required, our approach enables annotating with unlimited vocabulary and is highly scalable and robust to outliers. Experimental results on both real Web images and a benchmark image data set show the effectiveness and efficiency of the proposed algorithm. It is also worth noting that, although the entire approach is illustrated within the divide-and conquer framework, a query keyword is not crucial to our current implementation. We provide experimental results to prove this.
Effective integrated frameworks for assessing mining sustainability.
Virgone, K M; Ramirez-Andreotta, M; Mainhagu, J; Brusseau, M L
2018-05-28
The objectives of this research are to review existing methods used for assessing mining sustainability, analyze the limited prior research that has evaluated the methods, and identify key characteristics that would constitute an enhanced sustainability framework that would serve to improve sustainability reporting in the mining industry. Five of the most relevant frameworks were selected for comparison in this analysis, and the results show that there are many commonalities among the five, as well as some disparities. In addition, relevant components are missing from all five. An enhanced evaluation system and framework were created to provide a more holistic, comprehensive method for sustainability assessment and reporting. The proposed framework has five components that build from and encompass the twelve evaluation characteristics used in the analysis. The components include Foundation, Focus, Breadth, Quality Assurance, and Relevance. The enhanced framework promotes a comprehensive, location-specific reporting approach with a concise set of well-defined indicators. Built into the framework is quality assurance, as well as a defined method to use information from sustainability reports to inform decisions. The framework incorporates human health and socioeconomic aspects via initiatives such as community-engaged research, economic valuations, and community-initiated environmental monitoring.
A novel water quality data analysis framework based on time-series data mining.
Deng, Weihui; Wang, Guoyin
2017-07-01
The rapid development of time-series data mining provides an emerging method for water resource management research. In this paper, based on the time-series data mining methodology, we propose a novel and general analysis framework for water quality time-series data. It consists of two parts: implementation components and common tasks of time-series data mining in water quality data. In the first part, we propose to granulate the time series into several two-dimensional normal clouds and calculate the similarities in the granulated level. On the basis of the similarity matrix, the similarity search, anomaly detection, and pattern discovery tasks in the water quality time-series instance dataset can be easily implemented in the second part. We present a case study of this analysis framework on weekly Dissolve Oxygen time-series data collected from five monitoring stations on the upper reaches of Yangtze River, China. It discovered the relationship of water quality in the mainstream and tributary as well as the main changing patterns of DO. The experimental results show that the proposed analysis framework is a feasible and efficient method to mine the hidden and valuable knowledge from water quality historical time-series data. Copyright © 2017 Elsevier Ltd. All rights reserved.
Using natural language processing techniques to inform research on nanotechnology
Lewinski, Nastassja A
2015-01-01
Summary Literature in the field of nanotechnology is exponentially increasing with more and more engineered nanomaterials being created, characterized, and tested for performance and safety. With the deluge of published data, there is a need for natural language processing approaches to semi-automate the cataloguing of engineered nanomaterials and their associated physico-chemical properties, performance, exposure scenarios, and biological effects. In this paper, we review the different informatics methods that have been applied to patent mining, nanomaterial/device characterization, nanomedicine, and environmental risk assessment. Nine natural language processing (NLP)-based tools were identified: NanoPort, NanoMapper, TechPerceptor, a Text Mining Framework, a Nanodevice Analyzer, a Clinical Trial Document Classifier, Nanotoxicity Searcher, NanoSifter, and NEIMiner. We conclude with recommendations for sharing NLP-related tools through online repositories to broaden participation in nanoinformatics. PMID:26199848
MPI-FAUN: An MPI-Based Framework for Alternating-Updating Nonnegative Matrix Factorization
DOE Office of Scientific and Technical Information (OSTI.GOV)
Kannan, Ramakrishnan; Ballard, Grey; Park, Haesun
Non-negative matrix factorization (NMF) is the problem of determining two non-negative low rank factors W and H, for the given input matrix A, such that A≈WH. NMF is a useful tool for many applications in different domains such as topic modeling in text mining, background separation in video analysis, and community detection in social networks. Despite its popularity in the data mining community, there is a lack of efficient parallel algorithms to solve the problem for big data sets. The main contribution of this work is a new, high-performance parallel computational framework for a broad class of NMF algorithms thatmore » iteratively solves alternating non-negative least squares (NLS) subproblems for W and H. It maintains the data and factor matrices in memory (distributed across processors), uses MPI for interprocessor communication, and, in the dense case, provably minimizes communication costs (under mild assumptions). The framework is flexible and able to leverage a variety of NMF and NLS algorithms, including Multiplicative Update, Hierarchical Alternating Least Squares, and Block Principal Pivoting. Our implementation allows us to benchmark and compare different algorithms on massive dense and sparse data matrices of size that spans from few hundreds of millions to billions. We demonstrate the scalability of our algorithm and compare it with baseline implementations, showing significant performance improvements. The code and the datasets used for conducting the experiments are available online.« less
MPI-FAUN: An MPI-Based Framework for Alternating-Updating Nonnegative Matrix Factorization
Kannan, Ramakrishnan; Ballard, Grey; Park, Haesun
2017-10-30
Non-negative matrix factorization (NMF) is the problem of determining two non-negative low rank factors W and H, for the given input matrix A, such that A≈WH. NMF is a useful tool for many applications in different domains such as topic modeling in text mining, background separation in video analysis, and community detection in social networks. Despite its popularity in the data mining community, there is a lack of efficient parallel algorithms to solve the problem for big data sets. The main contribution of this work is a new, high-performance parallel computational framework for a broad class of NMF algorithms thatmore » iteratively solves alternating non-negative least squares (NLS) subproblems for W and H. It maintains the data and factor matrices in memory (distributed across processors), uses MPI for interprocessor communication, and, in the dense case, provably minimizes communication costs (under mild assumptions). The framework is flexible and able to leverage a variety of NMF and NLS algorithms, including Multiplicative Update, Hierarchical Alternating Least Squares, and Block Principal Pivoting. Our implementation allows us to benchmark and compare different algorithms on massive dense and sparse data matrices of size that spans from few hundreds of millions to billions. We demonstrate the scalability of our algorithm and compare it with baseline implementations, showing significant performance improvements. The code and the datasets used for conducting the experiments are available online.« less
Text mining resources for the life sciences.
Przybyła, Piotr; Shardlow, Matthew; Aubin, Sophie; Bossy, Robert; Eckart de Castilho, Richard; Piperidis, Stelios; McNaught, John; Ananiadou, Sophia
2016-01-01
Text mining is a powerful technology for quickly distilling key information from vast quantities of biomedical literature. However, to harness this power the researcher must be well versed in the availability, suitability, adaptability, interoperability and comparative accuracy of current text mining resources. In this survey, we give an overview of the text mining resources that exist in the life sciences to help researchers, especially those employed in biocuration, to engage with text mining in their own work. We categorize the various resources under three sections: Content Discovery looks at where and how to find biomedical publications for text mining; Knowledge Encoding describes the formats used to represent the different levels of information associated with content that enable text mining, including those formats used to carry such information between processes; Tools and Services gives an overview of workflow management systems that can be used to rapidly configure and compare domain- and task-specific processes, via access to a wide range of pre-built tools. We also provide links to relevant repositories in each section to enable the reader to find resources relevant to their own area of interest. Throughout this work we give a special focus to resources that are interoperable-those that have the crucial ability to share information, enabling smooth integration and reusability. © The Author(s) 2016. Published by Oxford University Press.
Chapter 16: text mining for translational bioinformatics.
Cohen, K Bretonnel; Hunter, Lawrence E
2013-04-01
Text mining for translational bioinformatics is a new field with tremendous research potential. It is a subfield of biomedical natural language processing that concerns itself directly with the problem of relating basic biomedical research to clinical practice, and vice versa. Applications of text mining fall both into the category of T1 translational research-translating basic science results into new interventions-and T2 translational research, or translational research for public health. Potential use cases include better phenotyping of research subjects, and pharmacogenomic research. A variety of methods for evaluating text mining applications exist, including corpora, structured test suites, and post hoc judging. Two basic principles of linguistic structure are relevant for building text mining applications. One is that linguistic structure consists of multiple levels. The other is that every level of linguistic structure is characterized by ambiguity. There are two basic approaches to text mining: rule-based, also known as knowledge-based; and machine-learning-based, also known as statistical. Many systems are hybrids of the two approaches. Shared tasks have had a strong effect on the direction of the field. Like all translational bioinformatics software, text mining software for translational bioinformatics can be considered health-critical and should be subject to the strictest standards of quality assurance and software testing.
Text mining resources for the life sciences
Shardlow, Matthew; Aubin, Sophie; Bossy, Robert; Eckart de Castilho, Richard; Piperidis, Stelios; McNaught, John; Ananiadou, Sophia
2016-01-01
Text mining is a powerful technology for quickly distilling key information from vast quantities of biomedical literature. However, to harness this power the researcher must be well versed in the availability, suitability, adaptability, interoperability and comparative accuracy of current text mining resources. In this survey, we give an overview of the text mining resources that exist in the life sciences to help researchers, especially those employed in biocuration, to engage with text mining in their own work. We categorize the various resources under three sections: Content Discovery looks at where and how to find biomedical publications for text mining; Knowledge Encoding describes the formats used to represent the different levels of information associated with content that enable text mining, including those formats used to carry such information between processes; Tools and Services gives an overview of workflow management systems that can be used to rapidly configure and compare domain- and task-specific processes, via access to a wide range of pre-built tools. We also provide links to relevant repositories in each section to enable the reader to find resources relevant to their own area of interest. Throughout this work we give a special focus to resources that are interoperable—those that have the crucial ability to share information, enabling smooth integration and reusability. PMID:27888231
Cohen, Raphael; Elhadad, Michael; Elhadad, Noémie
2013-01-16
The increasing availability of Electronic Health Record (EHR) data and specifically free-text patient notes presents opportunities for phenotype extraction. Text-mining methods in particular can help disease modeling by mapping named-entities mentions to terminologies and clustering semantically related terms. EHR corpora, however, exhibit specific statistical and linguistic characteristics when compared with corpora in the biomedical literature domain. We focus on copy-and-paste redundancy: clinicians typically copy and paste information from previous notes when documenting a current patient encounter. Thus, within a longitudinal patient record, one expects to observe heavy redundancy. In this paper, we ask three research questions: (i) How can redundancy be quantified in large-scale text corpora? (ii) Conventional wisdom is that larger corpora yield better results in text mining. But how does the observed EHR redundancy affect text mining? Does such redundancy introduce a bias that distorts learned models? Or does the redundancy introduce benefits by highlighting stable and important subsets of the corpus? (iii) How can one mitigate the impact of redundancy on text mining? We analyze a large-scale EHR corpus and quantify redundancy both in terms of word and semantic concept repetition. We observe redundancy levels of about 30% and non-standard distribution of both words and concepts. We measure the impact of redundancy on two standard text-mining applications: collocation identification and topic modeling. We compare the results of these methods on synthetic data with controlled levels of redundancy and observe significant performance variation. Finally, we compare two mitigation strategies to avoid redundancy-induced bias: (i) a baseline strategy, keeping only the last note for each patient in the corpus; (ii) removing redundant notes with an efficient fingerprinting-based algorithm. (a)For text mining, preprocessing the EHR corpus with fingerprinting yields significantly better results. Before applying text-mining techniques, one must pay careful attention to the structure of the analyzed corpora. While the importance of data cleaning has been known for low-level text characteristics (e.g., encoding and spelling), high-level and difficult-to-quantify corpus characteristics, such as naturally occurring redundancy, can also hurt text mining. Fingerprinting enables text-mining techniques to leverage available data in the EHR corpus, while avoiding the bias introduced by redundancy.
MetReS, an Efficient Database for Genomic Applications.
Vilaplana, Jordi; Alves, Rui; Solsona, Francesc; Mateo, Jordi; Teixidó, Ivan; Pifarré, Marc
2018-02-01
MetReS (Metabolic Reconstruction Server) is a genomic database that is shared between two software applications that address important biological problems. Biblio-MetReS is a data-mining tool that enables the reconstruction of molecular networks based on automated text-mining analysis of published scientific literature. Homol-MetReS allows functional (re)annotation of proteomes, to properly identify both the individual proteins involved in the processes of interest and their function. The main goal of this work was to identify the areas where the performance of the MetReS database performance could be improved and to test whether this improvement would scale to larger datasets and more complex types of analysis. The study was started with a relational database, MySQL, which is the current database server used by the applications. We also tested the performance of an alternative data-handling framework, Apache Hadoop. Hadoop is currently used for large-scale data processing. We found that this data handling framework is likely to greatly improve the efficiency of the MetReS applications as the dataset and the processing needs increase by several orders of magnitude, as expected to happen in the near future.
Interpreter of maladies: redescription mining applied to biomedical data analysis.
Waltman, Peter; Pearlman, Alex; Mishra, Bud
2006-04-01
Comprehensive, systematic and integrated data-centric statistical approaches to disease modeling can provide powerful frameworks for understanding disease etiology. Here, one such computational framework based on redescription mining in both its incarnations, static and dynamic, is discussed. The static framework provides bioinformatic tools applicable to multifaceted datasets, containing genetic, transcriptomic, proteomic, and clinical data for diseased patients and normal subjects. The dynamic redescription framework provides systems biology tools to model complex sets of regulatory, metabolic and signaling pathways in the initiation and progression of a disease. As an example, the case of chronic fatigue syndrome (CFS) is considered, which has so far remained intractable and unpredictable in its etiology and nosology. The redescription mining approaches can be applied to the Centers for Disease Control and Prevention's Wichita (KS, USA) dataset, integrating transcriptomic, epidemiological and clinical data, and can also be used to study how pathways in the hypothalamic-pituitary-adrenal axis affect CFS patients.
Text-mining and information-retrieval services for molecular biology
Krallinger, Martin; Valencia, Alfonso
2005-01-01
Text-mining in molecular biology - defined as the automatic extraction of information about genes, proteins and their functional relationships from text documents - has emerged as a hybrid discipline on the edges of the fields of information science, bioinformatics and computational linguistics. A range of text-mining applications have been developed recently that will improve access to knowledge for biologists and database annotators. PMID:15998455
Text mining for traditional Chinese medical knowledge discovery: a survey.
Zhou, Xuezhong; Peng, Yonghong; Liu, Baoyan
2010-08-01
Extracting meaningful information and knowledge from free text is the subject of considerable research interest in the machine learning and data mining fields. Text data mining (or text mining) has become one of the most active research sub-fields in data mining. Significant developments in the area of biomedical text mining during the past years have demonstrated its great promise for supporting scientists in developing novel hypotheses and new knowledge from the biomedical literature. Traditional Chinese medicine (TCM) provides a distinct methodology with which to view human life. It is one of the most complete and distinguished traditional medicines with a history of several thousand years of studying and practicing the diagnosis and treatment of human disease. It has been shown that the TCM knowledge obtained from clinical practice has become a significant complementary source of information for modern biomedical sciences. TCM literature obtained from the historical period and from modern clinical studies has recently been transformed into digital data in the form of relational databases or text documents, which provide an effective platform for information sharing and retrieval. This motivates and facilitates research and development into knowledge discovery approaches and to modernize TCM. In order to contribute to this still growing field, this paper presents (1) a comparative introduction to TCM and modern biomedicine, (2) a survey of the related information sources of TCM, (3) a review and discussion of the state of the art and the development of text mining techniques with applications to TCM, (4) a discussion of the research issues around TCM text mining and its future directions. Copyright 2010 Elsevier Inc. All rights reserved.
Managing biological networks by using text mining and computer-aided curation
NASA Astrophysics Data System (ADS)
Yu, Seok Jong; Cho, Yongseong; Lee, Min-Ho; Lim, Jongtae; Yoo, Jaesoo
2015-11-01
In order to understand a biological mechanism in a cell, a researcher should collect a huge number of protein interactions with experimental data from experiments and the literature. Text mining systems that extract biological interactions from papers have been used to construct biological networks for a few decades. Even though the text mining of literature is necessary to construct a biological network, few systems with a text mining tool are available for biologists who want to construct their own biological networks. We have developed a biological network construction system called BioKnowledge Viewer that can generate a biological interaction network by using a text mining tool and biological taggers. It also Boolean simulation software to provide a biological modeling system to simulate the model that is made with the text mining tool. A user can download PubMed articles and construct a biological network by using the Multi-level Knowledge Emergence Model (KMEM), MetaMap, and A Biomedical Named Entity Recognizer (ABNER) as a text mining tool. To evaluate the system, we constructed an aging-related biological network that consist 9,415 nodes (genes) by using manual curation. With network analysis, we found that several genes, including JNK, AP-1, and BCL-2, were highly related in aging biological network. We provide a semi-automatic curation environment so that users can obtain a graph database for managing text mining results that are generated in the server system and can navigate the network with BioKnowledge Viewer, which is freely available at http://bioknowledgeviewer.kisti.re.kr.
An overview of the biocreative 2012 workshop track III: Interactive text mining task
USDA-ARS?s Scientific Manuscript database
An important question is how to make use of text mining to enhance the biocuration workflow. A number of groups have developed tools for text mining from a computer science/linguistics perspective and there are many initiatives to curate some aspect of biology from the literature. In some cases the ...
Text Mining in Cancer Gene and Pathway Prioritization
Luo, Yuan; Riedlinger, Gregory; Szolovits, Peter
2014-01-01
Prioritization of cancer implicated genes has received growing attention as an effective way to reduce wet lab cost by computational analysis that ranks candidate genes according to the likelihood that experimental verifications will succeed. A multitude of gene prioritization tools have been developed, each integrating different data sources covering gene sequences, differential expressions, function annotations, gene regulations, protein domains, protein interactions, and pathways. This review places existing gene prioritization tools against the backdrop of an integrative Omic hierarchy view toward cancer and focuses on the analysis of their text mining components. We explain the relatively slow progress of text mining in gene prioritization, identify several challenges to current text mining methods, and highlight a few directions where more effective text mining algorithms may improve the overall prioritization task and where prioritizing the pathways may be more desirable than prioritizing only genes. PMID:25392685
Text mining in cancer gene and pathway prioritization.
Luo, Yuan; Riedlinger, Gregory; Szolovits, Peter
2014-01-01
Prioritization of cancer implicated genes has received growing attention as an effective way to reduce wet lab cost by computational analysis that ranks candidate genes according to the likelihood that experimental verifications will succeed. A multitude of gene prioritization tools have been developed, each integrating different data sources covering gene sequences, differential expressions, function annotations, gene regulations, protein domains, protein interactions, and pathways. This review places existing gene prioritization tools against the backdrop of an integrative Omic hierarchy view toward cancer and focuses on the analysis of their text mining components. We explain the relatively slow progress of text mining in gene prioritization, identify several challenges to current text mining methods, and highlight a few directions where more effective text mining algorithms may improve the overall prioritization task and where prioritizing the pathways may be more desirable than prioritizing only genes.
Citation Mining: Integrating Text Mining and Bibliometrics for Research User Profiling.
ERIC Educational Resources Information Center
Kostoff, Ronald N.; del Rio, J. Antonio; Humenik, James A.; Garcia, Esther Ofilia; Ramirez, Ana Maria
2001-01-01
Discusses the importance of identifying the users and impact of research, and describes an approach for identifying the pathways through which research can impact other research, technology development, and applications. Describes a study that used citation mining, an integration of citation bibliometrics and text mining, on articles from the…
Lu, Songjian; Jin, Bo; Cowart, L Ashley; Lu, Xinghua
2013-01-01
Genetic and pharmacological perturbation experiments, such as deleting a gene and monitoring gene expression responses, are powerful tools for studying cellular signal transduction pathways. However, it remains a challenge to automatically derive knowledge of a cellular signaling system at a conceptual level from systematic perturbation-response data. In this study, we explored a framework that unifies knowledge mining and data mining towards the goal. The framework consists of the following automated processes: 1) applying an ontology-driven knowledge mining approach to identify functional modules among the genes responding to a perturbation in order to reveal potential signals affected by the perturbation; 2) applying a graph-based data mining approach to search for perturbations that affect a common signal; and 3) revealing the architecture of a signaling system by organizing signaling units into a hierarchy based on their relationships. Applying this framework to a compendium of yeast perturbation-response data, we have successfully recovered many well-known signal transduction pathways; in addition, our analysis has led to many new hypotheses regarding the yeast signal transduction system; finally, our analysis automatically organized perturbed genes as a graph reflecting the architecture of the yeast signaling system. Importantly, this framework transformed molecular findings from a gene level to a conceptual level, which can be readily translated into computable knowledge in the form of rules regarding the yeast signaling system, such as "if genes involved in the MAPK signaling are perturbed, genes involved in pheromone responses will be differentially expressed."
Using text-mining techniques in electronic patient records to identify ADRs from medicine use.
Warrer, Pernille; Hansen, Ebba Holme; Juhl-Jensen, Lars; Aagaard, Lise
2012-05-01
This literature review included studies that use text-mining techniques in narrative documents stored in electronic patient records (EPRs) to investigate ADRs. We searched PubMed, Embase, Web of Science and International Pharmaceutical Abstracts without restrictions from origin until July 2011. We included empirically based studies on text mining of electronic patient records (EPRs) that focused on detecting ADRs, excluding those that investigated adverse events not related to medicine use. We extracted information on study populations, EPR data sources, frequencies and types of the identified ADRs, medicines associated with ADRs, text-mining algorithms used and their performance. Seven studies, all from the United States, were eligible for inclusion in the review. Studies were published from 2001, the majority between 2009 and 2010. Text-mining techniques varied over time from simple free text searching of outpatient visit notes and inpatient discharge summaries to more advanced techniques involving natural language processing (NLP) of inpatient discharge summaries. Performance appeared to increase with the use of NLP, although many ADRs were still missed. Due to differences in study design and populations, various types of ADRs were identified and thus we could not make comparisons across studies. The review underscores the feasibility and potential of text mining to investigate narrative documents in EPRs for ADRs. However, more empirical studies are needed to evaluate whether text mining of EPRs can be used systematically to collect new information about ADRs. © 2011 The Authors. British Journal of Clinical Pharmacology © 2011 The British Pharmacological Society.
Using text-mining techniques in electronic patient records to identify ADRs from medicine use
Warrer, Pernille; Hansen, Ebba Holme; Juhl-Jensen, Lars; Aagaard, Lise
2012-01-01
This literature review included studies that use text-mining techniques in narrative documents stored in electronic patient records (EPRs) to investigate ADRs. We searched PubMed, Embase, Web of Science and International Pharmaceutical Abstracts without restrictions from origin until July 2011. We included empirically based studies on text mining of electronic patient records (EPRs) that focused on detecting ADRs, excluding those that investigated adverse events not related to medicine use. We extracted information on study populations, EPR data sources, frequencies and types of the identified ADRs, medicines associated with ADRs, text-mining algorithms used and their performance. Seven studies, all from the United States, were eligible for inclusion in the review. Studies were published from 2001, the majority between 2009 and 2010. Text-mining techniques varied over time from simple free text searching of outpatient visit notes and inpatient discharge summaries to more advanced techniques involving natural language processing (NLP) of inpatient discharge summaries. Performance appeared to increase with the use of NLP, although many ADRs were still missed. Due to differences in study design and populations, various types of ADRs were identified and thus we could not make comparisons across studies. The review underscores the feasibility and potential of text mining to investigate narrative documents in EPRs for ADRs. However, more empirical studies are needed to evaluate whether text mining of EPRs can be used systematically to collect new information about ADRs. PMID:22122057
Gene prioritization and clustering by multi-view text mining
2010-01-01
Background Text mining has become a useful tool for biologists trying to understand the genetics of diseases. In particular, it can help identify the most interesting candidate genes for a disease for further experimental analysis. Many text mining approaches have been introduced, but the effect of disease-gene identification varies in different text mining models. Thus, the idea of incorporating more text mining models may be beneficial to obtain more refined and accurate knowledge. However, how to effectively combine these models still remains a challenging question in machine learning. In particular, it is a non-trivial issue to guarantee that the integrated model performs better than the best individual model. Results We present a multi-view approach to retrieve biomedical knowledge using different controlled vocabularies. These controlled vocabularies are selected on the basis of nine well-known bio-ontologies and are applied to index the vast amounts of gene-based free-text information available in the MEDLINE repository. The text mining result specified by a vocabulary is considered as a view and the obtained multiple views are integrated by multi-source learning algorithms. We investigate the effect of integration in two fundamental computational disease gene identification tasks: gene prioritization and gene clustering. The performance of the proposed approach is systematically evaluated and compared on real benchmark data sets. In both tasks, the multi-view approach demonstrates significantly better performance than other comparing methods. Conclusions In practical research, the relevance of specific vocabulary pertaining to the task is usually unknown. In such case, multi-view text mining is a superior and promising strategy for text-based disease gene identification. PMID:20074336
A life-cycle description of underground coal mining
NASA Technical Reports Server (NTRS)
Lavin, M. L.; Borden, C. S.; Duda, J. R.
1978-01-01
An initial effort to relate the major technological and economic variables which impact conventional underground coal mining systems, in order to help identify promising areas for advanced mining technology is described. The point of departure is a series of investment analyses published by the United States Bureau of Mines, which provide both the analytical framework and guidance on a choice of variables.
Shatkay, Hagit; Pan, Fengxia; Rzhetsky, Andrey; Wilbur, W. John
2008-01-01
Motivation: Much current research in biomedical text mining is concerned with serving biologists by extracting certain information from scientific text. We note that there is no ‘average biologist’ client; different users have distinct needs. For instance, as noted in past evaluation efforts (BioCreative, TREC, KDD) database curators are often interested in sentences showing experimental evidence and methods. Conversely, lab scientists searching for known information about a protein may seek facts, typically stated with high confidence. Text-mining systems can target specific end-users and become more effective, if the system can first identify text regions rich in the type of scientific content that is of interest to the user, retrieve documents that have many such regions, and focus on fact extraction from these regions. Here, we study the ability to characterize and classify such text automatically. We have recently introduced a multi-dimensional categorization and annotation scheme, developed to be applicable to a wide variety of biomedical documents and scientific statements, while intended to support specific biomedical retrieval and extraction tasks. Results: The annotation scheme was applied to a large corpus in a controlled effort by eight independent annotators, where three individual annotators independently tagged each sentence. We then trained and tested machine learning classifiers to automatically categorize sentence fragments based on the annotation. We discuss here the issues involved in this task, and present an overview of the results. The latter strongly suggest that automatic annotation along most of the dimensions is highly feasible, and that this new framework for scientific sentence categorization is applicable in practice. Contact: shatkay@cs.queensu.ca PMID:18718948
ERIC Educational Resources Information Center
Mei, Qiaozhu
2009-01-01
With the dramatic growth of text information, there is an increasing need for powerful text mining systems that can automatically discover useful knowledge from text. Text is generally associated with all kinds of contextual information. Those contexts can be explicit, such as the time and the location where a blog article is written, and the…
Small, Aeron M; Kiss, Daniel H; Zlatsin, Yevgeny; Birtwell, David L; Williams, Heather; Guerraty, Marie A; Han, Yuchi; Anwaruddin, Saif; Holmes, John H; Chirinos, Julio A; Wilensky, Robert L; Giri, Jay; Rader, Daniel J
2017-08-01
Interrogation of the electronic health record (EHR) using billing codes as a surrogate for diagnoses of interest has been widely used for clinical research. However, the accuracy of this methodology is variable, as it reflects billing codes rather than severity of disease, and depends on the disease and the accuracy of the coding practitioner. Systematic application of text mining to the EHR has had variable success for the detection of cardiovascular phenotypes. We hypothesize that the application of text mining algorithms to cardiovascular procedure reports may be a superior method to identify patients with cardiovascular conditions of interest. We adapted the Oracle product Endeca, which utilizes text mining to identify terms of interest from a NoSQL-like database, for purposes of searching cardiovascular procedure reports and termed the tool "PennSeek". We imported 282,569 echocardiography reports representing 81,164 individuals and 27,205 cardiac catheterization reports representing 14,567 individuals from non-searchable databases into PennSeek. We then applied clinical criteria to these reports in PennSeek to identify patients with trileaflet aortic stenosis (TAS) and coronary artery disease (CAD). Accuracy of patient identification by text mining through PennSeek was compared with ICD-9 billing codes. Text mining identified 7115 patients with TAS and 9247 patients with CAD. ICD-9 codes identified 8272 patients with TAS and 6913 patients with CAD. 4346 patients with AS and 6024 patients with CAD were identified by both approaches. A randomly selected sample of 200-250 patients uniquely identified by text mining was compared with 200-250 patients uniquely identified by billing codes for both diseases. We demonstrate that text mining was superior, with a positive predictive value (PPV) of 0.95 compared to 0.53 by ICD-9 for TAS, and a PPV of 0.97 compared to 0.86 for CAD. These results highlight the superiority of text mining algorithms applied to electronic cardiovascular procedure reports in the identification of phenotypes of interest for cardiovascular research. Copyright © 2017. Published by Elsevier Inc.
Effective Filtering of Query Results on Updated User Behavioral Profiles in Web Mining
Sadesh, S.; Suganthe, R. C.
2015-01-01
Web with tremendous volume of information retrieves result for user related queries. With the rapid growth of web page recommendation, results retrieved based on data mining techniques did not offer higher performance filtering rate because relationships between user profile and queries were not analyzed in an extensive manner. At the same time, existing user profile based prediction in web data mining is not exhaustive in producing personalized result rate. To improve the query result rate on dynamics of user behavior over time, Hamilton Filtered Regime Switching User Query Probability (HFRS-UQP) framework is proposed. HFRS-UQP framework is split into two processes, where filtering and switching are carried out. The data mining based filtering in our research work uses the Hamilton Filtering framework to filter user result based on personalized information on automatic updated profiles through search engine. Maximized result is fetched, that is, filtered out with respect to user behavior profiles. The switching performs accurate filtering updated profiles using regime switching. The updating in profile change (i.e., switches) regime in HFRS-UQP framework identifies the second- and higher-order association of query result on the updated profiles. Experiment is conducted on factors such as personalized information search retrieval rate, filtering efficiency, and precision ratio. PMID:26221626
Biocuration workflows and text mining: overview of the BioCreative 2012 Workshop Track II.
Lu, Zhiyong; Hirschman, Lynette
2012-01-01
Manual curation of data from the biomedical literature is a rate-limiting factor for many expert curated databases. Despite the continuing advances in biomedical text mining and the pressing needs of biocurators for better tools, few existing text-mining tools have been successfully integrated into production literature curation systems such as those used by the expert curated databases. To close this gap and better understand all aspects of literature curation, we invited submissions of written descriptions of curation workflows from expert curated databases for the BioCreative 2012 Workshop Track II. We received seven qualified contributions, primarily from model organism databases. Based on these descriptions, we identified commonalities and differences across the workflows, the common ontologies and controlled vocabularies used and the current and desired uses of text mining for biocuration. Compared to a survey done in 2009, our 2012 results show that many more databases are now using text mining in parts of their curation workflows. In addition, the workshop participants identified text-mining aids for finding gene names and symbols (gene indexing), prioritization of documents for curation (document triage) and ontology concept assignment as those most desired by the biocurators. DATABASE URL: http://www.biocreative.org/tasks/bc-workshop-2012/workflow/.
NASA Astrophysics Data System (ADS)
Znikina, Ludmila; Rozhneva, Elena
2017-11-01
The article deals with the distribution of informative intensity of the English-language scientific text based on its structural features contributing to the process of formalization of the scientific text and the preservation of the adequacy of the text with derived semantic information in relation to the primary. Discourse analysis is built on specific compositional and meaningful examples of scientific texts taken from the mining field. It also analyzes the adequacy of the translation of foreign texts into another language, the relationships between elements of linguistic systems, the degree of a formal conformance, translation with the specific objectives and information needs of the recipient. Some key words and ideas are emphasized in the paragraphs of the English-language mining scientific texts. The article gives the characteristic features of the structure of paragraphs of technical text and examples of constructions in English scientific texts based on a mining theme with the aim to explain the possible ways of their adequate translation.
A UIMA wrapper for the NCBO annotator.
Roeder, Christophe; Jonquet, Clement; Shah, Nigam H; Baumgartner, William A; Verspoor, Karin; Hunter, Lawrence
2010-07-15
The Unstructured Information Management Architecture (UIMA) framework and web services are emerging as useful tools for integrating biomedical text mining tools. This note describes our work, which wraps the National Center for Biomedical Ontology (NCBO) Annotator-an ontology-based annotation service-to make it available as a component in UIMA workflows. This wrapper is freely available on the web at http://bionlp-uima.sourceforge.net/ as part of the UIMA tools distribution from the Center for Computational Pharmacology (CCP) at the University of Colorado School of Medicine. It has been implemented in Java for support on Mac OS X, Linux and MS Windows.
ERIC Educational Resources Information Center
Anaya, Antonio R.; Boticario, Jesus G.
2009-01-01
Data mining methods are successful in educational environments to discover new knowledge or learner skills or features. Unfortunately, they have not been used in depth with collaboration. We have developed a scalable data mining method, whose objective is to infer information on the collaboration during the collaboration process in a…
An Evaluation of Text Mining Tools as Applied to Selected Scientific and Engineering Literature.
ERIC Educational Resources Information Center
Trybula, Walter J.; Wyllys, Ronald E.
2000-01-01
Addresses an approach to the discovery of scientific knowledge through an examination of data mining and text mining techniques. Presents the results of experiments that investigated knowledge acquisition from a selected set of technical documents by domain experts. (Contains 15 references.) (Author/LRW)
ERIC Educational Resources Information Center
Chen, Hsinchun
2003-01-01
Discusses information retrieval techniques used on the World Wide Web. Topics include machine learning in information extraction; relevance feedback; information filtering and recommendation; text classification and text clustering; Web mining, based on data mining techniques; hyperlink structure; and Web size. (LRW)
Application of text mining in the biomedical domain.
Fleuren, Wilco W M; Alkema, Wynand
2015-03-01
In recent years the amount of experimental data that is produced in biomedical research and the number of papers that are being published in this field have grown rapidly. In order to keep up to date with developments in their field of interest and to interpret the outcome of experiments in light of all available literature, researchers turn more and more to the use of automated literature mining. As a consequence, text mining tools have evolved considerably in number and quality and nowadays can be used to address a variety of research questions ranging from de novo drug target discovery to enhanced biological interpretation of the results from high throughput experiments. In this paper we introduce the most important techniques that are used for a text mining and give an overview of the text mining tools that are currently being used and the type of problems they are typically applied for. Copyright © 2015 Elsevier Inc. All rights reserved.
NASA Astrophysics Data System (ADS)
Delgado, J.; Juncosa, R.
2009-04-01
Coal mining in Galicia (NW Spain) has been an important activity which came to an end in December, 2007. Hence, for different reasons, the two large brown coal mines in Galicia (the As Pontes mine, run by ENDESA GENERACIÓN, and the Meirama mine, owned by Lignitos de Meirama, S.A., LIMEISA), have started closure procedures, both of which are considering the flooding of the mine pits to create two large lakes (~8 km2 in As Pontes and ~2 km2 in Meirama). They will be unique in Galicia, a nearly lake-free territory. An important point to consider as regards the flooding of the lignite mine pits in Galicia is how the process of the creation of a body of artificial water will adapt to the strict legal demands put forth in the Water Framework Directive. This problem has been carefully examined by different authors in other countries and it raises the question of the need to adapt sampling surveys to monitor a number of key parameters -priority substances, physical and chemical parameters, biological indicators, etc.- that cannot be overlooked. Flooding, in both cases consider the preferential entrance into the mine holes of river-diverted surface waters, in detriment of ground waters in order to minimize acidic inputs. Although both mines are located in the same hydraulic demarcation (i.e. administrative units that, in Spain, are in charge of the public administration and the enforcement of natural water-related laws) the problems facing the corresponding mine managers are different. In the case of Meirama, the mine hole covers the upper third part of the Barcés river catchment, which is a major source of water for the Cecebre reservoir. That reservoir constitutes the only supply of drinking water for the city of A Coruña (~250.000 inhabitants) and its surrounding towns. In this contribution we will discuss how mine managers and the administration have addressed the uncertainties derived from the implementation of the Water Framework Directive in the particular case of Meirama.
Graph Mining Meets the Semantic Web
DOE Office of Scientific and Technical Information (OSTI.GOV)
Lee, Sangkeun; Sukumar, Sreenivas R; Lim, Seung-Hwan
The Resource Description Framework (RDF) and SPARQL Protocol and RDF Query Language (SPARQL) were introduced about a decade ago to enable flexible schema-free data interchange on the Semantic Web. Today, data scientists use the framework as a scalable graph representation for integrating, querying, exploring and analyzing data sets hosted at different sources. With increasing adoption, the need for graph mining capabilities for the Semantic Web has emerged. We address that need through implementation of three popular iterative Graph Mining algorithms (Triangle count, Connected component analysis, and PageRank). We implement these algorithms as SPARQL queries, wrapped within Python scripts. We evaluatemore » the performance of our implementation on 6 real world data sets and show graph mining algorithms (that have a linear-algebra formulation) can indeed be unleashed on data represented as RDF graphs using the SPARQL query interface.« less
The Islamic State Battle Plan: Press Release Natural Language Processing
2016-06-01
Processing, text mining , corpus, generalized linear model, cascade, R Shiny, leaflet, data visualization 15. NUMBER OF PAGES 83 16. PRICE CODE...Terrorism and Responses to Terrorism TDM Term Document Matrix TF Term Frequency TF-IDF Term Frequency-Inverse Document Frequency tm text mining (R...package=leaflet. Feinerer I, Hornik K (2015) Text Mining Package “tm,” Version 0.6-2. (Jul 3) https://cran.r-project.org/web/packages/tm/tm.pdf
OntoGene web services for biomedical text mining.
Rinaldi, Fabio; Clematide, Simon; Marques, Hernani; Ellendorff, Tilia; Romacker, Martin; Rodriguez-Esteban, Raul
2014-01-01
Text mining services are rapidly becoming a crucial component of various knowledge management pipelines, for example in the process of database curation, or for exploration and enrichment of biomedical data within the pharmaceutical industry. Traditional architectures, based on monolithic applications, do not offer sufficient flexibility for a wide range of use case scenarios, and therefore open architectures, as provided by web services, are attracting increased interest. We present an approach towards providing advanced text mining capabilities through web services, using a recently proposed standard for textual data interchange (BioC). The web services leverage a state-of-the-art platform for text mining (OntoGene) which has been tested in several community-organized evaluation challenges,with top ranked results in several of them.
Text mining patents for biomedical knowledge.
Rodriguez-Esteban, Raul; Bundschus, Markus
2016-06-01
Biomedical text mining of scientific knowledge bases, such as Medline, has received much attention in recent years. Given that text mining is able to automatically extract biomedical facts that revolve around entities such as genes, proteins, and drugs, from unstructured text sources, it is seen as a major enabler to foster biomedical research and drug discovery. In contrast to the biomedical literature, research into the mining of biomedical patents has not reached the same level of maturity. Here, we review existing work and highlight the associated technical challenges that emerge from automatically extracting facts from patents. We conclude by outlining potential future directions in this domain that could help drive biomedical research and drug discovery. Copyright © 2016 Elsevier Ltd. All rights reserved.
OntoGene web services for biomedical text mining
2014-01-01
Text mining services are rapidly becoming a crucial component of various knowledge management pipelines, for example in the process of database curation, or for exploration and enrichment of biomedical data within the pharmaceutical industry. Traditional architectures, based on monolithic applications, do not offer sufficient flexibility for a wide range of use case scenarios, and therefore open architectures, as provided by web services, are attracting increased interest. We present an approach towards providing advanced text mining capabilities through web services, using a recently proposed standard for textual data interchange (BioC). The web services leverage a state-of-the-art platform for text mining (OntoGene) which has been tested in several community-organized evaluation challenges, with top ranked results in several of them. PMID:25472638
Mine waste management legislation. Gold mining areas in Romania
NASA Astrophysics Data System (ADS)
Maftei, Raluca-Mihaela; Filipciuc, Constantina; Tudor, Elena
2014-05-01
Problems in the post-mining regions of Eastern Europe range from degraded land and landscapes, huge insecure dumps, surface cracks, soil pollution, lowering groundwater table, deforestation, and damaged cultural potentials to socio economic problems like unemployment or population decline. There is no common prescription for tackling the development of post-mining regions after mine closure nor is there a common definition of good practices or policy in this field. Key words : waste management, legislation, EU Directive, post mining Rosia Montana is a common oh 16 villages; one of them is also called Rosia Montana, a traditional mining Community, located in the Apuseni Mountains in the North-Western Romania. Beneath part of the village area lays one of the largest gold and silver deposits in Europe. In the Rosia Montana area mining had begun ever since the height of the Roman Empire. While the modern approach to mining demands careful remediation of environmental impacts, historically disused mines in this region have been abandoned, leaving widespread environmental damage. General legislative framework Strict regulations and procedures govern modern mining activity, including mitigation of all environmental impacts. Precious metals exploitation is put under GO no. 190/2000 re-published in 2004. The institutional framework was established and organized based on specific regulations, being represented by the following bodies: • The Ministry of Economy and Commerce (MEC), a public institution which develops the Government policy in the mining area, also provides the management of the public property in the mineral resources area; • The National Agency for the development and implementation of the mining Regions Reconstruction Programs (NAD), responsible with promotion of social mitigation measures and actions; • The Office for Industry Privatization, within the Education Ministry, responsible with privatization of companies under the CEM; • The National Agency for Mineral Resources (NAMR) manages, on behalf of the state, the mineral resources. Waste management framework Nowadays, Romania, is trying to align its regulation concerning mining activity to the European legislation taking into consideration waste management and their impact on the environment. Therefore the European Waste Catalog (Commission Decision 2001/118/EC) has been updated and published in the form of HG 856/2002 Waste management inventory and approved wastes list, including dangerous wastes. The HG 349/2005 establishes the legal framework for waste storage activity as well as for the monitoring of the closing and post-closing existing deposits, taking into account the environment protection and the health of the general population. Based on Directive 2000/60/EC the Ministry of Waters Administration, Forests and Environment Protection from Romania issued the GO No 756/1997 (amended by GO 532/2002 and GO 1144/2002),"Regulations for environment pollution assessment" that contains alarm and intervention rates for soil pollution for contaminants such as metals, metalloids (Sb, Ag, As, Be, Bi, B, Cd, Co, Cr, Cu, Hg, Mo, Ni, Pb, Se, Sn, TI, V, Zn) and cyanides. Also GO No 756/1997 was amended and updated by Law No 310/2004 and 112/2006 in witch technical instructions concerning general framework for the use of water sources in the human activities including mining industry, are approved. Chemical compounds contained in industrial waters are fully regulated by H. G. 352/2005 concerning the contents of waste water discharged. Directive 2006/21/EC of the European Parliament and of the Council relating to the management of waste from extractive industries and amending Directive 2004/35/EC is transposed into the national law of the Romanian Government under Decision No 856/2008. The 856/2008 Decision on the management of waste from extractive industries establishes "the legal framework concerning the guidelines, measures and procedures to prevent or reduce as far as possible any adverse effects on the environment, in particular water, air, soil, fauna, flora and landscape, and any health risks to the population, arising as a result of waste management in extractive industries". Based on the Commission decision 2009/339/EC concerning the waste management facilities - classification criteria - Romanian Government issued GO 2042/2010 witch states the procedures for approving the plan of waste management in extractive industries and its applications norms. Law No. 22/2001 fallows the regulations from the Espoo Convention on assessing the impact of mining on the environment sector in a cross-border context. This work is presented within the framework of SUSMIN project.
NASA Astrophysics Data System (ADS)
Baba, R.; Iijima, A.
2014-12-01
Conservation of biodiversity is one of the key issues in the environmental studies. As means to solve this issue, education is becoming increasingly important. In the previous work, we have developed a course of workshops on the conservation of biodiversity. To disseminate the course as a tool for environmental education, determination of the educational effect is essential. A text mining enables analyses of frequency and co-occurrence of words in the freely described texts. This study is intended to evaluate the effect of workshop by using text mining technique. We hosted the originally developed workshop on the conservation of biodiversity for 22 college students. The aim of the workshop was to inform the definition of biodiversity. Generally, biodiversity refers to the diversity of ecosystem, diversity between species, and diversity within species. To facilitate discussion, supplementary materials were used. For instance, field guides of wildlife species were used to discuss about the diversity of ecosystem. Moreover, a hierarchical framework in an ecological pyramid was shown for understanding the role of diversity between species. Besides, we offered a document material on the historical affair of Potato Famine in Ireland to discuss about the diversity within species from the genetic viewpoint. Before and after the workshop, we asked students for free description on the definition of biodiversity, and analyzed by using Tiny Text Miner. This technique enables Japanese language morphological analysis. Frequently-used words were sorted into some categories. Moreover, a principle component analysis was carried out. After the workshop, frequency of the words tagged to diversity between species and diversity within species has significantly increased. From a principle component analysis, the 1st component consists of the words such as producer, consumer, decomposer, and food chain. This indicates that the students have comprehended the close relationship between biodiversity and ecological pyramid. The 2nd component consists of the words such as gene, species, and ecosystem, suggesting that the students have correctly understood the 3 aspects of biodiversity. Consequently, this workshop shows an effect on acquirement of basic knowledge about the biodiversity.
Selection of remedial alternatives for mine sites: a multicriteria decision analysis approach.
Betrie, Getnet D; Sadiq, Rehan; Morin, Kevin A; Tesfamariam, Solomon
2013-04-15
The selection of remedial alternatives for mine sites is a complex task because it involves multiple criteria and often with conflicting objectives. However, an existing framework used to select remedial alternatives lacks multicriteria decision analysis (MCDA) aids and does not consider uncertainty in the selection of alternatives. The objective of this paper is to improve the existing framework by introducing deterministic and probabilistic MCDA methods. The Preference Ranking Organization Method for Enrichment Evaluation (PROMETHEE) methods have been implemented in this study. The MCDA analysis involves processing inputs to the PROMETHEE methods that are identifying the alternatives, defining the criteria, defining the criteria weights using analytical hierarchical process (AHP), defining the probability distribution of criteria weights, and conducting Monte Carlo Simulation (MCS); running the PROMETHEE methods using these inputs; and conducting a sensitivity analysis. A case study was presented to demonstrate the improved framework at a mine site. The results showed that the improved framework provides a reliable way of selecting remedial alternatives as well as quantifying the impact of different criteria on selecting alternatives. Copyright © 2013 Elsevier Ltd. All rights reserved.
Text and Structural Data Mining of Influenza Mentions in Web and Social Media
DOE Office of Scientific and Technical Information (OSTI.GOV)
Corley, Courtney D.; Cook, Diane; Mikler, Armin R.
Text and structural data mining of Web and social media (WSM) provides a novel disease surveillance resource and can identify online communities for targeted public health communications (PHC) to assure wide dissemination of pertinent information. WSM that mention influenza are harvested over a 24-week period, 5-October-2008 to 21-March-2009. Link analysis reveals communities for targeted PHC. Text mining is shown to identify trends in flu posts that correlate to real-world influenza-like-illness patient report data. We also bring to bear a graph-based data mining technique to detect anomalies among flu blogs connected by publisher type, links, and user-tags.
Vaccine adverse event text mining system for extracting features from vaccine safety reports.
Botsis, Taxiarchis; Buttolph, Thomas; Nguyen, Michael D; Winiecki, Scott; Woo, Emily Jane; Ball, Robert
2012-01-01
To develop and evaluate a text mining system for extracting key clinical features from vaccine adverse event reporting system (VAERS) narratives to aid in the automated review of adverse event reports. Based upon clinical significance to VAERS reviewing physicians, we defined the primary (diagnosis and cause of death) and secondary features (eg, symptoms) for extraction. We built a novel vaccine adverse event text mining (VaeTM) system based on a semantic text mining strategy. The performance of VaeTM was evaluated using a total of 300 VAERS reports in three sequential evaluations of 100 reports each. Moreover, we evaluated the VaeTM contribution to case classification; an information retrieval-based approach was used for the identification of anaphylaxis cases in a set of reports and was compared with two other methods: a dedicated text classifier and an online tool. The performance metrics of VaeTM were text mining metrics: recall, precision and F-measure. We also conducted a qualitative difference analysis and calculated sensitivity and specificity for classification of anaphylaxis cases based on the above three approaches. VaeTM performed best in extracting diagnosis, second level diagnosis, drug, vaccine, and lot number features (lenient F-measure in the third evaluation: 0.897, 0.817, 0.858, 0.874, and 0.914, respectively). In terms of case classification, high sensitivity was achieved (83.1%); this was equal and better compared to the text classifier (83.1%) and the online tool (40.7%), respectively. Our VaeTM implementation of a semantic text mining strategy shows promise in providing accurate and efficient extraction of key features from VAERS narratives.
NASA Astrophysics Data System (ADS)
Löwe, Peter; Klump, Jens; Robertson, Jesse
2015-04-01
Text mining is commonly employed as a tool in data science to investigate and chart emergent information from corpora of research abstracts, such as the Geophysical Research Abstracts (GRA) published by Copernicus. In this context current standards, such as persistent identifiers like DOI and ORCID, allow us to trace, cite and map links between journal publications, the underlying research data and scientific software. This network can be expressed as a directed graph which enables us to chart networks of cooperation and innovation, thematic foci and the locations of research communities in time and space. However, this approach of data science, focusing on the research process in a self-referential manner, rather than the topical work, is still in a developing stage. Scientific work presented at the EGU General Assembly is often the first step towards new approaches and innovative ideas to the geospatial community. It represents a rich, deep and heterogeneous source of geoscientific thought. This corpus is a significant data source for data science, which has not been analysed on this scale previously. In this work, the corpus of the Geophysical Research Abstracts is used for the first time as a data base for analyses of topical text mining. For this, we used a sturdy and customizable software framework, based on the work of Schmitt et al. [1]. For the analysis we used the High Performance Computing infrastructure of the German Research Centre for Geosciences GFZ in Potsdam, Germany. Here, we report on the first results from the analysis of the continuous spreading the of use of Free and Open Source Software Tools (FOSS) within the EGU communities, mapping the general increase of FOSS-themed GRA articles in the last decade and the developing spatial patterns of involved parties and FOSS topics. References: [1] Schmitt, L. M., Christianson, K.T, Gupta R..: Linguistic Computing with UNIX Tools, in Kao, A., Poteet S.R. (Eds.): Natural Language processing and Text Mining, Springer, 2007. doi:10.1007/978-1-84628-754-1_12.
DISEASES: text mining and data integration of disease-gene associations.
Pletscher-Frankild, Sune; Pallejà, Albert; Tsafou, Kalliopi; Binder, Janos X; Jensen, Lars Juhl
2015-03-01
Text mining is a flexible technology that can be applied to numerous different tasks in biology and medicine. We present a system for extracting disease-gene associations from biomedical abstracts. The system consists of a highly efficient dictionary-based tagger for named entity recognition of human genes and diseases, which we combine with a scoring scheme that takes into account co-occurrences both within and between sentences. We show that this approach is able to extract half of all manually curated associations with a false positive rate of only 0.16%. Nonetheless, text mining should not stand alone, but be combined with other types of evidence. For this reason, we have developed the DISEASES resource, which integrates the results from text mining with manually curated disease-gene associations, cancer mutation data, and genome-wide association studies from existing databases. The DISEASES resource is accessible through a web interface at http://diseases.jensenlab.org/, where the text-mining software and all associations are also freely available for download. Copyright © 2014 The Authors. Published by Elsevier Inc. All rights reserved.
Simmons, Michael; Singhal, Ayush; Lu, Zhiyong
2018-01-01
The key question of precision medicine is whether it is possible to find clinically actionable granularity in diagnosing disease and classifying patient risk. The advent of next generation sequencing and the widespread adoption of electronic health records (EHRs) have provided clinicians and researchers a wealth of data and made possible the precise characterization of individual patient genotypes and phenotypes. Unstructured text — found in biomedical publications and clinical notes — is an important component of genotype and phenotype knowledge. Publications in the biomedical literature provide essential information for interpreting genetic data. Likewise, clinical notes contain the richest source of phenotype information in EHRs. Text mining can render these texts computationally accessible and support information extraction and hypothesis generation. This chapter reviews the mechanics of text mining in precision medicine and discusses several specific use cases, including database curation for personalized cancer medicine, patient outcome prediction from EHR-derived cohorts, and pharmacogenomic research. Taken as a whole, these use cases demonstrate how text mining enables effective utilization of existing knowledge sources and thus promotes increased value for patients and healthcare systems. Text mining is an indispensable tool for translating genotype-phenotype data into effective clinical care that will undoubtedly play an important role in the eventual realization of precision medicine. PMID:27807747
Simmons, Michael; Singhal, Ayush; Lu, Zhiyong
2016-01-01
The key question of precision medicine is whether it is possible to find clinically actionable granularity in diagnosing disease and classifying patient risk. The advent of next-generation sequencing and the widespread adoption of electronic health records (EHRs) have provided clinicians and researchers a wealth of data and made possible the precise characterization of individual patient genotypes and phenotypes. Unstructured text-found in biomedical publications and clinical notes-is an important component of genotype and phenotype knowledge. Publications in the biomedical literature provide essential information for interpreting genetic data. Likewise, clinical notes contain the richest source of phenotype information in EHRs. Text mining can render these texts computationally accessible and support information extraction and hypothesis generation. This chapter reviews the mechanics of text mining in precision medicine and discusses several specific use cases, including database curation for personalized cancer medicine, patient outcome prediction from EHR-derived cohorts, and pharmacogenomic research. Taken as a whole, these use cases demonstrate how text mining enables effective utilization of existing knowledge sources and thus promotes increased value for patients and healthcare systems. Text mining is an indispensable tool for translating genotype-phenotype data into effective clinical care that will undoubtedly play an important role in the eventual realization of precision medicine.
Using Text Mining to Uncover Students' Technology-Related Problems in Live Video Streaming
ERIC Educational Resources Information Center
Abdous, M'hammed; He, Wu
2011-01-01
Because of their capacity to sift through large amounts of data, text mining and data mining are enabling higher education institutions to reveal valuable patterns in students' learning behaviours without having to resort to traditional survey methods. In an effort to uncover live video streaming (LVS) students' technology related-problems and to…
Hammond, Kenric W; Ben-Ari, Alon Y; Laundry, Ryan J; Boyko, Edward J; Samore, Matthew H
2015-12-01
Free text in electronic health records resists large-scale analysis. Text records facts of interest not found in encoded data, and text mining enables their retrieval and quantification. The U.S. Department of Veterans Affairs (VA) clinical data repository affords an opportunity to apply text-mining methodology to study clinical questions in large populations. To assess the feasibility of text mining, investigation of the relationship between exposure to adverse childhood experiences (ACEs) and recorded diagnoses was conducted among all VA-treated Gulf war veterans, utilizing all progress notes recorded from 2000-2011. Text processing extracted ACE exposures recorded among 44.7 million clinical notes belonging to 243,973 veterans. The relationship of ACE exposure to adult illnesses was analyzed using logistic regression. Bias considerations were assessed. ACE score was strongly associated with suicide attempts and serious mental disorders (ORs = 1.84 to 1.97), and less so with behaviorally mediated and somatic conditions (ORs = 1.02 to 1.36) per unit. Bias adjustments did not remove persistent associations between ACE score and most illnesses. Text mining to detect ACE exposure in a large population was feasible. Analysis of the relationship between ACE score and adult health conditions yielded patterns of association consistent with prior research. Copyright © 2015 International Society for Traumatic Stress Studies.
NASA Astrophysics Data System (ADS)
Prno, Jason; Slocombe, D. Scott
2014-03-01
The concept of a "social license to operate" (SLO) was coined in the 1990s and gained popularity as one way in which "social" considerations can be addressed in mineral development decision making. The need for a SLO implies that developers require the widespread approval of local community members for their projects to avoid exposure to potentially costly conflict and business risks. Only a limited amount of scholarship exists on the topic, and there is a need for research that specifically addresses the complex and changeable nature of SLO outcomes. In response to these challenges, this paper advances a novel, systems-based conceptual framework for assessing SLO determinants and outcomes in the mining industry. Two strands of systems theory are specifically highlighted—complex adaptive systems and resilience—and the roles of context, key system variables, emergence, change, uncertainty, feedbacks, cross-scale effects, multiple stable states, thresholds, and resilience are discussed. The framework was developed from the results of a multi-year research project which involved international mining case study investigations, a comprehensive literature review, and interviews conducted with mining stakeholders and observers. The framework can help guide SLO analysis and management efforts, by encouraging users to account for important contextual and complexity-oriented elements present in SLO settings. We apply the framework to a case study in Alaska, USA before discussing its merits and challenges. We also illustrate knowledge gaps associated with applications of complex adaptive systems and resilience theories to the study of SLO dynamics, and discuss opportunities for future research.
Prno, Jason; Slocombe, D Scott
2014-03-01
The concept of a "social license to operate" (SLO) was coined in the 1990s and gained popularity as one way in which "social" considerations can be addressed in mineral development decision making. The need for a SLO implies that developers require the widespread approval of local community members for their projects to avoid exposure to potentially costly conflict and business risks. Only a limited amount of scholarship exists on the topic, and there is a need for research that specifically addresses the complex and changeable nature of SLO outcomes. In response to these challenges, this paper advances a novel, systems-based conceptual framework for assessing SLO determinants and outcomes in the mining industry. Two strands of systems theory are specifically highlighted-complex adaptive systems and resilience-and the roles of context, key system variables, emergence, change, uncertainty, feedbacks, cross-scale effects, multiple stable states, thresholds, and resilience are discussed. The framework was developed from the results of a multi-year research project which involved international mining case study investigations, a comprehensive literature review, and interviews conducted with mining stakeholders and observers. The framework can help guide SLO analysis and management efforts, by encouraging users to account for important contextual and complexity-oriented elements present in SLO settings. We apply the framework to a case study in Alaska, USA before discussing its merits and challenges. We also illustrate knowledge gaps associated with applications of complex adaptive systems and resilience theories to the study of SLO dynamics, and discuss opportunities for future research.
Text mining and its potential applications in systems biology.
Ananiadou, Sophia; Kell, Douglas B; Tsujii, Jun-ichi
2006-12-01
With biomedical literature increasing at a rate of several thousand papers per week, it is impossible to keep abreast of all developments; therefore, automated means to manage the information overload are required. Text mining techniques, which involve the processes of information retrieval, information extraction and data mining, provide a means of solving this. By adding meaning to text, these techniques produce a more structured analysis of textual knowledge than simple word searches, and can provide powerful tools for the production and analysis of systems biology models.
PubMedPortable: A Framework for Supporting the Development of Text Mining Applications.
Döring, Kersten; Grüning, Björn A; Telukunta, Kiran K; Thomas, Philippe; Günther, Stefan
2016-01-01
Information extraction from biomedical literature is continuously growing in scope and importance. Many tools exist that perform named entity recognition, e.g. of proteins, chemical compounds, and diseases. Furthermore, several approaches deal with the extraction of relations between identified entities. The BioCreative community supports these developments with yearly open challenges, which led to a standardised XML text annotation format called BioC. PubMed provides access to the largest open biomedical literature repository, but there is no unified way of connecting its data to natural language processing tools. Therefore, an appropriate data environment is needed as a basis to combine different software solutions and to develop customised text mining applications. PubMedPortable builds a relational database and a full text index on PubMed citations. It can be applied either to the complete PubMed data set or an arbitrary subset of downloaded PubMed XML files. The software provides the infrastructure to combine stand-alone applications by exporting different data formats, e.g. BioC. The presented workflows show how to use PubMedPortable to retrieve, store, and analyse a disease-specific data set. The provided use cases are well documented in the PubMedPortable wiki. The open-source software library is small, easy to use, and scalable to the user's system requirements. It is freely available for Linux on the web at https://github.com/KerstenDoering/PubMedPortable and for other operating systems as a virtual container. The approach was tested extensively and applied successfully in several projects.
PubMedPortable: A Framework for Supporting the Development of Text Mining Applications
Döring, Kersten; Grüning, Björn A.; Telukunta, Kiran K.; Thomas, Philippe; Günther, Stefan
2016-01-01
Information extraction from biomedical literature is continuously growing in scope and importance. Many tools exist that perform named entity recognition, e.g. of proteins, chemical compounds, and diseases. Furthermore, several approaches deal with the extraction of relations between identified entities. The BioCreative community supports these developments with yearly open challenges, which led to a standardised XML text annotation format called BioC. PubMed provides access to the largest open biomedical literature repository, but there is no unified way of connecting its data to natural language processing tools. Therefore, an appropriate data environment is needed as a basis to combine different software solutions and to develop customised text mining applications. PubMedPortable builds a relational database and a full text index on PubMed citations. It can be applied either to the complete PubMed data set or an arbitrary subset of downloaded PubMed XML files. The software provides the infrastructure to combine stand-alone applications by exporting different data formats, e.g. BioC. The presented workflows show how to use PubMedPortable to retrieve, store, and analyse a disease-specific data set. The provided use cases are well documented in the PubMedPortable wiki. The open-source software library is small, easy to use, and scalable to the user’s system requirements. It is freely available for Linux on the web at https://github.com/KerstenDoering/PubMedPortable and for other operating systems as a virtual container. The approach was tested extensively and applied successfully in several projects. PMID:27706202
Model architecture of intelligent data mining oriented urban transportation information
NASA Astrophysics Data System (ADS)
Yang, Bogang; Tao, Yingchun; Sui, Jianbo; Zhang, Feizhou
2007-06-01
Aiming at solving practical problems in urban traffic, the paper presents model architecture of intelligent data mining from hierarchical view. With artificial intelligent technologies used in the framework, the intelligent data mining technology improves, which is more suitable for the change of real-time road condition. It also provides efficient technology support for the urban transport information distribution, transmission and display.
VisualUrText: A Text Analytics Tool for Unstructured Textual Data
NASA Astrophysics Data System (ADS)
Zainol, Zuraini; Jaymes, Mohd T. H.; Nohuddin, Puteri N. E.
2018-05-01
The growing amount of unstructured text over Internet is tremendous. Text repositories come from Web 2.0, business intelligence and social networking applications. It is also believed that 80-90% of future growth data is available in the form of unstructured text databases that may potentially contain interesting patterns and trends. Text Mining is well known technique for discovering interesting patterns and trends which are non-trivial knowledge from massive unstructured text data. Text Mining covers multidisciplinary fields involving information retrieval (IR), text analysis, natural language processing (NLP), data mining, machine learning statistics and computational linguistics. This paper discusses the development of text analytics tool that is proficient in extracting, processing, analyzing the unstructured text data and visualizing cleaned text data into multiple forms such as Document Term Matrix (DTM), Frequency Graph, Network Analysis Graph, Word Cloud and Dendogram. This tool, VisualUrText, is developed to assist students and researchers for extracting interesting patterns and trends in document analyses.
A UIMA wrapper for the NCBO annotator
Roeder, Christophe; Jonquet, Clement; Shah, Nigam H.; Baumgartner, William A.; Verspoor, Karin; Hunter, Lawrence
2010-01-01
Summary: The Unstructured Information Management Architecture (UIMA) framework and web services are emerging as useful tools for integrating biomedical text mining tools. This note describes our work, which wraps the National Center for Biomedical Ontology (NCBO) Annotator—an ontology-based annotation service—to make it available as a component in UIMA workflows. Availability: This wrapper is freely available on the web at http://bionlp-uima.sourceforge.net/ as part of the UIMA tools distribution from the Center for Computational Pharmacology (CCP) at the University of Colorado School of Medicine. It has been implemented in Java for support on Mac OS X, Linux and MS Windows. Contact: chris.roeder@ucdenver.edu PMID:20505005
BioCreative Workshops for DOE Genome Sciences: Text Mining for Metagenomics
DOE Office of Scientific and Technical Information (OSTI.GOV)
Wu, Cathy H.; Hirschman, Lynette
The objective of this project was to host BioCreative workshops to define and develop text mining tasks to meet the needs of the Genome Sciences community, focusing on metadata information extraction in metagenomics. Following the successful introduction of metagenomics at the BioCreative IV workshop, members of the metagenomics community and BioCreative communities continued discussion to identify candidate topics for a BioCreative metagenomics track for BioCreative V. Of particular interest was the capture of environmental and isolation source information from text. The outcome was to form a “community of interest” around work on the interactive EXTRACT system, which supported interactive taggingmore » of environmental and species data. This experiment is included in the BioCreative V virtual issue of Database. In addition, there was broad participation by members of the metagenomics community in the panels held at BioCreative V, leading to valuable exchanges between the text mining developers and members of the metagenomics research community. These exchanges are reflected in a number of the overview and perspective pieces also being captured in the BioCreative V virtual issue. Overall, this conversation has exposed the metagenomics researchers to the possibilities of text mining, and educated the text mining developers to the specific needs of the metagenomics community.« less
Benchmarking infrastructure for mutation text mining
2014-01-01
Background Experimental research on the automatic extraction of information about mutations from texts is greatly hindered by the lack of consensus evaluation infrastructure for the testing and benchmarking of mutation text mining systems. Results We propose a community-oriented annotation and benchmarking infrastructure to support development, testing, benchmarking, and comparison of mutation text mining systems. The design is based on semantic standards, where RDF is used to represent annotations, an OWL ontology provides an extensible schema for the data and SPARQL is used to compute various performance metrics, so that in many cases no programming is needed to analyze results from a text mining system. While large benchmark corpora for biological entity and relation extraction are focused mostly on genes, proteins, diseases, and species, our benchmarking infrastructure fills the gap for mutation information. The core infrastructure comprises (1) an ontology for modelling annotations, (2) SPARQL queries for computing performance metrics, and (3) a sizeable collection of manually curated documents, that can support mutation grounding and mutation impact extraction experiments. Conclusion We have developed the principal infrastructure for the benchmarking of mutation text mining tasks. The use of RDF and OWL as the representation for corpora ensures extensibility. The infrastructure is suitable for out-of-the-box use in several important scenarios and is ready, in its current state, for initial community adoption. PMID:24568600
Benchmarking infrastructure for mutation text mining.
Klein, Artjom; Riazanov, Alexandre; Hindle, Matthew M; Baker, Christopher Jo
2014-02-25
Experimental research on the automatic extraction of information about mutations from texts is greatly hindered by the lack of consensus evaluation infrastructure for the testing and benchmarking of mutation text mining systems. We propose a community-oriented annotation and benchmarking infrastructure to support development, testing, benchmarking, and comparison of mutation text mining systems. The design is based on semantic standards, where RDF is used to represent annotations, an OWL ontology provides an extensible schema for the data and SPARQL is used to compute various performance metrics, so that in many cases no programming is needed to analyze results from a text mining system. While large benchmark corpora for biological entity and relation extraction are focused mostly on genes, proteins, diseases, and species, our benchmarking infrastructure fills the gap for mutation information. The core infrastructure comprises (1) an ontology for modelling annotations, (2) SPARQL queries for computing performance metrics, and (3) a sizeable collection of manually curated documents, that can support mutation grounding and mutation impact extraction experiments. We have developed the principal infrastructure for the benchmarking of mutation text mining tasks. The use of RDF and OWL as the representation for corpora ensures extensibility. The infrastructure is suitable for out-of-the-box use in several important scenarios and is ready, in its current state, for initial community adoption.
DrugQuest - a text mining workflow for drug association discovery.
Papanikolaou, Nikolas; Pavlopoulos, Georgios A; Theodosiou, Theodosios; Vizirianakis, Ioannis S; Iliopoulos, Ioannis
2016-06-06
Text mining and data integration methods are gaining ground in the field of health sciences due to the exponential growth of bio-medical literature and information stored in biological databases. While such methods mostly try to extract bioentity associations from PubMed, very few of them are dedicated in mining other types of repositories such as chemical databases. Herein, we apply a text mining approach on the DrugBank database in order to explore drug associations based on the DrugBank "Description", "Indication", "Pharmacodynamics" and "Mechanism of Action" text fields. We apply Name Entity Recognition (NER) techniques on these fields to identify chemicals, proteins, genes, pathways, diseases, and we utilize the TextQuest algorithm to find additional biologically significant words. Using a plethora of similarity and partitional clustering techniques, we group the DrugBank records based on their common terms and investigate possible scenarios why these records are clustered together. Different views such as clustered chemicals based on their textual information, tag clouds consisting of Significant Terms along with the terms that were used for clustering are delivered to the user through a user-friendly web interface. DrugQuest is a text mining tool for knowledge discovery: it is designed to cluster DrugBank records based on text attributes in order to find new associations between drugs. The service is freely available at http://bioinformatics.med.uoc.gr/drugquest .
ERIC Educational Resources Information Center
Kong, Siu Cheung; Li, Ping; Song, Yanjie
2018-01-01
This study evaluated a bilingual text-mining system, which incorporated a bilingual taxonomy of key words and provided hierarchical visualization, for understanding learner-generated text in the learning management systems through automatic identification and counting of matching key words. A class of 27 in-service teachers studied a course…
Beyond accuracy: creating interoperable and scalable text-mining web services.
Wei, Chih-Hsuan; Leaman, Robert; Lu, Zhiyong
2016-06-15
The biomedical literature is a knowledge-rich resource and an important foundation for future research. With over 24 million articles in PubMed and an increasing growth rate, research in automated text processing is becoming increasingly important. We report here our recently developed web-based text mining services for biomedical concept recognition and normalization. Unlike most text-mining software tools, our web services integrate several state-of-the-art entity tagging systems (DNorm, GNormPlus, SR4GN, tmChem and tmVar) and offer a batch-processing mode able to process arbitrary text input (e.g. scholarly publications, patents and medical records) in multiple formats (e.g. BioC). We support multiple standards to make our service interoperable and allow simpler integration with other text-processing pipelines. To maximize scalability, we have preprocessed all PubMed articles, and use a computer cluster for processing large requests of arbitrary text. Our text-mining web service is freely available at http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/tmTools/#curl : Zhiyong.Lu@nih.gov. Published by Oxford University Press 2016. This work is written by US Government employees and is in the public domain in the US.
75 FR 51291 - National Science Board: Sunshine Act Meetings; Notice
Federal Register 2010, 2011, 2012, 2013, 2014
2010-08-19
...-Gathering Activities. [cir] COV Report Text-Mining. [cir] Design of Research Questions for External Input. [cir] SBE/CISE Text-Mining Projects. [cir] Using a Blog for Informal Input. Committee on Education and...
Argo: an integrative, interactive, text mining-based workbench supporting curation
Rak, Rafal; Rowley, Andrew; Black, William; Ananiadou, Sophia
2012-01-01
Curation of biomedical literature is often supported by the automatic analysis of textual content that generally involves a sequence of individual processing components. Text mining (TM) has been used to enhance the process of manual biocuration, but has been focused on specific databases and tasks rather than an environment integrating TM tools into the curation pipeline, catering for a variety of tasks, types of information and applications. Processing components usually come from different sources and often lack interoperability. The well established Unstructured Information Management Architecture is a framework that addresses interoperability by defining common data structures and interfaces. However, most of the efforts are targeted towards software developers and are not suitable for curators, or are otherwise inconvenient to use on a higher level of abstraction. To overcome these issues we introduce Argo, an interoperable, integrative, interactive and collaborative system for text analysis with a convenient graphic user interface to ease the development of processing workflows and boost productivity in labour-intensive manual curation. Robust, scalable text analytics follow a modular approach, adopting component modules for distinct levels of text analysis. The user interface is available entirely through a web browser that saves the user from going through often complicated and platform-dependent installation procedures. Argo comes with a predefined set of processing components commonly used in text analysis, while giving the users the ability to deposit their own components. The system accommodates various areas and levels of user expertise, from TM and computational linguistics to ontology-based curation. One of the key functionalities of Argo is its ability to seamlessly incorporate user-interactive components, such as manual annotation editors, into otherwise completely automatic pipelines. As a use case, we demonstrate the functionality of an in-built manual annotation editor that is well suited for in-text corpus annotation tasks. Database URL: http://www.nactem.ac.uk/Argo PMID:22434844
Application of data mining in performance measures
NASA Astrophysics Data System (ADS)
Chan, Michael F. S.; Chung, Walter W.; Wong, Tai Sun
2001-10-01
This paper proposes a structured framework for exploiting data mining application for performance measures. The context is set in an airline company is illustrated for the use of such framework. The framework takes in consideration of how a knowledge worker interacts with performance information at the enterprise level to support them to make informed decision in managing the effectiveness of operations. A case study of applying data mining technology for performance data in an airline company is illustrated. The use of performance measures is specifically applied to assist in the aircraft delay management process. The increasingly dispersed and complex operations of airline operation put much strain on the part of knowledge worker in using search, acquiring and analyzing information to manage performance. One major problem faced with knowledge workers is the identification of root causes of performance deficiency. The large amount of factors involved in the analyze the root causes can be time consuming and the objective of applying data mining technology is to reduce the time and resources needed for such process. The increasing market competition for better performance management in various industries gives rises to need of the intelligent use of data. Because of this, the framework proposed here is very much generalizable to industries such as manufacturing. It could assist knowledge workers who are constantly looking for ways to improve operation effectiveness through new initiatives and the effort is required to be quickly done to gain competitive advantage in the marketplace.
Imitating manual curation of text-mined facts in biomedicine.
Rodriguez-Esteban, Raul; Iossifov, Ivan; Rzhetsky, Andrey
2006-09-08
Text-mining algorithms make mistakes in extracting facts from natural-language texts. In biomedical applications, which rely on use of text-mined data, it is critical to assess the quality (the probability that the message is correctly extracted) of individual facts--to resolve data conflicts and inconsistencies. Using a large set of almost 100,000 manually produced evaluations (most facts were independently reviewed more than once, producing independent evaluations), we implemented and tested a collection of algorithms that mimic human evaluation of facts provided by an automated information-extraction system. The performance of our best automated classifiers closely approached that of our human evaluators (ROC score close to 0.95). Our hypothesis is that, were we to use a larger number of human experts to evaluate any given sentence, we could implement an artificial-intelligence curator that would perform the classification job at least as accurately as an average individual human evaluator. We illustrated our analysis by visualizing the predicted accuracy of the text-mined relations involving the term cocaine.
Assimilating Text-Mining & Bio-Informatics Tools to Analyze Cellulase structures
NASA Astrophysics Data System (ADS)
Satyasree, K. P. N. V., Dr; Lalitha Kumari, B., Dr; Jyotsna Devi, K. S. N. V.; Choudri, S. M. Roy; Pratap Joshi, K.
2017-08-01
Text-mining is one of the best potential way of automatically extracting information from the huge biological literature. To exploit its prospective, the knowledge encrypted in the text should be converted to some semantic representation such as entities and relations, which could be analyzed by machines. But large-scale practical systems for this purpose are rare. But text mining could be helpful for generating or validating predictions. Cellulases have abundant applications in various industries. Cellulose degrading enzymes are cellulases and the same producing bacteria - Bacillus subtilis & fungus Pseudomonas putida were isolated from top soil of Guntur Dt. A.P. India. Absolute cultures were conserved on potato dextrose agar medium for molecular studies. In this paper, we presented how well the text mining concepts can be used to analyze cellulase producing bacteria and fungi, their comparative structures are also studied with the aid of well-establised, high quality standard bioinformatic tools such as Bioedit, Swissport, Protparam, EMBOSSwin with which a complete data on Cellulases like structure, constituents of the enzyme has been obtained.
Gurulingappa, Harsha; Toldo, Luca; Rajput, Abdul Mateen; Kors, Jan A; Taweel, Adel; Tayrouz, Yorki
2013-11-01
The aim of this study was to assess the impact of automatically detected adverse event signals from text and open-source data on the prediction of drug label changes. Open-source adverse effect data were collected from FAERS, Yellow Cards and SIDER databases. A shallow linguistic relation extraction system (JSRE) was applied for extraction of adverse effects from MEDLINE case reports. Statistical approach was applied on the extracted datasets for signal detection and subsequent prediction of label changes issued for 29 drugs by the UK Regulatory Authority in 2009. 76% of drug label changes were automatically predicted. Out of these, 6% of drug label changes were detected only by text mining. JSRE enabled precise identification of four adverse drug events from MEDLINE that were undetectable otherwise. Changes in drug labels can be predicted automatically using data and text mining techniques. Text mining technology is mature and well-placed to support the pharmacovigilance tasks. Copyright © 2013 John Wiley & Sons, Ltd.
Mining Adverse Drug Reactions in Social Media with Named Entity Recognition and Semantic Methods.
Chen, Xiaoyi; Deldossi, Myrtille; Aboukhamis, Rim; Faviez, Carole; Dahamna, Badisse; Karapetiantz, Pierre; Guenegou-Arnoux, Armelle; Girardeau, Yannick; Guillemin-Lanne, Sylvie; Lillo-Le-Louët, Agnès; Texier, Nathalie; Burgun, Anita; Katsahian, Sandrine
2017-01-01
Suspected adverse drug reactions (ADR) reported by patients through social media can be a complementary source to current pharmacovigilance systems. However, the performance of text mining tools applied to social media text data to discover ADRs needs to be evaluated. In this paper, we introduce the approach developed to mine ADR from French social media. A protocol of evaluation is highlighted, which includes a detailed sample size determination and evaluation corpus constitution. Our text mining approach provided very encouraging preliminary results with F-measures of 0.94 and 0.81 for recognition of drugs and symptoms respectively, and with F-measure of 0.70 for ADR detection. Therefore, this approach is promising for downstream pharmacovigilance analysis.
Detection and Evaluation of Cheating on College Exams Using Supervised Classification
ERIC Educational Resources Information Center
Cavalcanti, Elmano Ramalho; Pires, Carlos Eduardo; Cavalcanti, Elmano Pontes; Pires, Vládia Freire
2012-01-01
Text mining has been used for various purposes, such as document classification and extraction of domain-specific information from text. In this paper we present a study in which text mining methodology and algorithms were properly employed for academic dishonesty (cheating) detection and evaluation on open-ended college exams, based on document…
Chemical Topic Modeling: Exploring Molecular Data Sets Using a Common Text-Mining Approach.
Schneider, Nadine; Fechner, Nikolas; Landrum, Gregory A; Stiefl, Nikolaus
2017-08-28
Big data is one of the key transformative factors which increasingly influences all aspects of modern life. Although this transformation brings vast opportunities it also generates novel challenges, not the least of which is organizing and searching this data deluge. The field of medicinal chemistry is not different: more and more data are being generated, for instance, by technologies such as DNA encoded libraries, peptide libraries, text mining of large literature corpora, and new in silico enumeration methods. Handling those huge sets of molecules effectively is quite challenging and requires compromises that often come at the expense of the interpretability of the results. In order to find an intuitive and meaningful approach to organizing large molecular data sets, we adopted a probabilistic framework called "topic modeling" from the text-mining field. Here we present the first chemistry-related implementation of this method, which allows large molecule sets to be assigned to "chemical topics" and investigating the relationships between those. In this first study, we thoroughly evaluate this novel method in different experiments and discuss both its disadvantages and advantages. We show very promising results in reproducing human-assigned concepts using the approach to identify and retrieve chemical series from sets of molecules. We have also created an intuitive visualization of the chemical topics output by the algorithm. This is a huge benefit compared to other unsupervised machine-learning methods, like clustering, which are commonly used to group sets of molecules. Finally, we applied the new method to the 1.6 million molecules of the ChEMBL22 data set to test its robustness and efficiency. In about 1 h we built a 100-topic model of this large data set in which we could identify interesting topics like "proteins", "DNA", or "steroids". Along with this publication we provide our data sets and an open-source implementation of the new method (CheTo) which will be part of an upcoming version of the open-source cheminformatics toolkit RDKit.
Data Reduction Algorithm Using Nonnegative Matrix Factorization with Nonlinear Constraints
NASA Astrophysics Data System (ADS)
Sembiring, Pasukat
2017-12-01
Processing ofdata with very large dimensions has been a hot topic in recent decades. Various techniques have been proposed in order to execute the desired information or structure. Non- Negative Matrix Factorization (NMF) based on non-negatives data has become one of the popular methods for shrinking dimensions. The main strength of this method is non-negative object, the object model by a combination of some basic non-negative parts, so as to provide a physical interpretation of the object construction. The NMF is a dimension reduction method thathasbeen used widely for numerous applications including computer vision,text mining, pattern recognitions,and bioinformatics. Mathematical formulation for NMF did not appear as a convex optimization problem and various types of algorithms have been proposed to solve the problem. The Framework of Alternative Nonnegative Least Square(ANLS) are the coordinates of the block formulation approaches that have been proven reliable theoretically and empirically efficient. This paper proposes a new algorithm to solve NMF problem based on the framework of ANLS.This algorithm inherits the convergenceproperty of the ANLS framework to nonlinear constraints NMF formulations.
Thematic clustering of text documents using an EM-based approach
2012-01-01
Clustering textual contents is an important step in mining useful information on the web or other text-based resources. The common task in text clustering is to handle text in a multi-dimensional space, and to partition documents into groups, where each group contains documents that are similar to each other. However, this strategy lacks a comprehensive view for humans in general since it cannot explain the main subject of each cluster. Utilizing semantic information can solve this problem, but it needs a well-defined ontology or pre-labeled gold standard set. In this paper, we present a thematic clustering algorithm for text documents. Given text, subject terms are extracted and used for clustering documents in a probabilistic framework. An EM approach is used to ensure documents are assigned to correct subjects, hence it converges to a locally optimal solution. The proposed method is distinctive because its results are sufficiently explanatory for human understanding as well as efficient for clustering performance. The experimental results show that the proposed method provides a competitive performance compared to other state-of-the-art approaches. We also show that the extracted themes from the MEDLINE® dataset represent the subjects of clusters reasonably well. PMID:23046528
Semantic retrieval and navigation in clinical document collections.
Kreuzthaler, Markus; Daumke, Philipp; Schulz, Stefan
2015-01-01
Patients with chronic diseases undergo numerous in- and outpatient treatment periods, and therefore many documents accumulate in their electronic records. We report on an on-going project focussing on the semantic enrichment of medical texts, in order to support recall-oriented navigation across a patient's complete documentation. A document pool of 1,696 de-identified discharge summaries was used for prototyping. A natural language processing toolset for document annotation (based on the text-mining framework UIMA) and indexing (Solr) was used to support a browser-based platform for document import, search and navigation. The integrated search engine combines free text and concept-based querying, supported by dynamically generated facets (diagnoses, procedures, medications, lab values, and body parts). The prototype demonstrates the feasibility of semantic document enrichment within document collections of a single patient. Originally conceived as an add-on for the clinical workplace, this technology could also be adapted to support personalised health record platforms, as well as cross-patient search for cohort building and other secondary use scenarios.
What the papers say: Text mining for genomics and systems biology
2010-01-01
Keeping up with the rapidly growing literature has become virtually impossible for most scientists. This can have dire consequences. First, we may waste research time and resources on reinventing the wheel simply because we can no longer maintain a reliable grasp on the published literature. Second, and perhaps more detrimental, judicious (or serendipitous) combination of knowledge from different scientific disciplines, which would require following disparate and distinct research literatures, is rapidly becoming impossible for even the most ardent readers of research publications. Text mining -- the automated extraction of information from (electronically) published sources -- could potentially fulfil an important role -- but only if we know how to harness its strengths and overcome its weaknesses. As we do not expect that the rate at which scientific results are published will decrease, text mining tools are now becoming essential in order to cope with, and derive maximum benefit from, this information explosion. In genomics, this is particularly pressing as more and more rare disease-causing variants are found and need to be understood. Not being conversant with this technology may put scientists and biomedical regulators at a severe disadvantage. In this review, we introduce the basic concepts underlying modern text mining and its applications in genomics and systems biology. We hope that this review will serve three purposes: (i) to provide a timely and useful overview of the current status of this field, including a survey of present challenges; (ii) to enable researchers to decide how and when to apply text mining tools in their own research; and (iii) to highlight how the research communities in genomics and systems biology can help to make text mining from biomedical abstracts and texts more straightforward. PMID:21106487
Knowledge based word-concept model estimation and refinement for biomedical text mining.
Jimeno Yepes, Antonio; Berlanga, Rafael
2015-02-01
Text mining of scientific literature has been essential for setting up large public biomedical databases, which are being widely used by the research community. In the biomedical domain, the existence of a large number of terminological resources and knowledge bases (KB) has enabled a myriad of machine learning methods for different text mining related tasks. Unfortunately, KBs have not been devised for text mining tasks but for human interpretation, thus performance of KB-based methods is usually lower when compared to supervised machine learning methods. The disadvantage of supervised methods though is they require labeled training data and therefore not useful for large scale biomedical text mining systems. KB-based methods do not have this limitation. In this paper, we describe a novel method to generate word-concept probabilities from a KB, which can serve as a basis for several text mining tasks. This method not only takes into account the underlying patterns within the descriptions contained in the KB but also those in texts available from large unlabeled corpora such as MEDLINE. The parameters of the model have been estimated without training data. Patterns from MEDLINE have been built using MetaMap for entity recognition and related using co-occurrences. The word-concept probabilities were evaluated on the task of word sense disambiguation (WSD). The results showed that our method obtained a higher degree of accuracy than other state-of-the-art approaches when evaluated on the MSH WSD data set. We also evaluated our method on the task of document ranking using MEDLINE citations. These results also showed an increase in performance over existing baseline retrieval approaches. Copyright © 2014 Elsevier Inc. All rights reserved.
Assessing semantic similarity of texts - Methods and algorithms
NASA Astrophysics Data System (ADS)
Rozeva, Anna; Zerkova, Silvia
2017-12-01
Assessing the semantic similarity of texts is an important part of different text-related applications like educational systems, information retrieval, text summarization, etc. This task is performed by sophisticated analysis, which implements text-mining techniques. Text mining involves several pre-processing steps, which provide for obtaining structured representative model of the documents in a corpus by means of extracting and selecting the features, characterizing their content. Generally the model is vector-based and enables further analysis with knowledge discovery approaches. Algorithms and measures are used for assessing texts at syntactical and semantic level. An important text-mining method and similarity measure is latent semantic analysis (LSA). It provides for reducing the dimensionality of the document vector space and better capturing the text semantics. The mathematical background of LSA for deriving the meaning of the words in a given text by exploring their co-occurrence is examined. The algorithm for obtaining the vector representation of words and their corresponding latent concepts in a reduced multidimensional space as well as similarity calculation are presented.
Evaluating the Impact of Modern Copper Mining on Ecosystem Services in Southern Arizona
NASA Astrophysics Data System (ADS)
Virgone, K.; Brusseau, M. L.; Ramirez-Andreotta, M.; Coeurdray, M.; Poupeau, F.
2014-12-01
Historic mining practices were conducted with little environmental forethought, and hence generated a legacy of environmental and human-health impacts. However, an awareness and understanding of the impacts of mining on ecosystem services has developed over the past few decades. Ecosystem services are defined as benefits that humans obtain from ecosystems, and upon which they are fundamentally dependent for their survival. Ecosystem services are divided into four categories including provisioning services (i.e., food, water, timber, and fiber); regulating services (i.e., climate, floods, disease, wastes, and water quality); supporting services (i.e., soil formation, photosynthesis, and nutrient cycling) and cultural services (i.e., recreational, aesthetic, and spiritual benefits) (Millennium Ecosystem Assessment, 2005). Sustainable mining practices have been and are being developed in an effort to protect and preserve ecosystem services. This and related efforts constitute a new generation of "modern" mines, which are defined as those that are designed and permitted under contemporary environmental legislation. The objective of this study is to develop a framework to monitor and assess the impact of modern mining practices and sustainable mineral development on ecosystem services. Using the sustainability performance indicators from the Global Reporting Initiative (GRI) as a starting point, we develop a framework that is reflective of and adaptive to specific local conditions. Impacts on surface and groundwater water quality and quantity are anticipated to be of most importance to the southern Arizona region, which is struggling to meet urban and environmental water demands due to population growth and climate change. We seek to build a more comprehensive and effective assessment framework by incorporating socio-economic aspects via community engaged research, including economic valuations, community-initiated environmental monitoring, and environmental human-health education programs.
ERIC Educational Resources Information Center
Benoit, Gerald
2002-01-01
Discusses data mining (DM) and knowledge discovery in databases (KDD), taking the view that KDD is the larger view of the entire process, with DM emphasizing the cleaning, warehousing, mining, and visualization of knowledge discovery in databases. Highlights include algorithms; users; the Internet; text mining; and information extraction.…
A Framework for Mining Actionable Navigation Patterns from In-Store RFID Datasets via Indoor Mapping
Shen, Bin; Zheng, Qiuhua; Li, Xingsen; Xu, Libo
2015-01-01
With the quick development of RFID technology and the decreasing prices of RFID devices, RFID is becoming widely used in various intelligent services. Especially in the retail application domain, RFID is increasingly adopted to capture the shopping tracks and behavior of in-store customers. To further enhance the potential of this promising application, in this paper, we propose a unified framework for RFID-based path analytics, which uses both in-store shopping paths and RFID-based purchasing data to mine actionable navigation patterns. Four modules of this framework are discussed, which are: (1) mapping from the physical space to the cyber space, (2) data preprocessing, (3) pattern mining and (4) knowledge understanding and utilization. In the data preprocessing module, the critical problem of how to capture the mainstream shopping path sequences while wiping out unnecessary redundant and repeated details is addressed in detail. To solve this problem, two types of redundant patterns, i.e., loop repeat pattern and palindrome-contained pattern are recognized and the corresponding processing algorithms are proposed. The experimental results show that the redundant pattern filtering functions are effective and scalable. Overall, this work builds a bridge between indoor positioning and advanced data mining technologies, and provides a feasible way to study customers’ shopping behaviors via multi-source RFID data. PMID:25751076
ERIC Educational Resources Information Center
Bowers, Alex J.; Chen, Jingjing
2015-01-01
The purpose of this study is to bring together recent innovations in the research literature around school district capital facility finance, municipal bond elections, statistical models of conditional time-varying outcomes, and data mining algorithms for automated text mining of election ballot proposals to examine the factors that influence the…
New directions in biomedical text annotation: definitions, guidelines and corpus construction
Wilbur, W John; Rzhetsky, Andrey; Shatkay, Hagit
2006-01-01
Background While biomedical text mining is emerging as an important research area, practical results have proven difficult to achieve. We believe that an important first step towards more accurate text-mining lies in the ability to identify and characterize text that satisfies various types of information needs. We report here the results of our inquiry into properties of scientific text that have sufficient generality to transcend the confines of a narrow subject area, while supporting practical mining of text for factual information. Our ultimate goal is to annotate a significant corpus of biomedical text and train machine learning methods to automatically categorize such text along certain dimensions that we have defined. Results We have identified five qualitative dimensions that we believe characterize a broad range of scientific sentences, and are therefore useful for supporting a general approach to text-mining: focus, polarity, certainty, evidence, and directionality. We define these dimensions and describe the guidelines we have developed for annotating text with regard to them. To examine the effectiveness of the guidelines, twelve annotators independently annotated the same set of 101 sentences that were randomly selected from current biomedical periodicals. Analysis of these annotations shows 70–80% inter-annotator agreement, suggesting that our guidelines indeed present a well-defined, executable and reproducible task. Conclusion We present our guidelines defining a text annotation task, along with annotation results from multiple independently produced annotations, demonstrating the feasibility of the task. The annotation of a very large corpus of documents along these guidelines is currently ongoing. These annotations form the basis for the categorization of text along multiple dimensions, to support viable text mining for experimental results, methodology statements, and other forms of information. We are currently developing machine learning methods, to be trained and tested on the annotated corpus, that would allow for the automatic categorization of biomedical text along the general dimensions that we have presented. The guidelines in full detail, along with annotated examples, are publicly available. PMID:16867190
Rare itemsets mining algorithm based on RP-Tree and spark framework
NASA Astrophysics Data System (ADS)
Liu, Sainan; Pan, Haoan
2018-05-01
For the issues of the rare itemsets mining in big data, this paper proposed a rare itemsets mining algorithm based on RP-Tree and Spark framework. Firstly, it arranged the data vertically according to the transaction identifier, in order to solve the defects of scan the entire data set, the vertical datasets are divided into frequent vertical datasets and rare vertical datasets. Then, it adopted the RP-Tree algorithm to construct the frequent pattern tree that contains rare items and generate rare 1-itemsets. After that, it calculated the support of the itemsets by scanning the two vertical data sets, finally, it used the iterative process to generate rare itemsets. The experimental show that the algorithm can effectively excavate rare itemsets and have great superiority in execution time.
Kernel Methods for Mining Instance Data in Ontologies
NASA Astrophysics Data System (ADS)
Bloehdorn, Stephan; Sure, York
The amount of ontologies and meta data available on the Web is constantly growing. The successful application of machine learning techniques for learning of ontologies from textual data, i.e. mining for the Semantic Web, contributes to this trend. However, no principal approaches exist so far for mining from the Semantic Web. We investigate how machine learning algorithms can be made amenable for directly taking advantage of the rich knowledge expressed in ontologies and associated instance data. Kernel methods have been successfully employed in various learning tasks and provide a clean framework for interfacing between non-vectorial data and machine learning algorithms. In this spirit, we express the problem of mining instances in ontologies as the problem of defining valid corresponding kernels. We present a principled framework for designing such kernels by means of decomposing the kernel computation into specialized kernels for selected characteristics of an ontology which can be flexibly assembled and tuned. Initial experiments on real world Semantic Web data enjoy promising results and show the usefulness of our approach.
Web mining in soft computing framework: relevance, state of the art and future directions.
Pal, S K; Talwar, V; Mitra, P
2002-01-01
The paper summarizes the different characteristics of Web data, the basic components of Web mining and its different types, and the current state of the art. The reason for considering Web mining, a separate field from data mining, is explained. The limitations of some of the existing Web mining methods and tools are enunciated, and the significance of soft computing (comprising fuzzy logic (FL), artificial neural networks (ANNs), genetic algorithms (GAs), and rough sets (RSs) are highlighted. A survey of the existing literature on "soft Web mining" is provided along with the commercially available systems. The prospective areas of Web mining where the application of soft computing needs immediate attention are outlined with justification. Scope for future research in developing "soft Web mining" systems is explained. An extensive bibliography is also provided.
NASA Astrophysics Data System (ADS)
Tirupattur, Naveen; Lapish, Christopher C.; Mukhopadhyay, Snehasis
2011-06-01
Text mining, sometimes alternately referred to as text analytics, refers to the process of extracting high-quality knowledge from the analysis of textual data. Text mining has wide variety of applications in areas such as biomedical science, news analysis, and homeland security. In this paper, we describe an approach and some relatively small-scale experiments which apply text mining to neuroscience research literature to find novel associations among a diverse set of entities. Neuroscience is a discipline which encompasses an exceptionally wide range of experimental approaches and rapidly growing interest. This combination results in an overwhelmingly large and often diffuse literature which makes a comprehensive synthesis difficult. Understanding the relations or associations among the entities appearing in the literature not only improves the researchers current understanding of recent advances in their field, but also provides an important computational tool to formulate novel hypotheses and thereby assist in scientific discoveries. We describe a methodology to automatically mine the literature and form novel associations through direct analysis of published texts. The method first retrieves a set of documents from databases such as PubMed using a set of relevant domain terms. In the current study these terms yielded a set of documents ranging from 160,909 to 367,214 documents. Each document is then represented in a numerical vector form from which an Association Graph is computed which represents relationships between all pairs of domain terms, based on co-occurrence. Association graphs can then be subjected to various graph theoretic algorithms such as transitive closure and cycle (circuit) detection to derive additional information, and can also be visually presented to a human researcher for understanding. In this paper, we present three relatively small-scale problem-specific case studies to demonstrate that such an approach is very successful in replicating a neuroscience expert's mental model of object-object associations entirely by means of text mining. These preliminary results provide the confidence that this type of text mining based research approach provides an extremely powerful tool to better understand the literature and drive novel discovery for the neuroscience community.
Mining algorithm for association rules in big data based on Hadoop
NASA Astrophysics Data System (ADS)
Fu, Chunhua; Wang, Xiaojing; Zhang, Lijun; Qiao, Liying
2018-04-01
In order to solve the problem that the traditional association rules mining algorithm has been unable to meet the mining needs of large amount of data in the aspect of efficiency and scalability, take FP-Growth as an example, the algorithm is realized in the parallelization based on Hadoop framework and Map Reduce model. On the basis, it is improved using the transaction reduce method for further enhancement of the algorithm's mining efficiency. The experiment, which consists of verification of parallel mining results, comparison on efficiency between serials and parallel, variable relationship between mining time and node number and between mining time and data amount, is carried out in the mining results and efficiency by Hadoop clustering. Experiments show that the paralleled FP-Growth algorithm implemented is able to accurately mine frequent item sets, with a better performance and scalability. It can be better to meet the requirements of big data mining and efficiently mine frequent item sets and association rules from large dataset.
Industrial application of semantic process mining
NASA Astrophysics Data System (ADS)
Espen Ingvaldsen, Jon; Atle Gulla, Jon
2012-05-01
Process mining relates to the extraction of non-trivial and useful information from information system event logs. It is a new research discipline that has evolved significantly since the early work on idealistic process logs. Over the last years, process mining prototypes have incorporated elements from semantics and data mining and targeted visualisation techniques that are more user-friendly to business experts and process owners. In this article, we present a framework for evaluating different aspects of enterprise process flows and address practical challenges of state-of-the-art industrial process mining. We also explore the inherent strengths of the technology for more efficient process optimisation.
Using Open Web APIs in Teaching Web Mining
ERIC Educational Resources Information Center
Chen, Hsinchun; Li, Xin; Chau, M.; Ho, Yi-Jen; Tseng, Chunju
2009-01-01
With the advent of the World Wide Web, many business applications that utilize data mining and text mining techniques to extract useful business information on the Web have evolved from Web searching to Web mining. It is important for students to acquire knowledge and hands-on experience in Web mining during their education in information systems…
Sampling and monitoring for closure
McLemore, Virginia T.; Smith, Kathleen S.; Russell, Carol C.
2007-01-01
An important aspect of planning a new mine or mine expansion within the modern regulatory framework is to design for ultimate closure. Sampling and monitoring for closure is a form of environmental risk management. By implementing a sampling and monitoring program early in the life of the mining operation, major costs can be avoided or minimized. The costs for treating mine drainage in perpetuity are staggering, especially if they are unanticipated. The Metal Mining Sector of the Acid Drainage Technology Initiative (ADTI-MMS), a cooperative government-industry-academia organization, was established to address drainage-quality technologies of metal mining and metallurgical operations. ADTI-MMS recommends that sampling and monitoring programs consider the entire mine-life cycle and that data needed for closure of an operation be collected from exploration through postclosure.
Zhao, Ning; Zheng, Guang; Li, Jian; Zhao, Hong-Yan; Lu, Cheng; Jiang, Miao; Zhang, Chi; Guo, Hong-Tao; Lu, Ai-Ping
2018-01-09
To identify the commonalities between rheumatoid arthritis (RA) and diabetes mellitus (DM) to understand the mechanisms of Chinese medicine (CM) in different diseases with the same treatment. A text mining approach was adopted to analyze the commonalities between RA and DM according to CM and biological elements. The major commonalities were subsequently verifified in RA and DM rat models, in which herbal formula for the treatment of both RA and DM identifified via text mining was used as the intervention. Similarities were identifified between RA and DM regarding the CM approach used for diagnosis and treatment, as well as the networks of biological activities affected by each disease, including the involvement of adhesion molecules, oxidative stress, cytokines, T-lymphocytes, apoptosis, and inflfl ammation. The Ramulus Cinnamomi-Radix Paeoniae Alba-Rhizoma Anemarrhenae is an herbal combination used to treat RA and DM. This formula demonstrated similar effects on oxidative stress and inflfl ammation in rats with collagen-induced arthritis, which supports the text mining results regarding the commonalities between RA and DM. Commonalities between the biological activities involved in RA and DM were identifified through text mining, and both RA and DM might be responsive to the same intervention at a specifific stage.
NASA Astrophysics Data System (ADS)
Shahiri, Amirah Mohamed; Husain, Wahidah; Rashid, Nur'Aini Abd
2017-10-01
Huge amounts of data in educational datasets may cause the problem in producing quality data. Recently, data mining approach are increasingly used by educational data mining researchers for analyzing the data patterns. However, many research studies have concentrated on selecting suitable learning algorithms instead of performing feature selection process. As a result, these data has problem with computational complexity and spend longer computational time for classification. The main objective of this research is to provide an overview of feature selection techniques that have been used to analyze the most significant features. Then, this research will propose a framework to improve the quality of students' dataset. The proposed framework uses filter and wrapper based technique to support prediction process in future study.
Modular Algorithm Testbed Suite (MATS): A Software Framework for Automatic Target Recognition
2017-01-01
004 OFFICE OF NAVAL RESEARCH ATTN JASON STACK MINE WARFARE & OCEAN ENGINEERING PROGRAMS CODE 32, SUITE 1092 875 N RANDOLPH ST ARLINGTON VA 22203 ONR...naval mine countermeasures (MCM) operations by automating a large portion of the data analysis. Successful long-term implementation of ATR requires a...Modular Algorithm Testbed Suite; MATS; Mine Countermeasures Operations U U U SAR 24 Derek R. Kolacinski (850) 230-7218 THIS PAGE INTENTIONALLY LEFT
A. Dennis Lemly
2001-01-01
Beginning in 1996, selenium associated with phosphate mining on Caribou National Forest (CNF) was implicated as the cause of death to horses and sheep grazing on private land adjacent to the national forest. In response to these concerns, the Forest Service began a monitoring study to determine selenium concentrations in and around the mine sites. By 1998, the study...
Detecting causality from online psychiatric texts using inter-sentential language patterns
2012-01-01
Background Online psychiatric texts are natural language texts expressing depressive problems, published by Internet users via community-based web services such as web forums, message boards and blogs. Understanding the cause-effect relations embedded in these psychiatric texts can provide insight into the authors’ problems, thus increasing the effectiveness of online psychiatric services. Methods Previous studies have proposed the use of word pairs extracted from a set of sentence pairs to identify cause-effect relations between sentences. A word pair is made up of two words, with one coming from the cause text span and the other from the effect text span. Analysis of the relationship between these words can be used to capture individual word associations between cause and effect sentences. For instance, (broke up, life) and (boyfriend, meaningless) are two word pairs extracted from the sentence pair: “I broke up with my boyfriend. Life is now meaningless to me”. The major limitation of word pairs is that individual words in sentences usually cannot reflect the exact meaning of the cause and effect events, and thus may produce semantically incomplete word pairs, as the previous examples show. Therefore, this study proposes the use of inter-sentential language patterns such as ≪broke up, boyfriend>,
Text Mining of Journal Articles for Sleep Disorder Terminologies.
Lam, Calvin; Lai, Fu-Chih; Wang, Chia-Hui; Lai, Mei-Hsin; Hsu, Nanly; Chung, Min-Huey
2016-01-01
Research on publication trends in journal articles on sleep disorders (SDs) and the associated methodologies by using text mining has been limited. The present study involved text mining for terms to determine the publication trends in sleep-related journal articles published during 2000-2013 and to identify associations between SD and methodology terms as well as conducting statistical analyses of the text mining findings. SD and methodology terms were extracted from 3,720 sleep-related journal articles in the PubMed database by using MetaMap. The extracted data set was analyzed using hierarchical cluster analyses and adjusted logistic regression models to investigate publication trends and associations between SD and methodology terms. MetaMap had a text mining precision, recall, and false positive rate of 0.70, 0.77, and 11.51%, respectively. The most common SD term was breathing-related sleep disorder, whereas narcolepsy was the least common. Cluster analyses showed similar methodology clusters for each SD term, except narcolepsy. The logistic regression models showed an increasing prevalence of insomnia, parasomnia, and other sleep disorders but a decreasing prevalence of breathing-related sleep disorder during 2000-2013. Different SD terms were positively associated with different methodology terms regarding research design terms, measure terms, and analysis terms. Insomnia-, parasomnia-, and other sleep disorder-related articles showed an increasing publication trend, whereas those related to breathing-related sleep disorder showed a decreasing trend. Furthermore, experimental studies more commonly focused on hypersomnia and other SDs and less commonly on insomnia, breathing-related sleep disorder, narcolepsy, and parasomnia. Thus, text mining may facilitate the exploration of the publication trends in SDs and the associated methodologies.
Text Mining to Support Gene Ontology Curation and Vice Versa.
Ruch, Patrick
2017-01-01
In this chapter, we explain how text mining can support the curation of molecular biology databases dealing with protein functions. We also show how curated data can play a disruptive role in the developments of text mining methods. We review a decade of efforts to improve the automatic assignment of Gene Ontology (GO) descriptors, the reference ontology for the characterization of genes and gene products. To illustrate the high potential of this approach, we compare the performances of an automatic text categorizer and show a large improvement of +225 % in both precision and recall on benchmarked data. We argue that automatic text categorization functions can ultimately be embedded into a Question-Answering (QA) system to answer questions related to protein functions. Because GO descriptors can be relatively long and specific, traditional QA systems cannot answer such questions. A new type of QA system, so-called Deep QA which uses machine learning methods trained with curated contents, is thus emerging. Finally, future advances of text mining instruments are directly dependent on the availability of high-quality annotated contents at every curation step. Databases workflows must start recording explicitly all the data they curate and ideally also some of the data they do not curate.
Hahn, P; Dullweber, F; Unglaub, F; Spies, C K
2014-06-01
Searching for relevant publications is becoming more difficult with the increasing number of scientific articles. Text mining as a specific form of computer-based data analysis may be helpful in this context. Highlighting relations between authors and finding relevant publications concerning a specific subject using text analysis programs are illustrated graphically by 2 performed examples. © Georg Thieme Verlag KG Stuttgart · New York.
Mining of Business-Oriented Conversations at a Call Center
NASA Astrophysics Data System (ADS)
Takeuchi, Hironori; Nasukawa, Tetsuya; Watanabe, Hideo
Recently it has become feasible to transcribe textual records from telephone conversations at call centers by using automatic speech recognition. In this research, we extended a text mining system for call summary records and constructed a conversation mining system for the business-oriented conversations at the call center. To acquire useful business insights from the conversational data through the text mining system, it is critical to identify appropriate textual segments and expressions as the viewpoints to focus on. In the analysis of call summary data using a text mining system, some experts defined the viewpoints for the analysis by looking at some sample records and by preparing the dictionaries based on frequent keywords in the sample dataset. However with conversations it is difficult to identify such viewpoints manually and in advance because the target data consists of complete transcripts that are often lengthy and redundant. In this research, we defined a model of the business-oriented conversations and proposed a mining method to identify segments that have impacts on the outcomes of the conversations and can then extract useful expressions in each of these identified segments. In the experiment, we processed the real datasets from a car rental service center and constructed a mining system. With this system, we show the effectiveness of the method based on the defined conversation model.
Educational Data Mining Application for Estimating Students Performance in Weka Environment
NASA Astrophysics Data System (ADS)
Gowri, G. Shiyamala; Thulasiram, Ramasamy; Amit Baburao, Mahindra
2017-11-01
Educational data mining (EDM) is a multi-disciplinary research area that examines artificial intelligence, statistical modeling and data mining with the data generated from an educational institution. EDM utilizes computational ways to deal with explicate educational information keeping in mind the end goal to examine educational inquiries. To make a country stand unique among the other nations of the world, the education system has to undergo a major transition by redesigning its framework. The concealed patterns and data from various information repositories can be extracted by adopting the techniques of data mining. In order to summarize the performance of students with their credentials, we scrutinize the exploitation of data mining in the field of academics. Apriori algorithmic procedure is extensively applied to the database of students for a wider classification based on various categorizes. K-means procedure is applied to the same set of databases in order to accumulate them into a specific category. Apriori algorithm deals with mining the rules in order to extract patterns that are similar along with their associations in relation to various set of records. The records can be extracted from academic information repositories. The parameters used in this study gives more importance to psychological traits than academic features. The undesirable student conduct can be clearly witnessed if we make use of information mining frameworks. Thus, the algorithms efficiently prove to profile the students in any educational environment. The ultimate objective of the study is to suspect if a student is prone to violence or not.
Recent progress in automatically extracting information from the pharmacogenomic literature
Garten, Yael; Coulet, Adrien; Altman, Russ B
2011-01-01
The biomedical literature holds our understanding of pharmacogenomics, but it is dispersed across many journals. In order to integrate our knowledge, connect important facts across publications and generate new hypotheses we must organize and encode the contents of the literature. By creating databases of structured pharmocogenomic knowledge, we can make the value of the literature much greater than the sum of the individual reports. We can, for example, generate candidate gene lists or interpret surprising hits in genome-wide association studies. Text mining automatically adds structure to the unstructured knowledge embedded in millions of publications, and recent years have seen a surge in work on biomedical text mining, some specific to pharmacogenomics literature. These methods enable extraction of specific types of information and can also provide answers to general, systemic queries. In this article, we describe the main tasks of text mining in the context of pharmacogenomics, summarize recent applications and anticipate the next phase of text mining applications. PMID:21047206
Mining protein function from text using term-based support vector machines
Rice, Simon B; Nenadic, Goran; Stapley, Benjamin J
2005-01-01
Background Text mining has spurred huge interest in the domain of biology. The goal of the BioCreAtIvE exercise was to evaluate the performance of current text mining systems. We participated in Task 2, which addressed assigning Gene Ontology terms to human proteins and selecting relevant evidence from full-text documents. We approached it as a modified form of the document classification task. We used a supervised machine-learning approach (based on support vector machines) to assign protein function and select passages that support the assignments. As classification features, we used a protein's co-occurring terms that were automatically extracted from documents. Results The results evaluated by curators were modest, and quite variable for different problems: in many cases we have relatively good assignment of GO terms to proteins, but the selected supporting text was typically non-relevant (precision spanning from 3% to 50%). The method appears to work best when a substantial set of relevant documents is obtained, while it works poorly on single documents and/or short passages. The initial results suggest that our approach can also mine annotations from text even when an explicit statement relating a protein to a GO term is absent. Conclusion A machine learning approach to mining protein function predictions from text can yield good performance only if sufficient training data is available, and significant amount of supporting data is used for prediction. The most promising results are for combined document retrieval and GO term assignment, which calls for the integration of methods developed in BioCreAtIvE Task 1 and Task 2. PMID:15960835
O'Mara-Eves, Alison; Thomas, James; McNaught, John; Miwa, Makoto; Ananiadou, Sophia
2015-01-14
The large and growing number of published studies, and their increasing rate of publication, makes the task of identifying relevant studies in an unbiased way for inclusion in systematic reviews both complex and time consuming. Text mining has been offered as a potential solution: through automating some of the screening process, reviewer time can be saved. The evidence base around the use of text mining for screening has not yet been pulled together systematically; this systematic review fills that research gap. Focusing mainly on non-technical issues, the review aims to increase awareness of the potential of these technologies and promote further collaborative research between the computer science and systematic review communities. Five research questions led our review: what is the state of the evidence base; how has workload reduction been evaluated; what are the purposes of semi-automation and how effective are they; how have key contextual problems of applying text mining to the systematic review field been addressed; and what challenges to implementation have emerged? We answered these questions using standard systematic review methods: systematic and exhaustive searching, quality-assured data extraction and a narrative synthesis to synthesise findings. The evidence base is active and diverse; there is almost no replication between studies or collaboration between research teams and, whilst it is difficult to establish any overall conclusions about best approaches, it is clear that efficiencies and reductions in workload are potentially achievable. On the whole, most suggested that a saving in workload of between 30% and 70% might be possible, though sometimes the saving in workload is accompanied by the loss of 5% of relevant studies (i.e. a 95% recall). Using text mining to prioritise the order in which items are screened should be considered safe and ready for use in 'live' reviews. The use of text mining as a 'second screener' may also be used cautiously. The use of text mining to eliminate studies automatically should be considered promising, but not yet fully proven. In highly technical/clinical areas, it may be used with a high degree of confidence; but more developmental and evaluative work is needed in other disciplines.
Relation extraction for biological pathway construction using node2vec.
Kim, Munui; Baek, Seung Han; Song, Min
2018-06-13
Systems biology is an important field for understanding whole biological mechanisms composed of interactions between biological components. One approach for understanding complex and diverse mechanisms is to analyze biological pathways. However, because these pathways consist of important interactions and information on these interactions is disseminated in a large number of biomedical reports, text-mining techniques are essential for extracting these relationships automatically. In this study, we applied node2vec, an algorithmic framework for feature learning in networks, for relationship extraction. To this end, we extracted genes from paper abstracts using pkde4j, a text-mining tool for detecting entities and relationships. Using the extracted genes, a co-occurrence network was constructed and node2vec was used with the network to generate a latent representation. To demonstrate the efficacy of node2vec in extracting relationships between genes, performance was evaluated for gene-gene interactions involved in a type 2 diabetes pathway. Moreover, we compared the results of node2vec to those of baseline methods such as co-occurrence and DeepWalk. Node2vec outperformed existing methods in detecting relationships in the type 2 diabetes pathway, demonstrating that this method is appropriate for capturing the relatedness between pairs of biological entities involved in biological pathways. The results demonstrated that node2vec is useful for automatic pathway construction.
Conceptual biology, hypothesis discovery, and text mining: Swanson's legacy.
Bekhuis, Tanja
2006-04-03
Innovative biomedical librarians and information specialists who want to expand their roles as expert searchers need to know about profound changes in biology and parallel trends in text mining. In recent years, conceptual biology has emerged as a complement to empirical biology. This is partly in response to the availability of massive digital resources such as the network of databases for molecular biologists at the National Center for Biotechnology Information. Developments in text mining and hypothesis discovery systems based on the early work of Swanson, a mathematician and information scientist, are coincident with the emergence of conceptual biology. Very little has been written to introduce biomedical digital librarians to these new trends. In this paper, background for data and text mining, as well as for knowledge discovery in databases (KDD) and in text (KDT) is presented, then a brief review of Swanson's ideas, followed by a discussion of recent approaches to hypothesis discovery and testing. 'Testing' in the context of text mining involves partially automated methods for finding evidence in the literature to support hypothetical relationships. Concluding remarks follow regarding (a) the limits of current strategies for evaluation of hypothesis discovery systems and (b) the role of literature-based discovery in concert with empirical research. Report of an informatics-driven literature review for biomarkers of systemic lupus erythematosus is mentioned. Swanson's vision of the hidden value in the literature of science and, by extension, in biomedical digital databases, is still remarkably generative for information scientists, biologists, and physicians.
Environmental considerations related to mining of nonfuel minerals
Seal, Robert R.; Piatak, Nadine M.; Kimball, Bryn E.; Hammarstrom, Jane M.; Schulz, Klaus J.; DeYoung,, John H.; Seal, Robert R.; Bradley, Dwight C.
2017-12-19
Throughout most of human history, environmental stewardship during mining has not been a priority partly because of the lack of applicable laws and regulations and partly because of ignorance about the effects that mining can have on the environment. In the United States, the National Environmental Policy Act of 1969, in conjunction with related laws, codified a more modern approach to mining, including the responsibility for environmental stewardship, and provided a framework for incorporating environmental protection into mine planning. Today, similar frameworks are in place in the other developed countries of the world, and international mining companies generally follow similar procedures wherever they work in the world. The regulatory guidance has fostered an international effort among all stakeholders to identify best practices for environmental stewardship.The modern approach to mining using best practices involves the following: (a) establishment of a pre-mining baseline from which to monitor environmental effects during mining and help establish geologically reasonable closure goals; (b) identification of environmental risks related to mining through standardized approaches; and (c) formulation of an environmental closure plan before the start of mining. A key aspect of identifying the environmental risks and mitigating those risks is understanding how the risks vary from one deposit type to another—a concept that forms the basis for geoenvironmental mineral-deposit models.Accompanying the quest for best practices is the goal of making mining sustainable into the future. Sustainable mine development is generally considered to be development that meets the needs of the present generation without compromising the ability of future generations to meet their own needs. The concept extends beyond the availability of nonrenewable mineral commodities and includes the environmental and social effects of mine development.Global population growth, meanwhile, has decreased the percentage of inhabitable land available to support society’s material needs. Presently, the land area available to supply the mineral resources, energy resources, water, food, shelter, and waste disposal needs of all Earth’s inhabitants is estimated to be 135 square meters per person. Continued global population growth will only increase the challenges of sustainable mining.Current trends in mining are also expected to lead to new environmental challenges in the future, among which are mine-waste management issues related to mining larger deposits for lower ore grade; water-management issues related to both the mining of larger deposits and the changes in precipitation brought about by climate change; and greenhouse gas issues related to reducing the carbon footprint of larger, more energy-intensive mining operations.
Recent Advances and Emerging Applications in Text and Data Mining for Biomedical Discovery.
Gonzalez, Graciela H; Tahsin, Tasnia; Goodale, Britton C; Greene, Anna C; Greene, Casey S
2016-01-01
Precision medicine will revolutionize the way we treat and prevent disease. A major barrier to the implementation of precision medicine that clinicians and translational scientists face is understanding the underlying mechanisms of disease. We are starting to address this challenge through automatic approaches for information extraction, representation and analysis. Recent advances in text and data mining have been applied to a broad spectrum of key biomedical questions in genomics, pharmacogenomics and other fields. We present an overview of the fundamental methods for text and data mining, as well as recent advances and emerging applications toward precision medicine. © The Author 2015. Published by Oxford University Press.
Recent Advances and Emerging Applications in Text and Data Mining for Biomedical Discovery
Gonzalez, Graciela H.; Tahsin, Tasnia; Goodale, Britton C.; Greene, Anna C.
2016-01-01
Precision medicine will revolutionize the way we treat and prevent disease. A major barrier to the implementation of precision medicine that clinicians and translational scientists face is understanding the underlying mechanisms of disease. We are starting to address this challenge through automatic approaches for information extraction, representation and analysis. Recent advances in text and data mining have been applied to a broad spectrum of key biomedical questions in genomics, pharmacogenomics and other fields. We present an overview of the fundamental methods for text and data mining, as well as recent advances and emerging applications toward precision medicine. PMID:26420781
Data Mining and Complex Problems: Case Study in Composite Materials
NASA Technical Reports Server (NTRS)
Rabelo, Luis; Marin, Mario
2009-01-01
Data mining is defined as the discovery of useful, possibly unexpected, patterns and relationships in data using statistical and non-statistical techniques in order to develop schemes for decision and policy making. Data mining can be used to discover the sources and causes of problems in complex systems. In addition, data mining can support simulation strategies by finding the different constants and parameters to be used in the development of simulation models. This paper introduces a framework for data mining and its application to complex problems. To further explain some of the concepts outlined in this paper, the potential application to the NASA Shuttle Reinforced Carbon-Carbon structures and genetic programming is used as an illustration.
Application of text mining for customer evaluations in commercial banking
NASA Astrophysics Data System (ADS)
Tan, Jing; Du, Xiaojiang; Hao, Pengpeng; Wang, Yanbo J.
2015-07-01
Nowadays customer attrition is increasingly serious in commercial banks. To combat this problem roundly, mining customer evaluation texts is as important as mining customer structured data. In order to extract hidden information from customer evaluations, Textual Feature Selection, Classification and Association Rule Mining are necessary techniques. This paper presents all three techniques by using Chinese Word Segmentation, C5.0 and Apriori, and a set of experiments were run based on a collection of real textual data that includes 823 customer evaluations taken from a Chinese commercial bank. Results, consequent solutions, some advice for the commercial bank are given in this paper.
Extracting semantically enriched events from biomedical literature
2012-01-01
Background Research into event-based text mining from the biomedical literature has been growing in popularity to facilitate the development of advanced biomedical text mining systems. Such technology permits advanced search, which goes beyond document or sentence-based retrieval. However, existing event-based systems typically ignore additional information within the textual context of events that can determine, amongst other things, whether an event represents a fact, hypothesis, experimental result or analysis of results, whether it describes new or previously reported knowledge, and whether it is speculated or negated. We refer to such contextual information as meta-knowledge. The automatic recognition of such information can permit the training of systems allowing finer-grained searching of events according to the meta-knowledge that is associated with them. Results Based on a corpus of 1,000 MEDLINE abstracts, fully manually annotated with both events and associated meta-knowledge, we have constructed a machine learning-based system that automatically assigns meta-knowledge information to events. This system has been integrated into EventMine, a state-of-the-art event extraction system, in order to create a more advanced system (EventMine-MK) that not only extracts events from text automatically, but also assigns five different types of meta-knowledge to these events. The meta-knowledge assignment module of EventMine-MK performs with macro-averaged F-scores in the range of 57-87% on the BioNLP’09 Shared Task corpus. EventMine-MK has been evaluated on the BioNLP’09 Shared Task subtask of detecting negated and speculated events. Our results show that EventMine-MK can outperform other state-of-the-art systems that participated in this task. Conclusions We have constructed the first practical system that extracts both events and associated, detailed meta-knowledge information from biomedical literature. The automatically assigned meta-knowledge information can be used to refine search systems, in order to provide an extra search layer beyond entities and assertions, dealing with phenomena such as rhetorical intent, speculations, contradictions and negations. This finer grained search functionality can assist in several important tasks, e.g., database curation (by locating new experimental knowledge) and pathway enrichment (by providing information for inference). To allow easy integration into text mining systems, EventMine-MK is provided as a UIMA component that can be used in the interoperable text mining infrastructure, U-Compare. PMID:22621266
Extracting semantically enriched events from biomedical literature.
Miwa, Makoto; Thompson, Paul; McNaught, John; Kell, Douglas B; Ananiadou, Sophia
2012-05-23
Research into event-based text mining from the biomedical literature has been growing in popularity to facilitate the development of advanced biomedical text mining systems. Such technology permits advanced search, which goes beyond document or sentence-based retrieval. However, existing event-based systems typically ignore additional information within the textual context of events that can determine, amongst other things, whether an event represents a fact, hypothesis, experimental result or analysis of results, whether it describes new or previously reported knowledge, and whether it is speculated or negated. We refer to such contextual information as meta-knowledge. The automatic recognition of such information can permit the training of systems allowing finer-grained searching of events according to the meta-knowledge that is associated with them. Based on a corpus of 1,000 MEDLINE abstracts, fully manually annotated with both events and associated meta-knowledge, we have constructed a machine learning-based system that automatically assigns meta-knowledge information to events. This system has been integrated into EventMine, a state-of-the-art event extraction system, in order to create a more advanced system (EventMine-MK) that not only extracts events from text automatically, but also assigns five different types of meta-knowledge to these events. The meta-knowledge assignment module of EventMine-MK performs with macro-averaged F-scores in the range of 57-87% on the BioNLP'09 Shared Task corpus. EventMine-MK has been evaluated on the BioNLP'09 Shared Task subtask of detecting negated and speculated events. Our results show that EventMine-MK can outperform other state-of-the-art systems that participated in this task. We have constructed the first practical system that extracts both events and associated, detailed meta-knowledge information from biomedical literature. The automatically assigned meta-knowledge information can be used to refine search systems, in order to provide an extra search layer beyond entities and assertions, dealing with phenomena such as rhetorical intent, speculations, contradictions and negations. This finer grained search functionality can assist in several important tasks, e.g., database curation (by locating new experimental knowledge) and pathway enrichment (by providing information for inference). To allow easy integration into text mining systems, EventMine-MK is provided as a UIMA component that can be used in the interoperable text mining infrastructure, U-Compare.
Individual Profiling Using Text Analysis
2016-04-15
Mining a Text for Errors. . . . on Knowledge discovery in data mining , pages 624–628, 2005. [12] Michal Kosinski, David Stillwell, and Thore Graepel...AFRL-AFOSR-UK-TR-2016-0011 Individual Profiling using Text Analysis 140333 Mark Stevenson UNIVERSITY OF SHEFFIELD, DEPARTMENT OF PSYCHOLOGY Final...REPORT TYPE Final 3. DATES COVERED (From - To) 15 Sep 2014 to 14 Sep 2015 4. TITLE AND SUBTITLE Individual Profiling using Text Analysis
Mining the pharmacogenomics literature—a survey of the state of the art
Cohen, K. Bretonnel; Garten, Yael; Shah, Nigam H.
2012-01-01
This article surveys efforts on text mining of the pharmacogenomics literature, mainly from the period 2008 to 2011. Pharmacogenomics (or pharmacogenetics) is the field that studies how human genetic variation impacts drug response. Therefore, publications span the intersection of research in genotypes, phenotypes and pharmacology, a topic that has increasingly become a focus of active research in recent years. This survey covers efforts dealing with the automatic recognition of relevant named entities (e.g. genes, gene variants and proteins, diseases and other pathological phenomena, drugs and other chemicals relevant for medical treatment), as well as various forms of relations between them. A wide range of text genres is considered, such as scientific publications (abstracts, as well as full texts), patent texts and clinical narratives. We also discuss infrastructure and resources needed for advanced text analytics, e.g. document corpora annotated with corresponding semantic metadata (gold standards and training data), biomedical terminologies and ontologies providing domain-specific background knowledge at different levels of formality and specificity, software architectures for building complex and scalable text analytics pipelines and Web services grounded to them, as well as comprehensive ways to disseminate and interact with the typically huge amounts of semiformal knowledge structures extracted by text mining tools. Finally, we consider some of the novel applications that have already been developed in the field of pharmacogenomic text mining and point out perspectives for future research. PMID:22833496
Mining the pharmacogenomics literature--a survey of the state of the art.
Hahn, Udo; Cohen, K Bretonnel; Garten, Yael; Shah, Nigam H
2012-07-01
This article surveys efforts on text mining of the pharmacogenomics literature, mainly from the period 2008 to 2011. Pharmacogenomics (or pharmacogenetics) is the field that studies how human genetic variation impacts drug response. Therefore, publications span the intersection of research in genotypes, phenotypes and pharmacology, a topic that has increasingly become a focus of active research in recent years. This survey covers efforts dealing with the automatic recognition of relevant named entities (e.g. genes, gene variants and proteins, diseases and other pathological phenomena, drugs and other chemicals relevant for medical treatment), as well as various forms of relations between them. A wide range of text genres is considered, such as scientific publications (abstracts, as well as full texts), patent texts and clinical narratives. We also discuss infrastructure and resources needed for advanced text analytics, e.g. document corpora annotated with corresponding semantic metadata (gold standards and training data), biomedical terminologies and ontologies providing domain-specific background knowledge at different levels of formality and specificity, software architectures for building complex and scalable text analytics pipelines and Web services grounded to them, as well as comprehensive ways to disseminate and interact with the typically huge amounts of semiformal knowledge structures extracted by text mining tools. Finally, we consider some of the novel applications that have already been developed in the field of pharmacogenomic text mining and point out perspectives for future research.
Using Text Mining to Characterize Online Discussion Facilitation
ERIC Educational Resources Information Center
Ming, Norma; Baumer, Eric
2011-01-01
Facilitating class discussions effectively is a critical yet challenging component of instruction, particularly in online environments where student and faculty interaction is limited. Our goals in this research were to identify facilitation strategies that encourage productive discussion, and to explore text mining techniques that can help…
40 CFR 372.23 - SIC and NAICS codes to which this Part applies.
Code of Federal Regulations, 2010 CFR
2010-07-01
... facilities primarily engaged in reproducing text, drawings, plans, maps, or other copy, by blueprinting...)); 212324Kaolin and Ball Clay Mining Limited to facilities operating without a mine or quarry and that are...)); 212393Other Chemical and Fertilizer Mineral Mining Limited to facilities operating without a mine or quarry...
pubmed.mineR: an R package with text-mining algorithms to analyse PubMed abstracts.
Rani, Jyoti; Shah, A B Rauf; Ramachandran, Srinivasan
2015-10-01
The PubMed literature database is a valuable source of information for scientific research. It is rich in biomedical literature with more than 24 million citations. Data-mining of voluminous literature is a challenging task. Although several text-mining algorithms have been developed in recent years with focus on data visualization, they have limitations such as speed, are rigid and are not available in the open source. We have developed an R package, pubmed.mineR, wherein we have combined the advantages of existing algorithms, overcome their limitations, and offer user flexibility and link with other packages in Bioconductor and the Comprehensive R Network (CRAN) in order to expand the user capabilities for executing multifaceted approaches. Three case studies are presented, namely, 'Evolving role of diabetes educators', 'Cancer risk assessment' and 'Dynamic concepts on disease and comorbidity' to illustrate the use of pubmed.mineR. The package generally runs fast with small elapsed times in regular workstations even on large corpus sizes and with compute intensive functions. The pubmed.mineR is available at http://cran.rproject. org/web/packages/pubmed.mineR.
Chen, Chou-Cheng; Ho, Chung-Liang
2014-01-01
While a huge amount of information about biological literature can be obtained by searching the PubMed database, reading through all the titles and abstracts resulting from such a search for useful information is inefficient. Text mining makes it possible to increase this efficiency. Some websites use text mining to gather information from the PubMed database; however, they are database-oriented, using pre-defined search keywords while lacking a query interface for user-defined search inputs. We present the PubMed Abstract Reading Helper (PubstractHelper) website which combines text mining and reading assistance for an efficient PubMed search. PubstractHelper can accept a maximum of ten groups of keywords, within each group containing up to ten keywords. The principle behind the text-mining function of PubstractHelper is that keywords contained in the same sentence are likely to be related. PubstractHelper highlights sentences with co-occurring keywords in different colors. The user can download the PMID and the abstracts with color markings to be reviewed later. The PubstractHelper website can help users to identify relevant publications based on the presence of related keywords, which should be a handy tool for their research. http://bio.yungyun.com.tw/ATM/PubstractHelper.aspx and http://holab.med.ncku.edu.tw/ATM/PubstractHelper.aspx.
Kovalchuk, Sergey V; Funkner, Anastasia A; Metsker, Oleg G; Yakovlev, Aleksey N
2018-06-01
An approach to building a hybrid simulation of patient flow is introduced with a combination of data-driven methods for automation of model identification. The approach is described with a conceptual framework and basic methods for combination of different techniques. The implementation of the proposed approach for simulation of the acute coronary syndrome (ACS) was developed and used in an experimental study. A combination of data, text, process mining techniques, and machine learning approaches for the analysis of electronic health records (EHRs) with discrete-event simulation (DES) and queueing theory for the simulation of patient flow was proposed. The performed analysis of EHRs for ACS patients enabled identification of several classes of clinical pathways (CPs) which were used to implement a more realistic simulation of the patient flow. The developed solution was implemented using Python libraries (SimPy, SciPy, and others). The proposed approach enables more a realistic and detailed simulation of the patient flow within a group of related departments. An experimental study shows an improved simulation of patient length of stay for ACS patient flow obtained from EHRs in Almazov National Medical Research Centre in Saint Petersburg, Russia. The proposed approach, methods, and solutions provide a conceptual, methodological, and programming framework for the implementation of a simulation of complex and diverse scenarios within a flow of patients for different purposes: decision making, training, management optimization, and others. Copyright © 2018 Elsevier Inc. All rights reserved.
Text Classification for Organizational Researchers
Kobayashi, Vladimer B.; Mol, Stefan T.; Berkers, Hannah A.; Kismihók, Gábor; Den Hartog, Deanne N.
2017-01-01
Organizations are increasingly interested in classifying texts or parts thereof into categories, as this enables more effective use of their information. Manual procedures for text classification work well for up to a few hundred documents. However, when the number of documents is larger, manual procedures become laborious, time-consuming, and potentially unreliable. Techniques from text mining facilitate the automatic assignment of text strings to categories, making classification expedient, fast, and reliable, which creates potential for its application in organizational research. The purpose of this article is to familiarize organizational researchers with text mining techniques from machine learning and statistics. We describe the text classification process in several roughly sequential steps, namely training data preparation, preprocessing, transformation, application of classification techniques, and validation, and provide concrete recommendations at each step. To help researchers develop their own text classifiers, the R code associated with each step is presented in a tutorial. The tutorial draws from our own work on job vacancy mining. We end the article by discussing how researchers can validate a text classification model and the associated output. PMID:29881249
Graphics-based intelligent search and abstracting using Data Modeling
NASA Astrophysics Data System (ADS)
Jaenisch, Holger M.; Handley, James W.; Case, Carl T.; Songy, Claude G.
2002-11-01
This paper presents an autonomous text and context-mining algorithm that converts text documents into point clouds for visual search cues. This algorithm is applied to the task of data-mining a scriptural database comprised of the Old and New Testaments from the Bible and the Book of Mormon, Doctrine and Covenants, and the Pearl of Great Price. Results are generated which graphically show the scripture that represents the average concept of the database and the mining of the documents down to the verse level.
Schouten, Kim; van der Weijde, Onne; Frasincar, Flavius; Dekker, Rommert
2018-04-01
Using online consumer reviews as electronic word of mouth to assist purchase-decision making has become increasingly popular. The Web provides an extensive source of consumer reviews, but one can hardly read all reviews to obtain a fair evaluation of a product or service. A text processing framework that can summarize reviews, would therefore be desirable. A subtask to be performed by such a framework would be to find the general aspect categories addressed in review sentences, for which this paper presents two methods. In contrast to most existing approaches, the first method presented is an unsupervised method that applies association rule mining on co-occurrence frequency data obtained from a corpus to find these aspect categories. While not on par with state-of-the-art supervised methods, the proposed unsupervised method performs better than several simple baselines, a similar but supervised method, and a supervised baseline, with an -score of 67%. The second method is a supervised variant that outperforms existing methods with an -score of 84%.
A Data Warehouse Architecture for DoD Healthcare Performance Measurements.
1999-09-01
design, develop, implement, and apply statistical analysis and data mining tools to a Data Warehouse of healthcare metrics. With the DoD healthcare...framework, this thesis defines a methodology to design, develop, implement, and apply statistical analysis and data mining tools to a Data Warehouse...21 F. INABILITY TO CONDUCT HELATHCARE ANALYSIS
NASA Astrophysics Data System (ADS)
Shyu, Mei-Ling; Huang, Zifang; Luo, Hongli
In recent years, pervasive computing infrastructures have greatly improved the interaction between human and system. As we put more reliance on these computing infrastructures, we also face threats of network intrusion and/or any new forms of undesirable IT-based activities. Hence, network security has become an extremely important issue, which is closely connected with homeland security, business transactions, and people's daily life. Accurate and efficient intrusion detection technologies are required to safeguard the network systems and the critical information transmitted in the network systems. In this chapter, a novel network intrusion detection framework for mining and detecting sequential intrusion patterns is proposed. The proposed framework consists of a Collateral Representative Subspace Projection Modeling (C-RSPM) component for supervised classification, and an inter-transactional association rule mining method based on Layer Divided Modeling (LDM) for temporal pattern analysis. Experiments on the KDD99 data set and the traffic data set generated by a private LAN testbed show promising results with high detection rates, low processing time, and low false alarm rates in mining and detecting sequential intrusion detections.
The Labour Welfare Fund Laws (Amendment) Act, 1987 (No. 15 of 1987), 22 May 1987.
1987-01-01
This Act authorizes funds constituted under the Mica Mines Labour Welfare Fund Act, 1946, the Limestone and Dolomite Mines Labour Welfare Fund Act, 1972, the Iron Ore Mines, Manganese Ore Mines and Chrome Mines Labour Welfare Fund Act, 1976, and the Beedi Workers Welfare Fund Act, 1976, to be applied for the provision of family welfare, including family planning education and services. full text
Mining Tasks from the Web Anchor Text Graph: MSR Notebook Paper for the TREC 2015 Tasks Track
2015-11-20
Mining Tasks from the Web Anchor Text Graph: MSR Notebook Paper for the TREC 2015 Tasks Track Paul N. Bennett Microsoft Research Redmond, USA pauben...anchor text graph has proven useful in the general realm of query reformulation [2], we sought to quantify the value of extracting key phrases from...anchor text in the broader setting of the task understanding track. Given a query, our approach considers a simple method for identifying a relevant
Maier, Raina M.; Díaz-Barriga, Fernando; Field, James A.; Hopkins, James; Klein, Bern; Poulton, Mary M.
2016-01-01
Increasing global demand for metals is straining the ability of the mining industry to physically keep up with demand (physical scarcity). On the other hand, social issues including the environmental and human health consequences of mining as well as the disparity in income distribution from mining revenues are disproportionately felt at the local community level. This has created social rifts, particularly in the developing world, between affected communities and both industry and governments. Such rifts can result in a disruption of the steady supply of metals (situational scarcity). Here we discuss the importance of mining in relationship to poverty, identify steps that have been taken to create a framework for socially responsible mining, and then discuss the need for academia to work in partnership with communities, government, and industry to develop trans-disciplinary research-based step change solutions to the intertwined problems of physical and situational scarcity. PMID:24552962
ERIC Educational Resources Information Center
Wang, Yinying; Bowers, Alex J.; Fikis, David J.
2017-01-01
Purpose: The purpose of this study is to describe the underlying topics and the topic evolution in the 50-year history of educational leadership research literature. Method: We used automated text data mining with probabilistic latent topic models to examine the full text of the entire publication history of all 1,539 articles published in…
EAGLE: 'EAGLE'Is an' Algorithmic Graph Library for Exploration
DOE Office of Scientific and Technical Information (OSTI.GOV)
2015-01-16
The Resource Description Framework (RDF) and SPARQL Protocol and RDF Query Language (SPARQL) were introduced about a decade ago to enable flexible schema-free data interchange on the Semantic Web. Today data scientists use the framework as a scalable graph representation for integrating, querying, exploring and analyzing data sets hosted at different sources. With increasing adoption, the need for graph mining capabilities for the Semantic Web has emerged. Today there is no tools to conduct "graph mining" on RDF standard data sets. We address that need through implementation of popular iterative Graph Mining algorithms (Triangle count, Connected component analysis, degree distribution,more » diversity degree, PageRank, etc.). We implement these algorithms as SPARQL queries, wrapped within Python scripts and call our software tool as EAGLE. In RDF style, EAGLE stands for "EAGLE 'Is an' algorithmic graph library for exploration. EAGLE is like 'MATLAB' for 'Linked Data.'« less
Kibinge, Nelson; Ono, Naoaki; Horie, Masafumi; Sato, Tetsuo; Sugiura, Tadao; Altaf-Ul-Amin, Md; Saito, Akira; Kanaya, Shigehiko
2016-06-01
Conventionally, workflows examining transcription regulation networks from gene expression data involve distinct analytical steps. There is a need for pipelines that unify data mining and inference deduction into a singular framework to enhance interpretation and hypotheses generation. We propose a workflow that merges network construction with gene expression data mining focusing on regulation processes in the context of transcription factor driven gene regulation. The pipeline implements pathway-based modularization of expression profiles into functional units to improve biological interpretation. The integrated workflow was implemented as a web application software (TransReguloNet) with functions that enable pathway visualization and comparison of transcription factor activity between sample conditions defined in the experimental design. The pipeline merges differential expression, network construction, pathway-based abstraction, clustering and visualization. The framework was applied in analysis of actual expression datasets related to lung, breast and prostrate cancer. Copyright © 2016 Elsevier Inc. All rights reserved.
NASA Astrophysics Data System (ADS)
Kim, Kwang Hyeon; Lee, Suk; Shim, Jang Bo; Chang, Kyung Hwan; Yang, Dae Sik; Yoon, Won Sup; Park, Young Je; Kim, Chul Yong; Cao, Yuan Jie
2017-08-01
The aim of this study is an integrated research for text-based data mining and toxicity prediction modeling system for clinical decision support system based on big data in radiation oncology as a preliminary research. The structured and unstructured data were prepared by treatment plans and the unstructured data were extracted by dose-volume data image pattern recognition of prostate cancer for research articles crawling through the internet. We modeled an artificial neural network to build a predictor model system for toxicity prediction of organs at risk. We used a text-based data mining approach to build the artificial neural network model for bladder and rectum complication predictions. The pattern recognition method was used to mine the unstructured toxicity data for dose-volume at the detection accuracy of 97.9%. The confusion matrix and training model of the neural network were achieved with 50 modeled plans (n = 50) for validation. The toxicity level was analyzed and the risk factors for 25% bladder, 50% bladder, 20% rectum, and 50% rectum were calculated by the artificial neural network algorithm. As a result, 32 plans could cause complication but 18 plans were designed as non-complication among 50 modeled plans. We integrated data mining and a toxicity modeling method for toxicity prediction using prostate cancer cases. It is shown that a preprocessing analysis using text-based data mining and prediction modeling can be expanded to personalized patient treatment decision support based on big data.
A sentence sliding window approach to extract protein annotations from biomedical articles
Krallinger, Martin; Padron, Maria; Valencia, Alfonso
2005-01-01
Background Within the emerging field of text mining and statistical natural language processing (NLP) applied to biomedical articles, a broad variety of techniques have been developed during the past years. Nevertheless, there is still a great ned of comparative assessment of the performance of the proposed methods and the development of common evaluation criteria. This issue was addressed by the Critical Assessment of Text Mining Methods in Molecular Biology (BioCreative) contest. The aim of this contest was to assess the performance of text mining systems applied to biomedical texts including tools which recognize named entities such as genes and proteins, and tools which automatically extract protein annotations. Results The "sentence sliding window" approach proposed here was found to efficiently extract text fragments from full text articles containing annotations on proteins, providing the highest number of correctly predicted annotations. Moreover, the number of correct extractions of individual entities (i.e. proteins and GO terms) involved in the relationships used for the annotations was significantly higher than the correct extractions of the complete annotations (protein-function relations). Conclusion We explored the use of averaging sentence sliding windows for information extraction, especially in a context where conventional training data is unavailable. The combination of our approach with more refined statistical estimators and machine learning techniques might be a way to improve annotation extraction for future biomedical text mining applications. PMID:15960831
Text Mining Improves Prediction of Protein Functional Sites
Cohn, Judith D.; Ravikumar, Komandur E.
2012-01-01
We present an approach that integrates protein structure analysis and text mining for protein functional site prediction, called LEAP-FS (Literature Enhanced Automated Prediction of Functional Sites). The structure analysis was carried out using Dynamics Perturbation Analysis (DPA), which predicts functional sites at control points where interactions greatly perturb protein vibrations. The text mining extracts mentions of residues in the literature, and predicts that residues mentioned are functionally important. We assessed the significance of each of these methods by analyzing their performance in finding known functional sites (specifically, small-molecule binding sites and catalytic sites) in about 100,000 publicly available protein structures. The DPA predictions recapitulated many of the functional site annotations and preferentially recovered binding sites annotated as biologically relevant vs. those annotated as potentially spurious. The text-based predictions were also substantially supported by the functional site annotations: compared to other residues, residues mentioned in text were roughly six times more likely to be found in a functional site. The overlap of predictions with annotations improved when the text-based and structure-based methods agreed. Our analysis also yielded new high-quality predictions of many functional site residues that were not catalogued in the curated data sources we inspected. We conclude that both DPA and text mining independently provide valuable high-throughput protein functional site predictions, and that integrating the two methods using LEAP-FS further improves the quality of these predictions. PMID:22393388
NASA Astrophysics Data System (ADS)
Soltanmohammadi, Hossein; Osanloo, Morteza; Aghajani Bazzazi, Abbas
2009-08-01
This study intends to take advantage of a previously developed framework for mined land suitability analysis (MLSA) consisted of economical, social, technical and mine site factors to achieve a partial and also a complete pre-order of feasible post-mining land-uses. Analysis by an outranking multi-attribute decision-making (MADM) technique, called PROMETHEE (preference ranking organization method for enrichment evaluation), was taken into consideration because of its clear advantages on the field of MLSA as compared with MADM ranking techniques. Application of the proposed approach on a mined land can be completed through some successive steps. First, performance of the MLSA attributes is scored locally by each individual decision maker (DM). Then the assigned performance scores are normalized and the deviation amplitudes of non-dominated alternatives are calculated. Weights of the attributes are calculated by another MADM technique namely, analytical hierarchy process (AHP) in a separate procedure. Using the Gaussian preference function beside the weights, the preference indexes of the land-use alternatives are obtained. Calculation of the outgoing and entering flows of the alternatives and one by one comparison of these values will lead to partial pre-order of them and calculation of the net flows, will lead to a ranked preference for each land-use. At the final step, utilizing the PROMETHEE group decision support system which incorporates judgments of all the DMs, a consensual ranking can be derived. In this paper, preference order of post-mining land-uses for a hypothetical mined land has been derived according to judgments of one DM to reveal applicability of the proposed approach.
Xia, Jingbo; Zhang, Xing; Yuan, Daojun; Chen, Lingling; Webster, Jonathan; Fang, Alex Chengyu
2013-01-01
To effectively assess the possibility of the unknown rice protein resistant to Xanthomonas oryzae pv. oryzae, a hybrid strategy is proposed to enhance gene prioritization by combining text mining technologies with a sequence-based approach. The text mining technique of term frequency inverse document frequency is used to measure the importance of distinguished terms which reflect biomedical activity in rice before candidate genes are screened and vital terms are produced. Afterwards, a built-in classifier under the chaos games representation algorithm is used to sieve the best possible candidate gene. Our experiment results show that the combination of these two methods achieves enhanced gene prioritization. PMID:24371834
Uncovering text mining: A survey of current work on web-based epidemic intelligence
Collier, Nigel
2012-01-01
Real world pandemics such as SARS 2002 as well as popular fiction like the movie Contagion graphically depict the health threat of a global pandemic and the key role of epidemic intelligence (EI). While EI relies heavily on established indicator sources a new class of methods based on event alerting from unstructured digital Internet media is rapidly becoming acknowledged within the public health community. At the heart of automated information gathering systems is a technology called text mining. My contribution here is to provide an overview of the role that text mining technology plays in detecting epidemics and to synthesise my existing research on the BioCaster project. PMID:22783909
Xia, Jingbo; Zhang, Xing; Yuan, Daojun; Chen, Lingling; Webster, Jonathan; Fang, Alex Chengyu
2013-01-01
To effectively assess the possibility of the unknown rice protein resistant to Xanthomonas oryzae pv. oryzae, a hybrid strategy is proposed to enhance gene prioritization by combining text mining technologies with a sequence-based approach. The text mining technique of term frequency inverse document frequency is used to measure the importance of distinguished terms which reflect biomedical activity in rice before candidate genes are screened and vital terms are produced. Afterwards, a built-in classifier under the chaos games representation algorithm is used to sieve the best possible candidate gene. Our experiment results show that the combination of these two methods achieves enhanced gene prioritization.
A spatial assessment framework for evaluating flood risk under extreme climates.
Chen, Yun; Liu, Rui; Barrett, Damian; Gao, Lei; Zhou, Mingwei; Renzullo, Luigi; Emelyanova, Irina
2015-12-15
Australian coal mines have been facing a major challenge of increasing risk of flooding caused by intensive rainfall events in recent years. In light of growing climate change concerns and the predicted escalation of flooding, estimating flood inundation risk becomes essential for understanding sustainable mine water management in the Australian mining sector. This research develops a spatial multi-criteria decision making prototype for the evaluation of flooding risk at a regional scale using the Bowen Basin and its surroundings in Queensland as a case study. Spatial gridded data, including climate, hydrology, topography, vegetation and soils, were collected and processed in ArcGIS. Several indices were derived based on time series of observations and spatial modeling taking account of extreme rainfall, evapotranspiration, stream flow, potential soil water retention, elevation and slope generated from a digital elevation model (DEM), as well as drainage density and proximity extracted from a river network. These spatial indices were weighted using the analytical hierarchy process (AHP) and integrated in an AHP-based suitability assessment (AHP-SA) model under the spatial risk evaluation framework. A regional flooding risk map was delineated to represent likely impacts of criterion indices at different risk levels, which was verified using the maximum inundation extent detectable by a time series of remote sensing imagery. The result provides baseline information to help Bowen Basin coal mines identify and assess flooding risk when making adaptation strategies and implementing mitigation measures in future. The framework and methodology developed in this research offers the Australian mining industry, and social and environmental studies around the world, an effective way to produce reliable assessment on flood risk for managing uncertainty in water availability under climate change. Copyright © 2015. Published by Elsevier B.V.
Foundational Principles for Large-Scale Inference: Illustrations Through Correlation Mining.
Hero, Alfred O; Rajaratnam, Bala
2016-01-01
When can reliable inference be drawn in fue "Big Data" context? This paper presents a framework for answering this fundamental question in the context of correlation mining, wifu implications for general large scale inference. In large scale data applications like genomics, connectomics, and eco-informatics fue dataset is often variable-rich but sample-starved: a regime where the number n of acquired samples (statistical replicates) is far fewer than fue number p of observed variables (genes, neurons, voxels, or chemical constituents). Much of recent work has focused on understanding the computational complexity of proposed methods for "Big Data". Sample complexity however has received relatively less attention, especially in the setting when the sample size n is fixed, and the dimension p grows without bound. To address fuis gap, we develop a unified statistical framework that explicitly quantifies the sample complexity of various inferential tasks. Sampling regimes can be divided into several categories: 1) the classical asymptotic regime where fue variable dimension is fixed and fue sample size goes to infinity; 2) the mixed asymptotic regime where both variable dimension and sample size go to infinity at comparable rates; 3) the purely high dimensional asymptotic regime where the variable dimension goes to infinity and the sample size is fixed. Each regime has its niche but only the latter regime applies to exa cale data dimension. We illustrate this high dimensional framework for the problem of correlation mining, where it is the matrix of pairwise and partial correlations among the variables fua t are of interest. Correlation mining arises in numerous applications and subsumes the regression context as a special case. we demonstrate various regimes of correlation mining based on the unifying perspective of high dimensional learning rates and sample complexity for different structured covariance models and different inference tasks.
Kreula, Sanna M; Kaewphan, Suwisa; Ginter, Filip; Jones, Patrik R
2018-01-01
The increasing move towards open access full-text scientific literature enhances our ability to utilize advanced text-mining methods to construct information-rich networks that no human will be able to grasp simply from 'reading the literature'. The utility of text-mining for well-studied species is obvious though the utility for less studied species, or those with no prior track-record at all, is not clear. Here we present a concept for how advanced text-mining can be used to create information-rich networks even for less well studied species and apply it to generate an open-access gene-gene association network resource for Synechocystis sp. PCC 6803, a representative model organism for cyanobacteria and first case-study for the methodology. By merging the text-mining network with networks generated from species-specific experimental data, network integration was used to enhance the accuracy of predicting novel interactions that are biologically relevant. A rule-based algorithm (filter) was constructed in order to automate the search for novel candidate genes with a high degree of likely association to known target genes by (1) ignoring established relationships from the existing literature, as they are already 'known', and (2) demanding multiple independent evidences for every novel and potentially relevant relationship. Using selected case studies, we demonstrate the utility of the network resource and filter to ( i ) discover novel candidate associations between different genes or proteins in the network, and ( ii ) rapidly evaluate the potential role of any one particular gene or protein. The full network is provided as an open-source resource.
Land use-based landscape planning and restoration in mine closure areas.
Zhang, Jianjun; Fu, Meichen; Hassani, Ferri P; Zeng, Hui; Geng, Yuhuan; Bai, Zhongke
2011-05-01
Landscape planning and restoration in mine closure areas is not only an inevitable choice to sustain mining areas but also an important path to maximize landscape resources and to improve ecological function in mine closure areas. The analysis of the present mine development shows that many mines are unavoidably facing closures in China. This paper analyzes the periodic impact of mining activities on landscapes and then proposes planning concepts and principles. According to the landscape characteristics in mine closure areas, this paper classifies available landscape resources in mine closure areas into the landscape for restoration, for limited restoration and for protection, and then summarizes directions for their uses. This paper establishes the framework of spatial control planning and design of landscape elements from "macro control, medium allocation and micro optimization" for the purpose of managing and using this kind of special landscape resources. Finally, this paper applies the theories and methods to a case study in Wu'an from two aspects: the construction of a sustainable land-use pattern on a large scale and the optimized allocation of typical mine landscape resources on a small scale.
Introduction to Agent Mining Interaction and Integration
NASA Astrophysics Data System (ADS)
Cao, Longbing
In recent years, more and more researchers have been involved in research on both agent technology and data mining. A clear disciplinary effort has been activated toward removing the boundary between them, that is the interaction and integration between agent technology and data mining. We refer this to agent mining as a new area. The marriage of agents and data mining is driven by challenges faced by both communities, and the need of developing more advanced intelligence, information processing and systems. This chapter presents an overall picture of agent mining from the perspective of positioning it as an emerging area. We summarize the main driving forces, complementary essence, disciplinary framework, applications, case studies, and trends and directions, as well as brief observation on agent-driven data mining, data mining-driven agents, and mutual issues in agent mining. Arguably, we draw the following conclusions: (1) agent mining emerges as a new area in the scientific family, (2) both agent technology and data mining can greatly benefit from agent mining, (3) it is very promising to result in additional advancement in intelligent information processing and systems. However, as a new open area, there are many issues waiting for research and development from theoretical, technological and practical perspectives.
Code of Federal Regulations, 2010 CFR
2010-07-01
... texts of State and Federal cooperative agreements for regulation of mining on Federal lands. The... Resources OFFICE OF SURFACE MINING RECLAMATION AND ENFORCEMENT, DEPARTMENT OF THE INTERIOR PROGRAMS FOR THE CONDUCT OF SURFACE MINING OPERATIONS WITHIN EACH STATE INTRODUCTION § 900.2 Objectives. The objective of...
76 FR 40649 - Indiana Regulatory Program
Federal Register 2010, 2011, 2012, 2013, 2014
2011-07-11
... at 312 IAC 25-6-30 Surface mining; explosives; general requirements. The full text of the program... DEPARTMENT OF THE INTERIOR Office of Surface Mining Reclamation and Enforcement 30 CFR Part 914... Mining Reclamation and Enforcement, Interior. ACTION: Proposed rule; public comment period on proposed...
Complementing the Numbers: A Text Mining Analysis of College Course Withdrawals
ERIC Educational Resources Information Center
Michalski, Greg V.
2011-01-01
Excessive college course withdrawals are costly to the student and the institution in terms of time to degree completion, available classroom space, and other resources. Although generally well quantified, detailed analysis of the reasons given by students for course withdrawal is less common. To address this, a text mining analysis was performed…
Can abstract screening workload be reduced using text mining? User experiences of the tool Rayyan.
Olofsson, Hanna; Brolund, Agneta; Hellberg, Christel; Silverstein, Rebecca; Stenström, Karin; Österberg, Marie; Dagerhamn, Jessica
2017-09-01
One time-consuming aspect of conducting systematic reviews is the task of sifting through abstracts to identify relevant studies. One promising approach for reducing this burden uses text mining technology to identify those abstracts that are potentially most relevant for a project, allowing those abstracts to be screened first. To examine the effectiveness of the text mining functionality of the abstract screening tool Rayyan. User experiences were collected. Rayyan was used to screen abstracts for 6 reviews in 2015. After screening 25%, 50%, and 75% of the abstracts, the screeners logged the relevant references identified. A survey was sent to users. After screening half of the search result with Rayyan, 86% to 99% of the references deemed relevant to the study were identified. Of those studies included in the final reports, 96% to 100% were already identified in the first half of the screening process. Users rated Rayyan 4.5 out of 5. The text mining function in Rayyan successfully helped reviewers identify relevant studies early in the screening process. Copyright © 2017 John Wiley & Sons, Ltd.
Pandey, Abhishek; Kreimeyer, Kory; Foster, Matthew; Botsis, Taxiarchis; Dang, Oanh; Ly, Thomas; Wang, Wei; Forshee, Richard
2018-01-01
Structured Product Labels follow an XML-based document markup standard approved by the Health Level Seven organization and adopted by the US Food and Drug Administration as a mechanism for exchanging medical products information. Their current organization makes their secondary use rather challenging. We used the Side Effect Resource database and DailyMed to generate a comparison dataset of 1159 Structured Product Labels. We processed the Adverse Reaction section of these Structured Product Labels with the Event-based Text-mining of Health Electronic Records system and evaluated its ability to extract and encode Adverse Event terms to Medical Dictionary for Regulatory Activities Preferred Terms. A small sample of 100 labels was then selected for further analysis. Of the 100 labels, Event-based Text-mining of Health Electronic Records achieved a precision and recall of 81 percent and 92 percent, respectively. This study demonstrated Event-based Text-mining of Health Electronic Record's ability to extract and encode Adverse Event terms from Structured Product Labels which may potentially support multiple pharmacoepidemiological tasks.
Data Processing and Text Mining Technologies on Electronic Medical Records: A Review
Sun, Wencheng; Li, Yangyang; Liu, Fang; Fang, Shengqun; Wang, Guoyan
2018-01-01
Currently, medical institutes generally use EMR to record patient's condition, including diagnostic information, procedures performed, and treatment results. EMR has been recognized as a valuable resource for large-scale analysis. However, EMR has the characteristics of diversity, incompleteness, redundancy, and privacy, which make it difficult to carry out data mining and analysis directly. Therefore, it is necessary to preprocess the source data in order to improve data quality and improve the data mining results. Different types of data require different processing technologies. Most structured data commonly needs classic preprocessing technologies, including data cleansing, data integration, data transformation, and data reduction. For semistructured or unstructured data, such as medical text, containing more health information, it requires more complex and challenging processing methods. The task of information extraction for medical texts mainly includes NER (named-entity recognition) and RE (relation extraction). This paper focuses on the process of EMR processing and emphatically analyzes the key techniques. In addition, we make an in-depth study on the applications developed based on text mining together with the open challenges and research issues for future work. PMID:29849998
Trippi, Michael H.; Ruppert, Leslie F.; Attanasi, E.D.; Milici, Robert C.; Freeman, P.A.
2014-01-01
Data from 157 counties in the Appalachian basin of average sulfur content of coal mined for electrical power generation from 1983 through 2005 show a general decrease in the number of counties where coal mining has occurred and a decrease in the number of counties where higher sulfur coals (>2 percent sulfur) were mined. Calculated potential SO2 emissions (assuming no post-combustion SO2 removal) show a corresponding decrease over the same period of time.
NASA Astrophysics Data System (ADS)
Galitsky, Boris; Kovalerchuk, Boris
2006-04-01
We develop a software system Text Scanner for Emotional Distress (TSED) for helping to detect email messages which are suspicious of coming from people under strong emotional distress. It has been confirmed by multiple studies that terrorist attackers have experienced a substantial emotional distress at some points before committing a terrorist attack. Therefore, if an individual in emotional distress can be detected on the basis of email texts, some preventive measures can be taken. The proposed detection machinery is based on extraction and classification of emotional profiles from emails. An emotional profile is a formal representation of a sequence of emotional states through a textual discourse where communicative actions are attached to these emotional states. The issues of extraction of emotional profiles from text and reasoning about it are discussed and illustrated. We then develop an inductive machine learning and reasoning framework to relate an emotional profile to the class "Emotional distress" or "No emotional distress", given a training dataset where the class is assigned by an expert. TSED's machine learning is evaluated using the database of structured customer complaints.
Discovering System Health Anomalies Using Data Mining Techniques
NASA Technical Reports Server (NTRS)
Sriastava, Ashok, N.
2005-01-01
We present a data mining framework for the analysis and discovery of anomalies in high-dimensional time series of sensor measurements that would be found in an Integrated System Health Monitoring system. We specifically treat the problem of discovering anomalous features in the time series that may be indicative of a system anomaly, or in the case of a manned system, an anomaly due to the human. Identification of these anomalies is crucial to building stable, reusable, and cost-efficient systems. The framework consists of an analysis platform and new algorithms that can scale to thousands of sensor streams to discovers temporal anomalies. We discuss the mathematical framework that underlies the system and also describe in detail how this framework is general enough to encompass both discrete and continuous sensor measurements. We also describe a new set of data mining algorithms based on kernel methods and hidden Markov models that allow for the rapid assimilation, analysis, and discovery of system anomalies. We then describe the performance of the system on a real-world problem in the aircraft domain where we analyze the cockpit data from aircraft as well as data from the aircraft propulsion, control, and guidance systems. These data are discrete and continuous sensor measurements and are dealt with seamlessly in order to discover anomalous flights. We conclude with recommendations that describe the tradeoffs in building an integrated scalable platform for robust anomaly detection in ISHM applications.
Friedel, Michael J.
2008-01-01
Mauritania anticipates an increase in mining activities throughout the country and into the foreseeable future. Because mining-induced changes in the landscape are likely to affect their limited ground-water resources and sensitive aquatic ecosystems, a water-quality assessment program was designed for Mauritania that is based on a nationally consistent environmental stratification framework. The primary objectives of this program are to ensure that the environmental monitoring systems can quantify near real-time changes in surface-water chemistry at a local scale, and quantify intermediate- to long-term changes in groundwater and aquatic ecosystems over multiple scales.
Bockholt, Henry J.; Scully, Mark; Courtney, William; Rachakonda, Srinivas; Scott, Adam; Caprihan, Arvind; Fries, Jill; Kalyanam, Ravi; Segall, Judith M.; de la Garza, Raul; Lane, Susan; Calhoun, Vince D.
2009-01-01
A neuroinformatics (NI) system is critical to brain imaging research in order to shorten the time between study conception and results. Such a NI system is required to scale well when large numbers of subjects are studied. Further, when multiple sites participate in research projects organizational issues become increasingly difficult. Optimized NI applications mitigate these problems. Additionally, NI software enables coordination across multiple studies, leveraging advantages potentially leading to exponential research discoveries. The web-based, Mind Research Network (MRN), database system has been designed and improved through our experience with 200 research studies and 250 researchers from seven different institutions. The MRN tools permit the collection, management, reporting and efficient use of large scale, heterogeneous data sources, e.g., multiple institutions, multiple principal investigators, multiple research programs and studies, and multimodal acquisitions. We have collected and analyzed data sets on thousands of research participants and have set up a framework to automatically analyze the data, thereby making efficient, practical data mining of this vast resource possible. This paper presents a comprehensive framework for capturing and analyzing heterogeneous neuroscience research data sources that has been fully optimized for end-users to perform novel data mining. PMID:20461147
Molinari, Filippo; Raghavendra, U; Gudigar, Anjan; Meiburger, Kristen M; Rajendra Acharya, U
2018-02-23
Atherosclerosis is a type of cardiovascular disease which may cause stroke. It is due to the deposition of fatty plaque in the artery walls resulting in the reduction of elasticity gradually and hence restricting the blood flow to the heart. Hence, an early prediction of carotid plaque deposition is important, as it can save lives. This paper proposes a novel data mining framework for the assessment of atherosclerosis in its early stage using ultrasound images. In this work, we are using 1353 symptomatic and 420 asymptomatic carotid plaque ultrasound images. Our proposed method classifies the symptomatic and asymptomatic carotid plaques using bidimensional empirical mode decomposition (BEMD) and entropy features. The unbalanced data samples are compensated using adaptive synthetic sampling (ADASYN), and the developed method yielded a promising accuracy of 91.43%, sensitivity of 97.26%, and specificity of 83.22% using fourteen features. Hence, the proposed method can be used as an assisting tool during the regular screening of carotid arteries in hospitals. Graphical abstract Outline for our efficient data mining framework for the characterization of symptomatic and asymptomatic carotid plaques.
ERIC Educational Resources Information Center
O'Halloran, Kay L.; Tan, Sabine; Pham, Duc-Son; Bateman, John; Vande Moere, Andrew
2018-01-01
This article demonstrates how a digital environment offers new opportunities for transforming qualitative data into quantitative data in order to use data mining and information visualization for mixed methods research. The digital approach to mixed methods research is illustrated by a framework which combines qualitative methods of multimodal…
Edu-Mining for Book Recommendation for Pupils
ERIC Educational Resources Information Center
Nagata, Ryo; Takeda, Keigo; Suda, Koji; Kakegawa, Junichi; Morihiro, Koichiro
2009-01-01
This paper proposes a novel method for recommending books to pupils based on a framework called Edu-mining. One of the properties of the proposed method is that it uses only loan histories (pupil ID, book ID, date of loan) whereas the conventional methods require additional information such as taste information from a great number of users which…
An overview of the BioCreative 2012 Workshop Track III: interactive text mining task
Arighi, Cecilia N.; Carterette, Ben; Cohen, K. Bretonnel; Krallinger, Martin; Wilbur, W. John; Fey, Petra; Dodson, Robert; Cooper, Laurel; Van Slyke, Ceri E.; Dahdul, Wasila; Mabee, Paula; Li, Donghui; Harris, Bethany; Gillespie, Marc; Jimenez, Silvia; Roberts, Phoebe; Matthews, Lisa; Becker, Kevin; Drabkin, Harold; Bello, Susan; Licata, Luana; Chatr-aryamontri, Andrew; Schaeffer, Mary L.; Park, Julie; Haendel, Melissa; Van Auken, Kimberly; Li, Yuling; Chan, Juancarlos; Muller, Hans-Michael; Cui, Hong; Balhoff, James P.; Chi-Yang Wu, Johnny; Lu, Zhiyong; Wei, Chih-Hsuan; Tudor, Catalina O.; Raja, Kalpana; Subramani, Suresh; Natarajan, Jeyakumar; Cejuela, Juan Miguel; Dubey, Pratibha; Wu, Cathy
2013-01-01
In many databases, biocuration primarily involves literature curation, which usually involves retrieving relevant articles, extracting information that will translate into annotations and identifying new incoming literature. As the volume of biological literature increases, the use of text mining to assist in biocuration becomes increasingly relevant. A number of groups have developed tools for text mining from a computer science/linguistics perspective, and there are many initiatives to curate some aspect of biology from the literature. Some biocuration efforts already make use of a text mining tool, but there have not been many broad-based systematic efforts to study which aspects of a text mining tool contribute to its usefulness for a curation task. Here, we report on an effort to bring together text mining tool developers and database biocurators to test the utility and usability of tools. Six text mining systems presenting diverse biocuration tasks participated in a formal evaluation, and appropriate biocurators were recruited for testing. The performance results from this evaluation indicate that some of the systems were able to improve efficiency of curation by speeding up the curation task significantly (∼1.7- to 2.5-fold) over manual curation. In addition, some of the systems were able to improve annotation accuracy when compared with the performance on the manually curated set. In terms of inter-annotator agreement, the factors that contributed to significant differences for some of the systems included the expertise of the biocurator on the given curation task, the inherent difficulty of the curation and attention to annotation guidelines. After the task, annotators were asked to complete a survey to help identify strengths and weaknesses of the various systems. The analysis of this survey highlights how important task completion is to the biocurators’ overall experience of a system, regardless of the system’s high score on design, learnability and usability. In addition, strategies to refine the annotation guidelines and systems documentation, to adapt the tools to the needs and query types the end user might have and to evaluate performance in terms of efficiency, user interface, result export and traditional evaluation metrics have been analyzed during this task. This analysis will help to plan for a more intense study in BioCreative IV. PMID:23327936
An overview of the BioCreative 2012 Workshop Track III: interactive text mining task.
Arighi, Cecilia N; Carterette, Ben; Cohen, K Bretonnel; Krallinger, Martin; Wilbur, W John; Fey, Petra; Dodson, Robert; Cooper, Laurel; Van Slyke, Ceri E; Dahdul, Wasila; Mabee, Paula; Li, Donghui; Harris, Bethany; Gillespie, Marc; Jimenez, Silvia; Roberts, Phoebe; Matthews, Lisa; Becker, Kevin; Drabkin, Harold; Bello, Susan; Licata, Luana; Chatr-aryamontri, Andrew; Schaeffer, Mary L; Park, Julie; Haendel, Melissa; Van Auken, Kimberly; Li, Yuling; Chan, Juancarlos; Muller, Hans-Michael; Cui, Hong; Balhoff, James P; Chi-Yang Wu, Johnny; Lu, Zhiyong; Wei, Chih-Hsuan; Tudor, Catalina O; Raja, Kalpana; Subramani, Suresh; Natarajan, Jeyakumar; Cejuela, Juan Miguel; Dubey, Pratibha; Wu, Cathy
2013-01-01
In many databases, biocuration primarily involves literature curation, which usually involves retrieving relevant articles, extracting information that will translate into annotations and identifying new incoming literature. As the volume of biological literature increases, the use of text mining to assist in biocuration becomes increasingly relevant. A number of groups have developed tools for text mining from a computer science/linguistics perspective, and there are many initiatives to curate some aspect of biology from the literature. Some biocuration efforts already make use of a text mining tool, but there have not been many broad-based systematic efforts to study which aspects of a text mining tool contribute to its usefulness for a curation task. Here, we report on an effort to bring together text mining tool developers and database biocurators to test the utility and usability of tools. Six text mining systems presenting diverse biocuration tasks participated in a formal evaluation, and appropriate biocurators were recruited for testing. The performance results from this evaluation indicate that some of the systems were able to improve efficiency of curation by speeding up the curation task significantly (∼1.7- to 2.5-fold) over manual curation. In addition, some of the systems were able to improve annotation accuracy when compared with the performance on the manually curated set. In terms of inter-annotator agreement, the factors that contributed to significant differences for some of the systems included the expertise of the biocurator on the given curation task, the inherent difficulty of the curation and attention to annotation guidelines. After the task, annotators were asked to complete a survey to help identify strengths and weaknesses of the various systems. The analysis of this survey highlights how important task completion is to the biocurators' overall experience of a system, regardless of the system's high score on design, learnability and usability. In addition, strategies to refine the annotation guidelines and systems documentation, to adapt the tools to the needs and query types the end user might have and to evaluate performance in terms of efficiency, user interface, result export and traditional evaluation metrics have been analyzed during this task. This analysis will help to plan for a more intense study in BioCreative IV.
76 FR 12849 - Kentucky Regulatory Program
Federal Register 2010, 2011, 2012, 2013, 2014
2011-03-09
... (underground mining). The text of the Kentucky regulations can be found in the administrative record and online... DEPARTMENT OF THE INTERIOR Office of Surface Mining Reclamation and Enforcement 30 CFR Part 917 [KY-252-FOR; OSM-2009-0011] Kentucky Regulatory Program AGENCY: Office of Surface Mining Reclamation...
ERIC Educational Resources Information Center
Trumbull, Elise; Rothstein-Fisch, Carrie; Greenfield, Patricia M.
2001-01-01
Describes the Bridging Cultures framework, a tool for understanding how the expectations for a student at school may conflict with the values of the student's family. The framework describes two contrasting value systems (individualism and collectivism) and helps teachers consider where differences may lie to head off potential conflict. (SM)
Research on Classification of Chinese Text Data Based on SVM
NASA Astrophysics Data System (ADS)
Lin, Yuan; Yu, Hongzhi; Wan, Fucheng; Xu, Tao
2017-09-01
Data Mining has important application value in today’s industry and academia. Text classification is a very important technology in data mining. At present, there are many mature algorithms for text classification. KNN, NB, AB, SVM, decision tree and other classification methods all show good classification performance. Support Vector Machine’ (SVM) classification method is a good classifier in machine learning research. This paper will study the classification effect based on the SVM method in the Chinese text data, and use the support vector machine method in the chinese text to achieve the classify chinese text, and to able to combination of academia and practical application.
StemTextSearch: Stem cell gene database with evidence from abstracts.
Chen, Chou-Cheng; Ho, Chung-Liang
2017-05-01
Previous studies have used many methods to find biomarkers in stem cells, including text mining, experimental data and image storage. However, no text-mining methods have yet been developed which can identify whether a gene plays a positive or negative role in stem cells. StemTextSearch identifies the role of a gene in stem cells by using a text-mining method to find combinations of gene regulation, stem-cell regulation and cell processes in the same sentences of biomedical abstracts. The dataset includes 5797 genes, with 1534 genes having positive roles in stem cells, 1335 genes having negative roles, 1654 genes with both positive and negative roles, and 1274 with an uncertain role. The precision of gene role in StemTextSearch is 0.66, and the recall is 0.78. StemTextSearch is a web-based engine with queries that specify (i) gene, (ii) category of stem cell, (iii) gene role, (iv) gene regulation, (v) cell process, (vi) stem-cell regulation, and (vii) species. StemTextSearch is available through http://bio.yungyun.com.tw/StemTextSearch.aspx. Copyright © 2017. Published by Elsevier Inc.
2010-01-01
Background An increase in work on the full text of journal articles and the growth of PubMedCentral have the opportunity to create a major paradigm shift in how biomedical text mining is done. However, until now there has been no comprehensive characterization of how the bodies of full text journal articles differ from the abstracts that until now have been the subject of most biomedical text mining research. Results We examined the structural and linguistic aspects of abstracts and bodies of full text articles, the performance of text mining tools on both, and the distribution of a variety of semantic classes of named entities between them. We found marked structural differences, with longer sentences in the article bodies and much heavier use of parenthesized material in the bodies than in the abstracts. We found content differences with respect to linguistic features. Three out of four of the linguistic features that we examined were statistically significantly differently distributed between the two genres. We also found content differences with respect to the distribution of semantic features. There were significantly different densities per thousand words for three out of four semantic classes, and clear differences in the extent to which they appeared in the two genres. With respect to the performance of text mining tools, we found that a mutation finder performed equally well in both genres, but that a wide variety of gene mention systems performed much worse on article bodies than they did on abstracts. POS tagging was also more accurate in abstracts than in article bodies. Conclusions Aspects of structure and content differ markedly between article abstracts and article bodies. A number of these differences may pose problems as the text mining field moves more into the area of processing full-text articles. However, these differences also present a number of opportunities for the extraction of data types, particularly that found in parenthesized text, that is present in article bodies but not in article abstracts. PMID:20920264
Deriving novel relationships from the scientific literature is an important adjunct to datamining activities for complex datasets in genomics and high-throughput screening activities. Automated text-mining algorithms can be used to extract relevant content from the literature and...
A Feature Mining Based Approach for the Classification of Text Documents into Disjoint Classes.
ERIC Educational Resources Information Center
Nieto Sanchez, Salvador; Triantaphyllou, Evangelos; Kraft, Donald
2002-01-01
Proposes a new approach for classifying text documents into two disjoint classes. Highlights include a brief overview of document clustering; a data mining approach called the One Clause at a Time (OCAT) algorithm which is based on mathematical logic; vector space model (VSM); and comparing the OCAT to the VSM. (Author/LRW)
ERIC Educational Resources Information Center
Hung, Jui-Long; Zhang, Ke
2012-01-01
This study investigated the longitudinal trends of academic articles in Mobile Learning (ML) using text mining techniques. One hundred and nineteen (119) refereed journal articles and proceedings papers from the SCI/SSCI database were retrieved and analyzed. The taxonomies of ML publications were grouped into twelve clusters (topics) and four…
Trends of E-Learning Research from 2000 to 2008: Use of Text Mining and Bibliometrics
ERIC Educational Resources Information Center
Hung, Jui-long
2012-01-01
This study investigated the longitudinal trends of e-learning research using text mining techniques. Six hundred and eighty-nine (689) refereed journal articles and proceedings were retrieved from the Science Citation Index/Social Science Citation Index database in the period from 2000 to 2008. All e-learning publications were grouped into two…
Kreula, Sanna M.; Kaewphan, Suwisa; Ginter, Filip
2018-01-01
The increasing move towards open access full-text scientific literature enhances our ability to utilize advanced text-mining methods to construct information-rich networks that no human will be able to grasp simply from ‘reading the literature’. The utility of text-mining for well-studied species is obvious though the utility for less studied species, or those with no prior track-record at all, is not clear. Here we present a concept for how advanced text-mining can be used to create information-rich networks even for less well studied species and apply it to generate an open-access gene-gene association network resource for Synechocystis sp. PCC 6803, a representative model organism for cyanobacteria and first case-study for the methodology. By merging the text-mining network with networks generated from species-specific experimental data, network integration was used to enhance the accuracy of predicting novel interactions that are biologically relevant. A rule-based algorithm (filter) was constructed in order to automate the search for novel candidate genes with a high degree of likely association to known target genes by (1) ignoring established relationships from the existing literature, as they are already ‘known’, and (2) demanding multiple independent evidences for every novel and potentially relevant relationship. Using selected case studies, we demonstrate the utility of the network resource and filter to (i) discover novel candidate associations between different genes or proteins in the network, and (ii) rapidly evaluate the potential role of any one particular gene or protein. The full network is provided as an open-source resource. PMID:29844966
NASA Astrophysics Data System (ADS)
Mikhalchenko, V. V.; Rubanik, Yu T.
2016-10-01
The work is devoted to the problem of cost-effective adaptation of coal mines to the volatile and uncertain market conditions. Conceptually it can be achieved through alignment of the dynamic characteristics of the coal mining system and power spectrum of market demand for coal product. In practical terms, this ensures the viability and competitiveness of coal mines. Transformation of dynamic characteristics is to be done by changing the structure of production system as well as corporate, logistics and management processes. The proposed methods and algorithms of control are aimed at the development of the theoretical foundations of adaptive optimization as basic methodology for coal mine enterprise management in conditions of high variability and uncertainty of economic and natural environment. Implementation of the proposed methodology requires a revision of the basic principles of open coal mining enterprises design.
Generative Topic Modeling in Image Data Mining and Bioinformatics Studies
ERIC Educational Resources Information Center
Chen, Xin
2012-01-01
Probabilistic topic models have been developed for applications in various domains such as text mining, information retrieval and computer vision and bioinformatics domain. In this thesis, we focus on developing novel probabilistic topic models for image mining and bioinformatics studies. Specifically, a probabilistic topic-connection (PTC) model…
Federal Register 2010, 2011, 2012, 2013, 2014
2013-07-05
... silver mining operation. Most of the infrastructure to support a mining operation was authorized and.... The Proposed Action consists of underground mining, constructing a new production shaft, improving.... Public comments resulted in the addition of clarifying text, but did not significantly change the...
Federal Register 2010, 2011, 2012, 2013, 2014
2012-03-22
... authorization does not exceed one year, an IHA may be issued. Upon making a finding that an application for... establish a framework for authorizing incidental take with one or more future LOAs over a period not to... Countermeasures (MCM) detonations is one function of the U.S. Navy EOD force, which involves mine-hunting and mine...
Ghazizadeh, Mahtab; McDonald, Anthony D; Lee, John D
2014-09-01
This study applies text mining to extract clusters of vehicle problems and associated trends from free-response data in the National Highway Traffic Safety Administration's vehicle owner's complaint database. As the automotive industry adopts new technologies, it is important to systematically assess the effect of these changes on traffic safety. Driving simulators, naturalistic driving data, and crash databases all contribute to a better understanding of how drivers respond to changing vehicle technology, but other approaches, such as automated analysis of incident reports, are needed. Free-response data from incidents representing two severity levels (fatal incidents and incidents involving injury) were analyzed using a text mining approach: latent semantic analysis (LSA). LSA and hierarchical clustering identified clusters of complaints for each severity level, which were compared and analyzed across time. Cluster analysis identified eight clusters of fatal incidents and six clusters of incidents involving injury. Comparisons showed that although the airbag clusters across the two severity levels have the same most frequent terms, the circumstances around the incidents differ. The time trends show clear increases in complaints surrounding the Ford/Firestone tire recall and the Toyota unintended acceleration recall. Increases in complaints may be partially driven by these recall announcements and the associated media attention. Text mining can reveal useful information from free-response databases that would otherwise be prohibitively time-consuming and difficult to summarize manually. Text mining can extend human analysis capabilities for large free-response databases to support earlier detection of problems and more timely safety interventions.
Fuzzy Modelling for Human Dynamics Based on Online Social Networks
Cuenca-Jara, Jesus; Valdes-Vela, Mercedes; Skarmeta, Antonio F.
2017-01-01
Human mobility mining has attracted a lot of attention in the research community due to its multiple implications in the provisioning of innovative services for large metropolises. In this scope, Online Social Networks (OSN) have arisen as a promising source of location data to come up with new mobility models. However, the human nature of this data makes it rather noisy and inaccurate. In order to deal with such limitations, the present work introduces a framework for human mobility mining based on fuzzy logic. Firstly, a fuzzy clustering algorithm extracts the most active OSN areas at different time periods. Next, such clusters are the building blocks to compose mobility patterns. Furthermore, a location prediction service based on a fuzzy rule classifier has been developed on top of the framework. Finally, both the framework and the predictor has been tested with a Twitter and Flickr dataset in two large cities. PMID:28837120
Fuzzy Modelling for Human Dynamics Based on Online Social Networks.
Cuenca-Jara, Jesus; Terroso-Saenz, Fernando; Valdes-Vela, Mercedes; Skarmeta, Antonio F
2017-08-24
Human mobility mining has attracted a lot of attention in the research community due to its multiple implications in the provisioning of innovative services for large metropolises. In this scope, Online Social Networks (OSN) have arisen as a promising source of location data to come up with new mobility models. However, the human nature of this data makes it rather noisy and inaccurate. In order to deal with such limitations, the present work introduces a framework for human mobility mining based on fuzzy logic. Firstly, a fuzzy clustering algorithm extracts the most active OSN areas at different time periods. Next, such clusters are the building blocks to compose mobility patterns. Furthermore, a location prediction service based on a fuzzy rule classifier has been developed on top of the framework. Finally, both the framework and the predictor has been tested with a Twitter and Flickr dataset in two large cities.
Analysis of Nature of Science Included in Recent Popular Writing Using Text Mining Techniques
ERIC Educational Resources Information Center
Jiang, Feng; McComas, William F.
2014-01-01
This study examined the inclusion of nature of science (NOS) in popular science writing to determine whether it could serve supplementary resource for teaching NOS and to evaluate the accuracy of text mining and classification as a viable research tool in science education research. Four groups of documents published from 2001 to 2010 were…
ERIC Educational Resources Information Center
Cheon, Jongpil; Lee, Sangno; Smith, Walter; Song, Jaeki; Kim, Yongjin
2013-01-01
The purpose of this study was to use text mining analysis of early adolescents' online essays to determine their knowledge of global lunar patterns. Australian and American students in grades five to seven wrote about global lunar patterns they had discovered by sharing observations with each other via the Internet. These essays were analyzed for…
ERIC Educational Resources Information Center
Çepni, Sevcan Bayraktar; Demirel, Elif Tokdemir
2016-01-01
This study aimed to find out the impact of "text mining and imitating" strategies on lexical richness, lexical diversity and general success of students in their compositions in second language writing. The participants were 98 students studying their first year in Karadeniz Technical University in English Language and Literature…
Science and Technology Text Mining: Text Mining of the Journal Cortex
2004-01-01
Amnesia Retrograde Amnesia GENERAL Semantic Memory Episodic Memory Working Memory TEST Serial Position Curve...in Cortex can be reasonably divided into four categories (papers in each category in parenthesis): Semantic Memory (151); Handedness (145); Amnesia ... Semantic Memory (151) is divided into Verbal/ Numerical (76) and Visual/ Spatial (75). Amnesia (119) is divided into Amnesia Symptoms (50) and
Experiences with Text Mining Large Collections of Unstructured Systems Development Artifacts at JPL
NASA Technical Reports Server (NTRS)
Port, Dan; Nikora, Allen; Hihn, Jairus; Huang, LiGuo
2011-01-01
Often repositories of systems engineering artifacts at NASA's Jet Propulsion Laboratory (JPL) are so large and poorly structured that they have outgrown our capability to effectively manually process their contents to extract useful information. Sophisticated text mining methods and tools seem a quick, low-effort approach to automating our limited manual efforts. Our experiences of exploring such methods mainly in three areas including historical risk analysis, defect identification based on requirements analysis, and over-time analysis of system anomalies at JPL, have shown that obtaining useful results requires substantial unanticipated efforts - from preprocessing the data to transforming the output for practical applications. We have not observed any quick 'wins' or realized benefit from short-term effort avoidance through automation in this area. Surprisingly we have realized a number of unexpected long-term benefits from the process of applying text mining to our repositories. This paper elaborates some of these benefits and our important lessons learned from the process of preparing and applying text mining to large unstructured system artifacts at JPL aiming to benefit future TM applications in similar problem domains and also in hope for being extended to broader areas of applications.
Biomedical hypothesis generation by text mining and gene prioritization.
Petric, Ingrid; Ligeti, Balazs; Gyorffy, Balazs; Pongor, Sandor
2014-01-01
Text mining methods can facilitate the generation of biomedical hypotheses by suggesting novel associations between diseases and genes. Previously, we developed a rare-term model called RaJoLink (Petric et al, J. Biomed. Inform. 42(2): 219-227, 2009) in which hypotheses are formulated on the basis of terms rarely associated with a target domain. Since many current medical hypotheses are formulated in terms of molecular entities and molecular mechanisms, here we extend the methodology to proteins and genes, using a standardized vocabulary as well as a gene/protein network model. The proposed enhanced RaJoLink rare-term model combines text mining and gene prioritization approaches. Its utility is illustrated by finding known as well as potential gene-disease associations in ovarian cancer using MEDLINE abstracts and the STRING database.
The Functional Genomics Network in the evolution of biological text mining over the past decade.
Blaschke, Christian; Valencia, Alfonso
2013-03-25
Different programs of The European Science Foundation (ESF) have contributed significantly to connect researchers in Europe and beyond through several initiatives. This support was particularly relevant for the development of the areas related with extracting information from papers (text-mining) because it supported the field in its early phases long before it was recognized by the community. We review the historical development of text mining research and how it was introduced in bioinformatics. Specific applications in (functional) genomics are described like it's integration in genome annotation pipelines and the support to the analysis of high-throughput genomics experimental data, and we highlight the activities of evaluation of methods and benchmarking for which the ESF programme support was instrumental. Copyright © 2013 Elsevier B.V. All rights reserved.
Agile Text Mining for the 2014 i2b2/UTHealth Cardiac Risk Factors Challenge
Cormack, James; Nath, Chinmoy; Milward, David; Raja, Kalpana; Jonnalagadda, Siddhartha R
2016-01-01
This paper describes the use of an agile text mining platform (Linguamatics’ Interactive Information Extraction Platform, I2E) to extract document-level cardiac risk factors in patient records as defined in the i2b2/UTHealth 2014 Challenge. The approach uses a data-driven rule-based methodology with the addition of a simple supervised classifier. We demonstrate that agile text mining allows for rapid optimization of extraction strategies, while post-processing can leverage annotation guidelines, corpus statistics and logic inferred from the gold standard data. We also show how data imbalance in a training set affects performance. Evaluation of this approach on the test data gave an F-Score of 91.7%, one percent behind the top performing system. PMID:26209007
Fu, Xiao; Batista-Navarro, Riza; Rak, Rafal; Ananiadou, Sophia
2015-01-01
Chronic obstructive pulmonary disease (COPD) is a life-threatening lung disorder whose recent prevalence has led to an increasing burden on public healthcare. Phenotypic information in electronic clinical records is essential in providing suitable personalised treatment to patients with COPD. However, as phenotypes are often "hidden" within free text in clinical records, clinicians could benefit from text mining systems that facilitate their prompt recognition. This paper reports on a semi-automatic methodology for producing a corpus that can ultimately support the development of text mining tools that, in turn, will expedite the process of identifying groups of COPD patients. A corpus of 30 full-text papers was formed based on selection criteria informed by the expertise of COPD specialists. We developed an annotation scheme that is aimed at producing fine-grained, expressive and computable COPD annotations without burdening our curators with a highly complicated task. This was implemented in the Argo platform by means of a semi-automatic annotation workflow that integrates several text mining tools, including a graphical user interface for marking up documents. When evaluated using gold standard (i.e., manually validated) annotations, the semi-automatic workflow was shown to obtain a micro-averaged F-score of 45.70% (with relaxed matching). Utilising the gold standard data to train new concept recognisers, we demonstrated that our corpus, although still a work in progress, can foster the development of significantly better performing COPD phenotype extractors. We describe in this work the means by which we aim to eventually support the process of COPD phenotype curation, i.e., by the application of various text mining tools integrated into an annotation workflow. Although the corpus being described is still under development, our results thus far are encouraging and show great potential in stimulating the development of further automatic COPD phenotype extractors.
78 FR 64397 - Mississippi Regulatory Program
Federal Register 2010, 2011, 2012, 2013, 2014
2013-10-29
... text of the program amendment available at www.regulations.gov . A. Mississippi Surface Coal Mining... DEPARTMENT OF THE INTERIOR Office of Surface Mining Reclamation and Enforcement 30 CFR Part 924...; S2D2SSS08011000SX066A00033F13XS501520] Mississippi Regulatory Program AGENCY: Office of Surface Mining Reclamation and Enforcement...
Redundancy and Novelty Mining in the Business Blogosphere
ERIC Educational Resources Information Center
Tsai, Flora S.; Chan, Kap Luk
2010-01-01
Purpose: The paper aims to explore the performance of redundancy and novelty mining in the business blogosphere, which has not been studied before. Design/methodology/approach: Novelty mining techniques are implemented to single out novel information out of a massive set of text documents. This paper adopted the mixed metric approach which…
A framework of knowledge creation processes in participatory simulation of hospital work systems.
Andersen, Simone Nyholm; Broberg, Ole
2017-04-01
Participatory simulation (PS) is a method to involve workers in simulating and designing their own future work system. Existing PS studies have focused on analysing the outcome, and minimal attention has been devoted to the process of creating this outcome. In order to study this process, we suggest applying a knowledge creation perspective. The aim of this study was to develop a framework describing the process of how ergonomics knowledge is created in PS. Video recordings from three projects applying PS of hospital work systems constituted the foundation of process mining analysis. The analysis resulted in a framework revealing the sources of ergonomics knowledge creation as sequential relationships between the activities of simulation participants sharing work experiences; experimenting with scenarios; and reflecting on ergonomics consequences. We argue that this framework reveals the hidden steps of PS that are essential when planning and facilitating PS that aims at designing work systems. Practitioner Summary: When facilitating participatory simulation (PS) in work system design, achieving an understanding of the PS process is essential. By applying a knowledge creation perspective and process mining, we investigated the knowledge-creating activities constituting the PS process. The analysis resulted in a framework of the knowledge-creating process in PS.
Practice-Relevant Pedagogy for Mining Software Engineering Curricula Assets
2007-06-20
permits the application of the Lean methods by virtually grouping shared services into eWorkcenters to which only non-routine requests are routed...engineering can be applied to IT shared services improvement and provide precise system improvement methods to complement the ITIL best practice. This...Vertical� or internal service- chain of primary business functions and enabling shared services Framework results - Mined patterns that relate
A Data Analytical Framework for Improving Real-Time, Decision Support Systems in Healthcare
ERIC Educational Resources Information Center
Yahav, Inbal
2010-01-01
In this dissertation we develop a framework that combines data mining, statistics and operations research methods for improving real-time decision support systems in healthcare. Our approach consists of three main concepts: data gathering and preprocessing, modeling, and deployment. We introduce the notion of offline and semi-offline modeling to…
A Visual Analytics Framework for Identifying Topic Drivers in Media Events.
Lu, Yafeng; Wang, Hong; Landis, Steven; Maciejewski, Ross
2017-09-14
Media data has been the subject of large scale analysis with applications of text mining being used to provide overviews of media themes and information flows. Such information extracted from media articles has also shown its contextual value of being integrated with other data, such as criminal records and stock market pricing. In this work, we explore linking textual media data with curated secondary textual data sources through user-guided semantic lexical matching for identifying relationships and data links. In this manner, critical information can be identified and used to annotate media timelines in order to provide a more detailed overview of events that may be driving media topics and frames. These linked events are further analyzed through an application of causality modeling to model temporal drivers between the data series. Such causal links are then annotated through automatic entity extraction which enables the analyst to explore persons, locations, and organizations that may be pertinent to the media topic of interest. To demonstrate the proposed framework, two media datasets and an armed conflict event dataset are explored.
Survey of Analysis of Crime Detection Techniques Using Data Mining and Machine Learning
NASA Astrophysics Data System (ADS)
Prabakaran, S.; Mitra, Shilpa
2018-04-01
Data mining is the field containing procedures for finding designs or patterns in a huge dataset, it includes strategies at the convergence of machine learning and database framework. It can be applied to various fields like future healthcare, market basket analysis, education, manufacturing engineering, crime investigation etc. Among these, crime investigation is an interesting application to process crime characteristics to help the society for a better living. This paper survey various data mining techniques used in this domain. This study may be helpful in designing new strategies for crime prediction and analysis.
SPECTRa-T: machine-based data extraction and semantic searching of chemistry e-theses.
Downing, Jim; Harvey, Matt J; Morgan, Peter B; Murray-Rust, Peter; Rzepa, Henry S; Stewart, Diana C; Tonge, Alan P; Townsend, Joe A
2010-02-22
The SPECTRa-T project has developed text-mining tools to extract named chemical entities (NCEs), such as chemical names and terms, and chemical objects (COs), e.g., experimental spectral assignments and physical chemistry properties, from electronic theses (e-theses). Although NCEs were readily identified within the two major document formats studied, only the use of structured documents enabled identification of chemical objects and their association with the relevant chemical entity (e.g., systematic chemical name). A corpus of theses was analyzed and it is shown that a high degree of semantic information can be extracted from structured documents. This integrated information has been deposited in a persistent Resource Description Framework (RDF) triple-store that allows users to conduct semantic searches. The strength and weaknesses of several document formats are reviewed.
Agarwal, Shashank; Liu, Feifan; Yu, Hong
2011-10-03
Protein-protein interaction (PPI) is an important biomedical phenomenon. Automatically detecting PPI-relevant articles and identifying methods that are used to study PPI are important text mining tasks. In this study, we have explored domain independent features to develop two open source machine learning frameworks. One performs binary classification to determine whether the given article is PPI relevant or not, named "Simple Classifier", and the other one maps the PPI relevant articles with corresponding interaction method nodes in a standardized PSI-MI (Proteomics Standards Initiative-Molecular Interactions) ontology, named "OntoNorm". We evaluated our system in the context of BioCreative challenge competition using the standardized data set. Our systems are amongst the top systems reported by the organizers, attaining 60.8% F1-score for identifying relevant documents, and 52.3% F1-score for mapping articles to interaction method ontology. Our results show that domain-independent machine learning frameworks can perform competitively well at the tasks of detecting PPI relevant articles and identifying the methods that were used to study the interaction in such articles. Simple Classifier is available at http://sourceforge.net/p/simpleclassify/home/ and OntoNorm at http://sourceforge.net/p/ontonorm/home/.
Foundational Principles for Large-Scale Inference: Illustrations Through Correlation Mining
DOE Office of Scientific and Technical Information (OSTI.GOV)
Hero, Alfred O.; Rajaratnam, Bala
When can reliable inference be drawn in the ‘‘Big Data’’ context? This article presents a framework for answering this fundamental question in the context of correlation mining, with implications for general large-scale inference. In large-scale data applications like genomics, connectomics, and eco-informatics, the data set is often variable rich but sample starved: a regime where the number n of acquired samples (statistical replicates) is far fewer than the number p of observed variables (genes, neurons, voxels, or chemical constituents). Much of recent work has focused on understanding the computational complexity of proposed methods for ‘‘Big Data.’’ Sample complexity, however, hasmore » received relatively less attention, especially in the setting when the sample size n is fixed, and the dimension p grows without bound. To address this gap, we develop a unified statistical framework that explicitly quantifies the sample complexity of various inferential tasks. Sampling regimes can be divided into several categories: 1) the classical asymptotic regime where the variable dimension is fixed and the sample size goes to infinity; 2) the mixed asymptotic regime where both variable dimension and sample size go to infinity at comparable rates; and 3) the purely high-dimensional asymptotic regime where the variable dimension goes to infinity and the sample size is fixed. Each regime has its niche but only the latter regime applies to exa-scale data dimension. We illustrate this high-dimensional framework for the problem of correlation mining, where it is the matrix of pairwise and partial correlations among the variables that are of interest. Correlation mining arises in numerous applications and subsumes the regression context as a special case. We demonstrate various regimes of correlation mining based on the unifying perspective of high-dimensional learning rates and sample complexity for different structured covariance models and different inference tasks.« less
Foundational Principles for Large-Scale Inference: Illustrations Through Correlation Mining
Hero, Alfred O.; Rajaratnam, Bala
2015-01-01
When can reliable inference be drawn in fue “Big Data” context? This paper presents a framework for answering this fundamental question in the context of correlation mining, wifu implications for general large scale inference. In large scale data applications like genomics, connectomics, and eco-informatics fue dataset is often variable-rich but sample-starved: a regime where the number n of acquired samples (statistical replicates) is far fewer than fue number p of observed variables (genes, neurons, voxels, or chemical constituents). Much of recent work has focused on understanding the computational complexity of proposed methods for “Big Data”. Sample complexity however has received relatively less attention, especially in the setting when the sample size n is fixed, and the dimension p grows without bound. To address fuis gap, we develop a unified statistical framework that explicitly quantifies the sample complexity of various inferential tasks. Sampling regimes can be divided into several categories: 1) the classical asymptotic regime where fue variable dimension is fixed and fue sample size goes to infinity; 2) the mixed asymptotic regime where both variable dimension and sample size go to infinity at comparable rates; 3) the purely high dimensional asymptotic regime where the variable dimension goes to infinity and the sample size is fixed. Each regime has its niche but only the latter regime applies to exa cale data dimension. We illustrate this high dimensional framework for the problem of correlation mining, where it is the matrix of pairwise and partial correlations among the variables fua t are of interest. Correlation mining arises in numerous applications and subsumes the regression context as a special case. we demonstrate various regimes of correlation mining based on the unifying perspective of high dimensional learning rates and sample complexity for different structured covariance models and different inference tasks. PMID:27087700
Foundational Principles for Large-Scale Inference: Illustrations Through Correlation Mining
Hero, Alfred O.; Rajaratnam, Bala
2015-12-09
When can reliable inference be drawn in the ‘‘Big Data’’ context? This article presents a framework for answering this fundamental question in the context of correlation mining, with implications for general large-scale inference. In large-scale data applications like genomics, connectomics, and eco-informatics, the data set is often variable rich but sample starved: a regime where the number n of acquired samples (statistical replicates) is far fewer than the number p of observed variables (genes, neurons, voxels, or chemical constituents). Much of recent work has focused on understanding the computational complexity of proposed methods for ‘‘Big Data.’’ Sample complexity, however, hasmore » received relatively less attention, especially in the setting when the sample size n is fixed, and the dimension p grows without bound. To address this gap, we develop a unified statistical framework that explicitly quantifies the sample complexity of various inferential tasks. Sampling regimes can be divided into several categories: 1) the classical asymptotic regime where the variable dimension is fixed and the sample size goes to infinity; 2) the mixed asymptotic regime where both variable dimension and sample size go to infinity at comparable rates; and 3) the purely high-dimensional asymptotic regime where the variable dimension goes to infinity and the sample size is fixed. Each regime has its niche but only the latter regime applies to exa-scale data dimension. We illustrate this high-dimensional framework for the problem of correlation mining, where it is the matrix of pairwise and partial correlations among the variables that are of interest. Correlation mining arises in numerous applications and subsumes the regression context as a special case. We demonstrate various regimes of correlation mining based on the unifying perspective of high-dimensional learning rates and sample complexity for different structured covariance models and different inference tasks.« less
NASA Astrophysics Data System (ADS)
Boulicaut, Jean-Francois; Jeudy, Baptiste
Knowledge Discovery in Databases (KDD) is a complex interactive process. The promising theoretical framework of inductive databases considers this is essentially a querying process. It is enabled by a query language which can deal either with raw data or patterns which hold in the data. Mining patterns turns to be the so-called inductive query evaluation process for which constraint-based Data Mining techniques have to be designed. An inductive query specifies declaratively the desired constraints and algorithms are used to compute the patterns satisfying the constraints in the data. We survey important results of this active research domain. This chapter emphasizes a real breakthrough for hard problems concerning local pattern mining under various constraints and it points out the current directions of research as well.
Meng, Guilin; Meng, Xiulin; Ma, Xiaoye; Zhang, Gengping; Hu, Xiaolin; Jin, Aiping; Liu, Xueyuan
2018-01-01
Alzheimer’s disease (AD) is an increasing concern in human health. Despite significant research, highly effective drugs to treat AD are lacking. The present study describes the text mining process to identify drug candidates from a traditional Chinese medicine (TCM) database, along with associated protein target mechanisms. We carried out text mining to identify literatures that referenced both AD and TCM and focused on identifying compounds and protein targets of interest. After targeting one potential TCM candidate, corresponding protein-protein interaction (PPI) networks were assembled in STRING to decipher the most possible mechanism of action. This was followed by validation using Western blot and co-immunoprecipitation in an AD cell model. The text mining strategy using a vast amount of AD-related literature and the TCM database identified curcumin, whose major component was ferulic acid (FA). This was used as a key candidate compound for further study. Using the top calculated interaction score in STRING, BACE1 and MMP2 were implicated in the activity of FA in AD. Exposure of SHSY5Y-APP cells to FA resulted in the decrease in expression levels of BACE-1 and APP, while the expression of MMP-2 and MMP-9 increased in a dose-dependent manner. This suggests that FA induced BACE1 and MMP2 pathways maybe novel potential mechanisms involved in AD. The text mining of literature and TCM database related to AD suggested FA as a promising TCM ingredient for the treatment of AD. Potential mechanisms interconnected and integrated with Aβ aggregation inhibition and extracellular matrix remodeling underlying the activity of FA were identified using in vitro studies. PMID:29896095
Meng, Guilin; Meng, Xiulin; Ma, Xiaoye; Zhang, Gengping; Hu, Xiaolin; Jin, Aiping; Zhao, Yanxin; Liu, Xueyuan
2018-01-01
Alzheimer's disease (AD) is an increasing concern in human health. Despite significant research, highly effective drugs to treat AD are lacking. The present study describes the text mining process to identify drug candidates from a traditional Chinese medicine (TCM) database, along with associated protein target mechanisms. We carried out text mining to identify literatures that referenced both AD and TCM and focused on identifying compounds and protein targets of interest. After targeting one potential TCM candidate, corresponding protein-protein interaction (PPI) networks were assembled in STRING to decipher the most possible mechanism of action. This was followed by validation using Western blot and co-immunoprecipitation in an AD cell model. The text mining strategy using a vast amount of AD-related literature and the TCM database identified curcumin, whose major component was ferulic acid (FA). This was used as a key candidate compound for further study. Using the top calculated interaction score in STRING, BACE1 and MMP2 were implicated in the activity of FA in AD. Exposure of SHSY5Y-APP cells to FA resulted in the decrease in expression levels of BACE-1 and APP, while the expression of MMP-2 and MMP-9 increased in a dose-dependent manner. This suggests that FA induced BACE1 and MMP2 pathways maybe novel potential mechanisms involved in AD. The text mining of literature and TCM database related to AD suggested FA as a promising TCM ingredient for the treatment of AD. Potential mechanisms interconnected and integrated with Aβ aggregation inhibition and extracellular matrix remodeling underlying the activity of FA were identified using in vitro studies.
Johnson, Robin J.; Lay, Jean M.; Lennon-Hopkins, Kelley; Saraceni-Richards, Cynthia; Sciaky, Daniela; Murphy, Cynthia Grondin; Mattingly, Carolyn J.
2013-01-01
The Comparative Toxicogenomics Database (CTD; http://ctdbase.org/) is a public resource that curates interactions between environmental chemicals and gene products, and their relationships to diseases, as a means of understanding the effects of environmental chemicals on human health. CTD provides a triad of core information in the form of chemical-gene, chemical-disease, and gene-disease interactions that are manually curated from scientific articles. To increase the efficiency, productivity, and data coverage of manual curation, we have leveraged text mining to help rank and prioritize the triaged literature. Here, we describe our text-mining process that computes and assigns each article a document relevancy score (DRS), wherein a high DRS suggests that an article is more likely to be relevant for curation at CTD. We evaluated our process by first text mining a corpus of 14,904 articles triaged for seven heavy metals (cadmium, cobalt, copper, lead, manganese, mercury, and nickel). Based upon initial analysis, a representative subset corpus of 3,583 articles was then selected from the 14,094 articles and sent to five CTD biocurators for review. The resulting curation of these 3,583 articles was analyzed for a variety of parameters, including article relevancy, novel data content, interaction yield rate, mean average precision, and biological and toxicological interpretability. We show that for all measured parameters, the DRS is an effective indicator for scoring and improving the ranking of literature for the curation of chemical-gene-disease information at CTD. Here, we demonstrate how fully incorporating text mining-based DRS scoring into our curation pipeline enhances manual curation by prioritizing more relevant articles, thereby increasing data content, productivity, and efficiency. PMID:23613709
Data Mining: A Hybrid Methodology for Complex and Dynamic Research
ERIC Educational Resources Information Center
Lang, Susan; Baehr, Craig
2012-01-01
This article provides an overview of the ways in which data and text mining have potential as research methodologies in composition studies. It introduces data mining in the context of the field of composition studies and discusses ways in which this methodology can complement and extend our existing research practices by blending the best of what…
Text mining and medicine: usefulness in respiratory diseases.
Piedra, David; Ferrer, Antoni; Gea, Joaquim
2014-03-01
It is increasingly common to have medical information in electronic format. This includes scientific articles as well as clinical management reviews, and even records from health institutions with patient data. However, traditional instruments, both individual and institutional, are of little use for selecting the most appropriate information in each case, either in the clinical or research field. So-called text or data «mining» enables this huge amount of information to be managed, extracting it from various sources using processing systems (filtration and curation), integrating it and permitting the generation of new knowledge. This review aims to provide an overview of text and data mining, and of the potential usefulness of this bioinformatic technique in the exercise of care in respiratory medicine and in research in the same field. Copyright © 2013 SEPAR. Published by Elsevier Espana. All rights reserved.
Agile text mining for the 2014 i2b2/UTHealth Cardiac risk factors challenge.
Cormack, James; Nath, Chinmoy; Milward, David; Raja, Kalpana; Jonnalagadda, Siddhartha R
2015-12-01
This paper describes the use of an agile text mining platform (Linguamatics' Interactive Information Extraction Platform, I2E) to extract document-level cardiac risk factors in patient records as defined in the i2b2/UTHealth 2014 challenge. The approach uses a data-driven rule-based methodology with the addition of a simple supervised classifier. We demonstrate that agile text mining allows for rapid optimization of extraction strategies, while post-processing can leverage annotation guidelines, corpus statistics and logic inferred from the gold standard data. We also show how data imbalance in a training set affects performance. Evaluation of this approach on the test data gave an F-Score of 91.7%, one percent behind the top performing system. Copyright © 2015 Elsevier Inc. All rights reserved.
Overview of the gene ontology task at BioCreative IV.
Mao, Yuqing; Van Auken, Kimberly; Li, Donghui; Arighi, Cecilia N; McQuilton, Peter; Hayman, G Thomas; Tweedie, Susan; Schaeffer, Mary L; Laulederkind, Stanley J F; Wang, Shur-Jen; Gobeill, Julien; Ruch, Patrick; Luu, Anh Tuan; Kim, Jung-Jae; Chiang, Jung-Hsien; Chen, Yu-De; Yang, Chia-Jung; Liu, Hongfang; Zhu, Dongqing; Li, Yanpeng; Yu, Hong; Emadzadeh, Ehsan; Gonzalez, Graciela; Chen, Jian-Ming; Dai, Hong-Jie; Lu, Zhiyong
2014-01-01
Gene ontology (GO) annotation is a common task among model organism databases (MODs) for capturing gene function data from journal articles. It is a time-consuming and labor-intensive task, and is thus often considered as one of the bottlenecks in literature curation. There is a growing need for semiautomated or fully automated GO curation techniques that will help database curators to rapidly and accurately identify gene function information in full-length articles. Despite multiple attempts in the past, few studies have proven to be useful with regard to assisting real-world GO curation. The shortage of sentence-level training data and opportunities for interaction between text-mining developers and GO curators has limited the advances in algorithm development and corresponding use in practical circumstances. To this end, we organized a text-mining challenge task for literature-based GO annotation in BioCreative IV. More specifically, we developed two subtasks: (i) to automatically locate text passages that contain GO-relevant information (a text retrieval task) and (ii) to automatically identify relevant GO terms for the genes in a given article (a concept-recognition task). With the support from five MODs, we provided teams with >4000 unique text passages that served as the basis for each GO annotation in our task data. Such evidence text information has long been recognized as critical for text-mining algorithm development but was never made available because of the high cost of curation. In total, seven teams participated in the challenge task. From the team results, we conclude that the state of the art in automatically mining GO terms from literature has improved over the past decade while much progress is still needed for computer-assisted GO curation. Future work should focus on addressing remaining technical challenges for improved performance of automatic GO concept recognition and incorporating practical benefits of text-mining tools into real-world GO annotation. http://www.biocreative.org/tasks/biocreative-iv/track-4-GO/. Published by Oxford University Press 2014. This work is written by US Government employees and is in the public domain in the US.
PPInterFinder--a mining tool for extracting causal relations on human proteins from literature.
Raja, Kalpana; Subramani, Suresh; Natarajan, Jeyakumar
2013-01-01
One of the most common and challenging problem in biomedical text mining is to mine protein-protein interactions (PPIs) from MEDLINE abstracts and full-text research articles because PPIs play a major role in understanding the various biological processes and the impact of proteins in diseases. We implemented, PPInterFinder--a web-based text mining tool to extract human PPIs from biomedical literature. PPInterFinder uses relation keyword co-occurrences with protein names to extract information on PPIs from MEDLINE abstracts and consists of three phases. First, it identifies the relation keyword using a parser with Tregex and a relation keyword dictionary. Next, it automatically identifies the candidate PPI pairs with a set of rules related to PPI recognition. Finally, it extracts the relations by matching the sentence with a set of 11 specific patterns based on the syntactic nature of PPI pair. We find that PPInterFinder is capable of predicting PPIs with the accuracy of 66.05% on AIMED corpus and outperforms most of the existing systems. DATABASE URL: http://www.biomining-bu.in/ppinterfinder/
PPInterFinder—a mining tool for extracting causal relations on human proteins from literature
Raja, Kalpana; Subramani, Suresh; Natarajan, Jeyakumar
2013-01-01
One of the most common and challenging problem in biomedical text mining is to mine protein–protein interactions (PPIs) from MEDLINE abstracts and full-text research articles because PPIs play a major role in understanding the various biological processes and the impact of proteins in diseases. We implemented, PPInterFinder—a web-based text mining tool to extract human PPIs from biomedical literature. PPInterFinder uses relation keyword co-occurrences with protein names to extract information on PPIs from MEDLINE abstracts and consists of three phases. First, it identifies the relation keyword using a parser with Tregex and a relation keyword dictionary. Next, it automatically identifies the candidate PPI pairs with a set of rules related to PPI recognition. Finally, it extracts the relations by matching the sentence with a set of 11 specific patterns based on the syntactic nature of PPI pair. We find that PPInterFinder is capable of predicting PPIs with the accuracy of 66.05% on AIMED corpus and outperforms most of the existing systems. Database URL: http://www.biomining-bu.in/ppinterfinder/ PMID:23325628
Topaz, Maxim; Radhakrishnan, Kavita; Lei, Victor; Zhou, Li
2016-01-01
Effective self-management can decrease up to 50% of heart failure hospitalizations. Unfortunately, self-management by patients with heart failure remains poor. This pilot study aimed to explore the use of text-mining to identify heart failure patients with ineffective self-management. We first built a comprehensive self-management vocabulary based on the literature and clinical notes review. We then randomly selected 545 heart failure patients treated within Partners Healthcare hospitals (Boston, MA, USA) and conducted a regular expression search with the compiled vocabulary within 43,107 interdisciplinary clinical notes of these patients. We found that 38.2% (n = 208) patients had documentation of ineffective heart failure self-management in the domains of poor diet adherence (28.4%), missed medical encounters (26.4%) poor medication adherence (20.2%) and non-specified self-management issues (e.g., "compliance issues", 34.6%). We showed the feasibility of using text-mining to identify patients with ineffective self-management. More natural language processing algorithms are needed to help busy clinicians identify these patients.
Integration of Artificial Market Simulation and Text Mining for Market Analysis
NASA Astrophysics Data System (ADS)
Izumi, Kiyoshi; Matsui, Hiroki; Matsuo, Yutaka
We constructed an evaluation system of the self-impact in a financial market using an artificial market and text-mining technology. Economic trends were first extracted from text data circulating in the real world. Then, the trends were inputted into the market simulation. Our simulation revealed that an operation by intervention could reduce over 70% of rate fluctuation in 1995. By the simulation results, the system was able to help for its user to find the exchange policy which can stabilize the yen-dollar rate.
BioC implementations in Go, Perl, Python and Ruby
Liu, Wanli; Islamaj Doğan, Rezarta; Kwon, Dongseop; Marques, Hernani; Rinaldi, Fabio; Wilbur, W. John; Comeau, Donald C.
2014-01-01
As part of a communitywide effort for evaluating text mining and information extraction systems applied to the biomedical domain, BioC is focused on the goal of interoperability, currently a major barrier to wide-scale adoption of text mining tools. BioC is a simple XML format, specified by DTD, for exchanging data for biomedical natural language processing. With initial implementations in C++ and Java, BioC provides libraries of code for reading and writing BioC text documents and annotations. We extend BioC to Perl, Python, Go and Ruby. We used SWIG to extend the C++ implementation for Perl and one Python implementation. A second Python implementation and the Ruby implementation use native data structures and libraries. BioC is also implemented in the Google language Go. BioC modules are functional in all of these languages, which can facilitate text mining tasks. BioC implementations are freely available through the BioC site: http://bioc.sourceforge.net. Database URL: http://bioc.sourceforge.net/ PMID:24961236
Historical archaeology at the Clarkson Mine, an eastern Ohio mining complex
DOE Office of Scientific and Technical Information (OSTI.GOV)
Keener, C.S.
2003-07-01
This study examines the Clarkson Mine (33BL333), an eastern Ohio coal mine complex dating to the 1910s to 1920s, situated along Wheeling Creek. The results of preliminary surveys and the subsequent mitigation of four structures at the site are presented. The historical archaeology conducted at the site demonstrates the significant research possibilities inherent at many of these early industrial mine complexes. Of particular interest is the findings of depositional patterning around residential structures that revealed the influence of architecture on where and how items were deposited on the land surface. The ceramic and faunal assemblage were analyzed and provide significantmore » details on socioeconomic attributes associated with the workers or staff. Artifacts recovered at the site provide an excellent diagnostic framework from which other similarly aged sites can be compared and dated. The findings at the Clarkson Mine are also placed into a more regional perspective and compared with other contemporary studies.« less
PathText: a text mining integrator for biological pathway visualizations
Kemper, Brian; Matsuzaki, Takuya; Matsuoka, Yukiko; Tsuruoka, Yoshimasa; Kitano, Hiroaki; Ananiadou, Sophia; Tsujii, Jun'ichi
2010-01-01
Motivation: Metabolic and signaling pathways are an increasingly important part of organizing knowledge in systems biology. They serve to integrate collective interpretations of facts scattered throughout literature. Biologists construct a pathway by reading a large number of articles and interpreting them as a consistent network, but most of the models constructed currently lack direct links to those articles. Biologists who want to check the original articles have to spend substantial amounts of time to collect relevant articles and identify the sections relevant to the pathway. Furthermore, with the scientific literature expanding by several thousand papers per week, keeping a model relevant requires a continuous curation effort. In this article, we present a system designed to integrate a pathway visualizer, text mining systems and annotation tools into a seamless environment. This will enable biologists to freely move between parts of a pathway and relevant sections of articles, as well as identify relevant papers from large text bases. The system, PathText, is developed by Systems Biology Institute, Okinawa Institute of Science and Technology, National Centre for Text Mining (University of Manchester) and the University of Tokyo, and is being used by groups of biologists from these locations. Contact: brian@monrovian.com. PMID:20529930
@Note: a workbench for biomedical text mining.
Lourenço, Anália; Carreira, Rafael; Carneiro, Sónia; Maia, Paulo; Glez-Peña, Daniel; Fdez-Riverola, Florentino; Ferreira, Eugénio C; Rocha, Isabel; Rocha, Miguel
2009-08-01
Biomedical Text Mining (BioTM) is providing valuable approaches to the automated curation of scientific literature. However, most efforts have addressed the benchmarking of new algorithms rather than user operational needs. Bridging the gap between BioTM researchers and biologists' needs is crucial to solve real-world problems and promote further research. We present @Note, a platform for BioTM that aims at the effective translation of the advances between three distinct classes of users: biologists, text miners and software developers. Its main functional contributions are the ability to process abstracts and full-texts; an information retrieval module enabling PubMed search and journal crawling; a pre-processing module with PDF-to-text conversion, tokenisation and stopword removal; a semantic annotation schema; a lexicon-based annotator; a user-friendly annotation view that allows to correct annotations and a Text Mining Module supporting dataset preparation and algorithm evaluation. @Note improves the interoperability, modularity and flexibility when integrating in-home and open-source third-party components. Its component-based architecture allows the rapid development of new applications, emphasizing the principles of transparency and simplicity of use. Although it is still on-going, it has already allowed the development of applications that are currently being used.
Sahadevan, S; Hofmann-Apitius, M; Schellander, K; Tesfaye, D; Fluck, J; Friedrich, C M
2012-10-01
In biological research, establishing the prior art by searching and collecting information already present in the domain has equal importance as the experiments done. To obtain a complete overview about the relevant knowledge, researchers mainly rely on 2 major information sources: i) various biological databases and ii) scientific publications in the field. The major difference between the 2 information sources is that information from databases is available, typically well structured and condensed. The information content in scientific literature is vastly unstructured; that is, dispersed among the many different sections of scientific text. The traditional method of information extraction from scientific literature occurs by generating a list of relevant publications in the field of interest and manually scanning these texts for relevant information, which is very time consuming. It is more than likely that in using this "classical" approach the researcher misses some relevant information mentioned in the literature or has to go through biological databases to extract further information. Text mining and named entity recognition methods have already been used in human genomics and related fields as a solution to this problem. These methods can process and extract information from large volumes of scientific text. Text mining is defined as the automatic extraction of previously unknown and potentially useful information from text. Named entity recognition (NER) is defined as the method of identifying named entities (names of real world objects; for example, gene/protein names, drugs, enzymes) in text. In animal sciences, text mining and related methods have been briefly used in murine genomics and associated fields, leaving behind other fields of animal sciences, such as livestock genomics. The aim of this work was to develop an information retrieval platform in the livestock domain focusing on livestock publications and the recognition of relevant data from cattle and pigs. For this purpose, the rather noncomprehensive resources of pig and cattle gene and protein terminologies were enriched with orthologue synonyms, integrated in the NER platform, ProMiner, which is successfully used in human genomics domain. Based on the performance tests done, the present system achieved a fair performance with precision 0.64, recall 0.74, and F(1) measure of 0.69 in a test scenario based on cattle literature.
miRTex: A Text Mining System for miRNA-Gene Relation Extraction
Li, Gang; Ross, Karen E.; Arighi, Cecilia N.; Peng, Yifan; Wu, Cathy H.; Vijay-Shanker, K.
2015-01-01
MicroRNAs (miRNAs) regulate a wide range of cellular and developmental processes through gene expression suppression or mRNA degradation. Experimentally validated miRNA gene targets are often reported in the literature. In this paper, we describe miRTex, a text mining system that extracts miRNA-target relations, as well as miRNA-gene and gene-miRNA regulation relations. The system achieves good precision and recall when evaluated on a literature corpus of 150 abstracts with F-scores close to 0.90 on the three different types of relations. We conducted full-scale text mining using miRTex to process all the Medline abstracts and all the full-length articles in the PubMed Central Open Access Subset. The results for all the Medline abstracts are stored in a database for interactive query and file download via the website at http://proteininformationresource.org/mirtex. Using miRTex, we identified genes potentially regulated by miRNAs in Triple Negative Breast Cancer, as well as miRNA-gene relations that, in conjunction with kinase-substrate relations, regulate the response to abiotic stress in Arabidopsis thaliana. These two use cases demonstrate the usefulness of miRTex text mining in the analysis of miRNA-regulated biological processes. PMID:26407127
Cañada, Andres; Rabal, Obdulia; Oyarzabal, Julen; Valencia, Alfonso
2017-01-01
Abstract A considerable effort has been devoted to retrieve systematically information for genes and proteins as well as relationships between them. Despite the importance of chemical compounds and drugs as a central bio-entity in pharmacological and biological research, only a limited number of freely available chemical text-mining/search engine technologies are currently accessible. Here we present LimTox (Literature Mining for Toxicology), a web-based online biomedical search tool with special focus on adverse hepatobiliary reactions. It integrates a range of text mining, named entity recognition and information extraction components. LimTox relies on machine-learning, rule-based, pattern-based and term lookup strategies. This system processes scientific abstracts, a set of full text articles and medical agency assessment reports. Although the main focus of LimTox is on adverse liver events, it enables also basic searches for other organ level toxicity associations (nephrotoxicity, cardiotoxicity, thyrotoxicity and phospholipidosis). This tool supports specialized search queries for: chemical compounds/drugs, genes (with additional emphasis on key enzymes in drug metabolism, namely P450 cytochromes—CYPs) and biochemical liver markers. The LimTox website is free and open to all users and there is no login requirement. LimTox can be accessed at: http://limtox.bioinfo.cnio.es PMID:28531339
Text mining a self-report back-translation.
Blanch, Angel; Aluja, Anton
2016-06-01
There are several recommendations about the routine to undertake when back translating self-report instruments in cross-cultural research. However, text mining methods have been generally ignored within this field. This work describes a text mining innovative application useful to adapt a personality questionnaire to 12 different languages. The method is divided in 3 different stages, a descriptive analysis of the available back-translated instrument versions, a dissimilarity assessment between the source language instrument and the 12 back-translations, and an item assessment of item meaning equivalence. The suggested method contributes to improve the back-translation process of self-report instruments for cross-cultural research in 2 significant intertwined ways. First, it defines a systematic approach to the back translation issue, allowing for a more orderly and informed evaluation concerning the equivalence of different versions of the same instrument in different languages. Second, it provides more accurate instrument back-translations, which has direct implications for the reliability and validity of the instrument's test scores when used in different cultures/languages. In addition, this procedure can be extended to the back-translation of self-reports measuring psychological constructs in clinical assessment. Future research works could refine the suggested methodology and use additional available text mining tools. (PsycINFO Database Record (c) 2016 APA, all rights reserved).
Distributed data mining on grids: services, tools, and applications.
Cannataro, Mario; Congiusta, Antonio; Pugliese, Andrea; Talia, Domenico; Trunfio, Paolo
2004-12-01
Data mining algorithms are widely used today for the analysis of large corporate and scientific datasets stored in databases and data archives. Industry, science, and commerce fields often need to analyze very large datasets maintained over geographically distributed sites by using the computational power of distributed and parallel systems. The grid can play a significant role in providing an effective computational support for distributed knowledge discovery applications. For the development of data mining applications on grids we designed a system called Knowledge Grid. This paper describes the Knowledge Grid framework and presents the toolset provided by the Knowledge Grid for implementing distributed knowledge discovery. The paper discusses how to design and implement data mining applications by using the Knowledge Grid tools starting from searching grid resources, composing software and data components, and executing the resulting data mining process on a grid. Some performance results are also discussed.
Empirical Models of Zones Protecting Against Coal Dust Explosion
NASA Astrophysics Data System (ADS)
Prostański, Dariusz
2017-09-01
The paper presents predicted use of research' results to specify relations between volume of dust deposition and changes of its concentration in air. These were used to shape zones protecting against coal dust explosion. Methodology of research was presented, including methods of measurement of dust concentration as well as deposition. Measurements were taken in the Brzeszcze Mine within framework of MEZAP, co-financed by The National Centre for Research and Development (NCBR) and performed by the Institute of Mining Technology KOMAG, the Central Mining Institute (GIG) and the Coal Company PLC. The project enables performing of research related to measurements of volume of dust deposition as well as its concentration in air in protective zones in a number of mine workings in the Brzeszcze Mine. Developed model may be supportive tool in form of system located directly in protective zones or as operator tool warning about increasing hazard of coal dust explosion.
Mining of high utility-probability sequential patterns from uncertain databases
Zhang, Binbin; Fournier-Viger, Philippe; Li, Ting
2017-01-01
High-utility sequential pattern mining (HUSPM) has become an important issue in the field of data mining. Several HUSPM algorithms have been designed to mine high-utility sequential patterns (HUPSPs). They have been applied in several real-life situations such as for consumer behavior analysis and event detection in sensor networks. Nonetheless, most studies on HUSPM have focused on mining HUPSPs in precise data. But in real-life, uncertainty is an important factor as data is collected using various types of sensors that are more or less accurate. Hence, data collected in a real-life database can be annotated with existing probabilities. This paper presents a novel pattern mining framework called high utility-probability sequential pattern mining (HUPSPM) for mining high utility-probability sequential patterns (HUPSPs) in uncertain sequence databases. A baseline algorithm with three optional pruning strategies is presented to mine HUPSPs. Moroever, to speed up the mining process, a projection mechanism is designed to create a database projection for each processed sequence, which is smaller than the original database. Thus, the number of unpromising candidates can be greatly reduced, as well as the execution time for mining HUPSPs. Substantial experiments both on real-life and synthetic datasets show that the designed algorithm performs well in terms of runtime, number of candidates, memory usage, and scalability for different minimum utility and minimum probability thresholds. PMID:28742847
Willmer, D R; Haas, E J
2016-01-01
As national and international health and safety management system (HSMS) standards are voluntarily accepted or regulated into practice, organizations are making an effort to modify and integrate strategic elements of a connected management system into their daily risk management practices. In high-risk industries such as mining, that effort takes on added importance. The mining industry has long recognized the importance of a more integrated approach to recognizing and responding to site-specific risks, encouraging the adoption of a risk-based management framework. Recently, the U.S. National Mining Association led the development of an industry-specific HSMS built on the strategic frameworks of ANSI: Z10, OHSAS 18001, The American Chemistry Council's Responsible Care, and ILO-OSH 2001. All of these standards provide strategic guidance and focus on how to incorporate a plan-do-check-act cycle into the identification, management and evaluation of worksite risks. This paper details an exploratory study into whether practices associated with executing a risk-based management framework are visible through the actions of an organization's site-level management of health and safety risks. The results of this study show ways that site-level leaders manage day-to-day risk at their operations that can be characterized according to practices associated with a risk-based management framework. Having tangible operational examples of day-to-day risk management can serve as a starting point for evaluating field-level risk assessment efforts and their alignment to overall company efforts at effective risk mitigation through a HSMS or other processes.
Comprehensive evaluation of ecological security in mining area based on PSR-ANP-GRAY.
He, Gang; Yu, Baohua; Li, Shuzhou; Zhu, Yanna
2017-09-06
With the large exploitation of mineral resources, a series of problems have appeared in the ecological environment of the mining area. Therefore, evaluating the ecological security of mining area is of great significance to promote its healthy development. In this paper, the evaluation index system of ecological security in mining area was constructed from three dimensions of nature, society and economy, combined with Pressure-State-Response framework model. Then network analytic hierarchy process and GRAY relational analysis method were used to evaluate the ecological security of the region, and the weighted correlation degree of ecological security was calculated through the index data of a coal mine from 2012 to 2016 in China. The results show that the ecological security in the coal mine area is on the rise as a whole, though it alternatively rose and dropped from 2012 to 2016. Among them, the ecological security of the study mining area is at the general security level from 2012 to 2015, and at a relatively safe level in 2016. It shows that the ecological environment of the study mining area can basically meet the requirement of the survival and development of the enterprises.
The mining sector of Liberia: current practices and environmental challenges.
Wilson, Samuel T K; Wang, Hongtao; Kabenge, Martin; Qi, Xuejiao
2017-08-01
Liberia is endowed with an impressive stock of mineral reserves and has traditionally relied on mining, namely iron ore, gold, and diamonds, as a major source of income. The recent growth in the mining sector has the potential to contribute significantly to employment, income generation, and infrastructure development. However, the development of these mineral resources has significant environmental impacts that often go unnoticed. This paper presents an overview of the Liberian mining sector from historical, current development, and economic perspectives. The efforts made by government to address issues of environmental management and sustainable development expressed in national and international frameworks, as well as some of the environmental challenges in the mining sector are analyzed. A case study was conducted on one of the iron ore mines (China Union Bong Mines Investment) to analyze the effects of the water quality on the local water environment. The results show that the analyzed water sample concentrations were all above the WHO and Liberia water standard Class I guidelines for drinking water. Finally the paper examines the application of water footprint from a life cycle perspective in the Liberian mining sector and suggests some policy options for water resources management.
Flooded Underground Coal Mines: A Significant Source of Inexpensive Geothermal Energy
DOE Office of Scientific and Technical Information (OSTI.GOV)
Watzlaf, G.R.; Ackman, T.E.
2007-04-01
Many mining regions in the United States contain extensive areas of flooded underground mines. The water within these mines represents a significant and widespread opportunity for extracting low-grade, geothermal energy. Based on current energy prices, geothermal heat pump systems using mine water could reduce the annual costs for heating to over 70 percent compared to conventional heating methods (natural gas or heating oil). These same systems could reduce annual cooling costs by up to 50 percent over standard air conditioning in many areas of the country. (Formatted full-text version is released by permission of publisher)
Kano, Yoshinobu; Nguyen, Ngan; Saetre, Rune; Yoshida, Kazuhiro; Miyao, Yusuke; Tsuruoka, Yoshimasa; Matsubayashi, Yuichiro; Ananiadou, Sophia; Tsujii, Jun'ichi
2008-01-01
Recently, several text mining programs have reached a near-practical level of performance. Some systems are already being used by biologists and database curators. However, it has also been recognized that current Natural Language Processing (NLP) and Text Mining (TM) technology is not easy to deploy, since research groups tend to develop systems that cater specifically to their own requirements. One of the major reasons for the difficulty of deployment of NLP/TM technology is that re-usability and interoperability of software tools are typically not considered during development. While some effort has been invested in making interoperable NLP/TM toolkits, the developers of end-to-end systems still often struggle to reuse NLP/TM tools, and often opt to develop similar programs from scratch instead. This is particularly the case in BioNLP, since the requirements of biologists are so diverse that NLP tools have to be adapted and re-organized in a much more extensive manner than was originally expected. Although generic frameworks like UIMA (Unstructured Information Management Architecture) provide promising ways to solve this problem, the solution that they provide is only partial. In order for truly interoperable toolkits to become a reality, we also need sharable type systems and a developer-friendly environment for software integration that includes functionality for systematic comparisons of available tools, a simple I/O interface, and visualization tools. In this paper, we describe such an environment that was developed based on UIMA, and we show its feasibility through our experience in developing a protein-protein interaction (PPI) extraction system.
Raja, Kalpana; Patrick, Matthew; Gao, Yilin; Madu, Desmond; Yang, Yuyang
2017-01-01
In the past decade, the volume of “omics” data generated by the different high-throughput technologies has expanded exponentially. The managing, storing, and analyzing of this big data have been a great challenge for the researchers, especially when moving towards the goal of generating testable data-driven hypotheses, which has been the promise of the high-throughput experimental techniques. Different bioinformatics approaches have been developed to streamline the downstream analyzes by providing independent information to interpret and provide biological inference. Text mining (also known as literature mining) is one of the commonly used approaches for automated generation of biological knowledge from the huge number of published articles. In this review paper, we discuss the recent advancement in approaches that integrate results from omics data and information generated from text mining approaches to uncover novel biomedical information. PMID:28331849
Systematic drug repositioning through mining adverse event data in ClinicalTrials.gov.
Su, Eric Wen; Sanger, Todd M
2017-01-01
Drug repositioning (i.e., drug repurposing) is the process of discovering new uses for marketed drugs. Historically, such discoveries were serendipitous. However, the rapid growth in electronic clinical data and text mining tools makes it feasible to systematically identify drugs with the potential to be repurposed. Described here is a novel method of drug repositioning by mining ClinicalTrials.gov. The text mining tools I2E (Linguamatics) and PolyAnalyst (Megaputer) were utilized. An I2E query extracts "Serious Adverse Events" (SAE) data from randomized trials in ClinicalTrials.gov. Through a statistical algorithm, a PolyAnalyst workflow ranks the drugs where the treatment arm has fewer predefined SAEs than the control arm, indicating that potentially the drug is reducing the level of SAE. Hypotheses could then be generated for the new use of these drugs based on the predefined SAE that is indicative of disease (for example, cancer).
Matsuda, Yoshio; Manaka, Tomoko; Kobayashi, Makiko; Sato, Shuhei; Ohwada, Michitaka
2016-06-01
The aim of the present study was to examine the possibility of screening apprehensive pregnant women and mothers at risk for post-partum depression from an analysis of the textual data in the Mother and Child Handbook by using the text-mining method. Uncomplicated pregnant women (n = 58) were divided into two groups according to State-Trait Anxiety Inventory grade (high trait [group I, n = 21] and low trait [group II, n = 37]) or Edinburgh Postnatal Depression Scale score (high score [group III, n = 15] and low score [group IV, n = 43]). An exploratory analysis of the textual data from the Maternal and Child Handbook was conducted using the text-mining method with the Word Miner software program. A comparison of the 'structure elements' was made between the two groups. The number of structure elements extracted by separated words from text data was 20 004 and the number of structure elements with a threshold of 2 or more as an initial value was 1168. Fifteen key words related to maternal anxiety, and six key words related to post-partum depression were extracted. The text-mining method is useful for the exploratory analysis of textual data obtained from pregnant woman, and this screening method has been suggested to be useful for apprehensive pregnant women and mothers at risk for post-partum depression. © 2016 Japan Society of Obstetrics and Gynecology.
An Integrated Suite of Text and Data Mining Tools - Phase II
2005-08-30
Riverside, CA, USA Mazda Motor Corp, Jpn Univ of Darmstadt, Darmstadt, Ger Navy Center for Applied Research in Artificial Intelligence Univ of...with Georgia Tech Research Corporation developed a desktop text-mining software tool named TechOASIS (known commercially as VantagePoint). By the...of this dataset and groups the Corporate Source items that co-occur with the found items. He decides he is only interested in the institutions
He, Qiwei; Veldkamp, Bernard P; Glas, Cees A W; de Vries, Theo
2017-03-01
Patients' narratives about traumatic experiences and symptoms are useful in clinical screening and diagnostic procedures. In this study, we presented an automated assessment system to screen patients for posttraumatic stress disorder via a natural language processing and text-mining approach. Four machine-learning algorithms-including decision tree, naive Bayes, support vector machine, and an alternative classification approach called the product score model-were used in combination with n-gram representation models to identify patterns between verbal features in self-narratives and psychiatric diagnoses. With our sample, the product score model with unigrams attained the highest prediction accuracy when compared with practitioners' diagnoses. The addition of multigrams contributed most to balancing the metrics of sensitivity and specificity. This article also demonstrates that text mining is a promising approach for analyzing patients' self-expression behavior, thus helping clinicians identify potential patients from an early stage.
Using ontology network structure in text mining.
Berndt, Donald J; McCart, James A; Luther, Stephen L
2010-11-13
Statistical text mining treats documents as bags of words, with a focus on term frequencies within documents and across document collections. Unlike natural language processing (NLP) techniques that rely on an engineered vocabulary or a full-featured ontology, statistical approaches do not make use of domain-specific knowledge. The freedom from biases can be an advantage, but at the cost of ignoring potentially valuable knowledge. The approach proposed here investigates a hybrid strategy based on computing graph measures of term importance over an entire ontology and injecting the measures into the statistical text mining process. As a starting point, we adapt existing search engine algorithms such as PageRank and HITS to determine term importance within an ontology graph. The graph-theoretic approach is evaluated using a smoking data set from the i2b2 National Center for Biomedical Computing, cast as a simple binary classification task for categorizing smoking-related documents, demonstrating consistent improvements in accuracy.
Sustainable gold mining management waste policy in Romania
NASA Astrophysics Data System (ADS)
Tudor, Elena; Filipciuc, Constantina
2016-04-01
Sustainable mining practices and consistent implementation of the mining for the closure planning approach, within an improved legislative framework, create conditions for the development of creative, profitable, environmentally-sound and socially-responsible management and reuse of mine lands. According to the World Commission on Environment and Development definition, sustainable development is the type of development that meets the needs of the present without compromising the ability of future generations to meet their own needs. Romania has the largest gold reserves in Europe (760 million tons of gold-silver ores, of which 40 million tons in 68 gold deposits in the Apuseni Mountains. New mining projects draw particular attention regarding the environmental risks they cause. Rehabilitation is an ongoing consideration throughout the mine's lifecycle, both from a technical and a financial standpoint. The costs of land rehabilitation are classified as the mine's operating costs. According to Directive 2004/35/EC on environmental liability, the prevention and remedying of environmental damage should be implemented by applying the "polluter pays" principle, in line with the principle of sustainable development. Directive on the management of waste from extractive industries and amending Directive obliges operators to provide (and periodically adjust in size) a financial guarantee for waste facility maintenance and post-closure site restoration, including land rehabilitation. According to the Romanian Mining Law, the license holder has the following obligations related to land use and protection: to provide environmental agreements as one of the prerequisites for a building permit; to regularly update the mine closure plan; to set up and maintain the financial guarantee for environmental rehabilitation; and to execute and finalize the environmental rehabilitation of affected land in the mining site, according to the mine closure plan, including the post-closure monitoring program implementation and financing. Apart from the Mining Law, the Government Decision, which transposes EU Directive on the management of waste from extractive industries, as well as Government Emergency Ordinance, which implements the requirements of EU Directive 2004/35/CE on environmental liability, requests financial guarantees for waste facilities maintenance and for environment restoration in the case of pollution, respectively. In practice, there are problems in the calculation of the financial guarantee and the development of financial security instruments and markets as required by Directive, due to the lack of expertise in financial, economic and liability matters. Mining companies are usually not required to set up a special guarantee for the waste facilities, but only to set up and maintain the financial guarantee regulated under the Mining Law. Romania - because of the structure of its mining sector - has serious environmental legacies, a lack of funds for their restoration and the need to strengthen the administrative capacity in this area, as well as the important tasks on harmonization and/or implementation of the EU mining waste legislation. This work is presented within the framework of SUSMIN project. Key words : sustainable development, waste management, policy
DTMiner: identification of potential disease targets through biomedical literature mining
Xu, Dong; Zhang, Meizhuo; Xie, Yanping; Wang, Fan; Chen, Ming; Zhu, Kenny Q.; Wei, Jia
2016-01-01
Motivation: Biomedical researchers often search through massive catalogues of literature to look for potential relationships between genes and diseases. Given the rapid growth of biomedical literature, automatic relation extraction, a crucial technology in biomedical literature mining, has shown great potential to support research of gene-related diseases. Existing work in this field has produced datasets that are limited both in scale and accuracy. Results: In this study, we propose a reliable and efficient framework that takes large biomedical literature repositories as inputs, identifies credible relationships between diseases and genes, and presents possible genes related to a given disease and possible diseases related to a given gene. The framework incorporates name entity recognition (NER), which identifies occurrences of genes and diseases in texts, association detection whereby we extract and evaluate features from gene–disease pairs, and ranking algorithms that estimate how closely the pairs are related. The F1-score of the NER phase is 0.87, which is higher than existing studies. The association detection phase takes drastically less time than previous work while maintaining a comparable F1-score of 0.86. The end-to-end result achieves a 0.259 F1-score for the top 50 genes associated with a disease, which performs better than previous work. In addition, we released a web service for public use of the dataset. Availability and Implementation: The implementation of the proposed algorithms is publicly available at http://gdr-web.rwebox.com/public_html/index.php?page=download.php. The web service is available at http://gdr-web.rwebox.com/public_html/index.php. Contact: jenny.wei@astrazeneca.com or kzhu@cs.sjtu.edu.cn Supplementary information: Supplementary data are available at Bioinformatics online. PMID:27506226
DTMiner: identification of potential disease targets through biomedical literature mining.
Xu, Dong; Zhang, Meizhuo; Xie, Yanping; Wang, Fan; Chen, Ming; Zhu, Kenny Q; Wei, Jia
2016-12-01
Biomedical researchers often search through massive catalogues of literature to look for potential relationships between genes and diseases. Given the rapid growth of biomedical literature, automatic relation extraction, a crucial technology in biomedical literature mining, has shown great potential to support research of gene-related diseases. Existing work in this field has produced datasets that are limited both in scale and accuracy. In this study, we propose a reliable and efficient framework that takes large biomedical literature repositories as inputs, identifies credible relationships between diseases and genes, and presents possible genes related to a given disease and possible diseases related to a given gene. The framework incorporates name entity recognition (NER), which identifies occurrences of genes and diseases in texts, association detection whereby we extract and evaluate features from gene-disease pairs, and ranking algorithms that estimate how closely the pairs are related. The F1-score of the NER phase is 0.87, which is higher than existing studies. The association detection phase takes drastically less time than previous work while maintaining a comparable F1-score of 0.86. The end-to-end result achieves a 0.259 F1-score for the top 50 genes associated with a disease, which performs better than previous work. In addition, we released a web service for public use of the dataset. The implementation of the proposed algorithms is publicly available at http://gdr-web.rwebox.com/public_html/index.php?page=download.php The web service is available at http://gdr-web.rwebox.com/public_html/index.php CONTACT: jenny.wei@astrazeneca.com or kzhu@cs.sjtu.edu.cn Supplementary information: Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press.
Monitoring of the stability of underground workings in Polish copper mines conditions
NASA Astrophysics Data System (ADS)
Fuławka, Krzysztof; Mertuszka, Piotr; Pytel, Witold
2018-01-01
One of the problems associated with the excavation of deposit in underground mines is the local disturbance in a state of unstable equilibrium results in the sudden release of energy, mainly in the form of roof falls. The scale and intensity of this type of events depends on a number of factors. To minimize the risk of instability occurrence, continuous observations of the roof strata condition are recommended. Different roof strata observation methods used in the Polish copper mines have been analysed within the framework of presented paper. In addition, selected prospective methods, which could significantly increase efficiency of rock fall prevention are presented.
Semantic web for integrated network analysis in biomedicine.
Chen, Huajun; Ding, Li; Wu, Zhaohui; Yu, Tong; Dhanapalan, Lavanya; Chen, Jake Y
2009-03-01
The Semantic Web technology enables integration of heterogeneous data on the World Wide Web by making the semantics of data explicit through formal ontologies. In this article, we survey the feasibility and state of the art of utilizing the Semantic Web technology to represent, integrate and analyze the knowledge in various biomedical networks. We introduce a new conceptual framework, semantic graph mining, to enable researchers to integrate graph mining with ontology reasoning in network data analysis. Through four case studies, we demonstrate how semantic graph mining can be applied to the analysis of disease-causal genes, Gene Ontology category cross-talks, drug efficacy analysis and herb-drug interactions analysis.
Exploiting Early Intent Recognition for Competitive Advantage
2009-01-01
basketball [Bhan- dari et al., 1997; Jug et al., 2003], and Robocup soccer sim- ulations [Riley and Veloso, 2000; 2002; Kuhlmann et al., 2006] and non...actions (e.g. before, after, around). Jug et al. [2003] used a similar framework for offline basketball game analysis. More recently, Hess et al...and K. Ramanujam. Advanced Scout: Data mining and knowledge discovery in NBA data. Data Mining and Knowledge Discovery, 1(1):121–125, 1997. [Chang
Mining Multi-Aspect Reflection of News Events in Twitter: Discovery, Linking and Presentation
Wang, Jingjing; Tong, Wenzhu; Yu, Hongkun; Li, Min; Ma, Xiuli; Cai, Haoyan; Hanratty, Tim; Han, Jiawei
2015-01-01
A major event often has repercussions on both news media and microblogging sites such as Twitter. Reports from mainstream news agencies and discussions from Twitter complement each other to form a complete picture. An event can have multiple aspects (sub-events) describing it from multiple angles, each of which attracts opinions/comments posted on Twitter. Mining such reflections is interesting to both policy makers and ordinary people seeking information. In this paper, we propose a unified framework to mine multi-aspect reflections of news events in Twitter. We propose a novel and efficient dynamic hierarchical entity-aware event discovery model to learn news events and their multiple aspects. The aspects of an event are linked to their reflections in Twitter by a bootstrapped dataless classification scheme, which elegantly handles the challenges of selecting informative tweets under overwhelming noise and bridging the vocabularies of news and tweets. In addition, we demonstrate that our framework naturally generates an informative presentation of each event with entity graphs, time spans, news summaries and tweet highlights to facilitate user digestion. PMID:27034625
Mining Multi-Aspect Reflection of News Events in Twitter: Discovery, Linking and Presentation.
Wang, Jingjing; Tong, Wenzhu; Yu, Hongkun; Li, Min; Ma, Xiuli; Cai, Haoyan; Hanratty, Tim; Han, Jiawei
2015-11-01
A major event often has repercussions on both news media and microblogging sites such as Twitter. Reports from mainstream news agencies and discussions from Twitter complement each other to form a complete picture. An event can have multiple aspects (sub-events) describing it from multiple angles, each of which attracts opinions/comments posted on Twitter. Mining such reflections is interesting to both policy makers and ordinary people seeking information. In this paper, we propose a unified framework to mine multi-aspect reflections of news events in Twitter. We propose a novel and efficient dynamic hierarchical entity-aware event discovery model to learn news events and their multiple aspects. The aspects of an event are linked to their reflections in Twitter by a bootstrapped dataless classification scheme, which elegantly handles the challenges of selecting informative tweets under overwhelming noise and bridging the vocabularies of news and tweets. In addition, we demonstrate that our framework naturally generates an informative presentation of each event with entity graphs, time spans, news summaries and tweet highlights to facilitate user digestion.
The structure and infrastructure of the global nanotechnology literature
NASA Astrophysics Data System (ADS)
Kostoff, Ronald N.; Stump, Jesse A.; Johnson, Dustin; Murday, James S.; Lau, Clifford G. Y.; Tolles, William M.
2006-08-01
Text mining is the extraction of useful information from large volumes of text. A text mining analysis of the global open nanotechnology literature was performed. Records from the Science Citation Index (SCI)/Social SCI were analyzed to provide the infrastructure of the global nanotechnology literature (prolific authors/journals/institutions/countries, most cited authors/papers/journals) and the thematic structure (taxonomy) of the global nanotechnology literature, from a science perspective. Records from the Engineering Compendex (EC) were analyzed to provide a taxonomy from a technology perspective. The Far Eastern countries have expanded nanotechnology publication output dramatically in the past decade.
PubMed-EX: a web browser extension to enhance PubMed search with text mining features.
Tsai, Richard Tzong-Han; Dai, Hong-Jie; Lai, Po-Ting; Huang, Chi-Hsin
2009-11-15
PubMed-EX is a browser extension that marks up PubMed search results with additional text-mining information. PubMed-EX's page mark-up, which includes section categorization and gene/disease and relation mark-up, can help researchers to quickly focus on key terms and provide additional information on them. All text processing is performed server-side, freeing up user resources. PubMed-EX is freely available at http://bws.iis.sinica.edu.tw/PubMed-EX and http://iisr.cse.yzu.edu.tw:8000/PubMed-EX/.
Learning in the context of distribution drift
2017-05-09
published in the leading data mining journal, Data Mining and Knowledge Discovery (Webb et. al., 2016)1. We have shown that the previous qualitative...learner Low-bias learner Aggregated classifier Figure 7: Architecture for learning fr m streaming data in th co text of variable or unknown...Learning limited dependence Bayesian classifiers, in Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD
Enhancements for a Dynamic Data Warehousing and Mining System for Large-scale HSCB Data
2016-07-20
Intelligent Automation Incorporated Enhancements for a Dynamic Data Warehousing and Mining ...Page | 2 Intelligent Automation Incorporated Monthly Report No. 4 Enhancements for a Dynamic Data Warehousing and Mining System Large-Scale HSCB...including Top Videos, Top Users, Top Words, and Top Languages, and also applied NER to the text associated with YouTube posts. We have also developed UI for
Enhancements for a Dynamic Data Warehousing and Mining System for Large-Scale HSCB Data
2016-07-20
Intelligent Automation Incorporated Enhancements for a Dynamic Data Warehousing and Mining ...Page | 2 Intelligent Automation Incorporated Monthly Report No. 4 Enhancements for a Dynamic Data Warehousing and Mining System Large-Scale HSCB...including Top Videos, Top Users, Top Words, and Top Languages, and also applied NER to the text associated with YouTube posts. We have also developed UI for
How Agricultural Media Cover Safety Compared with Periodicals in Two Other Hazardous Industries.
Marolf, Amanda; Heiberger, Scott; Evans, James; Joseph, Lura
2017-01-26
This analysis featured a uniquely broad look at challenges and potentials for engaging agricultural and other industrial media more effectively in covering safety. It involved a content analysis of selected industry periodicals serving agriculture, mining, and transportation, which are three of the nation's most hazardous industries, in terms of human safety. Use of the social amplification of risk framework (SARF) provided insight on safety coverage. In particular, it tested previous research indicating that media coverage tends to amplify (increase) more than attenuate (decrease) a sense of risk. Analysis involved 18 periodicals (9 agriculture, 7 transportation, and 2 mining) spanning a five-year period from 2008 to 2012. Full-text digital analysis identified terms found in safety articles across all three industries. A manual review of articles revealed the quantity and nature of safety coverage within and among these industries. Results identified 528 safety-related articles published during the period. Transportation and mining periodicals averaged more than twice as many safety articles as the agricultural periodicals. The amount of coverage within the three industries also varied greatly. Findings on the nature of coverage supported previous media research within the SARF. Coverage across all three industries was clearly oriented more to amplifying than to attenuating risk. This study adds to the understanding of variations, commonalities, challenges, and potentials for enhancing safety coverage by media serving these three industries. It also provides direction for engaging industry media more effectively in the public safety mission. The authors recommend seven areas of opportunity for further research. Copyright© by the American Society of Agricultural Engineers.
Computable visually observed phenotype ontological framework for plants
2011-01-01
Background The ability to search for and precisely compare similar phenotypic appearances within and across species has vast potential in plant science and genetic research. The difficulty in doing so lies in the fact that many visual phenotypic data, especially visually observed phenotypes that often times cannot be directly measured quantitatively, are in the form of text annotations, and these descriptions are plagued by semantic ambiguity, heterogeneity, and low granularity. Though several bio-ontologies have been developed to standardize phenotypic (and genotypic) information and permit comparisons across species, these semantic issues persist and prevent precise analysis and retrieval of information. A framework suitable for the modeling and analysis of precise computable representations of such phenotypic appearances is needed. Results We have developed a new framework called the Computable Visually Observed Phenotype Ontological Framework for plants. This work provides a novel quantitative view of descriptions of plant phenotypes that leverages existing bio-ontologies and utilizes a computational approach to capture and represent domain knowledge in a machine-interpretable form. This is accomplished by means of a robust and accurate semantic mapping module that automatically maps high-level semantics to low-level measurements computed from phenotype imagery. The framework was applied to two different plant species with semantic rules mined and an ontology constructed. Rule quality was evaluated and showed high quality rules for most semantics. This framework also facilitates automatic annotation of phenotype images and can be adopted by different plant communities to aid in their research. Conclusions The Computable Visually Observed Phenotype Ontological Framework for plants has been developed for more efficient and accurate management of visually observed phenotypes, which play a significant role in plant genomics research. The uniqueness of this framework is its ability to bridge the knowledge of informaticians and plant science researchers by translating descriptions of visually observed phenotypes into standardized, machine-understandable representations, thus enabling the development of advanced information retrieval and phenotype annotation analysis tools for the plant science community. PMID:21702966
Empirical advances with text mining of electronic health records.
Delespierre, T; Denormandie, P; Bar-Hen, A; Josseran, L
2017-08-22
Korian is a private group specializing in medical accommodations for elderly and dependent people. A professional data warehouse (DWH) established in 2010 hosts all of the residents' data. Inside this information system (IS), clinical narratives (CNs) were used only by medical staff as a residents' care linking tool. The objective of this study was to show that, through qualitative and quantitative textual analysis of a relatively small physiotherapy and well-defined CN sample, it was possible to build a physiotherapy corpus and, through this process, generate a new body of knowledge by adding relevant information to describe the residents' care and lives. Meaningful words were extracted through Standard Query Language (SQL) with the LIKE function and wildcards to perform pattern matching, followed by text mining and a word cloud using R® packages. Another step involved principal components and multiple correspondence analyses, plus clustering on the same residents' sample as well as on other health data using a health model measuring the residents' care level needs. By combining these techniques, physiotherapy treatments could be characterized by a list of constructed keywords, and the residents' health characteristics were built. Feeding defects or health outlier groups could be detected, physiotherapy residents' data and their health data were matched, and differences in health situations showed qualitative and quantitative differences in physiotherapy narratives. This textual experiment using a textual process in two stages showed that text mining and data mining techniques provide convenient tools to improve residents' health and quality of care by adding new, simple, useable data to the electronic health record (EHR). When used with a normalized physiotherapy problem list, text mining through information extraction (IE), named entity recognition (NER) and data mining (DM) can provide a real advantage to describe health care, adding new medical material and helping to integrate the EHR system into the health staff work environment.
Database citation in full text biomedical articles.
Kafkas, Şenay; Kim, Jee-Hyub; McEntyre, Johanna R
2013-01-01
Molecular biology and literature databases represent essential infrastructure for life science research. Effective integration of these data resources requires that there are structured cross-references at the level of individual articles and biological records. Here, we describe the current patterns of how database entries are cited in research articles, based on analysis of the full text Open Access articles available from Europe PMC. Focusing on citation of entries in the European Nucleotide Archive (ENA), UniProt and Protein Data Bank, Europe (PDBe), we demonstrate that text mining doubles the number of structured annotations of database record citations supplied in journal articles by publishers. Many thousands of new literature-database relationships are found by text mining, since these relationships are also not present in the set of articles cited by database records. We recommend that structured annotation of database records in articles is extended to other databases, such as ArrayExpress and Pfam, entries from which are also cited widely in the literature. The very high precision and high-throughput of this text-mining pipeline makes this activity possible both accurately and at low cost, which will allow the development of new integrated data services.
Database Citation in Full Text Biomedical Articles
Kafkas, Şenay; Kim, Jee-Hyub; McEntyre, Johanna R.
2013-01-01
Molecular biology and literature databases represent essential infrastructure for life science research. Effective integration of these data resources requires that there are structured cross-references at the level of individual articles and biological records. Here, we describe the current patterns of how database entries are cited in research articles, based on analysis of the full text Open Access articles available from Europe PMC. Focusing on citation of entries in the European Nucleotide Archive (ENA), UniProt and Protein Data Bank, Europe (PDBe), we demonstrate that text mining doubles the number of structured annotations of database record citations supplied in journal articles by publishers. Many thousands of new literature-database relationships are found by text mining, since these relationships are also not present in the set of articles cited by database records. We recommend that structured annotation of database records in articles is extended to other databases, such as ArrayExpress and Pfam, entries from which are also cited widely in the literature. The very high precision and high-throughput of this text-mining pipeline makes this activity possible both accurately and at low cost, which will allow the development of new integrated data services. PMID:23734176
Subramani, Suresh; Kalpana, Raja; Monickaraj, Pankaj Moses; Natarajan, Jeyakumar
2015-04-01
The knowledge on protein-protein interactions (PPI) and their related pathways are equally important to understand the biological functions of the living cell. Such information on human proteins is highly desirable to understand the mechanism of several diseases such as cancer, diabetes, and Alzheimer's disease. Because much of that information is buried in biomedical literature, an automated text mining system for visualizing human PPI and pathways is highly desirable. In this paper, we present HPIminer, a text mining system for visualizing human protein interactions and pathways from biomedical literature. HPIminer extracts human PPI information and PPI pairs from biomedical literature, and visualize their associated interactions, networks and pathways using two curated databases HPRD and KEGG. To our knowledge, HPIminer is the first system to build interaction networks from literature as well as curated databases. Further, the new interactions mined only from literature and not reported earlier in databases are highlighted as new. A comparative study with other similar tools shows that the resultant network is more informative and provides additional information on interacting proteins and their associated networks. Copyright © 2015 Elsevier Inc. All rights reserved.
A Framework for Engaging Navajo Women in Clean Energy Development through Applied Theatre
ERIC Educational Resources Information Center
Osnes, Beth; Manygoats, Adrian; Weitkamp, Lindsay
2015-01-01
Through applied theatre, Navajo women can participate in authoring a new story for how energy is mined, produced, developed, disseminated and used in the Navajo Nation. This article is an analysis of a creative process that was utilised with primarily Navajo women to create a Navajo Women's Energy Project (NWEP). The framework for this creative…
2009-06-01
capabilities: web-based, relational/multi-dimensional, client/server, and metadata (data about data) inclusion (pp. 39-40). Text mining, on the other...and Organizational Systems ( CASOS ) (Carley, 2005). Although AutoMap can be used to conduct text-mining, it was utilized only for its visualization...provides insight into how the GMCOI is using the terms, and where there might be redundant terms and need for de -confliction and standardization
Terminologies for text-mining; an experiment in the lipoprotein metabolism domain
Alexopoulou, Dimitra; Wächter, Thomas; Pickersgill, Laura; Eyre, Cecilia; Schroeder, Michael
2008-01-01
Background The engineering of ontologies, especially with a view to a text-mining use, is still a new research field. There does not yet exist a well-defined theory and technology for ontology construction. Many of the ontology design steps remain manual and are based on personal experience and intuition. However, there exist a few efforts on automatic construction of ontologies in the form of extracted lists of terms and relations between them. Results We share experience acquired during the manual development of a lipoprotein metabolism ontology (LMO) to be used for text-mining. We compare the manually created ontology terms with the automatically derived terminology from four different automatic term recognition (ATR) methods. The top 50 predicted terms contain up to 89% relevant terms. For the top 1000 terms the best method still generates 51% relevant terms. In a corpus of 3066 documents 53% of LMO terms are contained and 38% can be generated with one of the methods. Conclusions Given high precision, automatic methods can help decrease development time and provide significant support for the identification of domain-specific vocabulary. The coverage of the domain vocabulary depends strongly on the underlying documents. Ontology development for text mining should be performed in a semi-automatic way; taking ATR results as input and following the guidelines we described. Availability The TFIDF term recognition is available as Web Service, described at PMID:18460175
Abbe, Adeline; Falissard, Bruno
2017-10-23
Internet is a particularly dynamic way to quickly capture the perceptions of a population in real time. Complementary to traditional face-to-face communication, online social networks help patients to improve self-esteem and self-help. The aim of this study was to use text mining on material from an online forum exploring patients' concerns about treatment (antidepressants and anxiolytics). Concerns about treatment were collected from discussion titles in patients' online community related to antidepressants and anxiolytics. To examine the content of these titles automatically, we used text mining methods, such as word frequency in a document-term matrix and co-occurrence of words using a network analysis. It was thus possible to identify topics discussed on the forum. The forum included 2415 discussions on antidepressants and anxiolytics over a period of 3 years. After a preprocessing step, the text mining algorithm identified the 99 most frequently occurring words in titles, among which were escitalopram, withdrawal, antidepressant, venlafaxine, paroxetine, and effect. Patients' concerns were related to antidepressant withdrawal, the need to share experience about symptoms, effects, and questions on weight gain with some drugs. Patients' expression on the Internet is a potential additional resource in addressing patients' concerns about treatment. Patient profiles are close to that of patients treated in psychiatry. ©Adeline Abbe, Bruno Falissard. Originally published in JMIR Mental Health (http://mental.jmir.org), 23.10.2017.
Jonnagaddala, Jitendra; Liaw, Siaw-Teng; Ray, Pradeep; Kumar, Manish; Chang, Nai-Wen; Dai, Hong-Jie
2015-12-01
Coronary artery disease (CAD) often leads to myocardial infarction, which may be fatal. Risk factors can be used to predict CAD, which may subsequently lead to prevention or early intervention. Patient data such as co-morbidities, medication history, social history and family history are required to determine the risk factors for a disease. However, risk factor data are usually embedded in unstructured clinical narratives if the data is not collected specifically for risk assessment purposes. Clinical text mining can be used to extract data related to risk factors from unstructured clinical notes. This study presents methods to extract Framingham risk factors from unstructured electronic health records using clinical text mining and to calculate 10-year coronary artery disease risk scores in a cohort of diabetic patients. We developed a rule-based system to extract risk factors: age, gender, total cholesterol, HDL-C, blood pressure, diabetes history and smoking history. The results showed that the output from the text mining system was reliable, but there was a significant amount of missing data to calculate the Framingham risk score. A systematic approach for understanding missing data was followed by implementation of imputation strategies. An analysis of the 10-year Framingham risk scores for coronary artery disease in this cohort has shown that the majority of the diabetic patients are at moderate risk of CAD. Copyright © 2015 Elsevier Inc. All rights reserved.
Cañada, Andres; Capella-Gutierrez, Salvador; Rabal, Obdulia; Oyarzabal, Julen; Valencia, Alfonso; Krallinger, Martin
2017-07-03
A considerable effort has been devoted to retrieve systematically information for genes and proteins as well as relationships between them. Despite the importance of chemical compounds and drugs as a central bio-entity in pharmacological and biological research, only a limited number of freely available chemical text-mining/search engine technologies are currently accessible. Here we present LimTox (Literature Mining for Toxicology), a web-based online biomedical search tool with special focus on adverse hepatobiliary reactions. It integrates a range of text mining, named entity recognition and information extraction components. LimTox relies on machine-learning, rule-based, pattern-based and term lookup strategies. This system processes scientific abstracts, a set of full text articles and medical agency assessment reports. Although the main focus of LimTox is on adverse liver events, it enables also basic searches for other organ level toxicity associations (nephrotoxicity, cardiotoxicity, thyrotoxicity and phospholipidosis). This tool supports specialized search queries for: chemical compounds/drugs, genes (with additional emphasis on key enzymes in drug metabolism, namely P450 cytochromes-CYPs) and biochemical liver markers. The LimTox website is free and open to all users and there is no login requirement. LimTox can be accessed at: http://limtox.bioinfo.cnio.es. © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.
Text Mining for Drugs and Chemical Compounds: Methods, Tools and Applications.
Vazquez, Miguel; Krallinger, Martin; Leitner, Florian; Valencia, Alfonso
2011-06-01
Providing prior knowledge about biological properties of chemicals, such as kinetic values, protein targets, or toxic effects, can facilitate many aspects of drug development. Chemical information is rapidly accumulating in all sorts of free text documents like patents, industry reports, or scientific articles, which has motivated the development of specifically tailored text mining applications. Despite the potential gains, chemical text mining still faces significant challenges. One of the most salient is the recognition of chemical entities mentioned in text. To help practitioners contribute to this area, a good portion of this review is devoted to this issue, and presents the basic concepts and principles underlying the main strategies. The technical details are introduced and accompanied by relevant bibliographic references. Other tasks discussed are retrieving relevant articles, identifying relationships between chemicals and other entities, or determining the chemical structures of chemicals mentioned in text. This review also introduces a number of published applications that can be used to build pipelines in topics like drug side effects, toxicity, and protein-disease-compound network analysis. We conclude the review with an outlook on how we expect the field to evolve, discussing its possibilities and its current limitations. Copyright © 2011 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
Mining free-text medical records for companion animal enteric syndrome surveillance.
Anholt, R M; Berezowski, J; Jamal, I; Ribble, C; Stephen, C
2014-03-01
Large amounts of animal health care data are present in veterinary electronic medical records (EMR) and they present an opportunity for companion animal disease surveillance. Veterinary patient records are largely in free-text without clinical coding or fixed vocabulary. Text-mining, a computer and information technology application, is needed to identify cases of interest and to add structure to the otherwise unstructured data. In this study EMR's were extracted from veterinary management programs of 12 participating veterinary practices and stored in a data warehouse. Using commercially available text-mining software (WordStat™), we developed a categorization dictionary that could be used to automatically classify and extract enteric syndrome cases from the warehoused electronic medical records. The diagnostic accuracy of the text-miner for retrieving cases of enteric syndrome was measured against human reviewers who independently categorized a random sample of 2500 cases as enteric syndrome positive or negative. Compared to the reviewers, the text-miner retrieved cases with enteric signs with a sensitivity of 87.6% (95%CI, 80.4-92.9%) and a specificity of 99.3% (95%CI, 98.9-99.6%). Automatic and accurate detection of enteric syndrome cases provides an opportunity for community surveillance of enteric pathogens in companion animals. Copyright © 2014 Elsevier B.V. All rights reserved.
Mapping genomic features to functional traits through microbial whole genome sequences.
Zhang, Wei; Zeng, Erliang; Liu, Dan; Jones, Stuart E; Emrich, Scott
2014-01-01
Recently, the utility of trait-based approaches for microbial communities has been identified. Increasing availability of whole genome sequences provide the opportunity to explore the genetic foundations of a variety of functional traits. We proposed a machine learning framework to quantitatively link the genomic features with functional traits. Genes from bacteria genomes belonging to different functional traits were grouped to Cluster of Orthologs (COGs), and were used as features. Then, TF-IDF technique from the text mining domain was applied to transform the data to accommodate the abundance and importance of each COG. After TF-IDF processing, COGs were ranked using feature selection methods to identify their relevance to the functional trait of interest. Extensive experimental results demonstrated that functional trait related genes can be detected using our method. Further, the method has the potential to provide novel biological insights.
Kharat, Amit T; Singh, Amarjit; Kulkarni, Vilas M; Shah, Digish
2014-01-01
Data mining facilitates the study of radiology data in various dimensions. It converts large patient image and text datasets into useful information that helps in improving patient care and provides informative reports. Data mining technology analyzes data within the Radiology Information System and Hospital Information System using specialized software which assesses relationships and agreement in available information. By using similar data analysis tools, radiologists can make informed decisions and predict the future outcome of a particular imaging finding. Data, information and knowledge are the components of data mining. Classes, Clusters, Associations, Sequential patterns, Classification, Prediction and Decision tree are the various types of data mining. Data mining has the potential to make delivery of health care affordable and ensure that the best imaging practices are followed. It is a tool for academic research. Data mining is considered to be ethically neutral, however concerns regarding privacy and legality exists which need to be addressed to ensure success of data mining. PMID:25024513
Analyzing asset management data using data and text mining.
DOT National Transportation Integrated Search
2014-07-01
Predictive models using text from a sample competitively bid California highway projects have been used to predict a construction : projects likely level of cost overrun. A text description of the project and the text of the five largest project line...
High-throughput literature mining to support read-across ...
Building scientific confidence in the development and evaluation of read-across remains an ongoing challenge. Approaches include establishing systematic frameworks to identify sources of uncertainty and ways to address them. One source of uncertainty is related to characterizing biological similarity. Many research efforts are underway such as structuring mechanistic data in adverse outcome pathways and investigating the utility of high throughput (HT)/high content (HC) screening data. A largely untapped resource for read-across to date is the biomedical literature. This information has the potential to support read-across by facilitating the identification of valid source analogues with similar biological and toxicological profiles as well as providing the mechanistic understanding for any prediction made. A key challenge in using biomedical literature is to convert and translate its unstructured form into a computable format that can be linked to chemical structure. We developed a novel text-mining strategy to represent literature information for read across. Keywords were used to organize literature into toxicity signatures at the chemical level. These signatures were integrated with HT in vitro data and curated chemical structures. A rule-based algorithm assessed the strength of the literature relationship, providing a mechanism to rank and visualize the signature as literature ToxPIs (LitToxPIs). LitToxPIs were developed for over 6,000 chemicals for a varie
BioC implementations in Go, Perl, Python and Ruby.
Liu, Wanli; Islamaj Doğan, Rezarta; Kwon, Dongseop; Marques, Hernani; Rinaldi, Fabio; Wilbur, W John; Comeau, Donald C
2014-01-01
As part of a communitywide effort for evaluating text mining and information extraction systems applied to the biomedical domain, BioC is focused on the goal of interoperability, currently a major barrier to wide-scale adoption of text mining tools. BioC is a simple XML format, specified by DTD, for exchanging data for biomedical natural language processing. With initial implementations in C++ and Java, BioC provides libraries of code for reading and writing BioC text documents and annotations. We extend BioC to Perl, Python, Go and Ruby. We used SWIG to extend the C++ implementation for Perl and one Python implementation. A second Python implementation and the Ruby implementation use native data structures and libraries. BioC is also implemented in the Google language Go. BioC modules are functional in all of these languages, which can facilitate text mining tasks. BioC implementations are freely available through the BioC site: http://bioc.sourceforge.net. Database URL: http://bioc.sourceforge.net/ Published by Oxford University Press 2014. This work is written by US Government employees and is in the public domain in the US.
Urbanski, William M; Condie, Brian G
2009-12-01
Textpresso Site Specific Recombinases (http://ssrc.genetics.uga.edu/) is a text-mining web server for searching a database of more than 9,000 full-text publications. The papers and abstracts in this database represent a wide range of topics related to site-specific recombinase (SSR) research tools. Included in the database are most of the papers that report the characterization or use of mouse strains that express Cre recombinase as well as papers that describe or analyze mouse lines that carry conditional (floxed) alleles or SSR-activated transgenes/knockins. The database also includes reports describing SSR-based cloning methods such as the Gateway or the Creator systems, papers reporting the development or use of SSR-based tools in systems such as Drosophila, bacteria, parasites, stem cells, yeast, plants, zebrafish, and Xenopus as well as publications that describe the biochemistry, genetics, or molecular structure of the SSRs themselves. Textpresso Site Specific Recombinases is the only comprehensive text-mining resource available for the literature describing the biology and technical applications of SSRs. (c) 2009 Wiley-Liss, Inc.
Text mining for metabolic pathways, signaling cascades, and protein networks.
Hoffmann, Robert; Krallinger, Martin; Andres, Eduardo; Tamames, Javier; Blaschke, Christian; Valencia, Alfonso
2005-05-10
The complexity of the information stored in databases and publications on metabolic and signaling pathways, the high throughput of experimental data, and the growing number of publications make it imperative to provide systems to help the researcher navigate through these interrelated information resources. Text-mining methods have started to play a key role in the creation and maintenance of links between the information stored in biological databases and its original sources in the literature. These links will be extremely useful for database updating and curation, especially if a number of technical problems can be solved satisfactorily, including the identification of protein and gene names (entities in general) and the characterization of their types of interactions. The first generation of openly accessible text-mining systems, such as iHOP (Information Hyperlinked over Proteins), provides additional functions to facilitate the reconstruction of protein interaction networks, combine database and text information, and support the scientist in the formulation of novel hypotheses. The next challenge is the generation of comprehensive information regarding the general function of signaling pathways and protein interaction networks.
TOY SAFETY SURVEILLANCE FROM ONLINE REVIEWS
Winkler, Matt; Abrahams, Alan S.; Gruss, Richard; Ehsani, Johnathan P.
2016-01-01
Toy-related injuries account for a significant number of childhood injuries and the prevention of these injuries remains a goal for regulatory agencies and manufacturers. Text-mining is an increasingly prevalent method for uncovering the significance of words using big data. This research sets out to determine the effectiveness of text-mining in uncovering potentially dangerous children’s toys. We develop a danger word list, also known as a ‘smoke word’ list, from injury and recall text narratives. We then use the smoke word lists to score over one million Amazon reviews, with the top scores denoting potential safety concerns. We compare the smoke word list to conventional sentiment analysis techniques, in terms of both word overlap and effectiveness. We find that smoke word lists are highly distinct from conventional sentiment dictionaries and provide a statistically significant method for identifying safety concerns in children’s toy reviews. Our findings indicate that text-mining is, in fact, an effective method for the surveillance of safety concerns in children’s toys and could be a gateway to effective prevention of toy-product-related injuries. PMID:27942092
Public exposure to hazards associated with natural radioactivity in open-pit mining in Ghana.
Darko, E O; Faanu, A; Awudu, A R; Emi-Reynolds, G; Yeboah, J; Oppon, O C; Akaho, E H K
2010-01-01
The results of studies carried out on public exposure contribution from naturally occurring radioactive materials (NORMS) in two open-pit mines in the Western and Ashanti regions of Ghana are reported. The studies were carried out under International Atomic Energy Agency-supported Technical Co-operation Project GHA/9/005. Measurements were made on samples of water, soil, ore, mine tailings and air using gamma spectrometry. Solid-state nuclear track detectors were used for radon concentration measurements. Survey was also carried out to determine the ambient gamma dose rate in the vicinity of the mines and surrounding areas. The effective doses due to external gamma irradiation, ingestion of water and inhalation of radon and ore dusts were calculated for the two mines. The average annual effective dose was found to be 0.30 +/- 0.06 mSv. The result was found to be within the levels published by other countries. The study provides a useful information and data for establishing a comprehensive framework to investigate other mines and develop guidelines for monitoring and control of NORMS in the mining industry and the environment as a whole in Ghana.
Ensuring the Environmental and Industrial Safety in Solid Mineral Deposit Surface Mining
NASA Astrophysics Data System (ADS)
Trubetskoy, Kliment; Rylnikova, Marina; Esina, Ekaterina
2017-11-01
The growing environmental pressure of mineral deposit surface mining and severization of industrial safety requirements dictate the necessity of refining the regulatory framework governing safe and efficient development of underground resources. The applicable regulatory documentation governing the procedure of ore open-pit wall and bench stability design for the stage of pit reaching its final boundary was issued several decades ago. Over recent decades, mining and geomechanical conditions have changed significantly in surface mining operations, numerous new software packages and computer developments have appeared, opportunities of experimental methods of source data collection and processing, grounding of the permissible parameters of open pit walls have changed dramatically, and, thus, methods of risk assessment have been perfected [10-13]. IPKON RAS, with the support of the Federal Service for Environmental Supervision, assumed the role of the initiator of the project for the development of Federal norms and regulations of industrial safety "Rules for ensuring the stability of walls and benches of open pits, open-cast mines and spoil banks", which contribute to the improvement of economic efficiency and safety of mineral deposit surface mining and enhancement of the competitiveness of Russian mines at the international level that is very important in the current situation.
The use of data mining by private health insurance companies and customers' privacy.
Al-Saggaf, Yeslam
2015-07-01
This article examines privacy threats arising from the use of data mining by private Australian health insurance companies. Qualitative interviews were conducted with key experts, and Australian governmental and nongovernmental websites relevant to private health insurance were searched. Using Rationale, a critical thinking tool, the themes and considerations elicited through this empirical approach were developed into an argument about the use of data mining by private health insurance companies. The argument is followed by an ethical analysis guided by classical philosophical theories-utilitarianism, Mill's harm principle, Kant's deontological theory, and Helen Nissenbaum's contextual integrity framework. Both the argument and the ethical analysis find the use of data mining by private health insurance companies in Australia to be unethical. Although private health insurance companies in Australia cannot use data mining for risk rating to cherry-pick customers and cannot use customers' personal information for unintended purposes, this article nonetheless concludes that the secondary use of customers' personal information and the absence of customers' consent still suggest that the use of data mining by private health insurance companies is wrong.
ASCOT: a text mining-based web-service for efficient search and assisted creation of clinical trials
2012-01-01
Clinical trials are mandatory protocols describing medical research on humans and among the most valuable sources of medical practice evidence. Searching for trials relevant to some query is laborious due to the immense number of existing protocols. Apart from search, writing new trials includes composing detailed eligibility criteria, which might be time-consuming, especially for new researchers. In this paper we present ASCOT, an efficient search application customised for clinical trials. ASCOT uses text mining and data mining methods to enrich clinical trials with metadata, that in turn serve as effective tools to narrow down search. In addition, ASCOT integrates a component for recommending eligibility criteria based on a set of selected protocols. PMID:22595088
Korkontzelos, Ioannis; Mu, Tingting; Ananiadou, Sophia
2012-04-30
Clinical trials are mandatory protocols describing medical research on humans and among the most valuable sources of medical practice evidence. Searching for trials relevant to some query is laborious due to the immense number of existing protocols. Apart from search, writing new trials includes composing detailed eligibility criteria, which might be time-consuming, especially for new researchers. In this paper we present ASCOT, an efficient search application customised for clinical trials. ASCOT uses text mining and data mining methods to enrich clinical trials with metadata, that in turn serve as effective tools to narrow down search. In addition, ASCOT integrates a component for recommending eligibility criteria based on a set of selected protocols.
Air Pollution Monitoring and Mining Based on Sensor Grid in London
Ma, Yajie; Richards, Mark; Ghanem, Moustafa; Guo, Yike; Hassard, John
2008-01-01
In this paper, we present a distributed infrastructure based on wireless sensors network and Grid computing technology for air pollution monitoring and mining, which aims to develop low-cost and ubiquitous sensor networks to collect real-time, large scale and comprehensive environmental data from road traffic emissions for air pollution monitoring in urban environment. The main informatics challenges in respect to constructing the high-throughput sensor Grid are discussed in this paper. We present a two-layer network framework, a P2P e-Science Grid architecture, and the distributed data mining algorithm as the solutions to address the challenges. We simulated the system in TinyOS to examine the operation of each sensor as well as the networking performance. We also present the distributed data mining result to examine the effectiveness of the algorithm. PMID:27879895
Towards Cooperative Predictive Data Mining in Competitive Environments
NASA Astrophysics Data System (ADS)
Lisý, Viliam; Jakob, Michal; Benda, Petr; Urban, Štěpán; Pěchouček, Michal
We study the problem of predictive data mining in a competitive multi-agent setting, in which each agent is assumed to have some partial knowledge required for correctly classifying a set of unlabelled examples. The agents are self-interested and therefore need to reason about the trade-offs between increasing their classification accuracy by collaborating with other agents and disclosing their private classification knowledge to other agents through such collaboration. We analyze the problem and propose a set of components which can enable cooperation in this otherwise competitive task. These components include measures for quantifying private knowledge disclosure, data-mining models suitable for multi-agent predictive data mining, and a set of strategies by which agents can improve their classification accuracy through collaboration. The overall framework and its individual components are validated on a synthetic experimental domain.
Air Pollution Monitoring and Mining Based on Sensor Grid in London.
Ma, Yajie; Richards, Mark; Ghanem, Moustafa; Guo, Yike; Hassard, John
2008-06-01
In this paper, we present a distributed infrastructure based on wireless sensors network and Grid computing technology for air pollution monitoring and mining, which aims to develop low-cost and ubiquitous sensor networks to collect real-time, large scale and comprehensive environmental data from road traffic emissions for air pollution monitoring in urban environment. The main informatics challenges in respect to constructing the high-throughput sensor Grid are discussed in this paper. We present a twolayer network framework, a P2P e-Science Grid architecture, and the distributed data mining algorithm as the solutions to address the challenges. We simulated the system in TinyOS to examine the operation of each sensor as well as the networking performance. We also present the distributed data mining result to examine the effectiveness of the algorithm.
Pak, Malk Eun; Kim, Yu Ri; Kim, Ha Neui; Ahn, Sung Min; Shin, Hwa Kyoung; Baek, Jin Ung; Choi, Byung Tae
2016-02-17
In literature on Korean medicine, Dongeuibogam (Treasured Mirror of Eastern Medicine), published in 1613, represents the overall results of the traditional medicines of North-East Asia based on prior medicinal literature of this region. We utilized this medicinal literature by text mining to establish a list of candidate herbs for cognitive enhancement in the elderly and then performed an evaluation of their effects. Text mining was performed for selection of candidate herbs. Cell viability was determined in HT22 hippocampal cells and immunohistochemistry and behavioral analysis was performed in a kainic acid (KA) mice model in order to observe alterations of hippocampal cells and cognition. Twenty four herbs for cognitive enhancement in the elderly were selected by text mining of Dongeuibogam. In HT22 cells, pretreatment with 3 candidate herbs resulted in significantly reduced glutamate-induced cell death. Panax ginseng was the most neuroprotective herb against glutamate-induced cell death. In the hippocampus of a KA mice model, pretreatment with 11 candidate herbs resulted in suppression of caspase-3 expression. Treatment with 7 candidate herbs resulted in significantly enhanced expression levels of phosphorylated cAMP response element binding protein. Number of proliferated cells indicated by BrdU labeling was increased by treatment with 10 candidate herbs. Schisandra chinensis was the most effective herb against cell death and proliferation of progenitor cells and Rehmannia glutinosa in neuroprotection in the hippocampus of a KA mice model. In a KA mice model, we confirmed improved spatial and short memory by treatment with the 3 most effective candidate herbs and these recovered functions were involved in a higher number of newly formed neurons from progenitor cells in the hippocampus. These established herbs and their combinations identified by text-mining technique and evaluation for effectiveness may have value in further experimental and clinical applications for cognitive enhancement in the elderly. Copyright © 2016 Elsevier Ireland Ltd. All rights reserved.
Mining the Text: 34 Text Features that Can Ease or Obstruct Text Comprehension and Use
ERIC Educational Resources Information Center
White, Sheida
2012-01-01
This article presents 34 characteristics of texts and tasks ("text features") that can make continuous (prose), noncontinuous (document), and quantitative texts easier or more difficult for adolescents and adults to comprehend and use. The text features were identified by examining the assessment tasks and associated texts in the national…
Federal Register 2010, 2011, 2012, 2013, 2014
2010-02-24
... Transmittal of Applications: March 26, 2010. Full Text of Announcement I. Funding Opportunity Description... related to industrial health and safety: Mining and mineral engineering, industrial engineering... technology/technician, hazardous materials information systems technology/technician, mining technology...
Tichomirowa, Marion; Heidel, Claudia
2012-01-01
The isotope composition of dissolved sulphate and strontium in atmospheric deposition, groundwater, mine water and river water in the region of Freiberg was investigated to better understand the fate of these components in the regional and global water cycle. Most of the isotope variations of dissolved sulphates in atmospheric deposition from three locations sampled bi- or tri-monthly can be explained by fractionation processes leading to lower [Formula: see text] (of about 2-3‰) and higher [Formula: see text] (of about 8-10‰) values in summer compared with the winter period. These samples showed a negative correlation between [Formula: see text] and [Formula: see text] values and a weak positive correlation between [Formula: see text] and [Formula: see text] values. They reflect the sulphate formed by aqueous oxidation from long-range transport in clouds. However, these isotope variations were superimposed by changes of the dominating atmospheric sulphate source. At two of the sampling points, large variations of mean annual [Formula: see text] values from atmospheric bulk deposition were recorded. From 2008 to 2009, the mean annual [Formula: see text] value increased by about 5‰; and decreased by about 4‰ from 2009 to 2010. A change in the dominating sulphate source or oxidation pathways of SO(2) in the atmosphere is proposed to cause these shifts. No changes were found in corresponding [Formula: see text] values. Groundwater, river water and some mine waters (where groundwater was the dominating sulphate source) also showed temporal shifts in their [Formula: see text] values corresponding to those of bulk atmospheric deposition, albeit to a lower degree. The mean transit time of atmospheric sulphur through the soil into the groundwater and river water was less than a year and therefore much shorter than previously suggested. Mining activities of about 800 years in the Freiberg region may have led to large subsurface areas with an enhanced groundwater flow along fractures and mined-refilled ore lodes which may shorten transit times of sulphate from precipitation through groundwater into river water.
Neural networks for data mining electronic text collections
NASA Astrophysics Data System (ADS)
Walker, Nicholas; Truman, Gregory
1997-04-01
The use of neural networks in information retrieval and text analysis has primarily suffered from the issues of adequate document representation, the ability to scale to very large collections, dynamism in the face of new information and the practical difficulties of basing the design on the use of supervised training sets. Perhaps the most important approach to begin solving these problems is the use of `intermediate entities' which reduce the dimensionality of document representations and the size of documents collections to manageable levels coupled with the use of unsupervised neural network paradigms. This paper describes the issues, a fully configured neural network-based text analysis system--dataHARVEST--aimed at data mining text collections which begins this process, along with the remaining difficulties and potential ways forward.
Beyond Academia - Interrogating Research Impact in the Research Excellence Framework.
Terama, Emma; Smallman, Melanie; Lock, Simon J; Johnson, Charlotte; Austwick, Martin Zaltz
2016-01-01
Big changes to the way in which research funding is allocated to UK universities were brought about in the Research Excellence Framework (REF), overseen by the Higher Education Funding Council, England. Replacing the earlier Research Assessment Exercise, the purpose of the REF was to assess the quality and reach of research in UK universities-and allocate funding accordingly. For the first time, this included an assessment of research 'impact', accounting for 20% of the funding allocation. In this article we use a text mining technique to investigate the interpretations of impact put forward via impact case studies in the REF process. We find that institutions have developed a diverse interpretation of impact, ranging from commercial applications to public and cultural engagement activities. These interpretations of impact vary from discipline to discipline and between institutions, with more broad-based institutions depicting a greater variety of impacts. Comparing the interpretations with the score given by REF, we found no evidence of one particular interpretation being more highly rewarded than another. Importantly, we also found a positive correlation between impact score and [overall research] quality score, suggesting that impact is not being achieved at the expense of research excellence.
Beyond Academia – Interrogating Research Impact in the Research Excellence Framework
Smallman, Melanie; Lock, Simon J.; Johnson, Charlotte; Austwick, Martin Zaltz
2016-01-01
Big changes to the way in which research funding is allocated to UK universities were brought about in the Research Excellence Framework (REF), overseen by the Higher Education Funding Council, England. Replacing the earlier Research Assessment Exercise, the purpose of the REF was to assess the quality and reach of research in UK universities–and allocate funding accordingly. For the first time, this included an assessment of research ‘impact’, accounting for 20% of the funding allocation. In this article we use a text mining technique to investigate the interpretations of impact put forward via impact case studies in the REF process. We find that institutions have developed a diverse interpretation of impact, ranging from commercial applications to public and cultural engagement activities. These interpretations of impact vary from discipline to discipline and between institutions, with more broad-based institutions depicting a greater variety of impacts. Comparing the interpretations with the score given by REF, we found no evidence of one particular interpretation being more highly rewarded than another. Importantly, we also found a positive correlation between impact score and [overall research] quality score, suggesting that impact is not being achieved at the expense of research excellence. PMID:27997599
Data Visualization in Information Retrieval and Data Mining (SIG VIS).
ERIC Educational Resources Information Center
Efthimiadis, Efthimis
2000-01-01
Presents abstracts that discuss using data visualization for information retrieval and data mining, including immersive information space and spatial metaphors; spatial data using multi-dimensional matrices with maps; TREC (Text Retrieval Conference) experiments; users' information needs in cartographic information retrieval; and users' relevance…
Macromolecule mass spectrometry: citation mining of user documents.
Kostoff, Ronald N; Bedford, Clifford D; del Río, J Antonio; Cortes, Héctor D; Karypis, George
2004-03-01
Identifying research users, applications, and impact is important for research performers, managers, evaluators, and sponsors. Identification of the user audience and the research impact is complex and time consuming due to the many indirect pathways through which fundamental research can impact applications. This paper identified the literature pathways through which two highly-cited papers of 2002 Chemistry Nobel Laureates Fenn and Tanaka impacted research, technology development, and applications. Citation Mining, an integration of citation bibliometrics and text mining, was applied to the >1600 first generation Science Citation Index (SCI) citing papers to Fenn's 1989 Science paper on Electrospray Ionization for Mass Spectrometry, and to the >400 first generation SCI citing papers to Tanaka's 1988 Rapid Communications in Mass Spectrometry paper on Laser Ionization Time-of-Flight Mass Spectrometry. Bibliometrics was performed on the citing papers to profile the user characteristics. Text mining was performed on the citing papers to identify the technical areas impacted by the research, and the relationships among these technical areas.
2016-09-26
Intelligent Automation Incorporated Enhancements for a Dynamic Data Warehousing and Mining ...Enhancements for a Dynamic Data Warehousing and Mining System for N00014-16-P-3014 Large-Scale Human Social Cultural Behavioral (HSBC) Data 5b. GRANT NUMBER...Representative Media Gallery View. We perform Scraawl’s NER algorithm to the text associated with YouTube post, which classifies the named entities into
Detecting Malicious Tweets in Twitter Using Runtime Monitoring With Hidden Information
2016-06-01
text mining using Twitter streaming API and python [Online]. Available: http://adilmoujahid.com/posts/2014/07/twitter-analytics/ [22] M. Singh, B...sites with 645,750,000 registered users [3] and has open source public tweets for data mining . 2. Malicious Users and Tweets In the modern world...want to data mine in Twitter, and presents the natural language assertions and corresponding rule patterns. It then describes the steps performed using
Numerical linear algebra in data mining
NASA Astrophysics Data System (ADS)
Eldén, Lars
Ideas and algorithms from numerical linear algebra are important in several areas of data mining. We give an overview of linear algebra methods in text mining (information retrieval), pattern recognition (classification of handwritten digits), and PageRank computations for web search engines. The emphasis is on rank reduction as a method of extracting information from a data matrix, low-rank approximation of matrices using the singular value decomposition and clustering, and on eigenvalue methods for network analysis.
Method to Select Technical Terms for Glossaries in Support of Joint Task Force Operations
2012-01-01
have been prohibitively time-consuming. Instead, we identified two publicly available terminology extractor tools: TerMine (NaCTEM, 2011) and Alchemy ...and that from the latter, by high recall. The Alchemy approach contrasts with that used in TerMine in that Alchemy will process the text with...information categories, such as person, location, and organization, in addition to returning topic keywords. Output from both TerMine and Alchemy
DOE Office of Scientific and Technical Information (OSTI.GOV)
Hirdt, J.A.; Brown, D.A., E-mail: dbrown@bnl.gov
The EXFOR library contains the largest collection of experimental nuclear reaction data available as well as the data's bibliographic information and experimental details. We text-mined the REACTION and MONITOR fields of the ENTRYs in the EXFOR library in order to identify understudied reactions and quantities. Using the results of the text-mining, we created an undirected graph from the EXFOR datasets with each graph node representing a single reaction and quantity and graph links representing the various types of connections between these reactions and quantities. This graph is an abstract representation of the connections in EXFOR, similar to graphs of socialmore » networks, authorship networks, etc. We use various graph theoretical tools to identify important yet understudied reactions and quantities in EXFOR. Although we identified a few cross sections relevant for shielding applications and isotope production, mostly we identified charged particle fluence monitor cross sections. As a side effect of this work, we learn that our abstract graph is typical of other real-world graphs.« less
Systematic analysis of molecular mechanisms for HCC metastasis via text mining approach.
Zhen, Cheng; Zhu, Caizhong; Chen, Haoyang; Xiong, Yiru; Tan, Junyuan; Chen, Dong; Li, Jin
2017-02-21
To systematically explore the molecular mechanism for hepatocellular carcinoma (HCC) metastasis and identify regulatory genes with text mining methods. Genes with highest frequencies and significant pathways related to HCC metastasis were listed. A handful of proteins such as EGFR, MDM2, TP53 and APP, were identified as hub nodes in PPI (protein-protein interaction) network. Compared with unique genes for HBV-HCCs, genes particular to HCV-HCCs were less, but may participate in more extensive signaling processes. VEGFA, PI3KCA, MAPK1, MMP9 and other genes may play important roles in multiple phenotypes of metastasis. Genes in abstracts of HCC-metastasis literatures were identified. Word frequency analysis, KEGG pathway and PPI network analysis were performed. Then co-occurrence analysis between genes and metastasis-related phenotypes were carried out. Text mining is effective for revealing potential regulators or pathways, but the purpose of it should be specific, and the combination of various methods will be more useful.
Knowledge acquisition, semantic text mining, and security risks in health and biomedical informatics
Huang, Jingshan; Dou, Dejing; Dang, Jiangbo; Pardue, J Harold; Qin, Xiao; Huan, Jun; Gerthoffer, William T; Tan, Ming
2012-01-01
Computational techniques have been adopted in medical and biological systems for a long time. There is no doubt that the development and application of computational methods will render great help in better understanding biomedical and biological functions. Large amounts of datasets have been produced by biomedical and biological experiments and simulations. In order for researchers to gain knowledge from original data, nontrivial transformation is necessary, which is regarded as a critical link in the chain of knowledge acquisition, sharing, and reuse. Challenges that have been encountered include: how to efficiently and effectively represent human knowledge in formal computing models, how to take advantage of semantic text mining techniques rather than traditional syntactic text mining, and how to handle security issues during the knowledge sharing and reuse. This paper summarizes the state-of-the-art in these research directions. We aim to provide readers with an introduction of major computing themes to be applied to the medical and biological research. PMID:22371823
Chemical named entities recognition: a review on approaches and applications.
Eltyeb, Safaa; Salim, Naomie
2014-01-01
The rapid increase in the flow rate of published digital information in all disciplines has resulted in a pressing need for techniques that can simplify the use of this information. The chemistry literature is very rich with information about chemical entities. Extracting molecules and their related properties and activities from the scientific literature to "text mine" these extracted data and determine contextual relationships helps research scientists, particularly those in drug development. One of the most important challenges in chemical text mining is the recognition of chemical entities mentioned in the texts. In this review, the authors briefly introduce the fundamental concepts of chemical literature mining, the textual contents of chemical documents, and the methods of naming chemicals in documents. We sketch out dictionary-based, rule-based and machine learning, as well as hybrid chemical named entity recognition approaches with their applied solutions. We end with an outlook on the pros and cons of these approaches and the types of chemical entities extracted.
NASA Astrophysics Data System (ADS)
Hirdt, J. A.; Brown, D. A.
2016-01-01
The EXFOR library contains the largest collection of experimental nuclear reaction data available as well as the data's bibliographic information and experimental details. We text-mined the REACTION and MONITOR fields of the ENTRYs in the EXFOR library in order to identify understudied reactions and quantities. Using the results of the text-mining, we created an undirected graph from the EXFOR datasets with each graph node representing a single reaction and quantity and graph links representing the various types of connections between these reactions and quantities. This graph is an abstract representation of the connections in EXFOR, similar to graphs of social networks, authorship networks, etc. We use various graph theoretical tools to identify important yet understudied reactions and quantities in EXFOR. Although we identified a few cross sections relevant for shielding applications and isotope production, mostly we identified charged particle fluence monitor cross sections. As a side effect of this work, we learn that our abstract graph is typical of other real-world graphs.
NASA Astrophysics Data System (ADS)
Zhironkin, S. A.; Khoreshok, A. A.; Tyulenev, M. A.; Barysheva, G. A.; Hellmer, M. C.
2016-08-01
This article describes the problems and prospects of development of coal mining in Kuzbass - the center of coal production in Siberia and Russia, in the framework of the major initiatives of the National Energy Strategy for the period until 2035. The structural character of the regional coal industry problems, caused by decline in investment activity, high level of fixed assets depreciation, slow development of deep coal processing and technological reduction of coal mining is shown.
Measuring Two-Event Structural Correlations on Graphs
2012-08-01
2012 to 00-00-2012 4. TITLE AND SUBTITLE Measuring Two-Event Structural Correlations on Graphs 5a. CONTRACT NUMBER 5b. GRANT NUMBER 5c. PROGRAM ...by event simulation on the DBLP graph. Then we examine the efficiency and scala - bility of the framework with a Twitter network. The third part of...correlation pattern mining for large graphs. In Proc. of the 8th Workshop on Mining and Learning with Graphs, pages 119–126, 2010. [23] T. Smith. A
Large-Scale Event Extraction from Literature with Multi-Level Gene Normalization
Wei, Chih-Hsuan; Hakala, Kai; Pyysalo, Sampo; Ananiadou, Sophia; Kao, Hung-Yu; Lu, Zhiyong; Salakoski, Tapio; Van de Peer, Yves; Ginter, Filip
2013-01-01
Text mining for the life sciences aims to aid database curation, knowledge summarization and information retrieval through the automated processing of biomedical texts. To provide comprehensive coverage and enable full integration with existing biomolecular database records, it is crucial that text mining tools scale up to millions of articles and that their analyses can be unambiguously linked to information recorded in resources such as UniProt, KEGG, BioGRID and NCBI databases. In this study, we investigate how fully automated text mining of complex biomolecular events can be augmented with a normalization strategy that identifies biological concepts in text, mapping them to identifiers at varying levels of granularity, ranging from canonicalized symbols to unique gene and proteins and broad gene families. To this end, we have combined two state-of-the-art text mining components, previously evaluated on two community-wide challenges, and have extended and improved upon these methods by exploiting their complementary nature. Using these systems, we perform normalization and event extraction to create a large-scale resource that is publicly available, unique in semantic scope, and covers all 21.9 million PubMed abstracts and 460 thousand PubMed Central open access full-text articles. This dataset contains 40 million biomolecular events involving 76 million gene/protein mentions, linked to 122 thousand distinct genes from 5032 species across the full taxonomic tree. Detailed evaluations and analyses reveal promising results for application of this data in database and pathway curation efforts. The main software components used in this study are released under an open-source license. Further, the resulting dataset is freely accessible through a novel API, providing programmatic and customized access (http://www.evexdb.org/api/v001/). Finally, to allow for large-scale bioinformatic analyses, the entire resource is available for bulk download from http://evexdb.org/download/, under the Creative Commons – Attribution – Share Alike (CC BY-SA) license. PMID:23613707
Ravikumar, Komandur Elayavilli; Wagholikar, Kavishwar B; Li, Dingcheng; Kocher, Jean-Pierre; Liu, Hongfang
2015-06-06
Advances in the next generation sequencing technology has accelerated the pace of individualized medicine (IM), which aims to incorporate genetic/genomic information into medicine. One immediate need in interpreting sequencing data is the assembly of information about genetic variants and their corresponding associations with other entities (e.g., diseases or medications). Even with dedicated effort to capture such information in biological databases, much of this information remains 'locked' in the unstructured text of biomedical publications. There is a substantial lag between the publication and the subsequent abstraction of such information into databases. Multiple text mining systems have been developed, but most of them focus on the sentence level association extraction with performance evaluation based on gold standard text annotations specifically prepared for text mining systems. We developed and evaluated a text mining system, MutD, which extracts protein mutation-disease associations from MEDLINE abstracts by incorporating discourse level analysis, using a benchmark data set extracted from curated database records. MutD achieves an F-measure of 64.3% for reconstructing protein mutation disease associations in curated database records. Discourse level analysis component of MutD contributed to a gain of more than 10% in F-measure when compared against the sentence level association extraction. Our error analysis indicates that 23 of the 64 precision errors are true associations that were not captured by database curators and 68 of the 113 recall errors are caused by the absence of associated disease entities in the abstract. After adjusting for the defects in the curated database, the revised F-measure of MutD in association detection reaches 81.5%. Our quantitative analysis reveals that MutD can effectively extract protein mutation disease associations when benchmarking based on curated database records. The analysis also demonstrates that incorporating discourse level analysis significantly improved the performance of extracting the protein-mutation-disease association. Future work includes the extension of MutD for full text articles.
NASA Astrophysics Data System (ADS)
Endsley, K.; McCarty, J. L.
2012-12-01
Data mining techniques have been applied to social media in a variety of contexts, from mapping the evolution of the Tahrir Square protests in Egypt to predicting influenza outbreaks. The Twitter platform is a particular favorite due to its robust application programming interface (API) and high throughput. Twitter, Inc. estimated in 2011 that over 2,200 messages or "tweets" are generated every second. Also helpful is Twitter's semblance in operation to the short message service (SMS), better known as "texting," available on cellular phones and the most popular means of wide telecommunications in many developing countries. In the United States, Twitter has been used by a number of federal, state and local officials as well as motivated individuals to report prescribed burns in advance (sometimes as part of a reporting obligation) or to communicate the emergence, response to, and containment of wildfires. These reports are unstructured and, like all Twitter messages, limited to 140 UTF-8 characters. Through internal research and development at the Michigan Tech Research Institute, the authors have developed a data mining routine that gathers potential tweets of interest using the Twitter API, eliminates duplicates ("retweets"), and extracts relevant information such as the approximate size and condition of the fire. Most importantly, the message is geocoded and/or contains approximate locational information, allowing for prescribed and wildland fires to be mapped. Natural language processing techniques, adapted to improve computational performance, are used to tokenize and tag these elements for each tweet. The entire routine is implemented in the Python programming language, using open-source libraries. As such, it is demonstrated in a web-based framework where prescribed burns and/or wildfires are mapped in real time, visualized through a JavaScript-based mapping client in any web browser. The practices demonstrated here generalize to an SMS platform (or any short text-based platform) and thus provide exciting opportunities for the cultivation of fire or other disaster alerts and response here in the U.S. and in the developing world.
Semantic biomedical resource discovery: a Natural Language Processing framework.
Sfakianaki, Pepi; Koumakis, Lefteris; Sfakianakis, Stelios; Iatraki, Galatia; Zacharioudakis, Giorgos; Graf, Norbert; Marias, Kostas; Tsiknakis, Manolis
2015-09-30
A plethora of publicly available biomedical resources do currently exist and are constantly increasing at a fast rate. In parallel, specialized repositories are been developed, indexing numerous clinical and biomedical tools. The main drawback of such repositories is the difficulty in locating appropriate resources for a clinical or biomedical decision task, especially for non-Information Technology expert users. In parallel, although NLP research in the clinical domain has been active since the 1960s, progress in the development of NLP applications has been slow and lags behind progress in the general NLP domain. The aim of the present study is to investigate the use of semantics for biomedical resources annotation with domain specific ontologies and exploit Natural Language Processing methods in empowering the non-Information Technology expert users to efficiently search for biomedical resources using natural language. A Natural Language Processing engine which can "translate" free text into targeted queries, automatically transforming a clinical research question into a request description that contains only terms of ontologies, has been implemented. The implementation is based on information extraction techniques for text in natural language, guided by integrated ontologies. Furthermore, knowledge from robust text mining methods has been incorporated to map descriptions into suitable domain ontologies in order to ensure that the biomedical resources descriptions are domain oriented and enhance the accuracy of services discovery. The framework is freely available as a web application at ( http://calchas.ics.forth.gr/ ). For our experiments, a range of clinical questions were established based on descriptions of clinical trials from the ClinicalTrials.gov registry as well as recommendations from clinicians. Domain experts manually identified the available tools in a tools repository which are suitable for addressing the clinical questions at hand, either individually or as a set of tools forming a computational pipeline. The results were compared with those obtained from an automated discovery of candidate biomedical tools. For the evaluation of the results, precision and recall measurements were used. Our results indicate that the proposed framework has a high precision and low recall, implying that the system returns essentially more relevant results than irrelevant. There are adequate biomedical ontologies already available, sufficiency of existing NLP tools and quality of biomedical annotation systems for the implementation of a biomedical resources discovery framework, based on the semantic annotation of resources and the use on NLP techniques. The results of the present study demonstrate the clinical utility of the application of the proposed framework which aims to bridge the gap between clinical question in natural language and efficient dynamic biomedical resources discovery.
A semantic model for multimodal data mining in healthcare information systems.
Iakovidis, Dimitris; Smailis, Christos
2012-01-01
Electronic health records (EHRs) are representative examples of multimodal/multisource data collections; including measurements, images and free texts. The diversity of such information sources and the increasing amounts of medical data produced by healthcare institutes annually, pose significant challenges in data mining. In this paper we present a novel semantic model that describes knowledge extracted from the lowest-level of a data mining process, where information is represented by multiple features i.e. measurements or numerical descriptors extracted from measurements, images, texts or other medical data, forming multidimensional feature spaces. Knowledge collected by manual annotation or extracted by unsupervised data mining from one or more feature spaces is modeled through generalized qualitative spatial semantics. This model enables a unified representation of knowledge across multimodal data repositories. It contributes to bridging the semantic gap, by enabling direct links between low-level features and higher-level concepts e.g. describing body parts, anatomies and pathological findings. The proposed model has been developed in web ontology language based on description logics (OWL-DL) and can be applied to a variety of data mining tasks in medical informatics. It utility is demonstrated for automatic annotation of medical data.
Tagawa, Miki; Matsuda, Yoshio; Manaka, Tomoko; Kobayashi, Makiko; Ohwada, Michitaka; Matsubara, Shigeki
2017-01-01
The aim of the study was to examine the possibility of converting subjective textual data written in the free column space of the Mother and Child Handbook (MCH) into objective information using text mining and to compare any monthly changes in the words written by the mothers. Pregnant women without complications (n = 60) were divided into two groups according to State-Trait Anxiety Inventory grade: low trait anxiety (group I, n = 39) and high trait anxiety (group II, n = 21). Exploratory analysis of the textual data from the MCH was conducted by text mining using the Word Miner software program. Using 1203 structural elements extracted after processing, a comparison of monthly changes in the words used in the mothers' comments was made between the two groups. The data was mainly analyzed by a correspondence analysis. The structural elements in groups I and II were divided into seven and six clusters, respectively, by cluster analysis. Correspondence analysis revealed clear monthly changes in the words used in the mothers' comments as the pregnancy progressed in group I, whereas the association was not clear in group II. The text mining method was useful for exploratory analysis of the textual data obtained from pregnant women, and the monthly change in the words used in the mothers' comments as pregnancy progressed differed according to their degree of unease. © 2016 Japan Society of Obstetrics and Gynecology.
Bravo, Àlex; Piñero, Janet; Queralt-Rosinach, Núria; Rautschka, Michael; Furlong, Laura I
2015-02-21
Current biomedical research needs to leverage and exploit the large amount of information reported in scientific publications. Automated text mining approaches, in particular those aimed at finding relationships between entities, are key for identification of actionable knowledge from free text repositories. We present the BeFree system aimed at identifying relationships between biomedical entities with a special focus on genes and their associated diseases. By exploiting morpho-syntactic information of the text, BeFree is able to identify gene-disease, drug-disease and drug-target associations with state-of-the-art performance. The application of BeFree to real-case scenarios shows its effectiveness in extracting information relevant for translational research. We show the value of the gene-disease associations extracted by BeFree through a number of analyses and integration with other data sources. BeFree succeeds in identifying genes associated to a major cause of morbidity worldwide, depression, which are not present in other public resources. Moreover, large-scale extraction and analysis of gene-disease associations, and integration with current biomedical knowledge, provided interesting insights on the kind of information that can be found in the literature, and raised challenges regarding data prioritization and curation. We found that only a small proportion of the gene-disease associations discovered by using BeFree is collected in expert-curated databases. Thus, there is a pressing need to find alternative strategies to manual curation, in order to review, prioritize and curate text-mining data and incorporate it into domain-specific databases. We present our strategy for data prioritization and discuss its implications for supporting biomedical research and applications. BeFree is a novel text mining system that performs competitively for the identification of gene-disease, drug-disease and drug-target associations. Our analyses show that mining only a small fraction of MEDLINE results in a large dataset of gene-disease associations, and only a small proportion of this dataset is actually recorded in curated resources (2%), raising several issues on data prioritization and curation. We propose that joint analysis of text mined data with data curated by experts appears as a suitable approach to both assess data quality and highlight novel and interesting information.
tmBioC: improving interoperability of text-mining tools with BioC.
Khare, Ritu; Wei, Chih-Hsuan; Mao, Yuqing; Leaman, Robert; Lu, Zhiyong
2014-01-01
The lack of interoperability among biomedical text-mining tools is a major bottleneck in creating more complex applications. Despite the availability of numerous methods and techniques for various text-mining tasks, combining different tools requires substantial efforts and time owing to heterogeneity and variety in data formats. In response, BioC is a recent proposal that offers a minimalistic approach to tool interoperability by stipulating minimal changes to existing tools and applications. BioC is a family of XML formats that define how to present text documents and annotations, and also provides easy-to-use functions to read/write documents in the BioC format. In this study, we introduce our text-mining toolkit, which is designed to perform several challenging and significant tasks in the biomedical domain, and repackage the toolkit into BioC to enhance its interoperability. Our toolkit consists of six state-of-the-art tools for named-entity recognition, normalization and annotation (PubTator) of genes (GenNorm), diseases (DNorm), mutations (tmVar), species (SR4GN) and chemicals (tmChem). Although developed within the same group, each tool is designed to process input articles and output annotations in a different format. We modify these tools and enable them to read/write data in the proposed BioC format. We find that, using the BioC family of formats and functions, only minimal changes were required to build the newer versions of the tools. The resulting BioC wrapped toolkit, which we have named tmBioC, consists of our tools in BioC, an annotated full-text corpus in BioC, and a format detection and conversion tool. Furthermore, through participation in the 2013 BioCreative IV Interoperability Track, we empirically demonstrate that the tools in tmBioC can be more efficiently integrated with each other as well as with external tools: Our experimental results show that using BioC reduces >60% in lines of code for text-mining tool integration. The tmBioC toolkit is publicly available at http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/tmTools/. Database URL: http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/tmTools/. Published by Oxford University Press 2014. This work is written by US Government employees and is in the public domain in the US.
What should a radiation regulator do about naturally occurring radioactive material?
Loy, J
2015-06-01
The standard regulatory framework of authorisation, review and assessment, inspection and enforcement, and regulation making is directed principally towards ensuring the regulatory control of planned exposure situations. Some mining and industrial activities involving exposures to naturally occurring radioactive material (NORM), such as uranium mining or the treatment and conditioning of NORM residues, may fit readily within this standard framework. In other cases, such as oil and gas exploration and production, the standard regulatory framework needs to be adjusted. For example, it is not sensible to require that an oil company seek a licence from the radiation regulator before drilling a well. The paper discusses other approaches that a regulator might take to assure protection and safety in such activities involving exposures to NORM, including the use of conditional exemptions from regulatory controls. It also suggests some areas where further guidance from the International Commission on Radiological Protection on application of the system of radiological protection to NORM would assist both regulators and operators. © The International Society for Prosthetics and Orthotics Reprints and permissions: sagepub.co.uk/journalsPermissions.nav.
Mining Health-Related Issues in Consumer Product Reviews by Using Scalable Text Analytics
Torii, Manabu; Tilak, Sameer S.; Doan, Son; Zisook, Daniel S.; Fan, Jung-wei
2016-01-01
In an era when most of our life activities are digitized and recorded, opportunities abound to gain insights about population health. Online product reviews present a unique data source that is currently underexplored. Health-related information, although scarce, can be systematically mined in online product reviews. Leveraging natural language processing and machine learning tools, we were able to mine 1.3 million grocery product reviews for health-related information. The objectives of the study were as follows: (1) conduct quantitative and qualitative analysis on the types of health issues found in consumer product reviews; (2) develop a machine learning classifier to detect reviews that contain health-related issues; and (3) gain insights about the task characteristics and challenges for text analytics to guide future research. PMID:27375358
Mining Health-Related Issues in Consumer Product Reviews by Using Scalable Text Analytics.
Torii, Manabu; Tilak, Sameer S; Doan, Son; Zisook, Daniel S; Fan, Jung-Wei
2016-01-01
In an era when most of our life activities are digitized and recorded, opportunities abound to gain insights about population health. Online product reviews present a unique data source that is currently underexplored. Health-related information, although scarce, can be systematically mined in online product reviews. Leveraging natural language processing and machine learning tools, we were able to mine 1.3 million grocery product reviews for health-related information. The objectives of the study were as follows: (1) conduct quantitative and qualitative analysis on the types of health issues found in consumer product reviews; (2) develop a machine learning classifier to detect reviews that contain health-related issues; and (3) gain insights about the task characteristics and challenges for text analytics to guide future research.
Weighted mining of massive collections of [Formula: see text]-values by convex optimization.
Dobriban, Edgar
2018-06-01
Researchers in data-rich disciplines-think of computational genomics and observational cosmology-often wish to mine large bodies of [Formula: see text]-values looking for significant effects, while controlling the false discovery rate or family-wise error rate. Increasingly, researchers also wish to prioritize certain hypotheses, for example, those thought to have larger effect sizes, by upweighting, and to impose constraints on the underlying mining, such as monotonicity along a certain sequence. We introduce Princessp , a principled method for performing weighted multiple testing by constrained convex optimization. Our method elegantly allows one to prioritize certain hypotheses through upweighting and to discount others through downweighting, while constraining the underlying weights involved in the mining process. When the [Formula: see text]-values derive from monotone likelihood ratio families such as the Gaussian means model, the new method allows exact solution of an important optimal weighting problem previously thought to be non-convex and computationally infeasible. Our method scales to massive data set sizes. We illustrate the applications of Princessp on a series of standard genomics data sets and offer comparisons with several previous 'standard' methods. Princessp offers both ease of operation and the ability to scale to extremely large problem sizes. The method is available as open-source software from github.com/dobriban/pvalue_weighting_matlab (accessed 11 October 2017).
30 CFR 900.15 - Federal lands program cooperative agreements.
Code of Federal Regulations, 2010 CFR
2010-07-01
.... 900.15 Section 900.15 Mineral Resources OFFICE OF SURFACE MINING RECLAMATION AND ENFORCEMENT, DEPARTMENT OF THE INTERIOR PROGRAMS FOR THE CONDUCT OF SURFACE MINING OPERATIONS WITHIN EACH STATE INTRODUCTION § 900.15 Federal lands program cooperative agreements. The full text of any State and Federal...
30 CFR 900.12 - State regulatory programs.
Code of Federal Regulations, 2010 CFR
2010-07-01
....12 Mineral Resources OFFICE OF SURFACE MINING RECLAMATION AND ENFORCEMENT, DEPARTMENT OF THE INTERIOR PROGRAMS FOR THE CONDUCT OF SURFACE MINING OPERATIONS WITHIN EACH STATE INTRODUCTION § 900.12 State... to be codified under the applicable part number assigned to the State. The full text will not appear...
Literature Mining Methods for Toxicology and Construction of ...
Webinar Presentation on text-mining methodologies in use at NCCT and how they can be used to assist with the OECD Retinoid project. Presentation to 1st Workshop/Scientific Expert Group meeting on the OECD Retinoid Project - April 26, 2016 –Brussels, Presented remotely via web.
Untangling Topic Threads in Chat-Based Communication: A Case Study
2011-08-01
learning techniques such as clustering are very popular for analyzing text for topic identification (Anjewierden,, Kollöffel and Hulshof 2007; Adams...Anjewierden, A., Kollöffel, B., and Hulshof , C. (2007). Towards educational data mining: Using data mining methods for automated chat analysis to
Data mining in soft computing framework: a survey.
Mitra, S; Pal, S K; Mitra, P
2002-01-01
The present article provides a survey of the available literature on data mining using soft computing. A categorization has been provided based on the different soft computing tools and their hybridizations used, the data mining function implemented, and the preference criterion selected by the model. The utility of the different soft computing methodologies is highlighted. Generally fuzzy sets are suitable for handling the issues related to understandability of patterns, incomplete/noisy data, mixed media information and human interaction, and can provide approximate solutions faster. Neural networks are nonparametric, robust, and exhibit good learning and generalization capabilities in data-rich environments. Genetic algorithms provide efficient search algorithms to select a model, from mixed media data, based on some preference criterion/objective function. Rough sets are suitable for handling different types of uncertainty in data. Some challenges to data mining and the application of soft computing methodologies are indicated. An extensive bibliography is also included.
Luo, Gang
2017-12-01
For user-friendliness, many software systems offer progress indicators for long-duration tasks. A typical progress indicator continuously estimates the remaining task execution time as well as the portion of the task that has been finished. Building a machine learning model often takes a long time, but no existing machine learning software supplies a non-trivial progress indicator. Similarly, running a data mining algorithm often takes a long time, but no existing data mining software provides a nontrivial progress indicator. In this article, we consider the problem of offering progress indicators for machine learning model building and data mining algorithm execution. We discuss the goals and challenges intrinsic to this problem. Then we describe an initial framework for implementing such progress indicators and two advanced, potential uses of them, with the goal of inspiring future research on this topic.
Study on Mine Emergency Mechanism based on TARP and ICS
NASA Astrophysics Data System (ADS)
Xi, Jian; Wu, Zongzhi
2018-01-01
By analyzing the experiences and practices of mine emergency in China and abroad, especially the United States and Australia, normative principle, risk management principle and adaptability principle of constructing mine emergency mechanism based on Trigger Action Response Plans (TARP) and Incident Command System (ICS) are summarized. Classification method, framework, flow and subject of TARP and ICS which are suitable for the actual situation of domestic mine emergency are proposed. The system dynamics model of TARP and ICS is established. The parameters such as evacuation ratio, response rate, per capita emergency capability and entry rate of rescuers are set up. By simulating the operation process of TARP and ICS, the impact of these parameters on the emergency process are analyzed, which could provide a reference and basis for building emergency capacity, formulating emergency plans and setting up action plans in the emergency process.
Application of modified extended method in CREAM for safety inspector in coal mines
NASA Astrophysics Data System (ADS)
Wang, Jinhe; Zhang, Xiaohong; Zeng, Jianchao
2018-01-01
Safety inspector often performs duties in circumstances contributes to the oc currence of human failures. Therefore, the paper aims at quantifying human failure pro bability (HFP) of safety inspector during the coal mine operation with cognitive reliabi lity and error analysis method (CREAM). Whereas, some shortcomings of this approa ch that lacking considering the applicability of the common performance condition (C PC), and the subjective of evaluating CPC level which weaken the accuracy of the qua ntitative prediction results. A modified extended method in CREAM which is able to a ddress these difficulties with a CPC framework table is proposed, and the proposed me thodology is demonstrated by the virtue of a coal-mine accident example. The results a re expected to be useful in predicting HFP of safety inspector and contribute to the enh ancement of coal mine safety.
Luo, Gang
2017-01-01
For user-friendliness, many software systems offer progress indicators for long-duration tasks. A typical progress indicator continuously estimates the remaining task execution time as well as the portion of the task that has been finished. Building a machine learning model often takes a long time, but no existing machine learning software supplies a non-trivial progress indicator. Similarly, running a data mining algorithm often takes a long time, but no existing data mining software provides a nontrivial progress indicator. In this article, we consider the problem of offering progress indicators for machine learning model building and data mining algorithm execution. We discuss the goals and challenges intrinsic to this problem. Then we describe an initial framework for implementing such progress indicators and two advanced, potential uses of them, with the goal of inspiring future research on this topic. PMID:29177022
Tagliaferri, Roberto; Longo, Giuseppe; Milano, Leopoldo; Acernese, Fausto; Barone, Fabrizio; Ciaramella, Angelo; De Rosa, Rosario; Donalek, Ciro; Eleuteri, Antonio; Raiconi, Giancarlo; Sessa, Salvatore; Staiano, Antonino; Volpicelli, Alfredo
2003-01-01
In the last decade, the use of neural networks (NN) and of other soft computing methods has begun to spread also in the astronomical community which, due to the required accuracy of the measurements, is usually reluctant to use automatic tools to perform even the most common tasks of data reduction and data mining. The federation of heterogeneous large astronomical databases which is foreseen in the framework of the astrophysical virtual observatory and national virtual observatory projects, is, however, posing unprecedented data mining and visualization problems which will find a rather natural and user friendly answer in artificial intelligence tools based on NNs, fuzzy sets or genetic algorithms. This review is aimed to both astronomers (who often have little knowledge of the methodological background) and computer scientists (who often know little about potentially interesting applications), and therefore will be structured as follows: after giving a short introduction to the subject, we shall summarize the methodological background and focus our attention on some of the most interesting fields of application, namely: object extraction and classification, time series analysis, noise identification, and data mining. Most of the original work described in the paper has been performed in the framework of the AstroNeural collaboration (Napoli-Salerno).
Comparison of two drug safety signals in a pharmacovigilance data mining framework.
Tubert-Bitter, Pascale; Bégaud, Bernard; Ahmed, Ismaïl
2016-04-01
Since adverse drug reactions are a major public health concern, early detection of drug safety signals has become a top priority for regulatory agencies and the pharmaceutical industry. Quantitative methods for analyzing spontaneous reporting material recorded in pharmacovigilance databases through data mining have been proposed in the last decades and are increasingly used to flag potential safety problems. While automated data mining is motivated by the usually huge size of pharmacovigilance databases, it does not systematically produce relevant alerts. Moreover, each detected signal requires appropriate assessment that may involve investigation of the whole therapeutic class. The goal of this article is to provide a methodology for comparing two detected signals. It is nested within the automated surveillance framework as (1) no extra information is required and (2) no simple inference on the actual risks can be extrapolated from spontaneous reporting data. We designed our methodology on the basis of two classical methods used for automated signal detection: the Bayesian Gamma Poisson Shrinker and the frequentist Proportional Reporting Ratio. A simulation study was conducted to assess the performances of both proposed methods. The latter were used to compare cardiovascular signals for two HIV treatments from the French pharmacovigilance database. © The Author(s) 2012.
Evidence of Historical Mining Impacts on Saltmarshes from east Cornwall, UK
NASA Astrophysics Data System (ADS)
Iurian, Andra-Rada; Taylor, Alex; Millward, Geoff; Blake, William
2016-04-01
In landscapes with extensive mining history, saltmarshes can become sinks for contaminants that are vulnerable to release with sea-level rise and increased storminess. Given the prolonged residence time of heavy metals in the environment, data is urgently required to contextualise the impacts of past and present mining and pollution events and provide a baseline against which to assess Water Framework Directive (WFD) (2000/60/EC) compliance within an integrated catchment management framework. The geology of east Cornwall, UK (with intrusions of granite into the surrounding sedimentary rocks) was favourable for a prosperous mining industry, although large scale operations did not start until about 1830. Tin, cooper, lead and tungsten were the most important ores in the region. In order to quantify the spatial and temporal extent of contamination from past mining, sediment cores were collected from three saltmarshes, namely: Antony Marsh and Treluggan Marsh on the Lower Basin of River Lynher, and Port Eliot Marsh on the Lower Basin of River Tiddy. Core sections at 1 cm intervals were analysed by gamma-ray spectrometry for Pb-210, Ra-226, Cs-137 and Am-241, and the well-established Constant Rate of Supply (CRS) model was employed to derive Pb-210 geochronology with bomb-derived Cs-137 and Am-241 as independent chronological markers. The geochronological data provided the sedimentary accumulation and temporal context for the study. In terms of sediment quality with respect to mining pollution, core sections were analysed using Q-ICP-MS techniques and, additionally, WD-XRF instrumentation at Plymouth University. Measurements were performed for target elements that are normally associated with mining and smelting activities (e.g. Pb, Cu, Sn, Zn, Cr, Cd, etc.), and lithogenic elements (e.g. Fe, Al, Ti) that allow enrichment factors for the anthropogenically-derived elements to be determined. The grain size distribution was determined to identify storminess events and to detect discontinuities in the sediment record. Downcore trends in metal pollutants are discussed in the context of the chronological data, sediment composition and historic meteorological and river flow records. Acknowledgements: Andra-Rada Iurian acknowledges the support of a Marie Curie Fellowship (H2020-MSCA-IF-2014, Grant Agreement number: 658863) within the Horizon 2020.
Comparison between BIDE, PrefixSpan, and TRuleGrowth for Mining of Indonesian Text
NASA Astrophysics Data System (ADS)
Sa'adillah Maylawati, Dian; Irfan, Mohamad; Budiawan Zulfikar, Wildan
2017-01-01
Mining proscess for Indonesian language still be an interesting research. Multiple of words representation was claimed can keep the meaning of text better than bag of words. In this paper, we compare several sequential pattern algortihm, among others BIDE (BIDirectional Extention), PrefixSpan, and TRuleGrowth. All of those algorithm produce frequent word sequence to keep the meaning of text. However, the experiment result, with 14.006 of Indonesian tweet from Twitter, shows that BIDE can produce more efficient frequent word sequence than PrefixSpan and TRuleGrowth without missing the meaning of text. Then, the average of time process of PrefixSpan is faster than BIDE and TRuleGrowth. In the other hand, PrefixSpan and TRuleGrowth is more efficient in using memory than BIDE.
Jane, Nancy Yesudhas; Nehemiah, Khanna Harichandran; Arputharaj, Kannan
2016-01-01
Clinical time-series data acquired from electronic health records (EHR) are liable to temporal complexities such as irregular observations, missing values and time constrained attributes that make the knowledge discovery process challenging. This paper presents a temporal rough set induced neuro-fuzzy (TRiNF) mining framework that handles these complexities and builds an effective clinical decision-making system. TRiNF provides two functionalities namely temporal data acquisition (TDA) and temporal classification. In TDA, a time-series forecasting model is constructed by adopting an improved double exponential smoothing method. The forecasting model is used in missing value imputation and temporal pattern extraction. The relevant attributes are selected using a temporal pattern based rough set approach. In temporal classification, a classification model is built with the selected attributes using a temporal pattern induced neuro-fuzzy classifier. For experimentation, this work uses two clinical time series dataset of hepatitis and thrombosis patients. The experimental result shows that with the proposed TRiNF framework, there is a significant reduction in the error rate, thereby obtaining the classification accuracy on an average of 92.59% for hepatitis and 91.69% for thrombosis dataset. The obtained classification results prove the efficiency of the proposed framework in terms of its improved classification accuracy.
Geospatial decision support framework for critical infrastructure interdependency assessment
NASA Astrophysics Data System (ADS)
Shih, Chung Yan
Critical infrastructures, such as telecommunications, energy, banking and finance, transportation, water systems and emergency services are the foundations of modern society. There is a heavy dependence on critical infrastructures at multiple levels within the supply chain of any good or service. Any disruptions in the supply chain may cause profound cascading effect to other critical infrastructures. A 1997 report by the President's Commission on Critical Infrastructure Protection states that a serious interruption in freight rail service would bring the coal mining industry to a halt within approximately two weeks and the availability of electric power could be reduced in a matter of one to two months. Therefore, this research aimed at representing and assessing the interdependencies between coal supply, transportation and energy production. A proposed geospatial decision support framework was established and applied to analyze interdependency related disruption impact. By utilizing the data warehousing approach, geospatial and non-geospatial data were retrieved, integrated and analyzed based on the transportation model and geospatial disruption analysis developed in the research. The results showed that by utilizing this framework, disruption impacts can be estimated at various levels (e.g., power plant, county, state, etc.) for preventative or emergency response efforts. The information derived from the framework can be used for data mining analysis (e.g., assessing transportation mode usages; finding alternative coal suppliers, etc.).
Online discourse on fibromyalgia: text-mining to identify clinical distinction and patient concerns.
Park, Jungsik; Ryu, Young Uk
2014-10-07
The purpose of this study was to evaluate the possibility of using text-mining to identify clinical distinctions and patient concerns in online memoires posted by patients with fibromyalgia (FM). A total of 399 memoirs were collected from an FM group website. The unstructured data of memoirs associated with FM were collected through a crawling process and converted into structured data with a concordance, parts of speech tagging, and word frequency. We also conducted a lexical analysis and phrase pattern identification. After examining the data, a set of FM-related keywords were obtained and phrase net relationships were set through a web-based visualization tool. The clinical distinction of FM was verified. Pain is the biggest issue to the FM patients. The pains were affecting body parts including 'muscles,' 'leg,' 'neck,' 'back,' 'joints,' and 'shoulders' with accompanying symptoms such as 'spasms,' 'stiffness,' and 'aching,' and were described as 'sever,' 'chronic,' and 'constant.' This study also demonstrated that it was possible to understand the interests and concerns of FM patients through text-mining. FM patients wanted to escape from the pain and symptoms, so they were interested in medical treatment and help. Also, they seemed to have interest in their work and occupation, and hope to continue to live life through the relationships with the people around them. This research shows the potential for extracting keywords to confirm the clinical distinction of a certain disease, and text-mining can help objectively understand the concerns of patients by generalizing their large number of subjective illness experiences. However, it is believed that there are limitations to the processes and methods for organizing and classifying large amounts of text, so these limits have to be considered when analyzing the results. The development of research methodology to overcome these limitations is greatly needed.
Stansfield, Claire; O'Mara-Eves, Alison; Thomas, James
2017-09-01
Using text mining to aid the development of database search strings for topics described by diverse terminology has potential benefits for systematic reviews; however, methods and tools for accomplishing this are poorly covered in the research methods literature. We briefly review the literature on applications of text mining for search term development for systematic reviewing. We found that the tools can be used in 5 overarching ways: improving the precision of searches; identifying search terms to improve search sensitivity; aiding the translation of search strategies across databases; searching and screening within an integrated system; and developing objectively derived search strategies. Using a case study and selected examples, we then reflect on the utility of certain technologies (term frequency-inverse document frequency and Termine, term frequency, and clustering) in improving the precision and sensitivity of searches. Challenges in using these tools are discussed. The utility of these tools is influenced by the different capabilities of the tools, the way the tools are used, and the text that is analysed. Increased awareness of how the tools perform facilitates the further development of methods for their use in systematic reviews. Copyright © 2017 John Wiley & Sons, Ltd.
Bioinformatics: perspectives for the future.
Costa, Luciano da Fontoura
2004-12-30
I give here a very personal perspective of Bioinformatics and its future, starting by discussing the origin of the term (and area) of bioinformatics and proceeding by trying to foresee the development of related issues, including pattern recognition/data mining, the need to reintegrate biology, the potential of complex networks as a powerful and flexible framework for bioinformatics and the interplay between bio- and neuroinformatics. Human resource formation and market perspective are also addressed. Given the complexity and vastness of these issues and concepts, as well as the limited size of a scientific article and finite patience of the reader, these perspectives are surely incomplete and biased. However, it is expected that some of the questions and trends that are identified will motivate discussions during the IcoBiCoBi round table (with the same name as this article) and perhaps provide a more ample perspective among the participants of that conference and the readers of this text.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Kargupta, H.; Stafford, B.; Hamzaoglu, I.
This paper describes an experimental parallel/distributed data mining system PADMA (PArallel Data Mining Agents) that uses software agents for local data accessing and analysis and a web based interface for interactive data visualization. It also presents the results of applying PADMA for detecting patterns in unstructured texts of postmortem reports and laboratory test data for Hepatitis C patients.
29 CFR 570.33 - Prohibited occupations for minors 14 and 15 years of age.
Code of Federal Regulations, 2010 CFR
2010-07-01
... shall apply to all occupations other than the following: (a) Manufacturing, mining, or processing... revised text is set forth as follows: § 570.33 Occupations that are prohibited to minors 14 and 15 years... age: (a) Manufacturing, mining, or processing occupations, including occupations requiring the...
Federal Register 2010, 2011, 2012, 2013, 2014
2011-07-28
... considered but eliminated from detailed analysis include conventional uranium mining and milling, conventional mining and heap leach processing, alternative site location, alternate lixiviants, and alternate...'s Agencywide Document Access and Management System (ADAMS), which provides text and image files of...
30 CFR 732.17 - State program amendments.
Code of Federal Regulations, 2010 CFR
2010-07-01
... Mineral Resources OFFICE OF SURFACE MINING RECLAMATION AND ENFORCEMENT, DEPARTMENT OF THE INTERIOR... the number or size of coal exploration or surface coal mining and reclamation operations in the State... amendment(s) is being reviewed by the Director and will include the following: (i) The text or a summary of...
30 CFR 745.11 - Application and agreement.
Code of Federal Regulations, 2010 CFR
2010-07-01
....11 Mineral Resources OFFICE OF SURFACE MINING RECLAMATION AND ENFORCEMENT, DEPARTMENT OF THE INTERIOR... approval under part 731 of this chapter, and has or may have within the State surface coal mining and... the full text of the terms of the proposed cooperative agreement as submitted or as subsequently...
29 CFR 570.118 - Sixteen-year minimum.
Code of Federal Regulations, 2010 CFR
2010-07-01
... for employment in manufacturing or mining occupations. Furthermore, this age minimum is applicable to... convenience of the user, the revised text is set forth as follows: § 570.118 Sixteen-year minimum. The Act sets a 16-year-age minimum for employment in manufacturing or mining occupations, although under FLSA...
Code of Federal Regulations, 2010 CFR
2010-07-01
... other than the following: (1) Manufacturing, (2) Mining, (3) An occupation found by the Secretary to be..., the revised text is set forth as follows: § 570.122 General. (a) Specific exemptions from the child... sixteen years in any occupation other than manufacturing, mining, or an occupation found by the Secretary...
29 CFR 570.119 - Fourteen-year minimum.
Code of Federal Regulations, 2010 CFR
2010-07-01
... occupations other than manufacturing and mining, the Secretary is authorized to issue regulations or orders... Subpart C of this part. 29-30 [Reserved] (a) Manufacturing, mining, or processing occupations; (b... of the user, the revised text is set forth as follows: § 570.119 Fourteen-year minimum. With respect...
Digital Workflows for a 3d Semantic Representation of AN Ancient Mining Landscape
NASA Astrophysics Data System (ADS)
Hiebel, G.; Hanke, K.
2017-08-01
The ancient mining landscape of Schwaz/Brixlegg in the Tyrol, Austria witnessed mining from prehistoric times to modern times creating a first order cultural landscape when it comes to one of the most important inventions in human history: the production of metal. In 1991 a part of this landscape was lost due to an enormous landslide that reshaped part of the mountain. With our work we want to propose a digital workflow to create a 3D semantic representation of this ancient mining landscape with its mining structures to preserve it for posterity. First, we define a conceptual model to integrate the data. It is based on the CIDOC CRM ontology and CRMgeo for geometric data. To transform our information sources to a formal representation of the classes and properties of the ontology we applied semantic web technologies and created a knowledge graph in RDF (Resource Description Framework). Through the CRMgeo extension coordinate information of mining features can be integrated into the RDF graph and thus related to the detailed digital elevation model that may be visualized together with the mining structures using Geoinformation systems or 3D visualization tools. The RDF network of the triple store can be queried using the SPARQL query language. We created a snapshot of mining, settlement and burial sites in the Bronze Age. The results of the query were loaded into a Geoinformation system and a visualization of known bronze age sites related to mining, settlement and burial activities was created.
Runkel, Robert L.; Verplanck, Philip; Kimball, Briant; Walton-Day, Katie
2018-01-01
Baseline, premining data for streams draining abandoned mine lands is virtually non existent, and indirect methods for estimating premining conditions are needed to establish realistic, cost effective cleanup goals. One such indirect method is the proximal analog approach, in which premining conditions are estimated using data from nearby mineralized areas that are unaffected by mining. In this paper, we combine the proximal analog approach with a quantitative mass balance framework using data from a spatially-detailed synoptic sampling campaign. The combined approach is applied to Cinnamon Gulch, a headwater stream with numerous draining adits. Synoptic sampling results indicate that three of the top five metal sources are affected by mining activities, and stream segments draining these sources account for a large percentage of overall metal loading within the study reach. These initial calculations overestimate the effects of mining, as the affected stream segments were likely acidic and metal rich prior to mining. Premining loads and concentrations were therefore determined through a replacement approach in which the chemistry of each mining-affected stream segment is revised based on proximal analog concentrations. The revised loading profiles indicate that 15–17% of the Al, Cd, Cu, Mn, Ni, and Zn loads are attributable to mining, whereas the mining contribution for Pb is 40%. Premining concentrations of Al, Cd, Cu, Mn, and Zn are estimated to be in excess of aquatic life standards over the length of the study reach.
A Review of Financial Accounting Fraud Detection based on Data Mining Techniques
NASA Astrophysics Data System (ADS)
Sharma, Anuj; Kumar Panigrahi, Prabin
2012-02-01
With an upsurge in financial accounting fraud in the current economic scenario experienced, financial accounting fraud detection (FAFD) has become an emerging topic of great importance for academic, research and industries. The failure of internal auditing system of the organization in identifying the accounting frauds has lead to use of specialized procedures to detect financial accounting fraud, collective known as forensic accounting. Data mining techniques are providing great aid in financial accounting fraud detection, since dealing with the large data volumes and complexities of financial data are big challenges for forensic accounting. This paper presents a comprehensive review of the literature on the application of data mining techniques for the detection of financial accounting fraud and proposes a framework for data mining techniques based accounting fraud detection. The systematic and comprehensive literature review of the data mining techniques applicable to financial accounting fraud detection may provide a foundation to future research in this field. The findings of this review show that data mining techniques like logistic models, neural networks, Bayesian belief network, and decision trees have been applied most extensively to provide primary solutions to the problems inherent in the detection and classification of fraudulent data.
OSCAR4: a flexible architecture for chemical text-mining.
Jessop, David M; Adams, Sam E; Willighagen, Egon L; Hawizy, Lezan; Murray-Rust, Peter
2011-10-14
The Open-Source Chemistry Analysis Routines (OSCAR) software, a toolkit for the recognition of named entities and data in chemistry publications, has been developed since 2002. Recent work has resulted in the separation of the core OSCAR functionality and its release as the OSCAR4 library. This library features a modular API (based on reduction of surface coupling) that permits client programmers to easily incorporate it into external applications. OSCAR4 offers a domain-independent architecture upon which chemistry specific text-mining tools can be built, and its development and usage are discussed.
Comparative Analysis of Document level Text Classification Algorithms using R
NASA Astrophysics Data System (ADS)
Syamala, Maganti; Nalini, N. J., Dr; Maguluri, Lakshamanaphaneendra; Ragupathy, R., Dr.
2017-08-01
From the past few decades there has been tremendous volumes of data available in Internet either in structured or unstructured form. Also, there is an exponential growth of information on Internet, so there is an emergent need of text classifiers. Text mining is an interdisciplinary field which draws attention on information retrieval, data mining, machine learning, statistics and computational linguistics. And to handle this situation, a wide range of supervised learning algorithms has been introduced. Among all these K-Nearest Neighbor(KNN) is efficient and simplest classifier in text classification family. But KNN suffers from imbalanced class distribution and noisy term features. So, to cope up with this challenge we use document based centroid dimensionality reduction(CentroidDR) using R Programming. By combining these two text classification techniques, KNN and Centroid classifiers, we propose a scalable and effective flat classifier, called MCenKNN which works well substantially better than CenKNN.
Mining Consumer Health Vocabulary from Community-Generated Text
Vydiswaran, V.G. Vinod; Mei, Qiaozhu; Hanauer, David A.; Zheng, Kai
2014-01-01
Community-generated text corpora can be a valuable resource to extract consumer health vocabulary (CHV) and link them to professional terminologies and alternative variants. In this research, we propose a pattern-based text-mining approach to identify pairs of CHV and professional terms from Wikipedia, a large text corpus created and maintained by the community. A novel measure, leveraging the ratio of frequency of occurrence, was used to differentiate consumer terms from professional terms. We empirically evaluated the applicability of this approach using a large data sample consisting of MedLine abstracts and all posts from an online health forum, MedHelp. The results show that the proposed approach is able to identify synonymous pairs and label the terms as either consumer or professional term with high accuracy. We conclude that the proposed approach provides great potential to produce a high quality CHV to improve the performance of computational applications in processing consumer-generated health text. PMID:25954426
Blood on the coal: the effect of organizational size and differentiation on coal mine accidents.
Page, Karen
2009-01-01
Each year, there are at least 100,000,000 occupational accidents and 100,000 occupational deaths in the world. In the United States, one of the safest countries in the world in which to work, there were more than 5,400 workplace fatalities and 5.9 million workplace injuries in 2007. The cost to American industry and taxpayers is estimated to be at least $170 billion per year. Further, as illustrated by accidents such as Three Mile Island and Bhopal, industrial accidents potentially impact a much wider sphere than that of the injured worker and his or her employer. As the repercussions of organizational accidents reverberate through organizations and are felt from human resources to accounting, firms are beginning to incorporate messages of safety in their missions and strategies. As firms organize to achieve safer work environments, they are faced with decisions on how to structure their activities in terms of, among other things, size and differentiation. This paper explores the impact on accident rates of size and differentiation at the corporate and mine levels of mining companies in an effort to create a framework for thinking about organizational accidents from a structural perspective. The results suggest that larger mines are safer than smaller mines, and that mines with less task diversity are safer than mines with greater task diversity. The results also suggest that at the corporate level, task diversity decreases mine accidents. These results may help mining executives and engineers structure their corporate activities and individual mines more effectively to help reduce accidents.
Methodology of Estimation of Methane Emissions from Coal Mines in Poland
NASA Astrophysics Data System (ADS)
Patyńska, Renata
2014-03-01
Based on a literature review concerning methane emissions in Poland, it was stated in 2009 that the National Greenhouse Inventory 2007 [13] was published. It was prepared firstly to meet Poland's obligations resulting from point 3.1 Decision no. 280/2004/WE of the European Parliament and of the Council of 11 February 2004, concerning a mechanism for monitoring community greenhouse gas emissions and for implementing the Kyoto Protocol and secondly, for the United Nations Framework Convention on Climate Change (UNFCCC) and Kyoto Protocol. The National Greenhouse Inventory states that there are no detailed data concerning methane emissions in collieries in the Polish mining industry. That is why the methane emission in the methane coal mines of Górnośląskie Zagłębie Węglowe - GZW (Upper Silesian Coal Basin - USCB) in Poland was meticulously studied and evaluated. The applied methodology for estimating methane emission from the GZW coal mining system was used for the four basic sources of its emission. Methane emission during the mining and post-mining process. Such an approach resulted from the IPCC guidelines of 2006 [10]. Updating the proposed methods (IPCC2006) of estimating the methane emissions of hard coal mines (active and abandoned ones) in Poland, assumes that the methane emission factor (EF) is calculated based on methane coal mine output and actual values of absolute methane content. The result of verifying the method of estimating methane emission during the mining process for Polish coal mines is the equation of methane emission factor EF.
Internet of Things Based Combustible Ice Safety Monitoring System Framework
NASA Astrophysics Data System (ADS)
Sun, Enji
2017-05-01
As the development of human society, more energy is requires to meet the need of human daily lives. New energies play a significant role in solving the problems of serious environmental pollution and resources exhaustion in the present world. Combustible ice is essentially frozen natural gas, which can literally be lit on fire bringing a whole new meaning to fire and ice with less pollutant. This paper analysed the advantages and risks on the uses of combustible ice. By compare to other kinds of alternative energies, the advantages of the uses of combustible ice were concluded. The combustible ice basic physical characters and safety risks were analysed. The developments troubles and key utilizations of combustible ice were predicted in the end. A real-time safety monitoring system framework based on the internet of things (IOT) was built to be applied in the future mining, which provide a brand new way to monitoring the combustible ice mining safety.
The BioLexicon: a large-scale terminological resource for biomedical text mining
2011-01-01
Background Due to the rapidly expanding body of biomedical literature, biologists require increasingly sophisticated and efficient systems to help them to search for relevant information. Such systems should account for the multiple written variants used to represent biomedical concepts, and allow the user to search for specific pieces of knowledge (or events) involving these concepts, e.g., protein-protein interactions. Such functionality requires access to detailed information about words used in the biomedical literature. Existing databases and ontologies often have a specific focus and are oriented towards human use. Consequently, biological knowledge is dispersed amongst many resources, which often do not attempt to account for the large and frequently changing set of variants that appear in the literature. Additionally, such resources typically do not provide information about how terms relate to each other in texts to describe events. Results This article provides an overview of the design, construction and evaluation of a large-scale lexical and conceptual resource for the biomedical domain, the BioLexicon. The resource can be exploited by text mining tools at several levels, e.g., part-of-speech tagging, recognition of biomedical entities, and the extraction of events in which they are involved. As such, the BioLexicon must account for real usage of words in biomedical texts. In particular, the BioLexicon gathers together different types of terms from several existing data resources into a single, unified repository, and augments them with new term variants automatically extracted from biomedical literature. Extraction of events is facilitated through the inclusion of biologically pertinent verbs (around which events are typically organized) together with information about typical patterns of grammatical and semantic behaviour, which are acquired from domain-specific texts. In order to foster interoperability, the BioLexicon is modelled using the Lexical Markup Framework, an ISO standard. Conclusions The BioLexicon contains over 2.2 M lexical entries and over 1.8 M terminological variants, as well as over 3.3 M semantic relations, including over 2 M synonymy relations. Its exploitation can benefit both application developers and users. We demonstrate some such benefits by describing integration of the resource into a number of different tools, and evaluating improvements in performance that this can bring. PMID:21992002
The BioLexicon: a large-scale terminological resource for biomedical text mining.
Thompson, Paul; McNaught, John; Montemagni, Simonetta; Calzolari, Nicoletta; del Gratta, Riccardo; Lee, Vivian; Marchi, Simone; Monachini, Monica; Pezik, Piotr; Quochi, Valeria; Rupp, C J; Sasaki, Yutaka; Venturi, Giulia; Rebholz-Schuhmann, Dietrich; Ananiadou, Sophia
2011-10-12
Due to the rapidly expanding body of biomedical literature, biologists require increasingly sophisticated and efficient systems to help them to search for relevant information. Such systems should account for the multiple written variants used to represent biomedical concepts, and allow the user to search for specific pieces of knowledge (or events) involving these concepts, e.g., protein-protein interactions. Such functionality requires access to detailed information about words used in the biomedical literature. Existing databases and ontologies often have a specific focus and are oriented towards human use. Consequently, biological knowledge is dispersed amongst many resources, which often do not attempt to account for the large and frequently changing set of variants that appear in the literature. Additionally, such resources typically do not provide information about how terms relate to each other in texts to describe events. This article provides an overview of the design, construction and evaluation of a large-scale lexical and conceptual resource for the biomedical domain, the BioLexicon. The resource can be exploited by text mining tools at several levels, e.g., part-of-speech tagging, recognition of biomedical entities, and the extraction of events in which they are involved. As such, the BioLexicon must account for real usage of words in biomedical texts. In particular, the BioLexicon gathers together different types of terms from several existing data resources into a single, unified repository, and augments them with new term variants automatically extracted from biomedical literature. Extraction of events is facilitated through the inclusion of biologically pertinent verbs (around which events are typically organized) together with information about typical patterns of grammatical and semantic behaviour, which are acquired from domain-specific texts. In order to foster interoperability, the BioLexicon is modelled using the Lexical Markup Framework, an ISO standard. The BioLexicon contains over 2.2 M lexical entries and over 1.8 M terminological variants, as well as over 3.3 M semantic relations, including over 2 M synonymy relations. Its exploitation can benefit both application developers and users. We demonstrate some such benefits by describing integration of the resource into a number of different tools, and evaluating improvements in performance that this can bring.
NASA Astrophysics Data System (ADS)
Tchindjang, Mesmin; Voundi, Eric; Mbevo Fendoung, Philippes; Haman, Unusa; Saha, Frédéric; Casimir Njombissie Petcheu, Igor
2018-05-01
Mining practices in Cameroon began since the colonial period. The artisanal mining sector before independence contributed to 11-20 % of GDP. From 2000, the rich potential of the Cameroonian subsoil attract many foreign investors with over 600 research and mining permits already granted during the last decade. But, Cameroonian forests also have a long history from the colonial period to the pre-sent. However, mining activities in forest environments are governed by two different legal frameworks, including mining code i.e. Law No. 001 of 16 April 2001 organizing the mining industry and Law No. 94-01 of 20 January 1994 governing forests, wildlife and fisheries. Therefore, in the absence of detailed studies of these laws, there are conflicts of interests, rights and obligations that overlap, requiring research needs and taking appropriate decisions. The objective of this research in the Lom and Djérem division is to study, apart from the proliferation of mining li-censes and actors, the dilemma as well as the impact of the extension of mining activities on the degradation of forest cover. Using geospatial tools through multi-temporal and multisensor satellite images (Landsat from 1976 to 2015, IKONOS, GEOEYE, Google Earth) coupled with field investigations; we mapped the dynamic of different forms of land use (mining permits, FMU and protected areas of permanent forest estate) and highlighted paradoxically the conflict of land use. We came to the conclusion that the rhythm of issuing mining permits and authorizations in this forestall zone is so fast that one can wonder whether we still find a patch of forest within 50 years.
Mining biomedical images towards valuable information retrieval in biomedical and life sciences
Ahmed, Zeeshan; Zeeshan, Saman; Dandekar, Thomas
2016-01-01
Biomedical images are helpful sources for the scientists and practitioners in drawing significant hypotheses, exemplifying approaches and describing experimental results in published biomedical literature. In last decades, there has been an enormous increase in the amount of heterogeneous biomedical image production and publication, which results in a need for bioimaging platforms for feature extraction and analysis of text and content in biomedical images to take advantage in implementing effective information retrieval systems. In this review, we summarize technologies related to data mining of figures. We describe and compare the potential of different approaches in terms of their developmental aspects, used methodologies, produced results, achieved accuracies and limitations. Our comparative conclusions include current challenges for bioimaging software with selective image mining, embedded text extraction and processing of complex natural language queries. PMID:27538578
Monitoring food safety violation reports from internet forums.
Kate, Kiran; Negi, Sumit; Kalagnanam, Jayant
2014-01-01
Food-borne illness is a growing public health concern in the world. Government bodies, which regulate and monitor the state of food safety, solicit citizen feedback about food hygiene practices followed by food establishments. They use traditional channels like call center, e-mail for such feedback collection. With the growing popularity of Web 2.0 and social media, citizens often post such feedback on internet forums, message boards etc. The system proposed in this paper applies text mining techniques to identify and mine such food safety complaints posted by citizens on web data sources thereby enabling the government agencies to gather more information about the state of food safety. In this paper, we discuss the architecture of our system and the text mining methods used. We also present results which demonstrate the effectiveness of this system in a real-world deployment.
A data mining based approach to predict spatiotemporal changes in satellite images
NASA Astrophysics Data System (ADS)
Boulila, W.; Farah, I. R.; Ettabaa, K. Saheb; Solaiman, B.; Ghézala, H. Ben
2011-06-01
The interpretation of remotely sensed images in a spatiotemporal context is becoming a valuable research topic. However, the constant growth of data volume in remote sensing imaging makes reaching conclusions based on collected data a challenging task. Recently, data mining appears to be a promising research field leading to several interesting discoveries in various areas such as marketing, surveillance, fraud detection and scientific discovery. By integrating data mining and image interpretation techniques, accurate and relevant information (i.e. functional relation between observed parcels and a set of informational contents) can be automatically elicited. This study presents a new approach to predict spatiotemporal changes in satellite image databases. The proposed method exploits fuzzy sets and data mining concepts to build predictions and decisions for several remote sensing fields. It takes into account imperfections related to the spatiotemporal mining process in order to provide more accurate and reliable information about land cover changes in satellite images. The proposed approach is validated using SPOT images representing the Saint-Denis region, capital of Reunion Island. Results show good performances of the proposed framework in predicting change for the urban zone.
Wiegers, Thomas C; Davis, Allan Peter; Mattingly, Carolyn J
2014-01-01
The Critical Assessment of Information Extraction systems in Biology (BioCreAtIvE) challenge evaluation tasks collectively represent a community-wide effort to evaluate a variety of text-mining and information extraction systems applied to the biological domain. The BioCreative IV Workshop included five independent subject areas, including Track 3, which focused on named-entity recognition (NER) for the Comparative Toxicogenomics Database (CTD; http://ctdbase.org). Previously, CTD had organized document ranking and NER-related tasks for the BioCreative Workshop 2012; a key finding of that effort was that interoperability and integration complexity were major impediments to the direct application of the systems to CTD's text-mining pipeline. This underscored a prevailing problem with software integration efforts. Major interoperability-related issues included lack of process modularity, operating system incompatibility, tool configuration complexity and lack of standardization of high-level inter-process communications. One approach to potentially mitigate interoperability and general integration issues is the use of Web services to abstract implementation details; rather than integrating NER tools directly, HTTP-based calls from CTD's asynchronous, batch-oriented text-mining pipeline could be made to remote NER Web services for recognition of specific biological terms using BioC (an emerging family of XML formats) for inter-process communications. To test this concept, participating groups developed Representational State Transfer /BioC-compliant Web services tailored to CTD's NER requirements. Participants were provided with a comprehensive set of training materials. CTD evaluated results obtained from the remote Web service-based URLs against a test data set of 510 manually curated scientific articles. Twelve groups participated in the challenge. Recall, precision, balanced F-scores and response times were calculated. Top balanced F-scores for gene, chemical and disease NER were 61, 74 and 51%, respectively. Response times ranged from fractions-of-a-second to over a minute per article. We present a description of the challenge and summary of results, demonstrating how curation groups can effectively use interoperable NER technologies to simplify text-mining pipeline implementation. Database URL: http://ctdbase.org/ © The Author(s) 2014. Published by Oxford University Press.
Building a glaucoma interaction network using a text mining approach.
Soliman, Maha; Nasraoui, Olfa; Cooper, Nigel G F
2016-01-01
The volume of biomedical literature and its underlying knowledge base is rapidly expanding, making it beyond the ability of a single human being to read through all the literature. Several automated methods have been developed to help make sense of this dilemma. The present study reports on the results of a text mining approach to extract gene interactions from the data warehouse of published experimental results which are then used to benchmark an interaction network associated with glaucoma. To the best of our knowledge, there is, as yet, no glaucoma interaction network derived solely from text mining approaches. The presence of such a network could provide a useful summative knowledge base to complement other forms of clinical information related to this disease. A glaucoma corpus was constructed from PubMed Central and a text mining approach was applied to extract genes and their relations from this corpus. The extracted relations between genes were checked using reference interaction databases and classified generally as known or new relations. The extracted genes and relations were then used to construct a glaucoma interaction network. Analysis of the resulting network indicated that it bears the characteristics of a small world interaction network. Our analysis showed the presence of seven glaucoma linked genes that defined the network modularity. A web-based system for browsing and visualizing the extracted glaucoma related interaction networks is made available at http://neurogene.spd.louisville.edu/GlaucomaINViewer/Form1.aspx. This study has reported the first version of a glaucoma interaction network using a text mining approach. The power of such an approach is in its ability to cover a wide range of glaucoma related studies published over many years. Hence, a bigger picture of the disease can be established. To the best of our knowledge, this is the first glaucoma interaction network to summarize the known literature. The major findings were a set of relations that could not be found in existing interaction databases and that were found to be new, in addition to a smaller subnetwork consisting of interconnected clusters of seven glaucoma genes. Future improvements can be applied towards obtaining a better version of this network.
Wiegers, Thomas C.; Davis, Allan Peter; Mattingly, Carolyn J.
2014-01-01
The Critical Assessment of Information Extraction systems in Biology (BioCreAtIvE) challenge evaluation tasks collectively represent a community-wide effort to evaluate a variety of text-mining and information extraction systems applied to the biological domain. The BioCreative IV Workshop included five independent subject areas, including Track 3, which focused on named-entity recognition (NER) for the Comparative Toxicogenomics Database (CTD; http://ctdbase.org). Previously, CTD had organized document ranking and NER-related tasks for the BioCreative Workshop 2012; a key finding of that effort was that interoperability and integration complexity were major impediments to the direct application of the systems to CTD's text-mining pipeline. This underscored a prevailing problem with software integration efforts. Major interoperability-related issues included lack of process modularity, operating system incompatibility, tool configuration complexity and lack of standardization of high-level inter-process communications. One approach to potentially mitigate interoperability and general integration issues is the use of Web services to abstract implementation details; rather than integrating NER tools directly, HTTP-based calls from CTD's asynchronous, batch-oriented text-mining pipeline could be made to remote NER Web services for recognition of specific biological terms using BioC (an emerging family of XML formats) for inter-process communications. To test this concept, participating groups developed Representational State Transfer /BioC-compliant Web services tailored to CTD's NER requirements. Participants were provided with a comprehensive set of training materials. CTD evaluated results obtained from the remote Web service-based URLs against a test data set of 510 manually curated scientific articles. Twelve groups participated in the challenge. Recall, precision, balanced F-scores and response times were calculated. Top balanced F-scores for gene, chemical and disease NER were 61, 74 and 51%, respectively. Response times ranged from fractions-of-a-second to over a minute per article. We present a description of the challenge and summary of results, demonstrating how curation groups can effectively use interoperable NER technologies to simplify text-mining pipeline implementation. Database URL: http://ctdbase.org/ PMID:24919658
NASA Astrophysics Data System (ADS)
Lee, M. J.; Oh, K. Y.; Joung-ho, L.
2016-12-01
Recently there are many research about analysing the interaction between entities by text-mining analysis in various fields. In this paper, we aimed to quantitatively analyse research-trends in the area of environmental research relating either spatial information or ICT (Information and Communications Technology) by Text-mining analysis. To do this, we applied low-dimensional embedding method, clustering analysis, and association rule to find meaningful associative patterns of key words frequently appeared in the articles. As the authors suppose that KCI (Korea Citation Index) articles reflect academic demands, total 1228 KCI articles that have been published from 1996 to 2015 were reviewed and analysed by Text-mining method. First, we derived KCI articles from NDSL(National Discovery for Science Leaders) site. And then we pre-processed their key-words elected from abstract and then classified those in separable sectors. We investigated the appearance rates and association rule of key-words for articles in the two fields: spatial-information and ICT. In order to detect historic trends, analysis was conducted separately for the four periods: 1996-2000, 2001-2005, 2006-2010, 2011-2015. These analysis were conducted with the usage of R-software. As a result, we conformed that environmental research relating spatial information mainly focused upon such fields as `GIS(35%)', `Remote-Sensing(25%)', `environmental theme map(15.7%)'. Next, `ICT technology(23.6%)', `ICT service(5.4%)', `mobile(24%)', `big data(10%)', `AI(7%)' are primarily emerging from environmental research relating ICT. Thus, from the analysis results, this paper asserts that research trends and academic progresses are well-structured to review recent spatial information and ICT technology and the outcomes of the analysis can be an adequate guidelines to establish environment policies and strategies. KEY WORDS: Big data, Test-mining, Environmental research, Spatial-information, ICT Acknowledgements: The authors appreciate the support that this study has received from `Building application frame of environmental issues, to respond to the latest ICT trends'.
Nano- and microsized cubic gel particles from cyclodextrin metal-organic frameworks.
Furukawa, Yuki; Ishiwata, Takumi; Sugikawa, Kouta; Kokado, Kenta; Sada, Kazuki
2012-10-15
Sweet cube o' mine: Bottom-up control of gel particles has been regarded as a great challenge. By employing internal cross-linking of cyclodextrin metal-organic frameworks, cubic sugar gels were formed with sharp edges that reflect the shape of the crystals. This enabled the fabrication of shape- and size-controlled polymer gels from porous crystals (see picture). Copyright © 2012 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
Adapting Pipeline Architectures to Track Developing Aftershock Sequences and Recurrent Explosions
2014-02-14
Sumatra earthquake was used to study the performance of subspace detectors to detect and classify events from within a very large (Area = ~250,000 km2... detectors to identify and organize repeating waveforms discovered in multichannel seismic data streams. The framework has been tested and evaluated on...a variety of different test cases from mining blasts in Central Asia to moderate and large earthquake aftershock sequences. The framework performs
Text mining-based in silico drug discovery in oral mucositis caused by high-dose cancer therapy.
Kirk, Jon; Shah, Nirav; Noll, Braxton; Stevens, Craig B; Lawler, Marshall; Mougeot, Farah B; Mougeot, Jean-Luc C
2018-08-01
Oral mucositis (OM) is a major dose-limiting side effect of chemotherapy and radiation used in cancer treatment. Due to the complex nature of OM, currently available drug-based treatments are of limited efficacy. Our objectives were (i) to determine genes and molecular pathways associated with OM and wound healing using computational tools and publicly available data and (ii) to identify drugs formulated for topical use targeting the relevant OM molecular pathways. OM and wound healing-associated genes were determined by text mining, and the intersection of the two gene sets was selected for gene ontology analysis using the GeneCodis program. Protein interaction network analysis was performed using STRING-db. Enriched gene sets belonging to the identified pathways were queried against the Drug-Gene Interaction database to find drug candidates for topical use in OM. Our analysis identified 447 genes common to both the "OM" and "wound healing" text mining concepts. Gene enrichment analysis yielded 20 genes representing six pathways and targetable by a total of 32 drugs which could possibly be formulated for topical application. A manual search on ClinicalTrials.gov confirmed no relevant pathway/drug candidate had been overlooked. Twenty-five of the 32 drugs can directly affect the PTGS2 (COX-2) pathway, the pathway that has been targeted in previous clinical trials with limited success. Drug discovery using in silico text mining and pathway analysis tools can facilitate the identification of existing drugs that have the potential of topical administration to improve OM treatment.
Takeda, Kayoko; Takahashi, Kiyoshi; Masukawa, Hiroyuki; Shimamori, Yoshimitsu
2017-01-01
Recently, the practice of active learning has spread, increasingly recognized as an essential component of academic studies. Classes incorporating small group discussion (SGD) are conducted at many universities. At present, assessments of the effectiveness of SGD have mostly involved evaluation by questionnaires conducted by teachers, by peer assessment, and by self-evaluation of students. However, qualitative data, such as open-ended descriptions by students, have not been widely evaluated. As a result, we have been unable to analyze the processes and methods involved in how students acquire knowledge in SGD. In recent years, due to advances in information and communication technology (ICT), text mining has enabled the analysis of qualitative data. We therefore investigated whether the introduction of a learning system comprising the jigsaw method and problem-based learning (PBL) would improve student attitudes toward learning; we did this by text mining analysis of the content of student reports. We found that by applying the jigsaw method before PBL, we were able to improve student attitudes toward learning and increase the depth of their understanding of the area of study as a result of working with others. The use of text mining to analyze qualitative data also allowed us to understand the processes and methods by which students acquired knowledge in SGD and also changes in students' understanding and performance based on improvements to the class. This finding suggests that the use of text mining to analyze qualitative data could enable teachers to evaluate the effectiveness of various methods employed to improve learning.
NASA Astrophysics Data System (ADS)
Khazaeli, S.; Ravandi, A. G.; Banerji, S.; Bagchi, A.
2016-04-01
Recently, data-driven models for Structural Health Monitoring (SHM) have been of great interest among many researchers. In data-driven models, the sensed data are processed to determine the structural performance and evaluate the damages of an instrumented structure without necessitating the mathematical modeling of the structure. A framework of data-driven models for online assessment of the condition of a structure has been developed here. The developed framework is intended for automated evaluation of the monitoring data and structural performance by the Internet technology and resources. The main challenges in developing such framework include: (a) utilizing the sensor measurements to estimate and localize the induced damage in a structure by means of signal processing and data mining techniques, and (b) optimizing the computing and storage resources with the aid of cloud services. The main focus in this paper is to demonstrate the efficiency of the proposed framework for real-time damage detection of a multi-story shear-building structure in two damage scenarios (change in mass and stiffness) in various locations. Several features are extracted from the sensed data by signal processing techniques and statistical methods. Machine learning algorithms are deployed to select damage-sensitive features as well as classifying the data to trace the anomaly in the response of the structure. Here, the cloud computing resources from Amazon Web Services (AWS) have been used to implement the proposed framework.
Text feature extraction based on deep learning: a review.
Liang, Hong; Sun, Xiao; Sun, Yunlei; Gao, Yuan
2017-01-01
Selection of text feature item is a basic and important matter for text mining and information retrieval. Traditional methods of feature extraction require handcrafted features. To hand-design, an effective feature is a lengthy process, but aiming at new applications, deep learning enables to acquire new effective feature representation from training data. As a new feature extraction method, deep learning has made achievements in text mining. The major difference between deep learning and conventional methods is that deep learning automatically learns features from big data, instead of adopting handcrafted features, which mainly depends on priori knowledge of designers and is highly impossible to take the advantage of big data. Deep learning can automatically learn feature representation from big data, including millions of parameters. This thesis outlines the common methods used in text feature extraction first, and then expands frequently used deep learning methods in text feature extraction and its applications, and forecasts the application of deep learning in feature extraction.
30 CFR 730.11 - Inconsistent and more stringent State laws and regulations.
Code of Federal Regulations, 2010 CFR
2010-07-01
... regulations. 730.11 Section 730.11 Mineral Resources OFFICE OF SURFACE MINING RECLAMATION AND ENFORCEMENT... Register setting forth the text or a summary of any State law or regulation initially determined by him to... stringent land use and environmental controls and regulations of coal exploration and surface coal mining...
Using Syntactic Patterns to Enhance Text Analytics
ERIC Educational Resources Information Center
Meyer, Bradley B.
2017-01-01
Large scale product and service reviews proliferate and are commonly found across the web. The ability to harvest, digest and analyze a large corpus of reviews from online websites is still however a difficult problem. This problem is referred to as "opinion mining." Opinion mining is an important area of research as advances in the…
Federal Register 2010, 2011, 2012, 2013, 2014
2011-01-28
... Uranium Recovery Project, located in the Pumpkin Buttes Uranium Mining District within the Powder River.... Alternatives that were considered, but were eliminated from detailed analysis, include conventional mining and... an Agencywide Documents and Management System (ADAMS), which provides text and image files of the NRC...
Federal Register 2010, 2011, 2012, 2013, 2014
2011-08-26
... (ADAMS), which provides text and image files of the NRC's public documents in the NRC Library at http... considered, but eliminated from detailed analysis, include conventional uranium mining and milling, conventional mining and heap leach processing, alternate lixiviants, and alternative wastewater disposal...
Illuminate Knowledge Elements in Geoscience Literature
NASA Astrophysics Data System (ADS)
Ma, X.; Zheng, J. G.; Wang, H.; Fox, P. A.
2015-12-01
There are numerous dark data hidden in geoscience literature. Efficient retrieval and reuse of those data will greatly benefit geoscience researches of nowadays. Among the works of data rescue, a topic of interest is illuminating the knowledge framework, i.e. entities and relationships, embedded in documents. Entity recognition and linking have received extensive attention in news and social media analysis, as well as in bioinformatics. In the domain of geoscience, however, such works are limited. We will present our work on how to use knowledge bases on the Web, such as ontologies and vocabularies, to facilitate entity recognition and linking in geoscience literature. The work deploys an un-supervised collective inference approach [1] to link entity mentions in unstructured texts to a knowledge base, which leverages the meaningful information and structures in ontologies and vocabularies for similarity computation and entity ranking. Our work is still in the initial stage towards the detection of knowledge frameworks in literature, and we have been collecting geoscience ontologies and vocabularies in order to build a comprehensive geoscience knowledge base [2]. We hope the work will initiate new ideas and collaborations on dark data rescue, as well as on the synthesis of data and knowledge from geoscience literature. References: 1. Zheng, J., Howsmon, D., Zhang, B., Hahn, J., McGuinness, D.L., Hendler, J., and Ji, H. 2014. Entity linking for biomedical literature. In Proceedings of ACM 8th International Workshop on Data and Text Mining in Bioinformatics, Shanghai, China. 2. Ma, X. Zheng, J., 2015. Linking geoscience entity mentions to the Web of Data. ESIP 2015 Summer Meeting, Pacific Grove, CA.
Discovering and visualizing indirect associations between biomedical concepts
Tsuruoka, Yoshimasa; Miwa, Makoto; Hamamoto, Kaisei; Tsujii, Jun'ichi; Ananiadou, Sophia
2011-01-01
Motivation: Discovering useful associations between biomedical concepts has been one of the main goals in biomedical text-mining, and understanding their biomedical contexts is crucial in the discovery process. Hence, we need a text-mining system that helps users explore various types of (possibly hidden) associations in an easy and comprehensible manner. Results: This article describes FACTA+, a real-time text-mining system for finding and visualizing indirect associations between biomedical concepts from MEDLINE abstracts. The system can be used as a text search engine like PubMed with additional features to help users discover and visualize indirect associations between important biomedical concepts such as genes, diseases and chemical compounds. FACTA+ inherits all functionality from its predecessor, FACTA, and extends it by incorporating three new features: (i) detecting biomolecular events in text using a machine learning model, (ii) discovering hidden associations using co-occurrence statistics between concepts, and (iii) visualizing associations to improve the interpretability of the output. To the best of our knowledge, FACTA+ is the first real-time web application that offers the functionality of finding concepts involving biomolecular events and visualizing indirect associations of concepts with both their categories and importance. Availability: FACTA+ is available as a web application at http://refine1-nactem.mc.man.ac.uk/facta/, and its visualizer is available at http://refine1-nactem.mc.man.ac.uk/facta-visualizer/. Contact: tsuruoka@jaist.ac.jp PMID:21685059
Lu, Zhiyong
2012-01-01
Today’s biomedical research has become heavily dependent on access to the biological knowledge encoded in expert curated biological databases. As the volume of biological literature grows rapidly, it becomes increasingly difficult for biocurators to keep up with the literature because manual curation is an expensive and time-consuming endeavour. Past research has suggested that computer-assisted curation can improve efficiency, but few text-mining systems have been formally evaluated in this regard. Through participation in the interactive text-mining track of the BioCreative 2012 workshop, we developed PubTator, a PubMed-like system that assists with two specific human curation tasks: document triage and bioconcept annotation. On the basis of evaluation results from two external user groups, we find that the accuracy of PubTator-assisted curation is comparable with that of manual curation and that PubTator can significantly increase human curatorial speed. These encouraging findings warrant further investigation with a larger number of publications to be annotated. Database URL: http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/PubTator/ PMID:23160414
A Framework for Assessing Reading Comprehension of Geometric Construction Texts
ERIC Educational Resources Information Center
Yang, Kai-Lin; Li, Jian-Lin
2018-01-01
This study investigates one issue related to reading mathematical texts by presenting a two-dimensional framework for assessing reading comprehension of geometric construction texts. The two dimensions of the framework were formulated by modifying categories of reading literacy and drawing on key elements of geometric construction texts. Three…
Alanazi, Ibrahim O; AlYahya, Sami A; Ebrahimie, Esmaeil; Mohammadi-Dehcheshmeh, Manijeh
2018-06-15
Exponentially growing scientific knowledge in scientific publications has resulted in the emergence of a new interdisciplinary science of literature mining. In text mining, the machine reads the published literature and transfers the discovered knowledge to mathematical-like formulas. In an integrative approach in this study, we used text mining in combination with network discovery, pathway analysis, and enrichment analysis of genomic regions for better understanding of biomarkers in lung cancer. Particular attention was paid to non-coding biomarkers. In total, 60 MicroRNA biomarkers were reported for lung cancer, including some prognostic biomarkers. MIR21, MIR155, MALAT1, and MIR31 were the top non-coding RNA biomarkers of lung cancer. Text mining identified 447 proteins which have been studied as biomarkers in lung cancer. EGFR (receptor), TP53 (transcription factor), KRAS, CDKN2A, ENO2, KRT19, RASSF1, GRP (ligand), SHOX2 (transcription factor), and ERBB2 (receptor) were the most studied proteins. Within small molecules, thymosin-a1, oestrogen, and 8-OHdG have received more attention. We found some chromosomal bands, such as 7q32.2, 18q12.1, 6p12, 11p15.5, and 3p21.3 that are highly involved in deriving lung cancer biomarkers. Copyright © 2018 Elsevier B.V. All rights reserved.
Generic framework for mining cellular automata models on protein-folding simulations.
Diaz, N; Tischer, I
2016-05-13
Cellular automata model identification is an important way of building simplified simulation models. In this study, we describe a generic architectural framework to ease the development process of new metaheuristic-based algorithms for cellular automata model identification in protein-folding trajectories. Our framework was developed by a methodology based on design patterns that allow an improved experience for new algorithms development. The usefulness of the proposed framework is demonstrated by the implementation of four algorithms, able to obtain extremely precise cellular automata models of the protein-folding process with a protein contact map representation. Dynamic rules obtained by the proposed approach are discussed, and future use for the new tool is outlined.
Literature classification for semi-automated updating of biological knowledgebases
2013-01-01
Background As the output of biological assays increase in resolution and volume, the body of specialized biological data, such as functional annotations of gene and protein sequences, enables extraction of higher-level knowledge needed for practical application in bioinformatics. Whereas common types of biological data, such as sequence data, are extensively stored in biological databases, functional annotations, such as immunological epitopes, are found primarily in semi-structured formats or free text embedded in primary scientific literature. Results We defined and applied a machine learning approach for literature classification to support updating of TANTIGEN, a knowledgebase of tumor T-cell antigens. Abstracts from PubMed were downloaded and classified as either "relevant" or "irrelevant" for database update. Training and five-fold cross-validation of a k-NN classifier on 310 abstracts yielded classification accuracy of 0.95, thus showing significant value in support of data extraction from the literature. Conclusion We here propose a conceptual framework for semi-automated extraction of epitope data embedded in scientific literature using principles from text mining and machine learning. The addition of such data will aid in the transition of biological databases to knowledgebases. PMID:24564403
NASA Astrophysics Data System (ADS)
Tyuleneva, Tatiana
2017-11-01
One of the problems of sustainable development of mining companies is attracting additional investment. To solve it requires access to international capital markets, in this context, enterprises need to prepare financial statements with international requirements based on the data generated by the accounting system. The article considers the basic problems of accounting in the extractive industries due to the nature of the industry, as well as evaluation of the completeness of their solution in the framework of international financial reporting standards. In addition, lists the characteristics of accounting for mining industry, due to the peculiarities of the production process that need to be considered to solve these problems. This sector is extremely important for individual countries and on a global scale.
Figure Text Extraction in Biomedical Literature
Kim, Daehyun; Yu, Hong
2011-01-01
Background Figures are ubiquitous in biomedical full-text articles, and they represent important biomedical knowledge. However, the sheer volume of biomedical publications has made it necessary to develop computational approaches for accessing figures. Therefore, we are developing the Biomedical Figure Search engine (http://figuresearch.askHERMES.org) to allow bioscientists to access figures efficiently. Since text frequently appears in figures, automatically extracting such text may assist the task of mining information from figures. Little research, however, has been conducted exploring text extraction from biomedical figures. Methodology We first evaluated an off-the-shelf Optical Character Recognition (OCR) tool on its ability to extract text from figures appearing in biomedical full-text articles. We then developed a Figure Text Extraction Tool (FigTExT) to improve the performance of the OCR tool for figure text extraction through the use of three innovative components: image preprocessing, character recognition, and text correction. We first developed image preprocessing to enhance image quality and to improve text localization. Then we adapted the off-the-shelf OCR tool on the improved text localization for character recognition. Finally, we developed and evaluated a novel text correction framework by taking advantage of figure-specific lexicons. Results/Conclusions The evaluation on 382 figures (9,643 figure texts in total) randomly selected from PubMed Central full-text articles shows that FigTExT performed with 84% precision, 98% recall, and 90% F1-score for text localization and with 62.5% precision, 51.0% recall and 56.2% F1-score for figure text extraction. When limiting figure texts to those judged by domain experts to be important content, FigTExT performed with 87.3% precision, 68.8% recall, and 77% F1-score. FigTExT significantly improved the performance of the off-the-shelf OCR tool we used, which on its own performed with 36.6% precision, 19.3% recall, and 25.3% F1-score for text extraction. In addition, our results show that FigTExT can extract texts that do not appear in figure captions or other associated text, further suggesting the potential utility of FigTExT for improving figure search. PMID:21249186
Figure text extraction in biomedical literature.
Kim, Daehyun; Yu, Hong
2011-01-13
Figures are ubiquitous in biomedical full-text articles, and they represent important biomedical knowledge. However, the sheer volume of biomedical publications has made it necessary to develop computational approaches for accessing figures. Therefore, we are developing the Biomedical Figure Search engine (http://figuresearch.askHERMES.org) to allow bioscientists to access figures efficiently. Since text frequently appears in figures, automatically extracting such text may assist the task of mining information from figures. Little research, however, has been conducted exploring text extraction from biomedical figures. We first evaluated an off-the-shelf Optical Character Recognition (OCR) tool on its ability to extract text from figures appearing in biomedical full-text articles. We then developed a Figure Text Extraction Tool (FigTExT) to improve the performance of the OCR tool for figure text extraction through the use of three innovative components: image preprocessing, character recognition, and text correction. We first developed image preprocessing to enhance image quality and to improve text localization. Then we adapted the off-the-shelf OCR tool on the improved text localization for character recognition. Finally, we developed and evaluated a novel text correction framework by taking advantage of figure-specific lexicons. The evaluation on 382 figures (9,643 figure texts in total) randomly selected from PubMed Central full-text articles shows that FigTExT performed with 84% precision, 98% recall, and 90% F1-score for text localization and with 62.5% precision, 51.0% recall and 56.2% F1-score for figure text extraction. When limiting figure texts to those judged by domain experts to be important content, FigTExT performed with 87.3% precision, 68.8% recall, and 77% F1-score. FigTExT significantly improved the performance of the off-the-shelf OCR tool we used, which on its own performed with 36.6% precision, 19.3% recall, and 25.3% F1-score for text extraction. In addition, our results show that FigTExT can extract texts that do not appear in figure captions or other associated text, further suggesting the potential utility of FigTExT for improving figure search.
OSCAR4: a flexible architecture for chemical text-mining
2011-01-01
The Open-Source Chemistry Analysis Routines (OSCAR) software, a toolkit for the recognition of named entities and data in chemistry publications, has been developed since 2002. Recent work has resulted in the separation of the core OSCAR functionality and its release as the OSCAR4 library. This library features a modular API (based on reduction of surface coupling) that permits client programmers to easily incorporate it into external applications. OSCAR4 offers a domain-independent architecture upon which chemistry specific text-mining tools can be built, and its development and usage are discussed. PMID:21999457
Literature evidence in open targets - a target validation platform.
Kafkas, Şenay; Dunham, Ian; McEntyre, Johanna
2017-06-06
We present the Europe PMC literature component of Open Targets - a target validation platform that integrates various evidence to aid drug target identification and validation. The component identifies target-disease associations in documents and ranks the documents based on their confidence from the Europe PMC literature database, by using rules utilising expert-provided heuristic information. The confidence score of a given document represents how valuable the document is in the scope of target validation for a given target-disease association by taking into account the credibility of the association based on the properties of the text. The component serves the platform regularly with the up-to-date data since December, 2015. Currently, there are a total number of 1168365 distinct target-disease associations text mined from >26 million PubMed abstracts and >1.2 million Open Access full text articles. Our comparative analyses on the current available evidence data in the platform revealed that 850179 of these associations are exclusively identified by literature mining. This component helps the platform's users by providing the most relevant literature hits for a given target and disease. The text mining evidence along with the other types of evidence can be explored visually through https://www.targetvalidation.org and all the evidence data is available for download in json format from https://www.targetvalidation.org/downloads/data .
Occupancy schedules learning process through a data mining framework
DOE Office of Scientific and Technical Information (OSTI.GOV)
D'Oca, Simona; Hong, Tianzhen
Building occupancy is a paramount factor in building energy simulations. Specifically, lighting, plug loads, HVAC equipment utilization, fresh air requirements and internal heat gain or loss greatly depends on the level of occupancy within a building. Developing the appropriate methodologies to describe and reproduce the intricate network responsible for human-building interactions are needed. Extrapolation of patterns from big data streams is a powerful analysis technique which will allow for a better understanding of energy usage in buildings. A three-step data mining framework is applied to discover occupancy patterns in office spaces. First, a data set of 16 offices with 10more » minute interval occupancy data, over a two year period is mined through a decision tree model which predicts the occupancy presence. Then a rule induction algorithm is used to learn a pruned set of rules on the results from the decision tree model. Finally, a cluster analysis is employed in order to obtain consistent patterns of occupancy schedules. Furthermore, the identified occupancy rules and schedules are representative as four archetypal working profiles that can be used as input to current building energy modeling programs, such as EnergyPlus or IDA-ICE, to investigate impact of occupant presence on design, operation and energy use in office buildings.« less
Occupancy schedules learning process through a data mining framework
D'Oca, Simona; Hong, Tianzhen
2014-12-17
Building occupancy is a paramount factor in building energy simulations. Specifically, lighting, plug loads, HVAC equipment utilization, fresh air requirements and internal heat gain or loss greatly depends on the level of occupancy within a building. Developing the appropriate methodologies to describe and reproduce the intricate network responsible for human-building interactions are needed. Extrapolation of patterns from big data streams is a powerful analysis technique which will allow for a better understanding of energy usage in buildings. A three-step data mining framework is applied to discover occupancy patterns in office spaces. First, a data set of 16 offices with 10more » minute interval occupancy data, over a two year period is mined through a decision tree model which predicts the occupancy presence. Then a rule induction algorithm is used to learn a pruned set of rules on the results from the decision tree model. Finally, a cluster analysis is employed in order to obtain consistent patterns of occupancy schedules. Furthermore, the identified occupancy rules and schedules are representative as four archetypal working profiles that can be used as input to current building energy modeling programs, such as EnergyPlus or IDA-ICE, to investigate impact of occupant presence on design, operation and energy use in office buildings.« less
Zhang, RuiJie; Li, Xia; Jiang, YongShuai; Liu, GuiYou; Li, ChuanXing; Zhang, Fan; Xiao, Yun; Gong, BinSheng
2009-02-01
High-throughout single nucleotide polymorphism detection technology and the existing knowledge provide strong support for mining the disease-related haplotypes and genes. In this study, first, we apply four kinds of haplotype identification methods (Confidence Intervals, Four Gamete Tests, Solid Spine of LD and fusing method of haplotype block) into high-throughout SNP genotype data to identify blocks, then use cluster analysis to verify the effectiveness of the four methods, and select the alcoholism-related SNP haplotypes through risk analysis. Second, we establish a mapping from haplotypes to alcoholism-related genes. Third, we inquire NCBI SNP and gene databases to locate the blocks and identify the candidate genes. In the end, we make gene function annotation by KEGG, Biocarta, and GO database. We find 159 haplotype blocks, which relate to the alcoholism most possibly on chromosome 1 approximately 22, including 227 haplotypes, of which 102 SNP haplotypes may increase the risk of alcoholism. We get 121 alcoholism-related genes and verify their reliability by the functional annotation of biology. In a word, we not only can handle the SNP data easily, but also can locate the disease-related genes precisely by combining our novel strategies of mining alcoholism-related haplotypes and genes with existing knowledge framework.
On the unsupervised analysis of domain-specific Chinese texts
Deng, Ke; Bol, Peter K.; Li, Kate J.; Liu, Jun S.
2016-01-01
With the growing availability of digitized text data both publicly and privately, there is a great need for effective computational tools to automatically extract information from texts. Because the Chinese language differs most significantly from alphabet-based languages in not specifying word boundaries, most existing Chinese text-mining methods require a prespecified vocabulary and/or a large relevant training corpus, which may not be available in some applications. We introduce an unsupervised method, top-down word discovery and segmentation (TopWORDS), for simultaneously discovering and segmenting words and phrases from large volumes of unstructured Chinese texts, and propose ways to order discovered words and conduct higher-level context analyses. TopWORDS is particularly useful for mining online and domain-specific texts where the underlying vocabulary is unknown or the texts of interest differ significantly from available training corpora. When outputs from TopWORDS are fed into context analysis tools such as topic modeling, word embedding, and association pattern finding, the results are as good as or better than that from using outputs of a supervised segmentation method. PMID:27185919
Mining biomedical images towards valuable information retrieval in biomedical and life sciences.
Ahmed, Zeeshan; Zeeshan, Saman; Dandekar, Thomas
2016-01-01
Biomedical images are helpful sources for the scientists and practitioners in drawing significant hypotheses, exemplifying approaches and describing experimental results in published biomedical literature. In last decades, there has been an enormous increase in the amount of heterogeneous biomedical image production and publication, which results in a need for bioimaging platforms for feature extraction and analysis of text and content in biomedical images to take advantage in implementing effective information retrieval systems. In this review, we summarize technologies related to data mining of figures. We describe and compare the potential of different approaches in terms of their developmental aspects, used methodologies, produced results, achieved accuracies and limitations. Our comparative conclusions include current challenges for bioimaging software with selective image mining, embedded text extraction and processing of complex natural language queries. © The Author(s) 2016. Published by Oxford University Press.
NASA Astrophysics Data System (ADS)
Fleischhammel, Petra; Schoenheinz, Dagmar; Grünewald, Uwe
2010-05-01
In terms of the European Water Framework Directive (WFD), post mining lakes are artificial water bodies (AWB). The sustainable integration of post mining lakes in the groundwater and surface water landscape and their consideration in river basin management plans have to be linked with various (geo)hydrological, hydro(geo)chemical, technological and socioeconomic issues. The Lower Lusatian lignite mining district in eastern Germany is part of the major river basins of river Elbe and river Oder. Regionally, the mining area is situated in the sub-basins of river Spree and Schwarze Elster. After the cessation of mining activities and thereby of the artificially created groundwater drawdown in numerous mining pits, a large number of post mining lakes are evolving as consequence of natural groundwater table recovery. The lakes' designated uses vary from water reservoirs to landscape, recreation or fish farming lakes. Groundwater raise is not only substantial for the lake filling, but also for the area rehabilitation and a largely self regulated water balance in post mining landscapes. Since the groundwater flow through soil and dump sites being affected by the former mining activities, groundwater experiences various changes in its hydrochemical properties as e.g. mineralization and acidification. Consequently, downstream located groundwater fed running and standing water bodies will be affected too. Respective the European Water Framework Directive, artificial post mining lakes are not allowed to cause significant adverse impacts on the good ecological status/potential of downstream groundwater and surface water bodies. The high sulphate concentrations of groundwater fed mining lakes which reach partly more than 1,000 mg/l are e.g. damaging concrete constructures in downstream water bodies thereby representing threats for hydraulic facilities and drinking water supply. Due to small amounts of nutrients, the lakes are characterised by oligo¬trophic to slightly mesotrophic conditions. The aquatic flora and fauna are limited to a few well adapted species. Therefore, the issue of hydrochemical constitution of the lakes' waters becomes more and more relevant. The prediction of water quality development in post mining lakes is a key requirement to regulate and manage the later hydrochemical conditions. Initially, this prediction was made by individual case studies for single lakes. By means of an iterative research process during the last years, hydrochemical lake models were developed as prediction tools, which allow a complex processing of interconnected post mining lakes and their integration in natural hydrography with respect to quantitative and qualitative evaluation. To counteract the poor water quality of mining lakes, flooding by surface water from neighbouring river basins, e.g. the river Neisse, shall support a quicker and thereby hydrochemically less damaging lake filling. However, this external flooding is only feasible under conditions of high runoff and therefore only as intermitted practice applicable. Additionally, technological measures of water treatment have to be applied to achieve the required effluent quality and to ensure the designated use. Regrettably, these technologies aren't commercially standard up to now and are not sustainable, while flooding or provides a huge amount itself of positive potential for hydrochemical stabilization. The river basin management of the rivers Spree and Schwarze Elster is attended by a common working group of the Federal States of Brandenburg and Berlin as well as the Free State of Saxony. The quantitative distribution of the regionally available water considers the potential use for drinking water supply, process water, …, and the flooding of open-pits. However, due to the formulated rank order, the flooding of the numerous mining open pits in Lusatia is on the last position. To guarantee a reliable flooding and a continuous water supply of the post mining lakes, additional water resources have to exploited. Additionally, the prospected climate induced changes in water supply have to be taken into account for a sustainable integrated water resources management in the Lusatian post-mining district.
2013-01-01
Background Professionals in the biomedical domain are confronted with an increasing mass of data. Developing methods to assist professional end users in the field of Knowledge Discovery to identify, extract, visualize and understand useful information from these huge amounts of data is a huge challenge. However, there are so many diverse methods and methodologies available, that for biomedical researchers who are inexperienced in the use of even relatively popular knowledge discovery methods, it can be very difficult to select the most appropriate method for their particular research problem. Results A web application, called KNODWAT (KNOwledge Discovery With Advanced Techniques) has been developed, using Java on Spring framework 3.1. and following a user-centered approach. The software runs on Java 1.6 and above and requires a web server such as Apache Tomcat and a database server such as the MySQL Server. For frontend functionality and styling, Twitter Bootstrap was used as well as jQuery for interactive user interface operations. Conclusions The framework presented is user-centric, highly extensible and flexible. Since it enables methods for testing using existing data to assess suitability and performance, it is especially suitable for inexperienced biomedical researchers, new to the field of knowledge discovery and data mining. For testing purposes two algorithms, CART and C4.5 were implemented using the WEKA data mining framework. PMID:23763826
Holzinger, Andreas; Zupan, Mario
2013-06-13
Professionals in the biomedical domain are confronted with an increasing mass of data. Developing methods to assist professional end users in the field of Knowledge Discovery to identify, extract, visualize and understand useful information from these huge amounts of data is a huge challenge. However, there are so many diverse methods and methodologies available, that for biomedical researchers who are inexperienced in the use of even relatively popular knowledge discovery methods, it can be very difficult to select the most appropriate method for their particular research problem. A web application, called KNODWAT (KNOwledge Discovery With Advanced Techniques) has been developed, using Java on Spring framework 3.1. and following a user-centered approach. The software runs on Java 1.6 and above and requires a web server such as Apache Tomcat and a database server such as the MySQL Server. For frontend functionality and styling, Twitter Bootstrap was used as well as jQuery for interactive user interface operations. The framework presented is user-centric, highly extensible and flexible. Since it enables methods for testing using existing data to assess suitability and performance, it is especially suitable for inexperienced biomedical researchers, new to the field of knowledge discovery and data mining. For testing purposes two algorithms, CART and C4.5 were implemented using the WEKA data mining framework.
A Proposed Data Fusion Architecture for Micro-Zone Analysis and Data Mining
DOE Office of Scientific and Technical Information (OSTI.GOV)
Kevin McCarthy; Milos Manic
Data Fusion requires the ability to combine or “fuse” date from multiple data sources. Time Series Analysis is a data mining technique used to predict future values from a data set based upon past values. Unlike other data mining techniques, however, Time Series places special emphasis on periodicity and how seasonal and other time-based factors tend to affect trends over time. One of the difficulties encountered in developing generic time series techniques is the wide variability of the data sets available for analysis. This presents challenges all the way from the data gathering stage to results presentation. This paper presentsmore » an architecture designed and used to facilitate the collection of disparate data sets well suited to Time Series analysis as well as other predictive data mining techniques. Results show this architecture provides a flexible, dynamic framework for the capture and storage of a myriad of dissimilar data sets and can serve as a foundation from which to build a complete data fusion architecture.« less
A new genome-mining tool redefines the lasso peptide biosynthetic landscape
Tietz, Jonathan I.; Schwalen, Christopher J.; Patel, Parth S.; Maxson, Tucker; Blair, Patricia M.; Tai, Hua-Chia; Zakai, Uzma I.; Mitchell, Douglas A.
2016-01-01
Ribosomally synthesized and post-translationally modified peptide (RiPP) natural products are attractive for genome-driven discovery and re-engineering, but limitations in bioinformatic methods and exponentially increasing genomic data make large-scale mining difficult. We report RODEO (Rapid ORF Description and Evaluation Online), which combines hidden Markov model-based analysis, heuristic scoring, and machine learning to identify biosynthetic gene clusters and predict RiPP precursor peptides. We initially focused on lasso peptides, which display intriguing physiochemical properties and bioactivities, but their hypervariability renders them challenging prospects for automated mining. Our approach yielded the most comprehensive mapping of lasso peptide space, revealing >1,300 compounds. We characterized the structures and bioactivities of six lasso peptides, prioritized based on predicted structural novelty, including an unprecedented handcuff-like topology and another with a citrulline modification exceptionally rare among bacteria. These combined insights significantly expand the knowledge of lasso peptides, and more broadly, provide a framework for future genome-mining efforts. PMID:28244986
75 FR 55678 - Minerals Management: Adjustment of Cost Recovery Fees
Federal Register 2010, 2011, 2012, 2013, 2014
2010-09-14
... text to the general cost recovery fee table so that mineral cost recovery fees can be found in one... Coal and Oil Shale) Program's lease renewal fee will increase from $480 to $485; (C) The Mining Law... $2,840; and (D) The Mining Law Administration Program's fee for mineral patent adjudication of 10 or...
Federal Register 2010, 2011, 2012, 2013, 2014
2010-10-15
... Environmental Assessment and Finding of No Significant Impact for License Amendment No. 61 for Rio Algom Mining... amendment to Source Materials License SUA-1473 issued to Rio Algom Mining LLC (Rio Algom, or the Licensee... access the NRC's Agencywide Document Access and Management System (ADAMS), which provides text and image...
ERIC Educational Resources Information Center
Tsai, Yea-Ru; Ouyang, Chen-Sen; Chang, Yukon
2016-01-01
The purpose of this study is to propose a diagnostic approach to identify engineering students' English reading comprehension errors. Student data were collected during the process of reading texts of English for science and technology on a web-based cumulative sentence analysis system. For the analysis, the association-rule, data mining technique…
Information Gain Based Dimensionality Selection for Classifying Text Documents
DOE Office of Scientific and Technical Information (OSTI.GOV)
Dumidu Wijayasekara; Milos Manic; Miles McQueen
2013-06-01
Selecting the optimal dimensions for various knowledge extraction applications is an essential component of data mining. Dimensionality selection techniques are utilized in classification applications to increase the classification accuracy and reduce the computational complexity. In text classification, where the dimensionality of the dataset is extremely high, dimensionality selection is even more important. This paper presents a novel, genetic algorithm based methodology, for dimensionality selection in text mining applications that utilizes information gain. The presented methodology uses information gain of each dimension to change the mutation probability of chromosomes dynamically. Since the information gain is calculated a priori, the computational complexitymore » is not affected. The presented method was tested on a specific text classification problem and compared with conventional genetic algorithm based dimensionality selection. The results show an improvement of 3% in the true positives and 1.6% in the true negatives over conventional dimensionality selection methods.« less
A Novel Multi-Class Ensemble Model for Classifying Imbalanced Biomedical Datasets
NASA Astrophysics Data System (ADS)
Bikku, Thulasi; Sambasiva Rao, N., Dr; Rao, Akepogu Ananda, Dr
2017-08-01
This paper mainly focuseson developing aHadoop based framework for feature selection and classification models to classify high dimensionality data in heterogeneous biomedical databases. Wide research has been performing in the fields of Machine learning, Big data and Data mining for identifying patterns. The main challenge is extracting useful features generated from diverse biological systems. The proposed model can be used for predicting diseases in various applications and identifying the features relevant to particular diseases. There is an exponential growth of biomedical repositories such as PubMed and Medline, an accurate predictive model is essential for knowledge discovery in Hadoop environment. Extracting key features from unstructured documents often lead to uncertain results due to outliers and missing values. In this paper, we proposed a two phase map-reduce framework with text preprocessor and classification model. In the first phase, mapper based preprocessing method was designed to eliminate irrelevant features, missing values and outliers from the biomedical data. In the second phase, a Map-Reduce based multi-class ensemble decision tree model was designed and implemented in the preprocessed mapper data to improve the true positive rate and computational time. The experimental results on the complex biomedical datasets show that the performance of our proposed Hadoop based multi-class ensemble model significantly outperforms state-of-the-art baselines.
DeepMeSH: deep semantic representation for improving large-scale MeSH indexing.
Peng, Shengwen; You, Ronghui; Wang, Hongning; Zhai, Chengxiang; Mamitsuka, Hiroshi; Zhu, Shanfeng
2016-06-15
Medical Subject Headings (MeSH) indexing, which is to assign a set of MeSH main headings to citations, is crucial for many important tasks in biomedical text mining and information retrieval. Large-scale MeSH indexing has two challenging aspects: the citation side and MeSH side. For the citation side, all existing methods, including Medical Text Indexer (MTI) by National Library of Medicine and the state-of-the-art method, MeSHLabeler, deal with text by bag-of-words, which cannot capture semantic and context-dependent information well. We propose DeepMeSH that incorporates deep semantic information for large-scale MeSH indexing. It addresses the two challenges in both citation and MeSH sides. The citation side challenge is solved by a new deep semantic representation, D2V-TFIDF, which concatenates both sparse and dense semantic representations. The MeSH side challenge is solved by using the 'learning to rank' framework of MeSHLabeler, which integrates various types of evidence generated from the new semantic representation. DeepMeSH achieved a Micro F-measure of 0.6323, 2% higher than 0.6218 of MeSHLabeler and 12% higher than 0.5637 of MTI, for BioASQ3 challenge data with 6000 citations. The software is available upon request. zhusf@fudan.edu.cn Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press.
Biomedical text mining for research rigor and integrity: tasks, challenges, directions.
Kilicoglu, Halil
2017-06-13
An estimated quarter of a trillion US dollars is invested in the biomedical research enterprise annually. There is growing alarm that a significant portion of this investment is wasted because of problems in reproducibility of research findings and in the rigor and integrity of research conduct and reporting. Recent years have seen a flurry of activities focusing on standardization and guideline development to enhance the reproducibility and rigor of biomedical research. Research activity is primarily communicated via textual artifacts, ranging from grant applications to journal publications. These artifacts can be both the source and the manifestation of practices leading to research waste. For example, an article may describe a poorly designed experiment, or the authors may reach conclusions not supported by the evidence presented. In this article, we pose the question of whether biomedical text mining techniques can assist the stakeholders in the biomedical research enterprise in doing their part toward enhancing research integrity and rigor. In particular, we identify four key areas in which text mining techniques can make a significant contribution: plagiarism/fraud detection, ensuring adherence to reporting guidelines, managing information overload and accurate citation/enhanced bibliometrics. We review the existing methods and tools for specific tasks, if they exist, or discuss relevant research that can provide guidance for future work. With the exponential increase in biomedical research output and the ability of text mining approaches to perform automatic tasks at large scale, we propose that such approaches can support tools that promote responsible research practices, providing significant benefits for the biomedical research enterprise. Published by Oxford University Press 2017. This work is written by a US Government employee and is in the public domain in the US.
May, Brian H; Zhang, Anthony; Lu, Yubo; Lu, Chuanjian; Xue, Charlie C L
2014-12-01
This project aimed to develop an approach to evaluating information contained in the premodern Traditional Chinese Medicine (TCM) literature that was (1) comprehensive, systematic, and replicable and (2) able to produce quantifiable output that could be used to answer specific research questions in order to identify natural products for clinical and experimental research. The project involved two stages. In stage 1, 14 TCM collections and compendia were evaluated for suitability as sources for searching; 8 of these were compared in detail. The results were published in the Journal of Alternative and Complementary Medicine. Stage 2 developed a text-mining approach for two of these sources. The text-mining approach was developed for Zhong Hua Yi Dian; Encyclopaedia of Traditional Chinese Medicine, 4th edition) and Zhong Yi Fang Ji Da Ci Dian; Great Compendium of Chinese Medical Formulae). This approach developed procedures for search term selection; methods for screening, classifying, and scoring data; procedures for systematic searching and data extraction; data checking procedures; and approaches for analyzing results. Examples are provided for studies of memory impairment and diabetic nephropathy, and issues relating to data interpretation are discussed. This approach to the analysis of large collections of the premodern TCM literature uses widely available sources and provides a text-mining approach that is systematic, replicable, and adaptable to the requirements of the particular project. Researchers can use these methods to explore changes in the names and conceptions of a disease over time, to identify which therapeutic methods have been more or less frequently used in different eras for particular disorders, and to assist in the selection of natural products for research efforts.
An Integrative data mining approach to identifying Adverse Outcome Pathway (AOP) Signatures
The Adverse Outcome Pathway (AOP) framework is a tool for making biological connections and summarizing key information across different levels of biological organization to connect biological perturbations at the molecular level to adverse outcomes for an individual or populatio...
Using high-throughput literature mining to support read-across predictions of toxicity (SOT)
Building scientific confidence in the development and evaluation of read-across remains an ongoing challenge. Approaches include establishing systematic frameworks to identify sources of uncertainty and ways to address them. One source of uncertainty is related to characterizing ...
High-throughput literature mining to support read-across predictions of toxicity (ASCCT meeting)
Building scientific confidence in the development and evaluation of read-across remains an ongoing challenge. Approaches include establishing systematic frameworks to identify sources of uncertainty and ways to address them. One source of uncertainty is related to characterizing ...
An Integrative Data Mining Approach to Identify Adverse Outcome Pathway Signatures
Adverse Outcome Pathways (AOPs) provide a formal framework for describing the mechanisms underlying the toxicity of chemicals in our environment. This process improves our ability to incorporate high-throughput toxicity testing (HTT) results and biomarker information on early key...
Incorporating linguistic knowledge for learning distributed word representations.
Wang, Yan; Liu, Zhiyuan; Sun, Maosong
2015-01-01
Combined with neural language models, distributed word representations achieve significant advantages in computational linguistics and text mining. Most existing models estimate distributed word vectors from large-scale data in an unsupervised fashion, which, however, do not take rich linguistic knowledge into consideration. Linguistic knowledge can be represented as either link-based knowledge or preference-based knowledge, and we propose knowledge regularized word representation models (KRWR) to incorporate these prior knowledge for learning distributed word representations. Experiment results demonstrate that our estimated word representation achieves better performance in task of semantic relatedness ranking. This indicates that our methods can efficiently encode both prior knowledge from knowledge bases and statistical knowledge from large-scale text corpora into a unified word representation model, which will benefit many tasks in text mining.
Incorporating Linguistic Knowledge for Learning Distributed Word Representations
Wang, Yan; Liu, Zhiyuan; Sun, Maosong
2015-01-01
Combined with neural language models, distributed word representations achieve significant advantages in computational linguistics and text mining. Most existing models estimate distributed word vectors from large-scale data in an unsupervised fashion, which, however, do not take rich linguistic knowledge into consideration. Linguistic knowledge can be represented as either link-based knowledge or preference-based knowledge, and we propose knowledge regularized word representation models (KRWR) to incorporate these prior knowledge for learning distributed word representations. Experiment results demonstrate that our estimated word representation achieves better performance in task of semantic relatedness ranking. This indicates that our methods can efficiently encode both prior knowledge from knowledge bases and statistical knowledge from large-scale text corpora into a unified word representation model, which will benefit many tasks in text mining. PMID:25874581
Text Mining of UU-ITE Implementation in Indonesia
NASA Astrophysics Data System (ADS)
Hakim, Lukmanul; Kusumasari, Tien F.; Lubis, Muharman
2018-04-01
At present, social media and networks act as one of the main platforms for sharing information, idea, thought and opinions. Many people share their knowledge and express their views on the specific topics or current hot issues that interest them. The social media texts have rich information about the complaints, comments, recommendation and suggestion as the automatic reaction or respond to government initiative or policy in order to overcome certain issues.This study examines the sentiment from netizensas part of citizen who has vocal sound about the implementation of UU ITE as the first cyberlaw in Indonesia as a means to identify the current tendency of citizen perception. To perform text mining techniques, this study used Twitter Rest API while R programming was utilized for the purpose of classification analysis based on hierarchical cluster.
Interactive text mining with Pipeline Pilot: a bibliographic web-based tool for PubMed.
Vellay, S G P; Latimer, N E Miller; Paillard, G
2009-06-01
Text mining has become an integral part of all research in the medical field. Many text analysis software platforms support particular use cases and only those. We show an example of a bibliographic tool that can be used to support virtually any use case in an agile manner. Here we focus on a Pipeline Pilot web-based application that interactively analyzes and reports on PubMed search results. This will be of interest to any scientist to help identify the most relevant papers in a topical area more quickly and to evaluate the results of query refinement. Links with Entrez databases help both the biologist and the chemist alike. We illustrate this application with Leishmaniasis, a neglected tropical disease, as a case study.
Spectral signature verification using statistical analysis and text mining
NASA Astrophysics Data System (ADS)
DeCoster, Mallory E.; Firpi, Alexe H.; Jacobs, Samantha K.; Cone, Shelli R.; Tzeng, Nigel H.; Rodriguez, Benjamin M.
2016-05-01
In the spectral science community, numerous spectral signatures are stored in databases representative of many sample materials collected from a variety of spectrometers and spectroscopists. Due to the variety and variability of the spectra that comprise many spectral databases, it is necessary to establish a metric for validating the quality of spectral signatures. This has been an area of great discussion and debate in the spectral science community. This paper discusses a method that independently validates two different aspects of a spectral signature to arrive at a final qualitative assessment; the textual meta-data and numerical spectral data. Results associated with the spectral data stored in the Signature Database1 (SigDB) are proposed. The numerical data comprising a sample material's spectrum is validated based on statistical properties derived from an ideal population set. The quality of the test spectrum is ranked based on a spectral angle mapper (SAM) comparison to the mean spectrum derived from the population set. Additionally, the contextual data of a test spectrum is qualitatively analyzed using lexical analysis text mining. This technique analyzes to understand the syntax of the meta-data to provide local learning patterns and trends within the spectral data, indicative of the test spectrum's quality. Text mining applications have successfully been implemented for security2 (text encryption/decryption), biomedical3 , and marketing4 applications. The text mining lexical analysis algorithm is trained on the meta-data patterns of a subset of high and low quality spectra, in order to have a model to apply to the entire SigDB data set. The statistical and textual methods combine to assess the quality of a test spectrum existing in a database without the need of an expert user. This method has been compared to other validation methods accepted by the spectral science community, and has provided promising results when a baseline spectral signature is present for comparison. The spectral validation method proposed is described from a practical application and analytical perspective.
Chen, Chuyun; Hong, Jiaming; Zhou, Weilin; Lin, Guohua; Wang, Zhengfei; Zhang, Qufei; Lu, Cuina; Lu, Lihong
2017-07-12
To construct a knowledge platform of acupuncture ancient books based on data mining technology, and to provide retrieval service for users. The Oracle 10 g database was applied and JAVA was selected as development language; based on the standard library and ancient books database established by manual entry, a variety of data mining technologies, including word segmentation, speech tagging, dependency analysis, rule extraction, similarity calculation, ambiguity analysis, supervised classification technology were applied to achieve text automatic extraction of ancient books; in the last, through association mining and decision analysis, the comprehensive and intelligent analysis of disease and symptom, meridians, acupoints, rules of acupuncture and moxibustion in acupuncture ancient books were realized, and retrieval service was provided for users through structure of browser/server (B/S). The platform realized full-text retrieval, word frequency analysis and association analysis; when diseases or acupoints were searched, the frequencies of meridian, acupoints (diseases) and techniques were presented from high to low, meanwhile the support degree and confidence coefficient between disease and acupoints (special acupoint), acupoints and acupoints in prescription, disease or acupoints and technique were presented. The experience platform of acupuncture ancient books based on data mining technology could be used as a reference for selection of disease, meridian and acupoint in clinical treatment and education of acupuncture and moxibustion.
Chen, Yi-An; Tripathi, Lokesh P; Mizuguchi, Kenji
2016-01-01
Data analysis is one of the most critical and challenging steps in drug discovery and disease biology. A user-friendly resource to visualize and analyse high-throughput data provides a powerful medium for both experimental and computational biologists to understand vastly different biological data types and obtain a concise, simplified and meaningful output for better knowledge discovery. We have previously developed TargetMine, an integrated data warehouse optimized for target prioritization. Here we describe how upgraded and newly modelled data types in TargetMine can now survey the wider biological and chemical data space, relevant to drug discovery and development. To enhance the scope of TargetMine from target prioritization to broad-based knowledge discovery, we have also developed a new auxiliary toolkit to assist with data analysis and visualization in TargetMine. This toolkit features interactive data analysis tools to query and analyse the biological data compiled within the TargetMine data warehouse. The enhanced system enables users to discover new hypotheses interactively by performing complicated searches with no programming and obtaining the results in an easy to comprehend output format. Database URL: http://targetmine.mizuguchilab.org. © The Author(s) 2016. Published by Oxford University Press.
Chen, Yi-An; Tripathi, Lokesh P.; Mizuguchi, Kenji
2016-01-01
Data analysis is one of the most critical and challenging steps in drug discovery and disease biology. A user-friendly resource to visualize and analyse high-throughput data provides a powerful medium for both experimental and computational biologists to understand vastly different biological data types and obtain a concise, simplified and meaningful output for better knowledge discovery. We have previously developed TargetMine, an integrated data warehouse optimized for target prioritization. Here we describe how upgraded and newly modelled data types in TargetMine can now survey the wider biological and chemical data space, relevant to drug discovery and development. To enhance the scope of TargetMine from target prioritization to broad-based knowledge discovery, we have also developed a new auxiliary toolkit to assist with data analysis and visualization in TargetMine. This toolkit features interactive data analysis tools to query and analyse the biological data compiled within the TargetMine data warehouse. The enhanced system enables users to discover new hypotheses interactively by performing complicated searches with no programming and obtaining the results in an easy to comprehend output format. Database URL: http://targetmine.mizuguchilab.org PMID:26989145
Integration of MOOCs in Advanced Mining Training Programmes
NASA Astrophysics Data System (ADS)
Saveleva, Irina; Greenwald, Oksana; Kolomiets, Svetlana; Medvedeva, Elena
2017-11-01
The paper covers the concept of innovative approaches in education based on incorporating MOOCs options into traditional classroom. It takes a look at the ways higher education instructors working with the mining engineers enrolled in advanced training programmes can brighten, upgrade and facilitate the learning process. The shift of higher education from in-class to online format has changed the learning environment and the methods of teaching including professional retraining courses. In addition, the need of mining companies for managers of a new kind obligates high school retraining centres rapidly move towards the 21st century skill framework. One of widely recognized innovations in the sphere of e-learning is MOOCs (Massive Open Online Courses) that can be used as an effective teaching tool for organizing professional training of managing staff of mining companies within the walls of a university. The authors share their instructional experience and show the benefits of introducing MOOCs options at the courses designed for retraining mining engineers and senior managers of coal enterprises. Though in recent researches the pedagogical value of MOOCs is highly questioned and even negated this invention of the 21st century can become an essential and truly helpful instrument in the hands of educators.
Public reactions to e-cigarette regulations on Twitter: a text mining analysis.
Lazard, Allison J; Wilcox, Gary B; Tuttle, Hannah M; Glowacki, Elizabeth M; Pikowski, Jessica
2017-12-01
In May 2016, the Food and Drug Administration (FDA) issued a final rule that deemed e-cigarettes to be within their regulatory authority as a tobacco product. News and opinions about the regulation were shared on social media platforms, such as Twitter, which can play an important role in shaping the public's attitudes. We analysed information shared on Twitter for insights into initial public reactions. A text mining approach was used to uncover important topics among reactions to the e-cigarette regulations on Twitter. SAS Text Miner V.12.1 software was used for descriptive text mining to uncover the primary topics from tweets collected from May 1 to May 17 2016 using NUVI software to gather the data. A total of nine topics were generated. These topics reveal initial reactions to whether the FDA's e-cigarette regulations will benefit or harm public health, how the regulations will impact the emerging e-cigarette market and efforts to share the news. The topics were dominated by negative or mixed reactions. In the days following the FDA's announcement of the new deeming regulations, the public reaction on Twitter was largely negative. Public health advocates should consider using social media outlets to better communicate the policy's intentions, reach and potential impact for public good to create a more balanced conversation. © Article author(s) (or their employer(s) unless otherwise stated in the text of the article) 2017. All rights reserved. No commercial use is permitted unless otherwise expressly granted.
2015-01-01
Background Sufficient knowledge of molecular and genetic interactions, which comprise the entire basis of the functioning of living systems, is one of the necessary requirements for successfully answering almost any research question in the field of biology and medicine. To date, more than 24 million scientific papers can be found in PubMed, with many of them containing descriptions of a wide range of biological processes. The analysis of such tremendous amounts of data requires the use of automated text-mining approaches. Although a handful of tools have recently been developed to meet this need, none of them provide error-free extraction of highly detailed information. Results The ANDSystem package was developed for the reconstruction and analysis of molecular genetic networks based on an automated text-mining technique. It provides a detailed description of the various types of interactions between genes, proteins, microRNA's, metabolites, cellular components, pathways and diseases, taking into account the specificity of cell lines and organisms. Although the accuracy of ANDSystem is comparable to other well known text-mining tools, such as Pathway Studio and STRING, it outperforms them in having the ability to identify an increased number of interaction types. Conclusion The use of ANDSystem, in combination with Pathway Studio and STRING, can improve the quality of the automated reconstruction of molecular and genetic networks. ANDSystem should provide a useful tool for researchers working in a number of different fields, including biology, biotechnology, pharmacology and medicine. PMID:25881313
Systematic review: Lost-time injuries in the US mining industry.
Nowrouzi-Kia, B; Sharma, B; Dignard, C; Kerekes, Z; Dumond, J; Li, A; Larivière, M
2017-08-01
The mining industry is associated with high levels of accidents, injuries and illnesses. Lost-time injuries are useful measures of health and safety in mines, and the effectiveness of its safety programmes. To identify the type of lost-time injuries in the US mining workforce and to examine predictors of these occupational injuries. Primary papers on lost-time injuries in the US mining sector were identified through a literature search in eight health, geology and mining databases, using a systematic review protocol tailored to each database. The Critical Appraisal Skills Programme (CASP), Framework of Quality Assurance for Administrative Data Source and the Cochrane Collaboration 'Risk of bias' assessment tools were used to assess study quality. A total of 1736 articles were retrieved before duplicates were removed. Fifteen articles were ultimately included with a CASP mean score of 6.33 (SD 0.62) out of 10. Predictors of lost-time injuries included slips and falls, electric injuries, use of mining equipment, working in underground mining, worker's age and occupational experience. This is the first systematic review of lost-time injuries in the US mining sector. The results support the need for further research on factors that contribute to workplace lost-time injuries as there is limited literature on the topic. Safety analytics should also be applied to uncover new trends and predict the likelihood of future incidents before they occur. New insights will allow employers to prevent injuries and foster a safer workplace environment by implementing successful occupational health and safety programmes. © The Author 2017. Published by Oxford University Press on behalf of the Society of Occupational Medicine. All rights reserved. For Permissions, please email: journals.permissions@oup.com
Hammarstrom, Jane M.; Johnson, Adam N.; Seal, Robert R.; Meier, Allen L.; Briggs, Paul L.; Piatak, Nadine M.
2006-01-01
The Virginia gold-pyrite belt, part of the central Virginia volcanic-plutonic belt, hosts numerous abandoned metal mines. The belt extends from about 50 km south of Washington, D.C., for approximately 175 km to the southwest into central Virginia. The rocks that comprise the belt include metamorphosed volcanic and clastic (noncarbonate) sedimentary rocks that were originally deposited during the Ordovician). Deposits that were mined can be classified into three broad categories: 1. volcanic-associated massive sulfide deposits, 2. low-sulfide quartz-gold vein deposits, 3. gold placer deposits, which result from weathering of the vein deposits The massive sulfide deposits were historically mined for iron and pyrite (sulfur), zinc, lead, and copper but also yielded byproduct gold and silver. The most intensely mineralized and mined section of the belt is southwest of Fredericksburg, in the Mineral district of Louisa and Spotsylvania counties. The Valzinco Piatak lead-zinc mine and the Mitchell gold prospect are abandoned sites in Spotsylvania County. As a result of environmental impacts associated with historic mining, both sites were prioritized for reclamation under the Virginia Orphaned Land Program administered by the Virginia Department of Mines, Minerals, and Energy (VDMME). This report summarizes geochemical data for all solid sample media, along with mineralogical data, and results of weathering experiments on Valzinco tailings and field experiments on sediment accumulation in Knights Branch. These data provide a framework for evaluating water-rock interactionsand geoenvironmental signatures of long-abandoned mines developed in massive sulfide deposits and low-sulfide gold-quartz vein deposits in the humid temperate ecosystem domain in the eastern United States.
ThaleMine: A Warehouse for Arabidopsis Data Integration and Discovery.
Krishnakumar, Vivek; Contrino, Sergio; Cheng, Chia-Yi; Belyaeva, Irina; Ferlanti, Erik S; Miller, Jason R; Vaughn, Matthew W; Micklem, Gos; Town, Christopher D; Chan, Agnes P
2017-01-01
ThaleMine (https://apps.araport.org/thalemine/) is a comprehensive data warehouse that integrates a wide array of genomic information of the model plant Arabidopsis thaliana. The data collection currently includes the latest structural and functional annotation from the Araport11 update, the Col-0 genome sequence, RNA-seq and array expression, co-expression, protein interactions, homologs, pathways, publications, alleles, germplasm and phenotypes. The data are collected from a wide variety of public resources. Users can browse gene-specific data through Gene Report pages, identify and create gene lists based on experiments or indexed keywords, and run GO enrichment analysis to investigate the biological significance of selected gene sets. Developed by the Arabidopsis Information Portal project (Araport, https://www.araport.org/), ThaleMine uses the InterMine software framework, which builds well-structured data, and provides powerful data query and analysis functionality. The warehoused data can be accessed by users via graphical interfaces, as well as programmatically via web-services. Here we describe recent developments in ThaleMine including new features and extensions, and discuss future improvements. InterMine has been broadly adopted by the model organism research community including nematode, rat, mouse, zebrafish, budding yeast, the modENCODE project, as well as being used for human data. ThaleMine is the first InterMine developed for a plant model. As additional new plant InterMines are developed by the legume and other plant research communities, the potential of cross-organism integrative data analysis will be further enabled. © The Author 2016. Published by Oxford University Press on behalf of Japanese Society of Plant Physiologists. All rights reserved. For permissions, please email: journals.permissions@oup.com.
An Interactive Visualization Framework to Support Exploration and Analysis of TBI/PTSD Clinical Data
2017-05-01
techniques to overcome some of the challenges and complexities of the data . Our approach uses a novel adaptive window-based frequency sequence mining ...AWARD NUMBER: W81XWH-15-2-0016 TITLE: An Interactive Visualization Framework to Support Exploration and Analysis of TBI/PTSD Clinical Data ...Analysis of TBI/PTSD Clinical Data 5a. CONTRACT NUMBER 5b. GRANT NUMBER W81XWH-15-2-0016 5c. PROGRAM ELEMENT NUMBER 6. AUTHOR(S) Dr. Jesus Caban 5d