Documents Similarity Measurement Using Field Association Terms.
ERIC Educational Resources Information Center
Atlam, El-Sayed; Fuketa, M.; Morita, K.; Aoe, Jun-ichi
2003-01-01
Discussion of text analysis and information retrieval and measurement of document similarity focuses on a new text manipulation system called FA (field association)-Sim that is useful for retrieving information in large heterogeneous texts and for recognizing content similarity in text excerpts. Discusses recall and precision, automatic indexing…
Duplicate document detection in DocBrowse
NASA Astrophysics Data System (ADS)
Chalana, Vikram; Bruce, Andrew G.; Nguyen, Thien
1998-04-01
Duplicate documents are frequently found in large databases of digital documents, such as those found in digital libraries or in the government declassification effort. Efficient duplicate document detection is important not only to allow querying for similar documents, but also to filter out redundant information in large document databases. We have designed three different algorithm to identify duplicate documents. The first algorithm is based on features extracted from the textual content of a document, the second algorithm is based on wavelet features extracted from the document image itself, and the third algorithm is a combination of the first two. These algorithms are integrated within the DocBrowse system for information retrieval from document images which is currently under development at MathSoft. DocBrowse supports duplicate document detection by allowing (1) automatic filtering to hide duplicate documents, and (2) ad hoc querying for similar or duplicate documents. We have tested the duplicate document detection algorithms on 171 documents and found that text-based method has an average 11-point precision of 97.7 percent while the image-based method has an average 11- point precision of 98.9 percent. However, in general, the text-based method performs better when the document contains enough high-quality machine printed text while the image- based method performs better when the document contains little or no quality machine readable text.
Spatial Paradigm for Information Retrieval and Exploration
DOE Office of Scientific and Technical Information (OSTI.GOV)
The SPIRE system consists of software for visual analysis of primarily text based information sources. This technology enables the content analysis of text documents without reading all the documents. It employs several algorithms for text and word proximity analysis. It identifies the key themes within the text documents. From this analysis, it projects the results onto a visual spatial proximity display (Galaxies or Themescape) where items (documents and/or themes) visually close to each other are known to have content which is close to each other. Innovative interaction techniques then allow for dynamic visual analysis of large text based information spaces.
SPIRE1.03. Spatial Paradigm for Information Retrieval and Exploration
DOE Office of Scientific and Technical Information (OSTI.GOV)
Adams, K.J.; Bohn, S.; Crow, V.
The SPIRE system consists of software for visual analysis of primarily text based information sources. This technology enables the content analysis of text documents without reading all the documents. It employs several algorithms for text and word proximity analysis. It identifies the key themes within the text documents. From this analysis, it projects the results onto a visual spatial proximity display (Galaxies or Themescape) where items (documents and/or themes) visually close to each other are known to have content which is close to each other. Innovative interaction techniques then allow for dynamic visual analysis of large text based information spaces.
Development of a full-text information retrieval system
DOE Office of Scientific and Technical Information (OSTI.GOV)
Keizo Oyama; AKira Miyazawa, Atsuhiro Takasu; Kouji Shibano
The authors have executed a project to realize a full-text information retrieval system. The system is designed to deal with a document database comprising full text of a large number of documents such as academic papers. The document structures are utilized in searching and extracting appropriate information. The concept of structure handling and the configuration of the system are described in this paper.
Comparing Latent Dirichlet Allocation and Latent Semantic Analysis as Classifiers
ERIC Educational Resources Information Center
Anaya, Leticia H.
2011-01-01
In the Information Age, a proliferation of unstructured text electronic documents exists. Processing these documents by humans is a daunting task as humans have limited cognitive abilities for processing large volumes of documents that can often be extremely lengthy. To address this problem, text data computer algorithms are being developed.…
Scalable ranked retrieval using document images
NASA Astrophysics Data System (ADS)
Jain, Rajiv; Oard, Douglas W.; Doermann, David
2013-12-01
Despite the explosion of text on the Internet, hard copy documents that have been scanned as images still play a significant role for some tasks. The best method to perform ranked retrieval on a large corpus of document images, however, remains an open research question. The most common approach has been to perform text retrieval using terms generated by optical character recognition. This paper, by contrast, examines whether a scalable segmentation-free image retrieval algorithm, which matches sub-images containing text or graphical objects, can provide additional benefit in satisfying a user's information needs on a large, real world dataset. Results on 7 million scanned pages from the CDIP v1.0 test collection show that content based image retrieval finds a substantial number of documents that text retrieval misses, and that when used as a basis for relevance feedback can yield improvements in retrieval effectiveness.
Fuzzy Document Clustering Approach using WordNet Lexical Categories
NASA Astrophysics Data System (ADS)
Gharib, Tarek F.; Fouad, Mohammed M.; Aref, Mostafa M.
Text mining refers generally to the process of extracting interesting information and knowledge from unstructured text. This area is growing rapidly mainly because of the strong need for analysing the huge and large amount of textual data that reside on internal file systems and the Web. Text document clustering provides an effective navigation mechanism to organize this large amount of data by grouping their documents into a small number of meaningful classes. In this paper we proposed a fuzzy text document clustering approach using WordNet lexical categories and Fuzzy c-Means algorithm. Some experiments are performed to compare efficiency of the proposed approach with the recently reported approaches. Experimental results show that Fuzzy clustering leads to great performance results. Fuzzy c-means algorithm overcomes other classical clustering algorithms like k-means and bisecting k-means in both clustering quality and running time efficiency.
Three-Dimensional Dispaly Of Document Set
Lantrip, David B.; Pennock, Kelly A.; Pottier, Marc C.; Schur, Anne; Thomas, James J.; Wise, James A.
2003-06-24
A method for spatializing text content for enhanced visual browsing and analysis. The invention is applied to large text document corpora such as digital libraries, regulations and procedures, archived reports, and the like. The text content from these sources may be transformed to a spatial representation that preserves informational characteristics from the documents. The three-dimensional representation may then be visually browsed and analyzed in ways that avoid language processing and that reduce the analysts' effort.
Three-dimensional display of document set
Lantrip, David B [Oxnard, CA; Pennock, Kelly A [Richland, WA; Pottier, Marc C [Richland, WA; Schur, Anne [Richland, WA; Thomas, James J [Richland, WA; Wise, James A [Richland, WA
2006-09-26
A method for spatializing text content for enhanced visual browsing and analysis. The invention is applied to large text document corpora such as digital libraries, regulations and procedures, archived reports, and the like. The text content from these sources may e transformed to a spatial representation that preserves informational characteristics from the documents. The three-dimensional representation may then be visually browsed and analyzed in ways that avoid language processing and that reduce the analysts' effort.
Three-dimensional display of document set
Lantrip, David B [Oxnard, CA; Pennock, Kelly A [Richland, WA; Pottier, Marc C [Richland, WA; Schur, Anne [Richland, WA; Thomas, James J [Richland, WA; Wise, James A [Richland, WA
2001-10-02
A method for spatializing text content for enhanced visual browsing and analysis. The invention is applied to large text document corpora such as digital libraries, regulations and procedures, archived reports, and the like. The text content from these sources may be transformed to a spatial representation that preserves informational characteristics from the documents. The three-dimensional representation may then be visually browsed and analyzed in ways that avoid language processing and that reduce the analysts' effort.
Three-dimensional display of document set
Lantrip, David B [Oxnard, CA; Pennock, Kelly A [Richland, WA; Pottier, Marc C [Richland, WA; Schur, Anne [Richland, WA; Thomas, James J [Richland, WA; Wise, James A [Richland, WA; York, Jeremy [Bothell, WA
2009-06-30
A method for spatializing text content for enhanced visual browsing and analysis. The invention is applied to large text document corpora such as digital libraries, regulations and procedures, archived reports, and the like. The text content from these sources may be transformed to a spatial representation that preserves informational characteristics from the documents. The three-dimensional representation may then be visually browsed and analyzed in ways that avoid language processing and that reduce the analysts' effort.
Visualizing the semantic content of large text databases using text maps
NASA Technical Reports Server (NTRS)
Combs, Nathan
1993-01-01
A methodology for generating text map representations of the semantic content of text databases is presented. Text maps provide a graphical metaphor for conceptualizing and visualizing the contents and data interrelationships of large text databases. Described are a set of experiments conducted against the TIPSTER corpora of Wall Street Journal articles. These experiments provide an introduction to current work in the representation and visualization of documents by way of their semantic content.
DocCube: Multi-Dimensional Visualization and Exploration of Large Document Sets.
ERIC Educational Resources Information Center
Mothe, Josiane; Chrisment, Claude; Dousset, Bernard; Alaux, Joel
2003-01-01
Describes a user interface that provides global visualizations of large document sets to help users formulate the query that corresponds to their information needs. Highlights include concept hierarchies that users can browse to specify and refine information needs; knowledge discovery in databases and texts; and multidimensional modeling.…
SureChEMBL: a large-scale, chemically annotated patent document database.
Papadatos, George; Davies, Mark; Dedman, Nathan; Chambers, Jon; Gaulton, Anna; Siddle, James; Koks, Richard; Irvine, Sean A; Pettersson, Joe; Goncharoff, Nicko; Hersey, Anne; Overington, John P
2016-01-04
SureChEMBL is a publicly available large-scale resource containing compounds extracted from the full text, images and attachments of patent documents. The data are extracted from the patent literature according to an automated text and image-mining pipeline on a daily basis. SureChEMBL provides access to a previously unavailable, open and timely set of annotated compound-patent associations, complemented with sophisticated combined structure and keyword-based search capabilities against the compound repository and patent document corpus; given the wealth of knowledge hidden in patent documents, analysis of SureChEMBL data has immediate applications in drug discovery, medicinal chemistry and other commercial areas of chemical science. Currently, the database contains 17 million compounds extracted from 14 million patent documents. Access is available through a dedicated web-based interface and data downloads at: https://www.surechembl.org/. © The Author(s) 2015. Published by Oxford University Press on behalf of Nucleic Acids Research.
SureChEMBL: a large-scale, chemically annotated patent document database
Papadatos, George; Davies, Mark; Dedman, Nathan; Chambers, Jon; Gaulton, Anna; Siddle, James; Koks, Richard; Irvine, Sean A.; Pettersson, Joe; Goncharoff, Nicko; Hersey, Anne; Overington, John P.
2016-01-01
SureChEMBL is a publicly available large-scale resource containing compounds extracted from the full text, images and attachments of patent documents. The data are extracted from the patent literature according to an automated text and image-mining pipeline on a daily basis. SureChEMBL provides access to a previously unavailable, open and timely set of annotated compound-patent associations, complemented with sophisticated combined structure and keyword-based search capabilities against the compound repository and patent document corpus; given the wealth of knowledge hidden in patent documents, analysis of SureChEMBL data has immediate applications in drug discovery, medicinal chemistry and other commercial areas of chemical science. Currently, the database contains 17 million compounds extracted from 14 million patent documents. Access is available through a dedicated web-based interface and data downloads at: https://www.surechembl.org/. PMID:26582922
Text line extraction in free style document
NASA Astrophysics Data System (ADS)
Shen, Xiaolu; Liu, Changsong; Ding, Xiaoqing; Zou, Yanming
2009-01-01
This paper addresses to text line extraction in free style document, such as business card, envelope, poster, etc. In free style document, global property such as character size, line direction can hardly be concluded, which reveals a grave limitation in traditional layout analysis. 'Line' is the most prominent and the highest structure in our bottom-up method. First, we apply a novel intensity function found on gradient information to locate text areas where gradient within a window have large magnitude and various directions, and split such areas into text pieces. We build a probability model of lines consist of text pieces via statistics on training data. For an input image, we group text pieces to lines using a simulated annealing algorithm with cost function based on the probability model.
Validating a strategy for psychosocial phenotyping using a large corpus of clinical text.
Gundlapalli, Adi V; Redd, Andrew; Carter, Marjorie; Divita, Guy; Shen, Shuying; Palmer, Miland; Samore, Matthew H
2013-12-01
To develop algorithms to improve efficiency of patient phenotyping using natural language processing (NLP) on text data. Of a large number of note titles available in our database, we sought to determine those with highest yield and precision for psychosocial concepts. From a database of over 1 billion documents from US Department of Veterans Affairs medical facilities, a random sample of 1500 documents from each of 218 enterprise note titles were chosen. Psychosocial concepts were extracted using a UIMA-AS-based NLP pipeline (v3NLP), using a lexicon of relevant concepts with negation and template format annotators. Human reviewers evaluated a subset of documents for false positives and sensitivity. High-yield documents were identified by hit rate and precision. Reasons for false positivity were characterized. A total of 58 707 psychosocial concepts were identified from 316 355 documents for an overall hit rate of 0.2 concepts per document (median 0.1, range 1.6-0). Of 6031 concepts reviewed from a high-yield set of note titles, the overall precision for all concept categories was 80%, with variability among note titles and concept categories. Reasons for false positivity included templating, negation, context, and alternate meaning of words. The sensitivity of the NLP system was noted to be 49% (95% CI 43% to 55%). Phenotyping using NLP need not involve the entire document corpus. Our methods offer a generalizable strategy for scaling NLP pipelines to large free text corpora with complex linguistic annotations in attempts to identify patients of a certain phenotype.
Validating a strategy for psychosocial phenotyping using a large corpus of clinical text
Gundlapalli, Adi V; Redd, Andrew; Carter, Marjorie; Divita, Guy; Shen, Shuying; Palmer, Miland; Samore, Matthew H
2013-01-01
Objective To develop algorithms to improve efficiency of patient phenotyping using natural language processing (NLP) on text data. Of a large number of note titles available in our database, we sought to determine those with highest yield and precision for psychosocial concepts. Materials and methods From a database of over 1 billion documents from US Department of Veterans Affairs medical facilities, a random sample of 1500 documents from each of 218 enterprise note titles were chosen. Psychosocial concepts were extracted using a UIMA-AS-based NLP pipeline (v3NLP), using a lexicon of relevant concepts with negation and template format annotators. Human reviewers evaluated a subset of documents for false positives and sensitivity. High-yield documents were identified by hit rate and precision. Reasons for false positivity were characterized. Results A total of 58 707 psychosocial concepts were identified from 316 355 documents for an overall hit rate of 0.2 concepts per document (median 0.1, range 1.6–0). Of 6031 concepts reviewed from a high-yield set of note titles, the overall precision for all concept categories was 80%, with variability among note titles and concept categories. Reasons for false positivity included templating, negation, context, and alternate meaning of words. The sensitivity of the NLP system was noted to be 49% (95% CI 43% to 55%). Conclusions Phenotyping using NLP need not involve the entire document corpus. Our methods offer a generalizable strategy for scaling NLP pipelines to large free text corpora with complex linguistic annotations in attempts to identify patients of a certain phenotype. PMID:24169276
High-Reproducibility and High-Accuracy Method for Automated Topic Classification
NASA Astrophysics Data System (ADS)
Lancichinetti, Andrea; Sirer, M. Irmak; Wang, Jane X.; Acuna, Daniel; Körding, Konrad; Amaral, Luís A. Nunes
2015-01-01
Much of human knowledge sits in large databases of unstructured text. Leveraging this knowledge requires algorithms that extract and record metadata on unstructured text documents. Assigning topics to documents will enable intelligent searching, statistical characterization, and meaningful classification. Latent Dirichlet allocation (LDA) is the state of the art in topic modeling. Here, we perform a systematic theoretical and numerical analysis that demonstrates that current optimization techniques for LDA often yield results that are not accurate in inferring the most suitable model parameters. Adapting approaches from community detection in networks, we propose a new algorithm that displays high reproducibility and high accuracy and also has high computational efficiency. We apply it to a large set of documents in the English Wikipedia and reveal its hierarchical structure.
Neural networks for data mining electronic text collections
NASA Astrophysics Data System (ADS)
Walker, Nicholas; Truman, Gregory
1997-04-01
The use of neural networks in information retrieval and text analysis has primarily suffered from the issues of adequate document representation, the ability to scale to very large collections, dynamism in the face of new information and the practical difficulties of basing the design on the use of supervised training sets. Perhaps the most important approach to begin solving these problems is the use of `intermediate entities' which reduce the dimensionality of document representations and the size of documents collections to manageable levels coupled with the use of unsupervised neural network paradigms. This paper describes the issues, a fully configured neural network-based text analysis system--dataHARVEST--aimed at data mining text collections which begins this process, along with the remaining difficulties and potential ways forward.
Görg, Carsten; Liu, Zhicheng; Kihm, Jaeyeon; Choo, Jaegul; Park, Haesun; Stasko, John
2013-10-01
Investigators across many disciplines and organizations must sift through large collections of text documents to understand and piece together information. Whether they are fighting crime, curing diseases, deciding what car to buy, or researching a new field, inevitably investigators will encounter text documents. Taking a visual analytics approach, we integrate multiple text analysis algorithms with a suite of interactive visualizations to provide a flexible and powerful environment that allows analysts to explore collections of documents while sensemaking. Our particular focus is on the process of integrating automated analyses with interactive visualizations in a smooth and fluid manner. We illustrate this integration through two example scenarios: an academic researcher examining InfoVis and VAST conference papers and a consumer exploring car reviews while pondering a purchase decision. Finally, we provide lessons learned toward the design and implementation of visual analytics systems for document exploration and understanding.
Document similarity measures and document browsing
NASA Astrophysics Data System (ADS)
Ahmadullin, Ildus; Fan, Jian; Damera-Venkata, Niranjan; Lim, Suk Hwan; Lin, Qian; Liu, Jerry; Liu, Sam; O'Brien-Strain, Eamonn; Allebach, Jan
2011-03-01
Managing large document databases is an important task today. Being able to automatically com- pare document layouts and classify and search documents with respect to their visual appearance proves to be desirable in many applications. We measure single page documents' similarity with respect to distance functions between three document components: background, text, and saliency. Each document component is represented as a Gaussian mixture distribution; and distances between dierent documents' components are calculated as probabilistic similarities between corresponding distributions. The similarity measure between documents is represented as a weighted sum of the components' distances. Using this document similarity measure, we propose a browsing mechanism operating on a document dataset. For these purposes, we use a hierarchical browsing environment which we call the document similarity pyramid. It allows the user to browse a large document dataset and to search for documents in the dataset that are similar to the query. The user can browse the dataset on dierent levels of the pyramid, and zoom into the documents that are of interest.
System of HPC content archiving
NASA Astrophysics Data System (ADS)
Bogdanov, A.; Ivashchenko, A.
2017-12-01
This work is aimed to develop a system, that will effectively solve the problem of storing and analyzing files containing text data, by using modern software development tools, techniques and approaches. The main challenge of storing a large number of text documents defined at the problem formulation stage, have to be resolved with such functionality as full text search and document clustering depends on their contents. Main system features could be described with notions of distributed multilevel architecture, flexibility and interchangeability of components, achieved through the standard functionality incapsulation in independent executable modules.
Taming Big Data: An Information Extraction Strategy for Large Clinical Text Corpora.
Gundlapalli, Adi V; Divita, Guy; Carter, Marjorie E; Redd, Andrew; Samore, Matthew H; Gupta, Kalpana; Trautner, Barbara
2015-01-01
Concepts of interest for clinical and research purposes are not uniformly distributed in clinical text available in electronic medical records. The purpose of our study was to identify filtering techniques to select 'high yield' documents for increased efficacy and throughput. Using two large corpora of clinical text, we demonstrate the identification of 'high yield' document sets in two unrelated domains: homelessness and indwelling urinary catheters. For homelessness, the high yield set includes homeless program and social work notes. For urinary catheters, concepts were more prevalent in notes from hospitalized patients; nursing notes accounted for a majority of the high yield set. This filtering will enable customization and refining of information extraction pipelines to facilitate extraction of relevant concepts for clinical decision support and other uses.
Massively Scalable Near Duplicate Detection in Streams of Documents using MDSH
DOE Office of Scientific and Technical Information (OSTI.GOV)
Bogen, Paul Logasa; Symons, Christopher T; McKenzie, Amber T
2013-01-01
In a world where large-scale text collections are not only becoming ubiquitous but also are growing at increasing rates, near duplicate documents are becoming a growing concern that has the potential to hinder many different information filtering tasks. While others have tried to address this problem, prior techniques have only been used on limited collection sizes and static cases. We will briefly describe the problem in the context of Open Source Intelligence (OSINT) along with our additional constraints for performance. In this work we propose two variations on Multi-dimensional Spectral Hash (MDSH) tailored for working on extremely large, growing setsmore » of text documents. We analyze the memory and runtime characteristics of our techniques and provide an informal analysis of the quality of the near-duplicate clusters produced by our techniques.« less
Computer-Assisted Search Of Large Textual Data Bases
NASA Technical Reports Server (NTRS)
Driscoll, James R.
1995-01-01
"QA" denotes high-speed computer system for searching diverse collections of documents including (but not limited to) technical reference manuals, legal documents, medical documents, news releases, and patents. Incorporates previously available and emerging information-retrieval technology to help user intelligently and rapidly locate information found in large textual data bases. Technology includes provision for inquiries in natural language; statistical ranking of retrieved information; artificial-intelligence implementation of semantics, in which "surface level" knowledge found in text used to improve ranking of retrieved information; and relevance feedback, in which user's judgements of relevance of some retrieved documents used automatically to modify search for further information.
Supporting the education evidence portal via text mining
Ananiadou, Sophia; Thompson, Paul; Thomas, James; Mu, Tingting; Oliver, Sandy; Rickinson, Mark; Sasaki, Yutaka; Weissenbacher, Davy; McNaught, John
2010-01-01
The UK Education Evidence Portal (eep) provides a single, searchable, point of access to the contents of the websites of 33 organizations relating to education, with the aim of revolutionizing work practices for the education community. Use of the portal alleviates the need to spend time searching multiple resources to find relevant information. However, the combined content of the websites of interest is still very large (over 500 000 documents and growing). This means that searches using the portal can produce very large numbers of hits. As users often have limited time, they would benefit from enhanced methods of performing searches and viewing results, allowing them to drill down to information of interest more efficiently, without having to sift through potentially long lists of irrelevant documents. The Joint Information Systems Committee (JISC)-funded ASSIST project has produced a prototype web interface to demonstrate the applicability of integrating a number of text-mining tools and methods into the eep, to facilitate an enhanced searching, browsing and document-viewing experience. New features include automatic classification of documents according to a taxonomy, automatic clustering of search results according to similar document content, and automatic identification and highlighting of key terms within documents. PMID:20643679
Text Mining the History of Medicine.
Thompson, Paul; Batista-Navarro, Riza Theresa; Kontonatsios, Georgios; Carter, Jacob; Toon, Elizabeth; McNaught, John; Timmermann, Carsten; Worboys, Michael; Ananiadou, Sophia
2016-01-01
Historical text archives constitute a rich and diverse source of information, which is becoming increasingly readily accessible, due to large-scale digitisation efforts. However, it can be difficult for researchers to explore and search such large volumes of data in an efficient manner. Text mining (TM) methods can help, through their ability to recognise various types of semantic information automatically, e.g., instances of concepts (places, medical conditions, drugs, etc.), synonyms/variant forms of concepts, and relationships holding between concepts (which drugs are used to treat which medical conditions, etc.). TM analysis allows search systems to incorporate functionality such as automatic suggestions of synonyms of user-entered query terms, exploration of different concepts mentioned within search results or isolation of documents in which concepts are related in specific ways. However, applying TM methods to historical text can be challenging, according to differences and evolutions in vocabulary, terminology, language structure and style, compared to more modern text. In this article, we present our efforts to overcome the various challenges faced in the semantic analysis of published historical medical text dating back to the mid 19th century. Firstly, we used evidence from diverse historical medical documents from different periods to develop new resources that provide accounts of the multiple, evolving ways in which concepts, their variants and relationships amongst them may be expressed. These resources were employed to support the development of a modular processing pipeline of TM tools for the robust detection of semantic information in historical medical documents with varying characteristics. We applied the pipeline to two large-scale medical document archives covering wide temporal ranges as the basis for the development of a publicly accessible semantically-oriented search system. The novel resources are available for research purposes, while the processing pipeline and its modules may be used and configured within the Argo TM platform.
Text Mining the History of Medicine
Thompson, Paul; Batista-Navarro, Riza Theresa; Kontonatsios, Georgios; Carter, Jacob; Toon, Elizabeth; McNaught, John; Timmermann, Carsten; Worboys, Michael; Ananiadou, Sophia
2016-01-01
Historical text archives constitute a rich and diverse source of information, which is becoming increasingly readily accessible, due to large-scale digitisation efforts. However, it can be difficult for researchers to explore and search such large volumes of data in an efficient manner. Text mining (TM) methods can help, through their ability to recognise various types of semantic information automatically, e.g., instances of concepts (places, medical conditions, drugs, etc.), synonyms/variant forms of concepts, and relationships holding between concepts (which drugs are used to treat which medical conditions, etc.). TM analysis allows search systems to incorporate functionality such as automatic suggestions of synonyms of user-entered query terms, exploration of different concepts mentioned within search results or isolation of documents in which concepts are related in specific ways. However, applying TM methods to historical text can be challenging, according to differences and evolutions in vocabulary, terminology, language structure and style, compared to more modern text. In this article, we present our efforts to overcome the various challenges faced in the semantic analysis of published historical medical text dating back to the mid 19th century. Firstly, we used evidence from diverse historical medical documents from different periods to develop new resources that provide accounts of the multiple, evolving ways in which concepts, their variants and relationships amongst them may be expressed. These resources were employed to support the development of a modular processing pipeline of TM tools for the robust detection of semantic information in historical medical documents with varying characteristics. We applied the pipeline to two large-scale medical document archives covering wide temporal ranges as the basis for the development of a publicly accessible semantically-oriented search system. The novel resources are available for research purposes, while the processing pipeline and its modules may be used and configured within the Argo TM platform. PMID:26734936
Using phrases and document metadata to improve topic modeling of clinical reports.
Speier, William; Ong, Michael K; Arnold, Corey W
2016-06-01
Probabilistic topic models provide an unsupervised method for analyzing unstructured text, which have the potential to be integrated into clinical automatic summarization systems. Clinical documents are accompanied by metadata in a patient's medical history and frequently contains multiword concepts that can be valuable for accurately interpreting the included text. While existing methods have attempted to address these problems individually, we present a unified model for free-text clinical documents that integrates contextual patient- and document-level data, and discovers multi-word concepts. In the proposed model, phrases are represented by chained n-grams and a Dirichlet hyper-parameter is weighted by both document-level and patient-level context. This method and three other Latent Dirichlet allocation models were fit to a large collection of clinical reports. Examples of resulting topics demonstrate the results of the new model and the quality of the representations are evaluated using empirical log likelihood. The proposed model was able to create informative prior probabilities based on patient and document information, and captured phrases that represented various clinical concepts. The representation using the proposed model had a significantly higher empirical log likelihood than the compared methods. Integrating document metadata and capturing phrases in clinical text greatly improves the topic representation of clinical documents. The resulting clinically informative topics may effectively serve as the basis for an automatic summarization system for clinical reports. Copyright © 2016 Elsevier Inc. All rights reserved.
Development and Evaluation of Thesauri-Based Bibliographic Biomedical Search Engine
ERIC Educational Resources Information Center
Alghoson, Abdullah
2017-01-01
Due to the large volume and exponential growth of biomedical documents (e.g., books, journal articles), it has become increasingly challenging for biomedical search engines to retrieve relevant documents based on users' search queries. Part of the challenge is the matching mechanism of free-text indexing that performs matching based on…
Assessing usage patterns of electronic clinical documentation templates.
Vawdrey, David K
2008-11-06
Many vendors of electronic medical records support structured and free-text entry of clinical documents using configurable templates. At a healthcare institution comprising two large academic medical centers, a documentation management data mart and a custom, Web-accessible business intelligence application were developed to track the availability and usage of electronic documentation templates. For each medical center, template availability and usage trends were measured from November 2007 through February 2008. By February 2008, approximately 65,000 electronic notes were authored per week on the two campuses. One site had 934 available templates, with 313 being used to author at least one note. The other site had 765 templates, of which 480 were used. The most commonly used template at both campuses was a free text note called "Miscellaneous Nursing Note," which accounted for 33.3% of total documents generated at one campus and 15.2% at the other.
Transcript mapping for handwritten English documents
NASA Astrophysics Data System (ADS)
Jose, Damien; Bharadwaj, Anurag; Govindaraju, Venu
2008-01-01
Transcript mapping or text alignment with handwritten documents is the automatic alignment of words in a text file with word images in a handwritten document. Such a mapping has several applications in fields ranging from machine learning where large quantities of truth data are required for evaluating handwriting recognition algorithms, to data mining where word image indexes are used in ranked retrieval of scanned documents in a digital library. The alignment also aids "writer identity" verification algorithms. Interfaces which display scanned handwritten documents may use this alignment to highlight manuscript tokens when a person examines the corresponding transcript word. We propose an adaptation of the True DTW dynamic programming algorithm for English handwritten documents. The integration of the dissimilarity scores from a word-model word recognizer and Levenshtein distance between the recognized word and lexicon word, as a cost metric in the DTW algorithm leading to a fast and accurate alignment, is our primary contribution. Results provided, confirm the effectiveness of our approach.
NASA Astrophysics Data System (ADS)
Srinivasa, K. G.; Shree Devi, B. N.
2017-10-01
String searching in documents has become a tedious task with the evolution of Big Data. Generation of large data sets demand for a high performance search algorithm in areas such as text mining, information retrieval and many others. The popularity of GPU's for general purpose computing has been increasing for various applications. Therefore it is of great interest to exploit the thread feature of a GPU to provide a high performance search algorithm. This paper proposes an optimized new approach to N-gram model for string search in a number of lengthy documents and its GPU implementation. The algorithm exploits GPGPUs for searching strings in many documents employing character level N-gram matching with parallel Score Table approach and search using CUDA API. The new approach of Score table used for frequency storage of N-grams in a document, makes the search independent of the document's length and allows faster access to the frequency values, thus decreasing the search complexity. The extensive thread feature in a GPU has been exploited to enable parallel pre-processing of trigrams in a document for Score Table creation and parallel search in huge number of documents, thus speeding up the whole search process even for a large pattern size. Experiments were carried out for many documents of varied length and search strings from the standard Lorem Ipsum text on NVIDIA's GeForce GT 540M GPU with 96 cores. Results prove that the parallel approach for Score Table creation and searching gives a good speed up than the same approach executed serially.
Enforced Sparse Non-Negative Matrix Factorization
2016-01-23
documents to find interesting pieces of information. With limited resources, analysts often employ automated text - mining tools that highlight common...represented as an undirected bipartite graph. It has become a common method for generating topic models of text data because it is known to produce good results...model and the convergence rate of the underlying algorithm. I. Introduction A common analyst challenge is searching through large quantities of text
DOE Office of Scientific and Technical Information (OSTI.GOV)
Crossno, Patricia Joyce; Dunlavy, Daniel M.; Stanton, Eric T.
This report is a summary of the accomplishments of the 'Scalable Solutions for Processing and Searching Very Large Document Collections' LDRD, which ran from FY08 through FY10. Our goal was to investigate scalable text analysis; specifically, methods for information retrieval and visualization that could scale to extremely large document collections. Towards that end, we designed, implemented, and demonstrated a scalable framework for text analysis - ParaText - as a major project deliverable. Further, we demonstrated the benefits of using visual analysis in text analysis algorithm development, improved performance of heterogeneous ensemble models in data classification problems, and the advantages ofmore » information theoretic methods in user analysis and interpretation in cross language information retrieval. The project involved 5 members of the technical staff and 3 summer interns (including one who worked two summers). It resulted in a total of 14 publications, 3 new software libraries (2 open source and 1 internal to Sandia), several new end-user software applications, and over 20 presentations. Several follow-on projects have already begun or will start in FY11, with additional projects currently in proposal.« less
Helios: Understanding Solar Evolution Through Text Analytics
DOE Office of Scientific and Technical Information (OSTI.GOV)
Randazzese, Lucien
This proof-of-concept project focused on developing, testing, and validating a range of bibliometric, text analytic, and machine-learning based methods to explore the evolution of three photovoltaic (PV) technologies: Cadmium Telluride (CdTe), Dye-Sensitized solar cells (DSSC), and Multi-junction solar cells. The analytical approach to the work was inspired by previous work by the same team to measure and predict the scientific prominence of terms and entities within specific research domains. The goal was to create tools that could assist domain-knowledgeable analysts in investigating the history and path of technological developments in general, with a focus on analyzing step-function changes in performance,more » or “breakthroughs,” in particular. The text-analytics platform developed during this project was dubbed Helios. The project relied on computational methods for analyzing large corpora of technical documents. For this project we ingested technical documents from the following sources into Helios: Thomson Scientific Web of Science (papers), the U.S. Patent & Trademark Office (patents), the U.S. Department of Energy (technical documents), the U.S. National Science Foundation (project funding summaries), and a hand curated set of full-text documents from Thomson Scientific and other sources.« less
Typograph: Multiscale Spatial Exploration of Text Documents
DOE Office of Scientific and Technical Information (OSTI.GOV)
Endert, Alexander; Burtner, Edwin R.; Cramer, Nicholas O.
2013-12-01
Visualizing large document collections using a spatial layout of terms can enable quick overviews of information. However, these metaphors (e.g., word clouds, tag clouds, etc.) often lack interactivity to explore the information and the location and rendering of the terms are often not based on mathematical models that maintain relative distances from other information based on similarity metrics. Further, transitioning between levels of detail (i.e., from terms to full documents) can be challanging. In this paper, we present Typograph, a multi-scale spatial exploration visualization for large document collections. Based on the term-based visualization methods, Typograh enables multipel levels of detailmore » (terms, phrases, snippets, and full documents) within the single spatialization. Further, the information is placed based on their relative similarity to other information to create the “near = similar” geography metaphor. This paper discusses the design principles and functionality of Typograph and presents a use case analyzing Wikipedia to demonstrate usage.« less
Highlights from Evaluation of EBCE.
ERIC Educational Resources Information Center
Bucknam, Ronald B.; Brand, Sheara G.
1983-01-01
THE FOLLOWING IS THE FULL TEXT OF THIS DOCUMENT A meta-analysis of 80 third-party evaluations (all that were conducted) of Experience-Based Career Education programs shows that in the large majority of programs: (1) EBCE students made large gains not only in career skills and life attitudes but also in academic skills; (2) EBCE students gained…
Redd, Andrew M; Gundlapalli, Adi V; Divita, Guy; Carter, Marjorie E; Tran, Le-Thuy; Samore, Matthew H
2017-07-01
Templates in text notes pose challenges for automated information extraction algorithms. We propose a method that identifies novel templates in plain text medical notes. The identification can then be used to either include or exclude templates when processing notes for information extraction. The two-module method is based on the framework of information foraging and addresses the hypothesis that documents containing templates and the templates within those documents can be identified by common features. The first module takes documents from the corpus and groups those with common templates. This is accomplished through a binned word count hierarchical clustering algorithm. The second module extracts the templates. It uses the groupings and performs a longest common subsequence (LCS) algorithm to obtain the constituent parts of the templates. The method was developed and tested on a random document corpus of 750 notes derived from a large database of US Department of Veterans Affairs (VA) electronic medical notes. The grouping module, using hierarchical clustering, identified 23 groups with 3 documents or more, consisting of 120 documents from the 750 documents in our test corpus. Of these, 18 groups had at least one common template that was present in all documents in the group for a positive predictive value of 78%. The LCS extraction module performed with 100% positive predictive value, 94% sensitivity, and 83% negative predictive value. The human review determined that in 4 groups the template covered the entire document, with the remaining 14 groups containing a common section template. Among documents with templates, the number of templates per document ranged from 1 to 14. The mean and median number of templates per group was 5.9 and 5, respectively. The grouping method was successful in finding like documents containing templates. Of the groups of documents containing templates, the LCS module was successful in deciphering text belonging to the template and text that was extraneous. Major obstacles to improved performance included documents composed of multiple templates, templates that included other templates embedded within them, and variants of templates. We demonstrate proof of concept of the grouping and extraction method of identifying templates in electronic medical records in this pilot study and propose methods to improve performance and scaling up. Published by Elsevier Inc.
Human Rights Texts: Converting Human Rights Primary Source Documents into Data.
Fariss, Christopher J; Linder, Fridolin J; Jones, Zachary M; Crabtree, Charles D; Biek, Megan A; Ross, Ana-Sophia M; Kaur, Taranamol; Tsai, Michael
2015-01-01
We introduce and make publicly available a large corpus of digitized primary source human rights documents which are published annually by monitoring agencies that include Amnesty International, Human Rights Watch, the Lawyers Committee for Human Rights, and the United States Department of State. In addition to the digitized text, we also make available and describe document-term matrices, which are datasets that systematically organize the word counts from each unique document by each unique term within the corpus of human rights documents. To contextualize the importance of this corpus, we describe the development of coding procedures in the human rights community and several existing categorical indicators that have been created by human coding of the human rights documents contained in the corpus. We then discuss how the new human rights corpus and the existing human rights datasets can be used with a variety of statistical analyses and machine learning algorithms to help scholars understand how human rights practices and reporting have evolved over time. We close with a discussion of our plans for dataset maintenance, updating, and availability.
Human Rights Texts: Converting Human Rights Primary Source Documents into Data
Fariss, Christopher J.; Linder, Fridolin J.; Jones, Zachary M.; Crabtree, Charles D.; Biek, Megan A.; Ross, Ana-Sophia M.; Kaur, Taranamol; Tsai, Michael
2015-01-01
We introduce and make publicly available a large corpus of digitized primary source human rights documents which are published annually by monitoring agencies that include Amnesty International, Human Rights Watch, the Lawyers Committee for Human Rights, and the United States Department of State. In addition to the digitized text, we also make available and describe document-term matrices, which are datasets that systematically organize the word counts from each unique document by each unique term within the corpus of human rights documents. To contextualize the importance of this corpus, we describe the development of coding procedures in the human rights community and several existing categorical indicators that have been created by human coding of the human rights documents contained in the corpus. We then discuss how the new human rights corpus and the existing human rights datasets can be used with a variety of statistical analyses and machine learning algorithms to help scholars understand how human rights practices and reporting have evolved over time. We close with a discussion of our plans for dataset maintenance, updating, and availability. PMID:26418817
Can physicians recognize their own patients in de-identified notes?
Meystre, Stéphane; Shen, Shuying; Hofmann, Deborah; Gundlapalli, Adi
2014-01-01
The adoption of Electronic Health Records is growing at a fast pace, and this growth results in very large quantities of patient clinical information becoming available in electronic format, with tremendous potentials, but also equally growing concern for patient confidentiality breaches. De-identification of patient information has been proposed as a solution to both facilitate secondary uses of clinical information, and protect patient information confidentiality. Automated approaches based on Natural Language Processing have been implemented and evaluated, allowing for much faster text de-identification than manual approaches. A U.S. Veterans Affairs clinical text de-identification project focused on investigating the current state of the art of automatic clinical text de-identification, on developing a best-of-breed de-identification application for clinical documents, and on evaluating its impact on subsequent text uses and the risk for re-identification. To evaluate this risk, we de-identified discharge summaries from 86 patients using our 'best-of-breed' text de-identification application with resynthesis of the identifiers detected. We then asked physicians working in the ward the patients were hospitalized in if they could recognize these patients when reading the de-identified documents. Each document was examined by at least one resident and one attending physician, and with 4.65% of the documents, physicians thought they recognized the patient because of specific clinical information, but after verification, none was correctly re-identified.
"What is relevant in a text document?": An interpretable machine learning approach
Arras, Leila; Horn, Franziska; Montavon, Grégoire; Müller, Klaus-Robert
2017-01-01
Text documents can be described by a number of abstract concepts such as semantic category, writing style, or sentiment. Machine learning (ML) models have been trained to automatically map documents to these abstract concepts, allowing to annotate very large text collections, more than could be processed by a human in a lifetime. Besides predicting the text’s category very accurately, it is also highly desirable to understand how and why the categorization process takes place. In this paper, we demonstrate that such understanding can be achieved by tracing the classification decision back to individual words using layer-wise relevance propagation (LRP), a recently developed technique for explaining predictions of complex non-linear classifiers. We train two word-based ML models, a convolutional neural network (CNN) and a bag-of-words SVM classifier, on a topic categorization task and adapt the LRP method to decompose the predictions of these models onto words. Resulting scores indicate how much individual words contribute to the overall classification decision. This enables one to distill relevant information from text documents without an explicit semantic information extraction step. We further use the word-wise relevance scores for generating novel vector-based document representations which capture semantic information. Based on these document vectors, we introduce a measure of model explanatory power and show that, although the SVM and CNN models perform similarly in terms of classification accuracy, the latter exhibits a higher level of explainability which makes it more comprehensible for humans and potentially more useful for other applications. PMID:28800619
Named Entity Recognition in Chinese Clinical Text Using Deep Neural Network.
Wu, Yonghui; Jiang, Min; Lei, Jianbo; Xu, Hua
2015-01-01
Rapid growth in electronic health records (EHRs) use has led to an unprecedented expansion of available clinical data in electronic formats. However, much of the important healthcare information is locked in the narrative documents. Therefore Natural Language Processing (NLP) technologies, e.g., Named Entity Recognition that identifies boundaries and types of entities, has been extensively studied to unlock important clinical information in free text. In this study, we investigated a novel deep learning method to recognize clinical entities in Chinese clinical documents using the minimal feature engineering approach. We developed a deep neural network (DNN) to generate word embeddings from a large unlabeled corpus through unsupervised learning and another DNN for the NER task. The experiment results showed that the DNN with word embeddings trained from the large unlabeled corpus outperformed the state-of-the-art CRF's model in the minimal feature engineering setting, achieving the highest F1-score of 0.9280. Further analysis showed that word embeddings derived through unsupervised learning from large unlabeled corpus remarkably improved the DNN with randomized embedding, denoting the usefulness of unsupervised feature learning.
Research on aviation unsafe incidents classification with improved TF-IDF algorithm
NASA Astrophysics Data System (ADS)
Wang, Yanhua; Zhang, Zhiyuan; Huo, Weigang
2016-05-01
The text content of Aviation Safety Confidential Reports contains a large number of valuable information. Term frequency-inverse document frequency algorithm is commonly used in text analysis, but it does not take into account the sequential relationship of the words in the text and its role in semantic expression. According to the seven category labels of civil aviation unsafe incidents, aiming at solving the problems of TF-IDF algorithm, this paper improved TF-IDF algorithm based on co-occurrence network; established feature words extraction and words sequential relations for classified incidents. Aviation domain lexicon was used to improve the accuracy rate of classification. Feature words network model was designed for multi-documents unsafe incidents classification, and it was used in the experiment. Finally, the classification accuracy of improved algorithm was verified by the experiments.
Enhancing biomedical text summarization using semantic relation extraction.
Shang, Yue; Li, Yanpeng; Lin, Hongfei; Yang, Zhihao
2011-01-01
Automatic text summarization for a biomedical concept can help researchers to get the key points of a certain topic from large amount of biomedical literature efficiently. In this paper, we present a method for generating text summary for a given biomedical concept, e.g., H1N1 disease, from multiple documents based on semantic relation extraction. Our approach includes three stages: 1) We extract semantic relations in each sentence using the semantic knowledge representation tool SemRep. 2) We develop a relation-level retrieval method to select the relations most relevant to each query concept and visualize them in a graphic representation. 3) For relations in the relevant set, we extract informative sentences that can interpret them from the document collection to generate text summary using an information retrieval based method. Our major focus in this work is to investigate the contribution of semantic relation extraction to the task of biomedical text summarization. The experimental results on summarization for a set of diseases show that the introduction of semantic knowledge improves the performance and our results are better than the MEAD system, a well-known tool for text summarization.
Document cards: a top trumps visualization for documents.
Strobelt, Hendrik; Oelke, Daniela; Rohrdantz, Christian; Stoffel, Andreas; Keim, Daniel A; Deussen, Oliver
2009-01-01
Finding suitable, less space consuming views for a document's main content is crucial to provide convenient access to large document collections on display devices of different size. We present a novel compact visualization which represents the document's key semantic as a mixture of images and important key terms, similar to cards in a top trumps game. The key terms are extracted using an advanced text mining approach based on a fully automatic document structure extraction. The images and their captions are extracted using a graphical heuristic and the captions are used for a semi-semantic image weighting. Furthermore, we use the image color histogram for classification and show at least one representative from each non-empty image class. The approach is demonstrated for the IEEE InfoVis publications of a complete year. The method can easily be applied to other publication collections and sets of documents which contain images.
Typograph: Multiscale Spatial Exploration of Text Documents
DOE Office of Scientific and Technical Information (OSTI.GOV)
Endert, Alexander; Burtner, Edwin R.; Cramer, Nicholas O.
2013-10-06
Visualizing large document collections using a spatial layout of terms can enable quick overviews of information. These visual metaphors (e.g., word clouds, tag clouds, etc.) traditionally show a series of terms organized by space-filling algorithms. However, often lacking in these views is the ability to interactively explore the information to gain more detail, and the location and rendering of the terms are often not based on mathematical models that maintain relative distances from other information based on similarity metrics. In this paper, we present Typograph, a multi-scale spatial exploration visualization for large document collections. Based on the term-based visualization methods,more » Typograh enables multiple levels of detail (terms, phrases, snippets, and full documents) within the single spatialization. Further, the information is placed based on their relative similarity to other information to create the “near = similar” geographic metaphor. This paper discusses the design principles and functionality of Typograph and presents a use case analyzing Wikipedia to demonstrate usage.« less
Resource Document for the Design of Electronic Instrument Approach Procedure Displays
DOT National Transportation Integrated Search
1995-03-01
Instrument approach procedure (IAP) charts play a large role in contributing to the success or failure of approaches and : landings. Paper IAP charts have been criticized for excessive clutter, for text sizes that are too small to read, and : for ina...
NASA Astrophysics Data System (ADS)
Nagy, George
2008-01-01
The fifteenth anniversary of the first SPIE symposium (titled Character Recognition Technologies) on Document Recognition and Retrieval provides an opportunity to examine DRR's contributions to the development of document technologies. Many of the tools taken for granted today, including workable general purpose OCR, large-scale, semi-automatic forms processing, inter-format table conversion, and text mining, followed research presented at this venue. This occasion also affords an opportunity to offer tribute to the conference organizers and proceedings editors and to the coterie of professionals who regularly participate in DRR.
Anonymizing and Sharing Medical Text Records
Li, Xiao-Bai; Qin, Jialun
2017-01-01
Health information technology has increased accessibility of health and medical data and benefited medical research and healthcare management. However, there are rising concerns about patient privacy in sharing medical and healthcare data. A large amount of these data are in free text form. Existing techniques for privacy-preserving data sharing deal largely with structured data. Current privacy approaches for medical text data focus on detection and removal of patient identifiers from the data, which may be inadequate for protecting privacy or preserving data quality. We propose a new systematic approach to extract, cluster, and anonymize medical text records. Our approach integrates methods developed in both data privacy and health informatics fields. The key novel elements of our approach include a recursive partitioning method to cluster medical text records based on the similarity of the health and medical information and a value-enumeration method to anonymize potentially identifying information in the text data. An experimental study is conducted using real-world medical documents. The results of the experiments demonstrate the effectiveness of the proposed approach. PMID:29569650
Automated document analysis system
NASA Astrophysics Data System (ADS)
Black, Jeffrey D.; Dietzel, Robert; Hartnett, David
2002-08-01
A software application has been developed to aid law enforcement and government intelligence gathering organizations in the translation and analysis of foreign language documents with potential intelligence content. The Automated Document Analysis System (ADAS) provides the capability to search (data or text mine) documents in English and the most commonly encountered foreign languages, including Arabic. Hardcopy documents are scanned by a high-speed scanner and are optical character recognized (OCR). Documents obtained in an electronic format bypass the OCR and are copied directly to a working directory. For translation and analysis, the script and the language of the documents are first determined. If the document is not in English, the document is machine translated to English. The documents are searched for keywords and key features in either the native language or translated English. The user can quickly review the document to determine if it has any intelligence content and whether detailed, verbatim human translation is required. The documents and document content are cataloged for potential future analysis. The system allows non-linguists to evaluate foreign language documents and allows for the quick analysis of a large quantity of documents. All document processing can be performed manually or automatically on a single document or a batch of documents.
Learning from technical documents: the role of intermodal referring expressions.
Dupont, Vincent; Bestgen, Yves
2006-01-01
We investigated the impact of two types of intermodal referring expressions on efficiency of instructions for use. User manuals for software or technical devices such as a video recording system frequently combine verbal instructions and illustrations. Much research has shown that the presence of an illustration has a beneficial effect on learning. The present study focuses on a factor that modulates this beneficial effect. The combination of text and an illustration can be effective only if the user integrates the information coming from these two media. This integration depends largely on the intermodal referential expressions, the function of which is to mark explicitly the relations between the text and the illustration. In an experiment (N = 104), we compared the effectiveness of two intermodal referring expressions often used in procedural texts: indexes (numbers introduced in the illustrations and in the instructions to establish cross-references) and icons (visual representations of the components of the device, which are inserted in the verbal instructions). The icons condition led to the most efficient use of the device. This experiment shows that learning from multimedia documents depends on the possibility of effectively connecting the verbal instructions to the illustration. Taking into account the ergonomic properties of the cross-media referring expressions should allow text designers to improve the effectiveness of technical documents.
Latent Dirichlet Allocation (LDA) Model and kNN Algorithm to Classify Research Project Selection
NASA Astrophysics Data System (ADS)
Safi’ie, M. A.; Utami, E.; Fatta, H. A.
2018-03-01
Universitas Sebelas Maret has a teaching staff more than 1500 people, and one of its tasks is to carry out research. In the other side, the funding support for research and service is limited, so there is need to be evaluated to determine the Research proposal submission and devotion on society (P2M). At the selection stage, research proposal documents are collected as unstructured data and the data stored is very large. To extract information contained in the documents therein required text mining technology. This technology applied to gain knowledge to the documents by automating the information extraction. In this articles we use Latent Dirichlet Allocation (LDA) to the documents as a model in feature extraction process, to get terms that represent its documents. Hereafter we use k-Nearest Neighbour (kNN) algorithm to classify the documents based on its terms.
Markó, K; Schulz, S; Hahn, U
2005-01-01
We propose an interlingua-based indexing approach to account for the particular challenges that arise in the design and implementation of cross-language document retrieval systems for the medical domain. Documents, as well as queries, are mapped to a language-independent conceptual layer on which retrieval operations are performed. We contrast this approach with the direct translation of German queries to English ones which, subsequently, are matched against English documents. We evaluate both approaches, interlingua-based and direct translation, on a large medical document collection, the OHSUMED corpus. A substantial benefit for interlingua-based document retrieval using German queries on English texts is found, which amounts to 93% of the (monolingual) English baseline. Most state-of-the-art cross-language information retrieval systems translate user queries to the language(s) of the target documents. In contra-distinction to this approach, translating both documents and user queries into a language-independent, concept-like representation format is more beneficial to enhance cross-language retrieval performance.
Interactive publications: creation and usage
NASA Astrophysics Data System (ADS)
Thoma, George R.; Ford, Glenn; Chung, Michael; Vasudevan, Kirankumar; Antani, Sameer
2006-02-01
As envisioned here, an "interactive publication" has similarities to multimedia documents that have been in existence for a decade or more, but possesses specific differentiating characteristics. In common usage, the latter refers to online entities that, in addition to text, consist of files of images and video clips residing separately in databases, rarely providing immediate context to the document text. While an interactive publication has many media objects as does the "traditional" multimedia document, it is a self-contained document, either as a single file with media files embedded within it, or as a "folder" containing tightly linked media files. The main characteristic that differentiates an interactive publication from a traditional multimedia document is that the reader would be able to reuse the media content for analysis and presentation, and to check the underlying data and possibly derive alternative conclusions leading, for example, to more in-depth peer reviews. We have created prototype publications containing paginated text and several media types encountered in the biomedical literature: 3D animations of anatomic structures; graphs, charts and tabular data; cell development images (video sequences); and clinical images such as CT, MRI and ultrasound in the DICOM format. This paper presents developments to date including: a tool to convert static tables or graphs into interactive entities, authoring procedures followed to create prototypes, and advantages and drawbacks of each of these platforms. It also outlines future work including meeting the challenge of network distribution for these large files.
Overview of Historical Earthquake Document Database in Japan and Future Development
NASA Astrophysics Data System (ADS)
Nishiyama, A.; Satake, K.
2014-12-01
In Japan, damage and disasters from historical large earthquakes have been documented and preserved. Compilation of historical earthquake documents started in the early 20th century and 33 volumes of historical document source books (about 27,000 pages) have been published. However, these source books are not effectively utilized for researchers due to a contamination of low-reliability historical records and a difficulty for keyword searching by characters and dates. To overcome these problems and to promote historical earthquake studies in Japan, construction of text database started in the 21 century. As for historical earthquakes from the beginning of the 7th century to the early 17th century, "Online Database of Historical Documents in Japanese Earthquakes and Eruptions in the Ancient and Medieval Ages" (Ishibashi, 2009) has been already constructed. They investigated the source books or original texts of historical literature, emended the descriptions, and assigned the reliability of each historical document on the basis of written age. Another database compiled the historical documents for seven damaging earthquakes occurred along the Sea of Japan coast in Honshu, central Japan in the Edo period (from the beginning of the 17th century to the middle of the 19th century) and constructed text database and seismic intensity data base. These are now publicized on the web (written only in Japanese). However, only about 9 % of the earthquake source books have been digitized so far. Therefore, we plan to digitize all of the remaining historical documents by the research-program which started in 2014. The specification of the data base will be similar for previous ones. We also plan to combine this database with liquefaction traces database, which will be constructed by other research program, by adding the location information described in historical documents. Constructed database would be utilized to estimate the distributions of seismic intensities and tsunami heights.
Scalable Kernel Methods and Algorithms for General Sequence Analysis
ERIC Educational Resources Information Center
Kuksa, Pavel
2011-01-01
Analysis of large-scale sequential data has become an important task in machine learning and pattern recognition, inspired in part by numerous scientific and technological applications such as the document and text classification or the analysis of biological sequences. However, current computational methods for sequence comparison still lack…
A Theory of Term Importance in Automatic Text Analysis.
ERIC Educational Resources Information Center
Salton, G.; And Others
Most existing automatic content analysis and indexing techniques are based on work frequency characteristics applied largely in an ad hoc manner. Contradictory requirements arise in this connection, in that terms exhibiting high occurrence frequencies in individual documents are often useful for high recall performance (to retrieve many relevant…
Enhancing Biomedical Text Summarization Using Semantic Relation Extraction
Shang, Yue; Li, Yanpeng; Lin, Hongfei; Yang, Zhihao
2011-01-01
Automatic text summarization for a biomedical concept can help researchers to get the key points of a certain topic from large amount of biomedical literature efficiently. In this paper, we present a method for generating text summary for a given biomedical concept, e.g., H1N1 disease, from multiple documents based on semantic relation extraction. Our approach includes three stages: 1) We extract semantic relations in each sentence using the semantic knowledge representation tool SemRep. 2) We develop a relation-level retrieval method to select the relations most relevant to each query concept and visualize them in a graphic representation. 3) For relations in the relevant set, we extract informative sentences that can interpret them from the document collection to generate text summary using an information retrieval based method. Our major focus in this work is to investigate the contribution of semantic relation extraction to the task of biomedical text summarization. The experimental results on summarization for a set of diseases show that the introduction of semantic knowledge improves the performance and our results are better than the MEAD system, a well-known tool for text summarization. PMID:21887336
Sentiment analysis of feature ranking methods for classification accuracy
NASA Astrophysics Data System (ADS)
Joseph, Shashank; Mugauri, Calvin; Sumathy, S.
2017-11-01
Text pre-processing and feature selection are important and critical steps in text mining. Text pre-processing of large volumes of datasets is a difficult task as unstructured raw data is converted into structured format. Traditional methods of processing and weighing took much time and were less accurate. To overcome this challenge, feature ranking techniques have been devised. A feature set from text preprocessing is fed as input for feature selection. Feature selection helps improve text classification accuracy. Of the three feature selection categories available, the filter category will be the focus. Five feature ranking methods namely: document frequency, standard deviation information gain, CHI-SQUARE, and weighted-log likelihood -ratio is analyzed.
AltText: A Showcase of User Centred Design in the Netherlands
NASA Astrophysics Data System (ADS)
Asjes, Kathleen
In the information processing chain many documents are produced that are inaccessible to the reading impaired. The altText project aims to increase the accessibility of this content by: a) raising awareness among content providers about content adaption; b) allowing content providers to deliver content in a way that suits the needs of the information receiver; c) developing an online service that converts written text into several accessible formats (Braille, synthetic speech, large print or DaisyXML). The name of this service is the altText conversion portal. The paper argues that user centred innovation will be crucial to the success of this project.
Preparing PNNL Reports with LaTeX
DOE Office of Scientific and Technical Information (OSTI.GOV)
Waichler, Scott R.
2005-06-01
LaTeX is a mature document preparation system that is the standard in many scientific and academic workplaces. It has been used extensively by scattered individuals and research groups within PNNL for years, but until now there have been no centralized or lab-focused resources to help authors and editors. PNNL authors and editors can produce correctly formatted PNNL or PNWD reports using the LaTeX document preparation system and the available template files. Please visit the PNNL-LaTeX Project (http://stidev.pnl.gov/resources/latex/, inside the PNNL firewall) for additional information and files. In LaTeX, document content is maintained separately from document structure for the most part.more » This means that the author can easily produce the same content in different formats and, more importantly, can focus on the content and write it in a plain text file that doesn't go awry, is easily transferable, and won't become obsolete due to software changes. LaTeX produces the finest print quality output; its typesetting is noticeably better than that of MS Word. This is particularly true for mathematics, tables, and other types of special text. Other benefits of LaTeX: easy handling of large numbers of figures and tables; automatic and error-free captioning, citation, cross-referencing, hyperlinking, and indexing; excellent published and online documentation; free or low-cost distributions for Windows/Linux/Unix/Mac OS X. This document serves two purposes: (1) it provides instructions to produce reports formatted to PNNL requirements using LaTeX, and (2) the document itself is in the form of a PNNL report, providing examples of many solved formatting challenges. Authors can use this document or its skeleton version (with formatting examples removed) as the starting point for their own reports. The pnnreport.cls class file and pnnl.bst bibliography style file contain the required formatting specifications for reports to the Department of Energy. Options are also provided for formatting PNWD (non-1830) reports. This documentation and the referenced files are meant to provide a complete package of PNNL particulars for authors and editors who wish to prepare technical reports using LaTeX. The example material in this document was borrowed from real reports and edited for demonstration purposes. The subject matter content of the example material is not relevant here and generally does not make literal sense in the context of this document. Brackets ''[]'' are used to denote large blocks of example text. The PDF file for this report contains hyperlinks to facilitate navigation. Hyperlinks are provided for all cross-referenced material, including section headings, figures, tables, and references. Not all hyperlinks are colored but will be obvious when you move your mouse over them.« less
Semi automatic indexing of PostScript files using Medical Text Indexer in medical education.
Mollah, Shamim Ara; Cimino, Christopher
2007-10-11
At Albert Einstein College of Medicine a large part of online lecture materials contain PostScript files. As the collection grows it becomes essential to create a digital library to have easy access to relevant sections of the lecture material that is full-text indexed; to create this index it is necessary to extract all the text from the document files that constitute the originals of the lectures. In this study we present a semi automatic indexing method using robust technique for extracting text from PostScript files and National Library of Medicine's Medical Text Indexer (MTI) program for indexing the text. This model can be applied to other medical schools for indexing purposes.
On the Reconstruction of Text Phylogeny Trees: Evaluation and Analysis of Textual Relationships
Marmerola, Guilherme D.; Dias, Zanoni; Goldenstein, Siome; Rocha, Anderson
2016-01-01
Over the history of mankind, textual records change. Sometimes due to mistakes during transcription, sometimes on purpose, as a way to rewrite facts and reinterpret history. There are several classical cases, such as the logarithmic tables, and the transmission of antique and medieval scholarship. Today, text documents are largely edited and redistributed on the Web. Articles on news portals and collaborative platforms (such as Wikipedia), source code, posts on social networks, and even scientific publications or literary works are some examples in which textual content can be subject to changes in an evolutionary process. In this scenario, given a set of near-duplicate documents, it is worthwhile to find which one is the original and the history of changes that created the whole set. Such functionality would have immediate applications on news tracking services, detection of plagiarism, textual criticism, and copyright enforcement, for instance. However, this is not an easy task, as textual features pointing to the documents’ evolutionary direction may not be evident and are often dataset dependent. Moreover, side information, such as time stamps, are neither always available nor reliable. In this paper, we propose a framework for reliably reconstructing text phylogeny trees, and seamlessly exploring new approaches on a wide range of scenarios of text reusage. We employ and evaluate distinct combinations of dissimilarity measures and reconstruction strategies within the proposed framework, and evaluate each approach with extensive experiments, including a set of artificial near-duplicate documents with known phylogeny, and from documents collected from Wikipedia, whose modifications were made by Internet users. We also present results from qualitative experiments in two different applications: text plagiarism and reconstruction of evolutionary trees for manuscripts (stemmatology). PMID:27992446
Munkhdalai, Tsendsuren; Li, Meijing; Batsuren, Khuyagbaatar; Park, Hyeon Ah; Choi, Nak Hyeon; Ryu, Keun Ho
2015-01-01
Chemical and biomedical Named Entity Recognition (NER) is an essential prerequisite task before effective text mining can begin for biochemical-text data. Exploiting unlabeled text data to leverage system performance has been an active and challenging research topic in text mining due to the recent growth in the amount of biomedical literature. We present a semi-supervised learning method that efficiently exploits unlabeled data in order to incorporate domain knowledge into a named entity recognition model and to leverage system performance. The proposed method includes Natural Language Processing (NLP) tasks for text preprocessing, learning word representation features from a large amount of text data for feature extraction, and conditional random fields for token classification. Other than the free text in the domain, the proposed method does not rely on any lexicon nor any dictionary in order to keep the system applicable to other NER tasks in bio-text data. We extended BANNER, a biomedical NER system, with the proposed method. This yields an integrated system that can be applied to chemical and drug NER or biomedical NER. We call our branch of the BANNER system BANNER-CHEMDNER, which is scalable over millions of documents, processing about 530 documents per minute, is configurable via XML, and can be plugged into other systems by using the BANNER Unstructured Information Management Architecture (UIMA) interface. BANNER-CHEMDNER achieved an 85.68% and an 86.47% F-measure on the testing sets of CHEMDNER Chemical Entity Mention (CEM) and Chemical Document Indexing (CDI) subtasks, respectively, and achieved an 87.04% F-measure on the official testing set of the BioCreative II gene mention task, showing remarkable performance in both chemical and biomedical NER. BANNER-CHEMDNER system is available at: https://bitbucket.org/tsendeemts/banner-chemdner.
Beyond Information Retrieval—Medical Question Answering
Lee, Minsuk; Cimino, James; Zhu, Hai Ran; Sable, Carl; Shanker, Vijay; Ely, John; Yu, Hong
2006-01-01
Physicians have many questions when caring for patients, and frequently need to seek answers for their questions. Information retrieval systems (e.g., PubMed) typically return a list of documents in response to a user’s query. Frequently the number of returned documents is large and makes physicians’ information seeking “practical only ‘after hours’ and not in the clinical settings”. Question answering techniques are based on automatically analyzing thousands of electronic documents to generate short-text answers in response to clinical questions that are posed by physicians. The authors address physicians’ information needs and described the design, implementation, and evaluation of the medical question answering system (MedQA). Although our long term goal is to enable MedQA to answer all types of medical questions, currently, we currently implement MedQA to integrate information retrieval, extraction, and summarization techniques to automatically generate paragraph-level text for definitional questions (i.e., “What is X?”). MedQA can be accessed at http://www.dbmi.columbia.edu/~yuh9001/research/MedQA.html. PMID:17238385
Brehmer, Matthew; Ingram, Stephen; Stray, Jonathan; Munzner, Tamara
2014-12-01
For an investigative journalist, a large collection of documents obtained from a Freedom of Information Act request or a leak is both a blessing and a curse: such material may contain multiple newsworthy stories, but it can be difficult and time consuming to find relevant documents. Standard text search is useful, but even if the search target is known it may not be possible to formulate an effective query. In addition, summarization is an important non-search task. We present Overview, an application for the systematic analysis of large document collections based on document clustering, visualization, and tagging. This work contributes to the small set of design studies which evaluate a visualization system "in the wild", and we report on six case studies where Overview was voluntarily used by self-initiated journalists to produce published stories. We find that the frequently-used language of "exploring" a document collection is both too vague and too narrow to capture how journalists actually used our application. Our iterative process, including multiple rounds of deployment and observations of real world usage, led to a much more specific characterization of tasks. We analyze and justify the visual encoding and interaction techniques used in Overview's design with respect to our final task abstractions, and propose generalizable lessons for visualization design methodology.
Vijayakrishnan, Rajakrishnan; Steinhubl, Steven R.; Ng, Kenney; Sun, Jimeng; Byrd, Roy J.; Daar, Zahra; Williams, Brent A.; deFilippi, Christopher; Ebadollahi, Shahram; Stewart, Walter F.
2014-01-01
Background The electronic health record contains a tremendous amount of data that if appropriately detected can lead to earlier identification of disease states such as heart failure (HF). Using a novel text and data analytic tool we explored the longitudinal EHR of over 50,000 primary care patients to identify the documentation of the signs and symptoms of HF in the years preceding its diagnosis. Methods and Results Retrospective analysis consisting of 4,644 incident HF cases and 45,981 group-matched controls. Documentation of Framingham HF signs and symptoms within encounter notes were carried out using a previously validated natural language processing procedure. A total of 892,805 affirmed criteria were documented over an average observation period of 3.4 years. Among eventual HF cases, 85% had at least one criterion within a year prior to their HF diagnosis (as did 55% of controls). Substantial variability in the prevalence of individual signs and symptoms were found in both cases and controls. Conclusions HF signs and symptoms are frequently documented in a primary care population as identified through automated text and data mining of EHRs. Their frequent identification demonstrates the rich data available within EHRs that will allow for future work on automated criterion identification to help develop predictive models for HF. PMID:24709663
Jiang, Xiangying; Ringwald, Martin; Blake, Judith; Shatkay, Hagit
2017-01-01
The Gene Expression Database (GXD) is a comprehensive online database within the Mouse Genome Informatics resource, aiming to provide available information about endogenous gene expression during mouse development. The information stems primarily from many thousands of biomedical publications that database curators must go through and read. Given the very large number of biomedical papers published each year, automatic document classification plays an important role in biomedical research. Specifically, an effective and efficient document classifier is needed for supporting the GXD annotation workflow. We present here an effective yet relatively simple classification scheme, which uses readily available tools while employing feature selection, aiming to assist curators in identifying publications relevant to GXD. We examine the performance of our method over a large manually curated dataset, consisting of more than 25 000 PubMed abstracts, of which about half are curated as relevant to GXD while the other half as irrelevant to GXD. In addition to text from title-and-abstract, we also consider image captions, an important information source that we integrate into our method. We apply a captions-based classifier to a subset of about 3300 documents, for which the full text of the curated articles is available. The results demonstrate that our proposed approach is robust and effectively addresses the GXD document classification. Moreover, using information obtained from image captions clearly improves performance, compared to title and abstract alone, affirming the utility of image captions as a substantial evidence source for automatically determining the relevance of biomedical publications to a specific subject area. www.informatics.jax.org. © The Author(s) 2017. Published by Oxford University Press.
NASA Technical Reports Server (NTRS)
Grams, R. R.
1982-01-01
A system designed to access a large range of available medical textbook information in an online interactive fashion is described. A high level query type database manager, INQUIRE, is used. Operating instructions, system flow diagrams, database descriptions, text generation, and error messages are discussed. User information is provided.
Automatic Word Sense Disambiguation of Acronyms and Abbreviations in Clinical Texts
ERIC Educational Resources Information Center
Moon, Sungrim
2012-01-01
The use of acronyms and abbreviations is increasing profoundly in the clinical domain in large part due to the greater adoption of electronic health record (EHR) systems and increased electronic documentation within healthcare. A single acronym or abbreviation may have multiple different meanings or senses. Comprehending the proper meaning of an…
Benefits of Coaching on Test Scores Seen as Negligible.
ERIC Educational Resources Information Center
Report on Education Research, 1983
1983-01-01
THE FOLLOWING IS THE FULL TEXT OF THIS DOCUMENT: A new study by a pair of Harvard University researchers discounts earlier findings that coaching can substantially improve student performance on the Scholastic Aptitude Test (SAT). "There is simply insufficient evidence that large score increases are a result of a coaching program," write…
Vietnamese Document Representation and Classification
NASA Astrophysics Data System (ADS)
Nguyen, Giang-Son; Gao, Xiaoying; Andreae, Peter
Vietnamese is very different from English and little research has been done on Vietnamese document classification, or indeed, on any kind of Vietnamese language processing, and only a few small corpora are available for research. We created a large Vietnamese text corpus with about 18000 documents, and manually classified them based on different criteria such as topics and styles, giving several classification tasks of different difficulty levels. This paper introduces a new syllable-based document representation at the morphological level of the language for efficient classification. We tested the representation on our corpus with different classification tasks using six classification algorithms and two feature selection techniques. Our experiments show that the new representation is effective for Vietnamese categorization, and suggest that best performance can be achieved using syllable-pair document representation, an SVM with a polynomial kernel as the learning algorithm, and using Information gain and an external dictionary for feature selection.
The Implementation of Cosine Similarity to Calculate Text Relevance between Two Documents
NASA Astrophysics Data System (ADS)
Gunawan, D.; Sembiring, C. A.; Budiman, M. A.
2018-03-01
Rapidly increasing number of web pages or documents leads to topic specific filtering in order to find web pages or documents efficiently. This is a preliminary research that uses cosine similarity to implement text relevance in order to find topic specific document. This research is divided into three parts. The first part is text-preprocessing. In this part, the punctuation in a document will be removed, then convert the document to lower case, implement stop word removal and then extracting the root word by using Porter Stemming algorithm. The second part is keywords weighting. Keyword weighting will be used by the next part, the text relevance calculation. Text relevance calculation will result the value between 0 and 1. The closer value to 1, then both documents are more related, vice versa.
NASA Astrophysics Data System (ADS)
Tirupattur, Naveen; Lapish, Christopher C.; Mukhopadhyay, Snehasis
2011-06-01
Text mining, sometimes alternately referred to as text analytics, refers to the process of extracting high-quality knowledge from the analysis of textual data. Text mining has wide variety of applications in areas such as biomedical science, news analysis, and homeland security. In this paper, we describe an approach and some relatively small-scale experiments which apply text mining to neuroscience research literature to find novel associations among a diverse set of entities. Neuroscience is a discipline which encompasses an exceptionally wide range of experimental approaches and rapidly growing interest. This combination results in an overwhelmingly large and often diffuse literature which makes a comprehensive synthesis difficult. Understanding the relations or associations among the entities appearing in the literature not only improves the researchers current understanding of recent advances in their field, but also provides an important computational tool to formulate novel hypotheses and thereby assist in scientific discoveries. We describe a methodology to automatically mine the literature and form novel associations through direct analysis of published texts. The method first retrieves a set of documents from databases such as PubMed using a set of relevant domain terms. In the current study these terms yielded a set of documents ranging from 160,909 to 367,214 documents. Each document is then represented in a numerical vector form from which an Association Graph is computed which represents relationships between all pairs of domain terms, based on co-occurrence. Association graphs can then be subjected to various graph theoretic algorithms such as transitive closure and cycle (circuit) detection to derive additional information, and can also be visually presented to a human researcher for understanding. In this paper, we present three relatively small-scale problem-specific case studies to demonstrate that such an approach is very successful in replicating a neuroscience expert's mental model of object-object associations entirely by means of text mining. These preliminary results provide the confidence that this type of text mining based research approach provides an extremely powerful tool to better understand the literature and drive novel discovery for the neuroscience community.
Text extraction method for historical Tibetan document images based on block projections
NASA Astrophysics Data System (ADS)
Duan, Li-juan; Zhang, Xi-qun; Ma, Long-long; Wu, Jian
2017-11-01
Text extraction is an important initial step in digitizing the historical documents. In this paper, we present a text extraction method for historical Tibetan document images based on block projections. The task of text extraction is considered as text area detection and location problem. The images are divided equally into blocks and the blocks are filtered by the information of the categories of connected components and corner point density. By analyzing the filtered blocks' projections, the approximate text areas can be located, and the text regions are extracted. Experiments on the dataset of historical Tibetan documents demonstrate the effectiveness of the proposed method.
Boyack, Kevin W.; Newman, David; Duhon, Russell J.; Klavans, Richard; Patek, Michael; Biberstine, Joseph R.; Schijvenaars, Bob; Skupin, André; Ma, Nianli; Börner, Katy
2011-01-01
Background We investigate the accuracy of different similarity approaches for clustering over two million biomedical documents. Clustering large sets of text documents is important for a variety of information needs and applications such as collection management and navigation, summary and analysis. The few comparisons of clustering results from different similarity approaches have focused on small literature sets and have given conflicting results. Our study was designed to seek a robust answer to the question of which similarity approach would generate the most coherent clusters of a biomedical literature set of over two million documents. Methodology We used a corpus of 2.15 million recent (2004-2008) records from MEDLINE, and generated nine different document-document similarity matrices from information extracted from their bibliographic records, including titles, abstracts and subject headings. The nine approaches were comprised of five different analytical techniques with two data sources. The five analytical techniques are cosine similarity using term frequency-inverse document frequency vectors (tf-idf cosine), latent semantic analysis (LSA), topic modeling, and two Poisson-based language models – BM25 and PMRA (PubMed Related Articles). The two data sources were a) MeSH subject headings, and b) words from titles and abstracts. Each similarity matrix was filtered to keep the top-n highest similarities per document and then clustered using a combination of graph layout and average-link clustering. Cluster results from the nine similarity approaches were compared using (1) within-cluster textual coherence based on the Jensen-Shannon divergence, and (2) two concentration measures based on grant-to-article linkages indexed in MEDLINE. Conclusions PubMed's own related article approach (PMRA) generated the most coherent and most concentrated cluster solution of the nine text-based similarity approaches tested, followed closely by the BM25 approach using titles and abstracts. Approaches using only MeSH subject headings were not competitive with those based on titles and abstracts. PMID:21437291
Analysis Of The IJCNN 2011 UTL Challenge
2012-01-13
large datasets from various application domains: handwriting recognition, image recognition, video processing, text processing, and ecology. The goal...validation and final evaluation sets consist of 4096 examples each. Dataset Domain Features Sparsity Devel. Transf. AVICENNA Handwriting 120 0% 150205...documents [3]. Transfer learning methods could accelerate the application of handwriting recognizers to historical manuscript by reducing the need for
ERIC Educational Resources Information Center
Zhang, Rui
2013-01-01
The widespread adoption of Electronic Health Record (EHR) has resulted in rapid text proliferation within clinical care. Clinicians' use of copying and pasting functions in EHR systems further compounds this by creating a large amount of redundant clinical information in clinical documents. A mixture of redundant information (especially outdated…
Christmas Program for Elementary School Children.
ERIC Educational Resources Information Center
Taggart, Doris
THE FOLLOWING IS THE FULL TEXT OF THIS DOCUMENT: In 1974 Doris Taggart, Public Relations Vice President of Zions First National Bank in Salt Lake City, was serving on the Free Enterprise Committee of the Salt Lake Chamber of Commerce. She developed a plan to involve elementary school children with a large bank by asking the children to make…
10 CFR 2.1013 - Use of the electronic docket during the proceeding.
Code of Federal Regulations, 2012 CFR
2012-01-01
... searchable full text, by header and image, as appropriate. (b) Absent good cause, all exhibits tendered... circumstances where submitters may need to use an image scanned before January 1, 2004, in a document created after January 1, 2004, or the scanning process for a large, one-page image may not successfully complete...
Texture for script identification.
Busch, Andrew; Boles, Wageeh W; Sridharan, Sridha
2005-11-01
The problem of determining the script and language of a document image has a number of important applications in the field of document analysis, such as indexing and sorting of large collections of such images, or as a precursor to optical character recognition (OCR). In this paper, we investigate the use of texture as a tool for determining the script of a document image, based on the observation that text has a distinct visual texture. An experimental evaluation of a number of commonly used texture features is conducted on a newly created script database, providing a qualitative measure of which features are most appropriate for this task. Strategies for improving classification results in situations with limited training data and multiple font types are also proposed.
Large Scale Document Inversion using a Multi-threaded Computing System
Jung, Sungbo; Chang, Dar-Jen; Park, Juw Won
2018-01-01
Current microprocessor architecture is moving towards multi-core/multi-threaded systems. This trend has led to a surge of interest in using multi-threaded computing devices, such as the Graphics Processing Unit (GPU), for general purpose computing. We can utilize the GPU in computation as a massive parallel coprocessor because the GPU consists of multiple cores. The GPU is also an affordable, attractive, and user-programmable commodity. Nowadays a lot of information has been flooded into the digital domain around the world. Huge volume of data, such as digital libraries, social networking services, e-commerce product data, and reviews, etc., is produced or collected every moment with dramatic growth in size. Although the inverted index is a useful data structure that can be used for full text searches or document retrieval, a large number of documents will require a tremendous amount of time to create the index. The performance of document inversion can be improved by multi-thread or multi-core GPU. Our approach is to implement a linear-time, hash-based, single program multiple data (SPMD), document inversion algorithm on the NVIDIA GPU/CUDA programming platform utilizing the huge computational power of the GPU, to develop high performance solutions for document indexing. Our proposed parallel document inversion system shows 2-3 times faster performance than a sequential system on two different test datasets from PubMed abstract and e-commerce product reviews. CCS Concepts •Information systems➝Information retrieval • Computing methodologies➝Massively parallel and high-performance simulations. PMID:29861701
Large Scale Document Inversion using a Multi-threaded Computing System.
Jung, Sungbo; Chang, Dar-Jen; Park, Juw Won
2017-06-01
Current microprocessor architecture is moving towards multi-core/multi-threaded systems. This trend has led to a surge of interest in using multi-threaded computing devices, such as the Graphics Processing Unit (GPU), for general purpose computing. We can utilize the GPU in computation as a massive parallel coprocessor because the GPU consists of multiple cores. The GPU is also an affordable, attractive, and user-programmable commodity. Nowadays a lot of information has been flooded into the digital domain around the world. Huge volume of data, such as digital libraries, social networking services, e-commerce product data, and reviews, etc., is produced or collected every moment with dramatic growth in size. Although the inverted index is a useful data structure that can be used for full text searches or document retrieval, a large number of documents will require a tremendous amount of time to create the index. The performance of document inversion can be improved by multi-thread or multi-core GPU. Our approach is to implement a linear-time, hash-based, single program multiple data (SPMD), document inversion algorithm on the NVIDIA GPU/CUDA programming platform utilizing the huge computational power of the GPU, to develop high performance solutions for document indexing. Our proposed parallel document inversion system shows 2-3 times faster performance than a sequential system on two different test datasets from PubMed abstract and e-commerce product reviews. •Information systems➝Information retrieval • Computing methodologies➝Massively parallel and high-performance simulations.
NASA Astrophysics Data System (ADS)
Davies, Bob; Lienhart, Rainer W.; Yeo, Boon-Lock
1999-08-01
The metaphor of film and TV permeates the design of software to support video on the PC. Simply transplanting the non- interactive, sequential experience of film to the PC fails to exploit the virtues of the new context. Video ont eh PC should be interactive and non-sequential. This paper experiments with a variety of tools for using video on the PC that exploits the new content of the PC. Some feature are more successful than others. Applications that use these tools are explored, including primarily the home video archive but also streaming video servers on the Internet. The ability to browse, edit, abstract and index large volumes of video content such as home video and corporate video is a problem without appropriate solution in today's market. The current tools available are complex, unfriendly video editors, requiring hours of work to prepare a short home video, far more work that a typical home user can be expected to provide. Our proposed solution treats video like a text document, providing functionality similar to a text editor. Users can browse, interact, edit and compose one or more video sequences with the same ease and convenience as handling text documents. With this level of text-like composition, we call what is normally a sequential medium a 'video document'. An important component of the proposed solution is shot detection, the ability to detect when a short started or stopped. When combined with a spreadsheet of key frames, the host become a grid of pictures that can be manipulated and viewed in the same way that a spreadsheet can be edited. Multiple video documents may be viewed, joined, manipulated, and seamlessly played back. Abstracts of unedited video content can be produce automatically to create novel video content for export to other venues. Edited and raw video content can be published to the net or burned to a CD-ROM with a self-installing viewer for Windows 98 and Windows NT 4.0.
NASA Astrophysics Data System (ADS)
Hadyan, Fadhlil; Shaufiah; Arif Bijaksana, Moch.
2017-01-01
Automatic summarization is a system that can help someone to take the core information of a long text instantly. The system can help by summarizing text automatically. there’s Already many summarization systems that have been developed at this time but there are still many problems in those system. In this final task proposed summarization method using document index graph. This method utilizes the PageRank and HITS formula used to assess the web page, adapted to make an assessment of words in the sentences in a text document. The expected outcome of this final task is a system that can do summarization of a single document, by utilizing document index graph with TextRank and HITS to improve the quality of the summary results automatically.
Graph-based biomedical text summarization: An itemset mining and sentence clustering approach.
Nasr Azadani, Mozhgan; Ghadiri, Nasser; Davoodijam, Ensieh
2018-06-12
Automatic text summarization offers an efficient solution to access the ever-growing amounts of both scientific and clinical literature in the biomedical domain by summarizing the source documents while maintaining their most informative contents. In this paper, we propose a novel graph-based summarization method that takes advantage of the domain-specific knowledge and a well-established data mining technique called frequent itemset mining. Our summarizer exploits the Unified Medical Language System (UMLS) to construct a concept-based model of the source document and mapping the document to the concepts. Then, it discovers frequent itemsets to take the correlations among multiple concepts into account. The method uses these correlations to propose a similarity function based on which a represented graph is constructed. The summarizer then employs a minimum spanning tree based clustering algorithm to discover various subthemes of the document. Eventually, it generates the final summary by selecting the most informative and relative sentences from all subthemes within the text. We perform an automatic evaluation over a large number of summaries using the Recall-Oriented Understudy for Gisting Evaluation (ROUGE) metrics. The results demonstrate that the proposed summarization system outperforms various baselines and benchmark approaches. The carried out research suggests that the incorporation of domain-specific knowledge and frequent itemset mining equips the summarization system in a better way to address the informativeness measurement of the sentences. Moreover, clustering the graph nodes (sentences) can enable the summarizer to target different main subthemes of a source document efficiently. The evaluation results show that the proposed approach can significantly improve the performance of the summarization systems in the biomedical domain. Copyright © 2018. Published by Elsevier Inc.
The Magic of Words: Teaching Vocabulary in the Early Childhood Classroom
ERIC Educational Resources Information Center
Neuman, Susan B.; Wright, Tanya S.
2014-01-01
Developing a large and rich vocabulary is central to learning to read. Children must know the words that make up written texts in order to understand them, especially as the vocabulary demands of content-related materials increase in the upper grades. Studies have documented that the size of a person's vocabulary is strongly related to how…
HyperCard as a Text Analysis Tool for the Qualitative Researcher.
ERIC Educational Resources Information Center
Handler, Marianne G.; Turner, Sandra V.
HyperCard is a general-purpose program for the Macintosh computer that allows multiple ways of viewing and accessing a large body of information. Two ways in which HyperCard can be used as a research tool are illustrated. One way is to organize and analyze qualitative data from observations, interviews, surveys, and other documents. The other way…
How Can Positive Effects of Pop-Up Windows on Multimedia Learning Be Explained?
ERIC Educational Resources Information Center
Erhel, Severine; Jamet, Eric
2011-01-01
A large body of research has shown that incorporating text in the corresponding sections of an illustration facilitates the learning of illustrated documents. More recently, a series of studies has revealed that the use of interactive windows located close to the illustration causes similar effects. The aim of this paper is to help bring about a…
Biblio-MetReS for user-friendly mining of genes and biological processes in scientific documents.
Usie, Anabel; Karathia, Hiren; Teixidó, Ivan; Alves, Rui; Solsona, Francesc
2014-01-01
One way to initiate the reconstruction of molecular circuits is by using automated text-mining techniques. Developing more efficient methods for such reconstruction is a topic of active research, and those methods are typically included by bioinformaticians in pipelines used to mine and curate large literature datasets. Nevertheless, experimental biologists have a limited number of available user-friendly tools that use text-mining for network reconstruction and require no programming skills to use. One of these tools is Biblio-MetReS. Originally, this tool permitted an on-the-fly analysis of documents contained in a number of web-based literature databases to identify co-occurrence of proteins/genes. This approach ensured results that were always up-to-date with the latest live version of the databases. However, this 'up-to-dateness' came at the cost of large execution times. Here we report an evolution of the application Biblio-MetReS that permits constructing co-occurrence networks for genes, GO processes, Pathways, or any combination of the three types of entities and graphically represent those entities. We show that the performance of Biblio-MetReS in identifying gene co-occurrence is as least as good as that of other comparable applications (STRING and iHOP). In addition, we also show that the identification of GO processes is on par to that reported in the latest BioCreAtIvE challenge. Finally, we also report the implementation of a new strategy that combines on-the-fly analysis of new documents with preprocessed information from documents that were encountered in previous analyses. This combination simultaneously decreases program run time and maintains 'up-to-dateness' of the results. http://metres.udl.cat/index.php/downloads, metres.cmb@gmail.com.
Text Mining in Biomedical Domain with Emphasis on Document Clustering.
Renganathan, Vinaitheerthan
2017-07-01
With the exponential increase in the number of articles published every year in the biomedical domain, there is a need to build automated systems to extract unknown information from the articles published. Text mining techniques enable the extraction of unknown knowledge from unstructured documents. This paper reviews text mining processes in detail and the software tools available to carry out text mining. It also reviews the roles and applications of text mining in the biomedical domain. Text mining processes, such as search and retrieval of documents, pre-processing of documents, natural language processing, methods for text clustering, and methods for text classification are described in detail. Text mining techniques can facilitate the mining of vast amounts of knowledge on a given topic from published biomedical research articles and draw meaningful conclusions that are not possible otherwise.
Feature extraction for document text using Latent Dirichlet Allocation
NASA Astrophysics Data System (ADS)
Prihatini, P. M.; Suryawan, I. K.; Mandia, IN
2018-01-01
Feature extraction is one of stages in the information retrieval system that used to extract the unique feature values of a text document. The process of feature extraction can be done by several methods, one of which is Latent Dirichlet Allocation. However, researches related to text feature extraction using Latent Dirichlet Allocation method are rarely found for Indonesian text. Therefore, through this research, a text feature extraction will be implemented for Indonesian text. The research method consists of data acquisition, text pre-processing, initialization, topic sampling and evaluation. The evaluation is done by comparing Precision, Recall and F-Measure value between Latent Dirichlet Allocation and Term Frequency Inverse Document Frequency KMeans which commonly used for feature extraction. The evaluation results show that Precision, Recall and F-Measure value of Latent Dirichlet Allocation method is higher than Term Frequency Inverse Document Frequency KMeans method. This shows that Latent Dirichlet Allocation method is able to extract features and cluster Indonesian text better than Term Frequency Inverse Document Frequency KMeans method.
Evangelopoulos, Nicholas E
2013-11-01
This article reviews latent semantic analysis (LSA), a theory of meaning as well as a method for extracting that meaning from passages of text, based on statistical computations over a collection of documents. LSA as a theory of meaning defines a latent semantic space where documents and individual words are represented as vectors. LSA as a computational technique uses linear algebra to extract dimensions that represent that space. This representation enables the computation of similarity among terms and documents, categorization of terms and documents, and summarization of large collections of documents using automated procedures that mimic the way humans perform similar cognitive tasks. We present some technical details, various illustrative examples, and discuss a number of applications from linguistics, psychology, cognitive science, education, information science, and analysis of textual data in general. WIREs Cogn Sci 2013, 4:683-692. doi: 10.1002/wcs.1254 CONFLICT OF INTEREST: The author has declared no conflicts of interest for this article. For further resources related to this article, please visit the WIREs website. © 2013 John Wiley & Sons, Ltd.
Script-independent text line segmentation in freestyle handwritten documents.
Li, Yi; Zheng, Yefeng; Doermann, David; Jaeger, Stefan; Li, Yi
2008-08-01
Text line segmentation in freestyle handwritten documents remains an open document analysis problem. Curvilinear text lines and small gaps between neighboring text lines present a challenge to algorithms developed for machine printed or hand-printed documents. In this paper, we propose a novel approach based on density estimation and a state-of-the-art image segmentation technique, the level set method. From an input document image, we estimate a probability map, where each element represents the probability that the underlying pixel belongs to a text line. The level set method is then exploited to determine the boundary of neighboring text lines by evolving an initial estimate. Unlike connected component based methods ( [1], [2] for example), the proposed algorithm does not use any script-specific knowledge. Extensive quantitative experiments on freestyle handwritten documents with diverse scripts, such as Arabic, Chinese, Korean, and Hindi, demonstrate that our algorithm consistently outperforms previous methods [1]-[3]. Further experiments show the proposed algorithm is robust to scale change, rotation, and noise.
Liu, Ying; Lita, Lucian Vlad; Niculescu, Radu Stefan; Mitra, Prasenjit; Giles, C Lee
2008-11-06
Owing to new advances in computer hardware, large text databases have become more prevalent than ever.Automatically mining information from these databases proves to be a challenge due to slow pattern/string matching techniques. In this paper we present a new, fast multi-string pattern matching method based on the well known Aho-Chorasick algorithm. Advantages of our algorithm include:the ability to exploit the natural structure of text, the ability to perform significant character shifting, avoiding backtracking jumps that are not useful, efficiency in terms of matching time and avoiding the typical "sub-string" false positive errors.Our algorithm is applicable to many fields with free text, such as the health care domain and the scientific document field. In this paper, we apply the BSS algorithm to health care data and mine hundreds of thousands of medical concepts from a large Electronic Medical Record (EMR) corpora simultaneously and efficiently. Experimental results show the superiority of our algorithm when compared with the top of the line multi-string matching algorithms.
Document segmentation for high-quality printing
NASA Astrophysics Data System (ADS)
Ancin, Hakan
1997-04-01
A technique to segment dark texts on light background of mixed mode color documents is presented. This process does not perceptually change graphics and photo regions. Color documents are scanned and printed from various media which usually do not have clean background. This is especially the case for the printouts generated from thin magazine samples, these printouts usually include text and figures form the back of the page, which is called bleeding. Removal of bleeding artifacts improves the perceptual quality of the printed document and reduces the color ink usage. By detecting the light background of the document, these artifacts are removed from background regions. Also detection of dark text regions enables the halftoning algorithms to use true black ink for the black text pixels instead of composite black. The processed document contains sharp black text on white background, resulting improved perceptual quality and better ink utilization. The described method is memory efficient and requires a small number of scan lines of high resolution color documents during processing.
Computation of term dominance in text documents
Bauer, Travis L [Albuquerque, NM; Benz, Zachary O [Albuquerque, NM; Verzi, Stephen J [Albuquerque, NM
2012-04-24
An improved entropy-based term dominance metric useful for characterizing a corpus of text documents, and is useful for comparing the term dominance metrics of a first corpus of documents to a second corpus having a different number of documents.
Kabir, Muhammad N.; Alginahi, Yasser M.
2014-01-01
This paper addresses the problems and threats associated with verification of integrity, proof of authenticity, tamper detection, and copyright protection for digital-text content. Such issues were largely addressed in the literature for images, audio, and video, with only a few papers addressing the challenge of sensitive plain-text media under known constraints. Specifically, with text as the predominant online communication medium, it becomes crucial that techniques are deployed to protect such information. A number of digital-signature, hashing, and watermarking schemes have been proposed that essentially bind source data or embed invisible data in a cover media to achieve its goal. While many such complex schemes with resource redundancies are sufficient in offline and less-sensitive texts, this paper proposes a hybrid approach based on zero-watermarking and digital-signature-like manipulations for sensitive text documents in order to achieve content originality and integrity verification without physically modifying the cover text in anyway. The proposed algorithm was implemented and shown to be robust against undetected content modifications and is capable of confirming proof of originality whilst detecting and locating deliberate/nondeliberate tampering. Additionally, enhancements in resource utilisation and reduced redundancies were achieved in comparison to traditional encryption-based approaches. Finally, analysis and remarks are made about the current state of the art, and future research issues are discussed under the given constraints. PMID:25254247
Text Mining in Biomedical Domain with Emphasis on Document Clustering
2017-01-01
Objectives With the exponential increase in the number of articles published every year in the biomedical domain, there is a need to build automated systems to extract unknown information from the articles published. Text mining techniques enable the extraction of unknown knowledge from unstructured documents. Methods This paper reviews text mining processes in detail and the software tools available to carry out text mining. It also reviews the roles and applications of text mining in the biomedical domain. Results Text mining processes, such as search and retrieval of documents, pre-processing of documents, natural language processing, methods for text clustering, and methods for text classification are described in detail. Conclusions Text mining techniques can facilitate the mining of vast amounts of knowledge on a given topic from published biomedical research articles and draw meaningful conclusions that are not possible otherwise. PMID:28875048
Investigation into Text Classification With Kernel Based Schemes
2010-03-01
Document Matrix TDMs Term-Document Matrices TMG Text to Matrix Generator TN True Negative TP True Positive VSM Vector Space Model xxii THIS PAGE...are represented as a term-document matrix, common evaluation metrics, and the software package Text to Matrix Generator ( TMG ). The classifier...AND METRICS This chapter introduces the indexing capabilities of the Text to Matrix Generator ( TMG ) Toolbox. Specific attention is placed on the
Knowledge based word-concept model estimation and refinement for biomedical text mining.
Jimeno Yepes, Antonio; Berlanga, Rafael
2015-02-01
Text mining of scientific literature has been essential for setting up large public biomedical databases, which are being widely used by the research community. In the biomedical domain, the existence of a large number of terminological resources and knowledge bases (KB) has enabled a myriad of machine learning methods for different text mining related tasks. Unfortunately, KBs have not been devised for text mining tasks but for human interpretation, thus performance of KB-based methods is usually lower when compared to supervised machine learning methods. The disadvantage of supervised methods though is they require labeled training data and therefore not useful for large scale biomedical text mining systems. KB-based methods do not have this limitation. In this paper, we describe a novel method to generate word-concept probabilities from a KB, which can serve as a basis for several text mining tasks. This method not only takes into account the underlying patterns within the descriptions contained in the KB but also those in texts available from large unlabeled corpora such as MEDLINE. The parameters of the model have been estimated without training data. Patterns from MEDLINE have been built using MetaMap for entity recognition and related using co-occurrences. The word-concept probabilities were evaluated on the task of word sense disambiguation (WSD). The results showed that our method obtained a higher degree of accuracy than other state-of-the-art approaches when evaluated on the MSH WSD data set. We also evaluated our method on the task of document ranking using MEDLINE citations. These results also showed an increase in performance over existing baseline retrieval approaches. Copyright © 2014 Elsevier Inc. All rights reserved.
Clustering and Recurring Anomaly Identification: Recurring Anomaly Detection System (ReADS)
NASA Technical Reports Server (NTRS)
McIntosh, Dawn
2006-01-01
This viewgraph presentation reviews the Recurring Anomaly Detection System (ReADS). The Recurring Anomaly Detection System is a tool to analyze text reports, such as aviation reports and maintenance records: (1) Text clustering algorithms group large quantities of reports and documents; Reduces human error and fatigue (2) Identifies interconnected reports; Automates the discovery of possible recurring anomalies; (3) Provides a visualization of the clusters and recurring anomalies We have illustrated our techniques on data from Shuttle and ISS discrepancy reports, as well as ASRS data. ReADS has been integrated with a secure online search
Identifying Issue Frames in Text
Sagi, Eyal; Diermeier, Daniel; Kaufmann, Stefan
2013-01-01
Framing, the effect of context on cognitive processes, is a prominent topic of research in psychology and public opinion research. Research on framing has traditionally relied on controlled experiments and manually annotated document collections. In this paper we present a method that allows for quantifying the relative strengths of competing linguistic frames based on corpus analysis. This method requires little human intervention and can therefore be efficiently applied to large bodies of text. We demonstrate its effectiveness by tracking changes in the framing of terror over time and comparing the framing of abortion by Democrats and Republicans in the U.S. PMID:23874909
Video Browsing on Handheld Devices
NASA Astrophysics Data System (ADS)
Hürst, Wolfgang
Recent improvements in processing power, storage space, and video codec development enable users now to playback video on their handheld devices in a reasonable quality. However, given the form factor restrictions of such a mobile device, screen size still remains a natural limit and - as the term "handheld" implies - always will be a critical resource. This is not only true for video but any data that is processed on such devices. For this reason, developers have come up with new and innovative ways to deal with large documents in such limited scenarios. For example, if you look at the iPhone, innovative techniques such as flicking have been introduced to skim large lists of text (e.g. hundreds of entries in your music collection). Automatically adapting the zoom level to, for example, the width of table cells when double tapping on the screen enables reasonable browsing of web pages that have originally been designed for large, desktop PC sized screens. A multi touch interface allows you to easily zoom in and out of large text documents and images using two fingers. In the next section, we will illustrate that advanced techniques to browse large video files have been developed in the past years, as well. However, if you look at state-of-the-art video players on mobile devices, normally just simple, VCR like controls are supported (at least at the time of this writing) that only allow users to just start, stop, and pause video playback. If supported at all, browsing and navigation functionality is often restricted to simple skipping of chapters via two single buttons for backward and forward navigation and a small and thus not very sensitive timeline slider.
Document retrieval on repetitive string collections.
Gagie, Travis; Hartikainen, Aleksi; Karhu, Kalle; Kärkkäinen, Juha; Navarro, Gonzalo; Puglisi, Simon J; Sirén, Jouni
2017-01-01
Most of the fastest-growing string collections today are repetitive, that is, most of the constituent documents are similar to many others. As these collections keep growing, a key approach to handling them is to exploit their repetitiveness, which can reduce their space usage by orders of magnitude. We study the problem of indexing repetitive string collections in order to perform efficient document retrieval operations on them. Document retrieval problems are routinely solved by search engines on large natural language collections, but the techniques are less developed on generic string collections. The case of repetitive string collections is even less understood, and there are very few existing solutions. We develop two novel ideas, interleaved LCPs and precomputed document lists , that yield highly compressed indexes solving the problem of document listing (find all the documents where a string appears), top- k document retrieval (find the k documents where a string appears most often), and document counting (count the number of documents where a string appears). We also show that a classical data structure supporting the latter query becomes highly compressible on repetitive data. Finally, we show how the tools we developed can be combined to solve ranked conjunctive and disjunctive multi-term queries under the simple [Formula: see text] model of relevance. We thoroughly evaluate the resulting techniques in various real-life repetitiveness scenarios, and recommend the best choices for each case.
Multi-font printed Mongolian document recognition system
NASA Astrophysics Data System (ADS)
Peng, Liangrui; Liu, Changsong; Ding, Xiaoqing; Wang, Hua; Jin, Jianming
2009-01-01
Mongolian is one of the major ethnic languages in China. Large amount of Mongolian printed documents need to be digitized in digital library and various applications. Traditional Mongolian script has unique writing style and multi-font-type variations, which bring challenges to Mongolian OCR research. As traditional Mongolian script has some characteristics, for example, one character may be part of another character, we define the character set for recognition according to the segmented components, and the components are combined into characters by rule-based post-processing module. For character recognition, a method based on visual directional feature and multi-level classifiers is presented. For character segmentation, a scheme is used to find the segmentation point by analyzing the properties of projection and connected components. As Mongolian has different font-types which are categorized into two major groups, the parameter of segmentation is adjusted for each group. A font-type classification method for the two font-type group is introduced. For recognition of Mongolian text mixed with Chinese and English, language identification and relevant character recognition kernels are integrated. Experiments show that the presented methods are effective. The text recognition rate is 96.9% on the test samples from practical documents with multi-font-types and mixed scripts.
Using Bitmap Indexing Technology for Combined Numerical and TextQueries
DOE Office of Scientific and Technical Information (OSTI.GOV)
Stockinger, Kurt; Cieslewicz, John; Wu, Kesheng
2006-10-16
In this paper, we describe a strategy of using compressedbitmap indices to speed up queries on both numerical data and textdocuments. By using an efficient compression algorithm, these compressedbitmap indices are compact even for indices with millions of distinctterms. Moreover, bitmap indices can be used very efficiently to answerBoolean queries over text documents involving multiple query terms.Existing inverted indices for text searches are usually inefficient forcorpora with a very large number of terms as well as for queriesinvolving a large number of hits. We demonstrate that our compressedbitmap index technology overcomes both of those short-comings. In aperformance comparison against amore » commonly used database system, ourindices answer queries 30 times faster on average. To provide full SQLsupport, we integrated our indexing software, called FastBit, withMonetDB. The integrated system MonetDB/FastBit provides not onlyefficient searches on a single table as FastBit does, but also answersjoin queries efficiently. Furthermore, MonetDB/FastBit also provides avery efficient retrieval mechanism of result records.« less
Techniques of Document Management: A Review of Text Retrieval and Related Technologies.
ERIC Educational Resources Information Center
Veal, D. C.
2001-01-01
Reviews present and possible future developments in the techniques of electronic document management, the major ones being text retrieval and scanning and OCR (optical character recognition). Also addresses document acquisition, indexing and thesauri, publishing and dissemination standards, impact of the Internet, and the document management…
DOE Office of Scientific and Technical Information (OSTI.GOV)
De Volpi, A.; Fenrick, M. R.; Stanford, G. S.
1980-10-01
Documentation often is a primary residual of research and development. Because of this important role and because of the large amount of time consumed in generating technical reports, particularly those containing formulas and graphics, an existing data-processing computer system has been adapted so as to provide text-processing of technical documents. Emphasis has been on accuracy, turnaround time, and time savings for staff and secretaries, for the types of reports normally produced in the reactor development program. The computer-assisted text-processing system, called TXT, has been implemented to benefit primarily the originator of technical reports. The system is of particular value tomore » professional staff, such as scientists and engineers, who have responsibility for generating much correspondence or lengthy, complex reports or manuscripts - especially if prompt turnaround and high accuracy are required. It can produce text that contains special Greek or mathematical symbols. Written in FORTRAN and MACRO, the program TXT operates on a PDP-11 minicomputer under the RSX-11M multitask multiuser monitor. Peripheral hardware includes videoterminals, electrostatic printers, and magnetic disks. Either data- or word-processing tasks may be performed at the terminals. The repertoire of operations has been restricted so as to minimize user training and memory burden. Spectarial staff may be readily trained to make corrections from annotated copy. Some examples of camera-ready copy are provided.« less
Detection of text strings from mixed text/graphics images
NASA Astrophysics Data System (ADS)
Tsai, Chien-Hua; Papachristou, Christos A.
2000-12-01
A robust system for text strings separation from mixed text/graphics images is presented. Based on a union-find (region growing) strategy the algorithm is thus able to classify the text from graphics and adapts to changes in document type, language category (e.g., English, Chinese and Japanese), text font style and size, and text string orientation within digital images. In addition, it allows for a document skew that usually occurs in documents, without skew correction prior to discrimination while these proposed methods such a projection profile or run length coding are not always suitable for the condition. The method has been tested with a variety of printed documents from different origins with one common set of parameters, and the experimental results of the performance of the algorithm in terms of computational efficiency are demonstrated by using several tested images from the evaluation.
Narayana, D B Anantha; Durg, Sharanbasappa; Manohar, P Ram; Mahapatra, Anita; Aramya, A R
2017-02-02
Chyawanprash (CP), a traditional immune booster recipe, has a long history of ethnic origin, development, household preparation and usage. There are even mythological stories about the origin of this recipe including its nomenclature. In the last six decades, CP, because of entrepreneurial actions of some research Vaidyas (traditional doctors) has grown to industrial production and marketing in packed forms to a large number of consumers/patients like any food or health care product. Currently, CP has acquired a large accepted user base in India and in a few countries out-side India. Authoritative texts, recognized by the Drugs and Cosmetics Act of India, describe CP as an immunity enhancer and strength giver meant for improving lung functions in diseases with compromised immunity. This review focuses on published clinical efficacy and safety studies of CP for correlation with health benefits as documented in the authoritative texts, and also briefs on its recipes and processes. Authoritative texts were searched for recipes, processes, and other technical details of CP. Labels of marketing CP products (Indian) were studied for the health claims. Electronic search for studies of CP on efficacy and safety data were performed in PubMed/MEDLINE and DHARA (Digital Helpline for Ayurveda Research Articles), and Ayurvedic books were also searched for clinical studies. The documented clinical studies from electronic databases and Ayurvedic books evidenced that individuals who consume CP regularly for a definite period of time showed improvement in overall health status and immunity. However, most of the clinical studies in this review are of smaller sample size and short duration. Further, limitation to access and review significant data on traditional products like CP in electronic databases was noted. Randomized controlled trials of high quality with larger sample size and longer follow-up are needed to have significant evidence on the clinical use of CP as immunity booster. Additional studies involving measurement of current biomarkers of immunity pre- and post-consumption of the product as well as benefits accruing with the use of CP as an adjuvant are suggested. Copyright © 2016 Elsevier Ireland Ltd. All rights reserved.
Graph-based layout analysis for PDF documents
NASA Astrophysics Data System (ADS)
Xu, Canhui; Tang, Zhi; Tao, Xin; Li, Yun; Shi, Cao
2013-03-01
To increase the flexibility and enrich the reading experience of e-book on small portable screens, a graph based method is proposed to perform layout analysis on Portable Document Format (PDF) documents. Digital born document has its inherent advantages like representing texts and fractional images in explicit form, which can be straightforwardly exploited. To integrate traditional image-based document analysis and the inherent meta-data provided by PDF parser, the page primitives including text, image and path elements are processed to produce text and non text layer for respective analysis. Graph-based method is developed in superpixel representation level, and page text elements corresponding to vertices are used to construct an undirected graph. Euclidean distance between adjacent vertices is applied in a top-down manner to cut the graph tree formed by Kruskal's algorithm. And edge orientation is then used in a bottom-up manner to extract text lines from each sub tree. On the other hand, non-textual objects are segmented by connected component analysis. For each segmented text and non-text composite, a 13-dimensional feature vector is extracted for labelling purpose. The experimental results on selected pages from PDF books are presented.
Telemetry Standards, RCC Standard 106-17, Annex A.1, Pulse Amplitude Modulation Standards
2017-07-01
conform to either Figure Error! No text of specified style in document.-1 or Figure Error! No text of specified style in document.-2. Figure Error...No text of specified style in document.-1. 50 percent duty cycle PAM with amplitude synchronization A 20-25 percent deviation reserved for pulse...synchronization is recommended. Telemetry Standards, RCC Standard 106-17 Annex A.1, July 2017 A.1.2 Figure Error! No text of specified style
Reading and Writing in the 21st Century.
ERIC Educational Resources Information Center
Soloway, Elliot; And Others
1993-01-01
Describes MediaText, a multimedia document processor developed at the University of Michigan that allows the incorporation of video, music, sound, animations, still images, and text into one document. Interactive documents are discussed, and the need for users to be able to write documents as well as read them is emphasized. (four references) (LRW)
Electronic Documentation Support Tools and Text Duplication in the Electronic Medical Record
ERIC Educational Resources Information Center
Wrenn, Jesse
2010-01-01
In order to ease the burden of electronic note entry on physicians, electronic documentation support tools have been developed to assist in note authoring. There is little evidence of the effects of these tools on attributes of clinical documentation, including document quality. Furthermore, the resultant abundance of duplicated text and…
Semi-Automated Methods for Refining a Domain-Specific Terminology Base
2011-02-01
only as a resource for written and oral translation, but also for Natural Language Processing ( NLP ) applications, text retrieval, document indexing...Natural Language Processing ( NLP ) applications, text retrieval, document indexing, and other knowledge management tasks. The objective of this...also for Natural Language Processing ( NLP ) applications, text retrieval (1), document indexing, and other knowledge management tasks. The National
Thematic clustering of text documents using an EM-based approach
2012-01-01
Clustering textual contents is an important step in mining useful information on the web or other text-based resources. The common task in text clustering is to handle text in a multi-dimensional space, and to partition documents into groups, where each group contains documents that are similar to each other. However, this strategy lacks a comprehensive view for humans in general since it cannot explain the main subject of each cluster. Utilizing semantic information can solve this problem, but it needs a well-defined ontology or pre-labeled gold standard set. In this paper, we present a thematic clustering algorithm for text documents. Given text, subject terms are extracted and used for clustering documents in a probabilistic framework. An EM approach is used to ensure documents are assigned to correct subjects, hence it converges to a locally optimal solution. The proposed method is distinctive because its results are sufficiently explanatory for human understanding as well as efficient for clustering performance. The experimental results show that the proposed method provides a competitive performance compared to other state-of-the-art approaches. We also show that the extracted themes from the MEDLINE® dataset represent the subjects of clusters reasonably well. PMID:23046528
Document Set Differentiability Analyzer v. 0.1
DOE Office of Scientific and Technical Information (OSTI.GOV)
Osborn, Thor D.
Software is a JMP Scripting Language (JSL) script designed to evaluate the differentiability of a set of documents that exhibit some conceptual commonalities but are expected to describe substantially different – thus differentiable – categories. The script imports the document set, a subset of which may be partitioned into an additions pool. The bulk of the documents form a basis pool. Text analysis is applied to the basis pool to extract a mathematical representation of its conceptual content, referred to as the document concept space. A bootstrapping approach is applied to that mathematical representation in order to generate a representationmore » of a large population of randomly designed documents that could be written within the concept space, notably without actually writing the text of those documents.The Kolmogorov-Smirnov test is applied to determine whether the basis pool document set exhibits superior differentiation relative to the randomly designed virtual documents produced by bootstrapping. If an additions pool exists, the documents are incrementally added to the basis pool, choosing the best differentiated remaining document at each step. In this manner the impact of additional categories to overall document set differentiability may be assessed.The software was developed to assess the differentiability of job description document sets. Differentiability is key to meaningful categorization. Poor job differentiation may have economic, ethical, and/or legal implications for an organization. Job categories are used in the assignment of market-based salaries; consequently, poor differentiation of job duties may set the stage for legal challenges if very similar jobs pay differently depending on title, a circumstance that also invites economic waste.The software can be applied to ensure job description set differentiability, reducing legal, economic, and ethical risks to an organization and its people. The extraction of the conceptual space to a mathematical representation enables identification of exceedingly similar documents. In the event of redundancy, two jobs may be collapsed into one. If in the judgment of the subject matter experts the jobs are truly different, the conceptual similarities are highlighted, inviting inclusion of appropriate descriptive content to explicitly characterize those differences. When additional job categories may be needed as the organization changes, the software enables evaluation of proposed additions to ensure that the resulting document set remains adequately differentiated.« less
Challenges and methodology for indexing the computerized patient record.
Ehrler, Frédéric; Ruch, Patrick; Geissbuhler, Antoine; Lovis, Christian
2007-01-01
Patient records contain most crucial documents for managing the treatments and healthcare of patients in the hospital. Retrieving information from these records in an easy, quick and safe way helps care providers to save time and find important facts about their patient's health. This paper presents the scalability issues induced by the indexing and the retrieval of the information contained in the patient records. For this study, EasyIR, an information retrieval tool performing full text queries and retrieving the related documents has been used. An evaluation of the performance reveals that the indexing process suffers from overhead consequence of the particular structure of the patient records. Most IR tools are designed to manage very large numbers of documents in a single index whereas in our hypothesis, one index per record, which usually implies few documents, has been imposed. As the number of modifications and creations of patient records are significant in a day, using a specialized and efficient indexation tool is required.
Fast words boundaries localization in text fields for low quality document images
NASA Astrophysics Data System (ADS)
Ilin, Dmitry; Novikov, Dmitriy; Polevoy, Dmitry; Nikolaev, Dmitry
2018-04-01
The paper examines the problem of word boundaries precise localization in document text zones. Document processing on a mobile device consists of document localization, perspective correction, localization of individual fields, finding words in separate zones, segmentation and recognition. While capturing an image with a mobile digital camera under uncontrolled capturing conditions, digital noise, perspective distortions or glares may occur. Further document processing gets complicated because of its specifics: layout elements, complex background, static text, document security elements, variety of text fonts. However, the problem of word boundaries localization has to be solved at runtime on mobile CPU with limited computing capabilities under specified restrictions. At the moment, there are several groups of methods optimized for different conditions. Methods for the scanned printed text are quick but limited only for images of high quality. Methods for text in the wild have an excessively high computational complexity, thus, are hardly suitable for running on mobile devices as part of the mobile document recognition system. The method presented in this paper solves a more specialized problem than the task of finding text on natural images. It uses local features, a sliding window and a lightweight neural network in order to achieve an optimal algorithm speed-precision ratio. The duration of the algorithm is 12 ms per field running on an ARM processor of a mobile device. The error rate for boundaries localization on a test sample of 8000 fields is 0.3
Means of storage and automated monitoring of versions of text technical documentation
NASA Astrophysics Data System (ADS)
Leonovets, S. A.; Shukalov, A. V.; Zharinov, I. O.
2018-03-01
The paper presents automation of the process of preparation, storage and monitoring of version control of a text designer, and program documentation by means of the specialized software is considered. Automation of preparation of documentation is based on processing of the engineering data which are contained in the specifications and technical documentation or in the specification. Data handling assumes existence of strictly structured electronic documents prepared in widespread formats according to templates on the basis of industry standards and generation by an automated method of the program or designer text document. Further life cycle of the document and engineering data entering it are controlled. At each stage of life cycle, archive data storage is carried out. Studies of high-speed performance of use of different widespread document formats in case of automated monitoring and storage are given. The new developed software and the work benches available to the developer of the instrumental equipment are described.
2012-01-01
Background To establish a common database on particle therapy for the evaluation of clinical studies integrating a large variety of voluminous datasets, different documentation styles, and various information systems, especially in the field of radiation oncology. Methods We developed a web-based documentation system for transnational and multicenter clinical studies in particle therapy. 560 patients have been treated from November 2009 to September 2011. Protons, carbon ions or a combination of both, as well as a combination with photons were applied. To date, 12 studies have been initiated and more are in preparation. Results It is possible to immediately access all patient information and exchange, store, process, and visualize text data, any DICOM images and multimedia data. Accessing the system and submitting clinical data is possible for internal and external users. Integrated into the hospital environment, data is imported both manually and automatically. Security and privacy protection as well as data validation and verification are ensured. Studies can be designed to fit individual needs. Conclusions The described database provides a basis for documentation of large patient groups with specific and specialized questions to be answered. Having recently begun electronic documentation, it has become apparent that the benefits lie in the user-friendly and timely workflow for documentation. The ultimate goal is a simplification of research work, better study analyses quality and eventually, the improvement of treatment concepts by evaluating the effectiveness of particle therapy. PMID:22828013
Identifying duplicate content using statistically improbable phrases
Errami, Mounir; Sun, Zhaohui; George, Angela C.; Long, Tara C.; Skinner, Michael A.; Wren, Jonathan D.; Garner, Harold R.
2010-01-01
Motivation: Document similarity metrics such as PubMed's ‘Find related articles’ feature, which have been primarily used to identify studies with similar topics, can now also be used to detect duplicated or potentially plagiarized papers within literature reference databases. However, the CPU-intensive nature of document comparison has limited MEDLINE text similarity studies to the comparison of abstracts, which constitute only a small fraction of a publication's total text. Extending searches to include text archived by online search engines would drastically increase comparison ability. For large-scale studies, submitting short phrases encased in direct quotes to search engines for exact matches would be optimal for both individual queries and programmatic interfaces. We have derived a method of analyzing statistically improbable phrases (SIPs) for assistance in identifying duplicate content. Results: When applied to MEDLINE citations, this method substantially improves upon previous algorithms in the detection of duplication citations, yielding a precision and recall of 78.9% (versus 50.3% for eTBLAST) and 99.6% (versus 99.8% for eTBLAST), respectively. Availability: Similar citations identified by this work are freely accessible in the Déjà vu database, under the SIP discovery method category at http://dejavu.vbi.vt.edu/dejavu/ Contact: merrami@collin.edu PMID:20472545
On the use of the singular value decomposition for text retrieval
DOE Office of Scientific and Technical Information (OSTI.GOV)
Husbands, P.; Simon, H.D.; Ding, C.
2000-12-04
The use of the Singular Value Decomposition (SVD) has been proposed for text retrieval in several recent works. This technique uses the SVD to project very high dimensional document and query vectors into a low dimensional space. In this new space it is hoped that the underlying structure of the collection is revealed thus enhancing retrieval performance. Theoretical results have provided some evidence for this claim and to some extent experiments have confirmed this. However, these studies have mostly used small test collections and simplified document models. In this work we investigate the use of the SVD on large documentmore » collections. We show that, if interpreted as a mechanism for representing the terms of the collection, this technique alone is insufficient for dealing with the variability in term occurrence. Section 2 introduces the text retrieval concepts necessary for our work. A short description of our experimental architecture is presented in Section 3. Section 4 describes how term occurrence variability affects the SVD and then shows how the decomposition influences retrieval performance. A possible way of improving SVD-based techniques is presented in Section 5 and concluded in Section 6.« less
ERIC Educational Resources Information Center
Giordano, Richard
1994-01-01
Describes the Text Encoding Initiative (TEI) project and the TEI header, which documents electronic text in a standard interchange format understandable to both librarian catalogers and nonlibrarian text encoders. The form and function of the TEI header is introduced, and its relationship to the MARC record is explained. (10 references) (KRN)
Handwritten text line segmentation by spectral clustering
NASA Astrophysics Data System (ADS)
Han, Xuecheng; Yao, Hui; Zhong, Guoqiang
2017-02-01
Since handwritten text lines are generally skewed and not obviously separated, text line segmentation of handwritten document images is still a challenging problem. In this paper, we propose a novel text line segmentation algorithm based on the spectral clustering. Given a handwritten document image, we convert it to a binary image first, and then compute the adjacent matrix of the pixel points. We apply spectral clustering on this similarity metric and use the orthogonal kmeans clustering algorithm to group the text lines. Experiments on Chinese handwritten documents database (HIT-MW) demonstrate the effectiveness of the proposed method.
Machine printed text and handwriting identification in noisy document images.
Zheng, Yefeng; Li, Huiping; Doermann, David
2004-03-01
In this paper, we address the problem of the identification of text in noisy document images. We are especially focused on segmenting and identifying between handwriting and machine printed text because: 1) Handwriting in a document often indicates corrections, additions, or other supplemental information that should be treated differently from the main content and 2) the segmentation and recognition techniques requested for machine printed and handwritten text are significantly different. A novel aspect of our approach is that we treat noise as a separate class and model noise based on selected features. Trained Fisher classifiers are used to identify machine printed text and handwriting from noise and we further exploit context to refine the classification. A Markov Random Field-based (MRF) approach is used to model the geometrical structure of the printed text, handwriting, and noise to rectify misclassifications. Experimental results show that our approach is robust and can significantly improve page segmentation in noisy document collections.
The Text Encoding Initiative: Flexible and Extensible Document Encoding.
ERIC Educational Resources Information Center
Barnard, David T.; Ide, Nancy M.
1997-01-01
The Text Encoding Initiative (TEI), an international collaboration aimed at producing a common encoding scheme for complex texts, examines the requirement for generality versus the requirement to handle specialized text types. Discusses how documents and users tax the limits of fixed schemes requiring flexible extensible encoding to support…
BioTextQuest: a web-based biomedical text mining suite for concept discovery.
Papanikolaou, Nikolas; Pafilis, Evangelos; Nikolaou, Stavros; Ouzounis, Christos A; Iliopoulos, Ioannis; Promponas, Vasilis J
2011-12-01
BioTextQuest combines automated discovery of significant terms in article clusters with structured knowledge annotation, via Named Entity Recognition services, offering interactive user-friendly visualization. A tag-cloud-based illustration of terms labeling each document cluster are semantically annotated according to the biological entity, and a list of document titles enable users to simultaneously compare terms and documents of each cluster, facilitating concept association and hypothesis generation. BioTextQuest allows customization of analysis parameters, e.g. clustering/stemming algorithms, exclusion of documents/significant terms, to better match the biological question addressed. http://biotextquest.biol.ucy.ac.cy vprobon@ucy.ac.cy; iliopj@med.uoc.gr Supplementary data are available at Bioinformatics online.
Yuan, Soe-Tsyr; Sun, Jerry
2005-10-01
Development of algorithms for automated text categorization in massive text document sets is an important research area of data mining and knowledge discovery. Most of the text-clustering methods were grounded in the term-based measurement of distance or similarity, ignoring the structure of the documents. In this paper, we present a novel method named structured cosine similarity (SCS) that furnishes document clustering with a new way of modeling on document summarization, considering the structure of the documents so as to improve the performance of document clustering in terms of quality, stability, and efficiency. This study was motivated by the problem of clustering speech documents (of no rich document features) attained from the wireless experience oral sharing conducted by mobile workforce of enterprises, fulfilling audio-based knowledge management. In other words, this problem aims to facilitate knowledge acquisition and sharing by speech. The evaluations also show fairly promising results on our method of structured cosine similarity.
Page layout analysis and classification for complex scanned documents
NASA Astrophysics Data System (ADS)
Erkilinc, M. Sezer; Jaber, Mustafa; Saber, Eli; Bauer, Peter; Depalov, Dejan
2011-09-01
A framework for region/zone classification in color and gray-scale scanned documents is proposed in this paper. The algorithm includes modules for extracting text, photo, and strong edge/line regions. Firstly, a text detection module which is based on wavelet analysis and Run Length Encoding (RLE) technique is employed. Local and global energy maps in high frequency bands of the wavelet domain are generated and used as initial text maps. Further analysis using RLE yields a final text map. The second module is developed to detect image/photo and pictorial regions in the input document. A block-based classifier using basis vector projections is employed to identify photo candidate regions. Then, a final photo map is obtained by applying probabilistic model based on Markov random field (MRF) based maximum a posteriori (MAP) optimization with iterated conditional mode (ICM). The final module detects lines and strong edges using Hough transform and edge-linkages analysis, respectively. The text, photo, and strong edge/line maps are combined to generate a page layout classification of the scanned target document. Experimental results and objective evaluation show that the proposed technique has a very effective performance on variety of simple and complex scanned document types obtained from MediaTeam Oulu document database. The proposed page layout classifier can be used in systems for efficient document storage, content based document retrieval, optical character recognition, mobile phone imagery, and augmented reality.
NASA Technical Reports Server (NTRS)
Driscoll, James N.
1994-01-01
The high-speed data search system developed for KSC incorporates existing and emerging information retrieval technology to help a user intelligently and rapidly locate information found in large textual databases. This technology includes: natural language input; statistical ranking of retrieved information; an artificial intelligence concept called semantics, where 'surface level' knowledge found in text is used to improve the ranking of retrieved information; and relevance feedback, where user judgements about viewed information are used to automatically modify the search for further information. Semantics and relevance feedback are features of the system which are not available commercially. The system further demonstrates focus on paragraphs of information to decide relevance; and it can be used (without modification) to intelligently search all kinds of document collections, such as collections of legal documents medical documents, news stories, patents, and so forth. The purpose of this paper is to demonstrate the usefulness of statistical ranking, our semantic improvement, and relevance feedback.
NASA Astrophysics Data System (ADS)
Erickson, T. A.; Granger, B.; Grout, J.; Corlay, S.
2017-12-01
The volume of Earth science data gathered from satellites, aircraft, drones, and field instruments continues to increase. For many scientific questions in the Earth sciences, managing this large volume of data is a barrier to progress, as it is difficult to explore and analyze large volumes of data using the traditional paradigm of downloading datasets to a local computer for analysis. Furthermore, methods for communicating Earth science algorithms that operate on large datasets in an easily understandable and reproducible way are needed. Here we describe a system for developing, interacting, and sharing well-documented Earth Science algorithms that combines existing software components: Jupyter Notebook: An open-source, web-based environment that supports documents that combine code and computational results with text narrative, mathematics, images, and other media. These notebooks provide an environment for interactive exploration of data and development of well documented algorithms. Jupyter Widgets / ipyleaflet: An architecture for creating interactive user interface controls (such as sliders, text boxes, etc.) in Jupyter Notebooks that communicate with Python code. This architecture includes a default set of UI controls (sliders, dropboxes, etc.) as well as APIs for building custom UI controls. The ipyleaflet project is one example that offers a custom interactive map control that allows a user to display and manipulate geographic data within the Jupyter Notebook. Google Earth Engine: A cloud-based geospatial analysis platform that provides access to petabytes of Earth science data via a Python API. The combination of Jupyter Notebooks, Jupyter Widgets, ipyleaflet, and Google Earth Engine makes it possible to explore and analyze massive Earth science datasets via a web browser, in an environment suitable for interactive exploration, teaching, and sharing. Using these environments can make Earth science analyses easier to understand and reproducible, which may increase the rate of scientific discoveries and the transition of discoveries into real-world impacts.
Representation-based user interfaces for the audiovisual library of the year 2000
NASA Astrophysics Data System (ADS)
Aigrain, Philippe; Joly, Philippe; Lepain, Philippe; Longueville, Veronique
1995-03-01
The audiovisual library of the future will be based on computerized access to digitized documents. In this communication, we address the user interface issues which will arise from this new situation. One cannot simply transfer a user interface designed for the piece by piece production of some audiovisual presentation and make it a tool for accessing full-length movies in an electronic library. One cannot take a digital sound editing tool and propose it as a means to listen to a musical recording. In our opinion, when computers are used as mediations to existing contents, document representation-based user interfaces are needed. With such user interfaces, a structured visual representation of the document contents is presented to the user, who can then manipulate it to control perception and analysis of these contents. In order to build such manipulable visual representations of audiovisual documents, one needs to automatically extract structural information from the documents contents. In this communication, we describe possible visual interfaces for various temporal media, and we propose methods for the economically feasible large scale processing of documents. The work presented is sponsored by the Bibliotheque Nationale de France: it is part of the program aiming at developing for image and sound documents an experimental counterpart to the digitized text reading workstation of this library.
Clustering of Farsi sub-word images for whole-book recognition
NASA Astrophysics Data System (ADS)
Soheili, Mohammad Reza; Kabir, Ehsanollah; Stricker, Didier
2015-01-01
Redundancy of word and sub-word occurrences in large documents can be effectively utilized in an OCR system to improve recognition results. Most OCR systems employ language modeling techniques as a post-processing step; however these techniques do not use important pictorial information that exist in the text image. In case of large-scale recognition of degraded documents, this information is even more valuable. In our previous work, we proposed a subword image clustering method for the applications dealing with large printed documents. In our clustering method, the ideal case is when all equivalent sub-word images lie in one cluster. To overcome the issues of low print quality, the clustering method uses an image matching algorithm for measuring the distance between two sub-word images. The measured distance with a set of simple shape features were used to cluster all sub-word images. In this paper, we analyze the effects of adding more shape features on processing time, purity of clustering, and the final recognition rate. Previously published experiments have shown the efficiency of our method on a book. Here we present extended experimental results and evaluate our method on another book with totally different font face. Also we show that the number of the new created clusters in a page can be used as a criteria for assessing the quality of print and evaluating preprocessing phases.
GeneView: a comprehensive semantic search engine for PubMed.
Thomas, Philippe; Starlinger, Johannes; Vowinkel, Alexander; Arzt, Sebastian; Leser, Ulf
2012-07-01
Research results are primarily published in scientific literature and curation efforts cannot keep up with the rapid growth of published literature. The plethora of knowledge remains hidden in large text repositories like MEDLINE. Consequently, life scientists have to spend a great amount of time searching for specific information. The enormous ambiguity among most names of biomedical objects such as genes, chemicals and diseases often produces too large and unspecific search results. We present GeneView, a semantic search engine for biomedical knowledge. GeneView is built upon a comprehensively annotated version of PubMed abstracts and openly available PubMed Central full texts. This semi-structured representation of biomedical texts enables a number of features extending classical search engines. For instance, users may search for entities using unique database identifiers or they may rank documents by the number of specific mentions they contain. Annotation is performed by a multitude of state-of-the-art text-mining tools for recognizing mentions from 10 entity classes and for identifying protein-protein interactions. GeneView currently contains annotations for >194 million entities from 10 classes for ∼21 million citations with 271,000 full text bodies. GeneView can be searched at http://bc3.informatik.hu-berlin.de/.
Probing the Topological Properties of Complex Networks Modeling Short Written Texts
Amancio, Diego R.
2015-01-01
In recent years, graph theory has been widely employed to probe several language properties. More specifically, the so-called word adjacency model has been proven useful for tackling several practical problems, especially those relying on textual stylistic analysis. The most common approach to treat texts as networks has simply considered either large pieces of texts or entire books. This approach has certainly worked well—many informative discoveries have been made this way—but it raises an uncomfortable question: could there be important topological patterns in small pieces of texts? To address this problem, the topological properties of subtexts sampled from entire books was probed. Statistical analyses performed on a dataset comprising 50 novels revealed that most of the traditional topological measurements are stable for short subtexts. When the performance of the authorship recognition task was analyzed, it was found that a proper sampling yields a discriminability similar to the one found with full texts. Surprisingly, the support vector machine classification based on the characterization of short texts outperformed the one performed with entire books. These findings suggest that a local topological analysis of large documents might improve its global characterization. Most importantly, it was verified, as a proof of principle, that short texts can be analyzed with the methods and concepts of complex networks. As a consequence, the techniques described here can be extended in a straightforward fashion to analyze texts as time-varying complex networks. PMID:25719799
Eyrolle, Hélène; Virbel, Jacques; Lemarié, Julie
2008-03-01
Based on previous research in the field of cognitive psychology, highlighting the facilitatory effects of titles on several text-related activities, this paper looks at the extent to which titles reflect text content. An exploratory study of real-life technical documents investigated the content of their Subject lines, which linguistic analyses had led us to regard as titles. The study showed that most of the titles supplied by the writers failed to represent the documents' contents and that most users failed to detect this lack of validity.
Semantic retrieval and navigation in clinical document collections.
Kreuzthaler, Markus; Daumke, Philipp; Schulz, Stefan
2015-01-01
Patients with chronic diseases undergo numerous in- and outpatient treatment periods, and therefore many documents accumulate in their electronic records. We report on an on-going project focussing on the semantic enrichment of medical texts, in order to support recall-oriented navigation across a patient's complete documentation. A document pool of 1,696 de-identified discharge summaries was used for prototyping. A natural language processing toolset for document annotation (based on the text-mining framework UIMA) and indexing (Solr) was used to support a browser-based platform for document import, search and navigation. The integrated search engine combines free text and concept-based querying, supported by dynamically generated facets (diagnoses, procedures, medications, lab values, and body parts). The prototype demonstrates the feasibility of semantic document enrichment within document collections of a single patient. Originally conceived as an add-on for the clinical workplace, this technology could also be adapted to support personalised health record platforms, as well as cross-patient search for cohort building and other secondary use scenarios.
Pedoinformatics Approach to Soil Text Analytics
NASA Astrophysics Data System (ADS)
Furey, J.; Seiter, J.; Davis, A.
2017-12-01
The several extant schema for the classification of soils rely on differing criteria, but the major soil science taxonomies, including the United States Department of Agriculture (USDA) and the international harmonized World Reference Base for Soil Resources systems, are based principally on inferred pedogenic properties. These taxonomies largely result from compiled individual observations of soil morphologies within soil profiles, and the vast majority of this pedologic information is contained in qualitative text descriptions. We present text mining analyses of hundreds of gigabytes of parsed text and other data in the digitally available USDA soil taxonomy documentation, the Soil Survey Geographic (SSURGO) database, and the National Cooperative Soil Survey (NCSS) soil characterization database. These analyses implemented iPython calls to Gensim modules for topic modelling, with latent semantic indexing completed down to the lowest taxon level (soil series) paragraphs. Via a custom extension of the Natural Language Toolkit (NLTK), approximately one percent of the USDA soil series descriptions were used to train a classifier for the remainder of the documents, essentially by treating soil science words as comprising a novel language. While location-specific descriptors at the soil series level are amenable to geomatics methods, unsupervised clustering of the occurrence of other soil science words did not closely follow the usual hierarchy of soil taxa. We present preliminary phrasal analyses that may account for some of these effects.
Text-line extraction in handwritten Chinese documents based on an energy minimization framework.
Koo, Hyung Il; Cho, Nam Ik
2012-03-01
Text-line extraction in unconstrained handwritten documents remains a challenging problem due to nonuniform character scale, spatially varying text orientation, and the interference between text lines. In order to address these problems, we propose a new cost function that considers the interactions between text lines and the curvilinearity of each text line. Precisely, we achieve this goal by introducing normalized measures for them, which are based on an estimated line spacing. We also present an optimization method that exploits the properties of our cost function. Experimental results on a database consisting of 853 handwritten Chinese document images have shown that our method achieves a detection rate of 99.52% and an error rate of 0.32%, which outperforms conventional methods.
Mapping annotations with textual evidence using an scLDA model.
Jin, Bo; Chen, Vicky; Chen, Lujia; Lu, Xinghua
2011-01-01
Most of the knowledge regarding genes and proteins is stored in biomedical literature as free text. Extracting information from complex biomedical texts demands techniques capable of inferring biological concepts from local text regions and mapping them to controlled vocabularies. To this end, we present a sentence-based correspondence latent Dirichlet allocation (scLDA) model which, when trained with a corpus of PubMed documents with known GO annotations, performs the following tasks: 1) learning major biological concepts from the corpus, 2) inferring the biological concepts existing within text regions (sentences), and 3) identifying the text regions in a document that provides evidence for the observed annotations. When applied to new gene-related documents, a trained scLDA model is capable of predicting GO annotations and identifying text regions as textual evidence supporting the predicted annotations. This study uses GO annotation data as a testbed; the approach can be generalized to other annotated data, such as MeSH and MEDLINE documents.
The design and implementation of GML data management information system based on PostgreSQL
NASA Astrophysics Data System (ADS)
Zhang, Aiguo; Wu, Qunyong; Xu, Qifeng
2008-10-01
GML expresses geographic information in text, and it provides an extensible and standard way of spatial information encoding. At the present time, the management of GML data is in terms of document. By this way, the inquiry and update of GML data is inefficient, and it demands high memory when the document is comparatively large. In this respect, the paper put forward a data management of GML based on PostgreSQL. It designs four kinds of inquiries, which are inquiry of metadata, inquiry of geometry based on property, inquiry of property based on spatial information, and inquiry of spatial data based on location. At the same time, it designs and implements the visualization of the inquired WKT data.
Offline Arabic handwriting recognition: a survey.
Lorigo, Liana M; Govindaraju, Venu
2006-05-01
The automatic recognition of text on scanned images has enabled many applications such as searching for words in large volumes of documents, automatic sorting of postal mail, and convenient editing of previously printed documents. The domain of handwriting in the Arabic script presents unique technical challenges and has been addressed more recently than other domains. Many different methods have been proposed and applied to various types of images. This paper provides a comprehensive review of these methods. It is the first survey to focus on Arabic handwriting recognition and the first Arabic character recognition survey to provide recognition rates and descriptions of test data for the approaches discussed. It includes background on the field, discussion of the methods, and future research directions.
Shatkay, Hagit; Pan, Fengxia; Rzhetsky, Andrey; Wilbur, W. John
2008-01-01
Motivation: Much current research in biomedical text mining is concerned with serving biologists by extracting certain information from scientific text. We note that there is no ‘average biologist’ client; different users have distinct needs. For instance, as noted in past evaluation efforts (BioCreative, TREC, KDD) database curators are often interested in sentences showing experimental evidence and methods. Conversely, lab scientists searching for known information about a protein may seek facts, typically stated with high confidence. Text-mining systems can target specific end-users and become more effective, if the system can first identify text regions rich in the type of scientific content that is of interest to the user, retrieve documents that have many such regions, and focus on fact extraction from these regions. Here, we study the ability to characterize and classify such text automatically. We have recently introduced a multi-dimensional categorization and annotation scheme, developed to be applicable to a wide variety of biomedical documents and scientific statements, while intended to support specific biomedical retrieval and extraction tasks. Results: The annotation scheme was applied to a large corpus in a controlled effort by eight independent annotators, where three individual annotators independently tagged each sentence. We then trained and tested machine learning classifiers to automatically categorize sentence fragments based on the annotation. We discuss here the issues involved in this task, and present an overview of the results. The latter strongly suggest that automatic annotation along most of the dimensions is highly feasible, and that this new framework for scientific sentence categorization is applicable in practice. Contact: shatkay@cs.queensu.ca PMID:18718948
How to use the WWW to distribute STI
NASA Technical Reports Server (NTRS)
Roper, Donna G.
1994-01-01
This presentation explains how to use the World Wide Web (WWW) to distribute scientific and technical information as hypermedia. WWW clients and servers use the HyperText Transfer Protocol (HTTP) to transfer documents containing links to other text, graphics, video, and sound. The standard language for these documents is the HyperText MarkUp Language (HTML). These are simply text files with formatting codes that contain layout information and hyperlinks. HTML documents can be created with any text editor or with one of the publicly available HTML editors or convertors. HTML can also include links to available image formats. This presentation is available online. The URL is http://sti.larc.nasa. (followed by) gov/demos/workshop/introtext.html.
Jackson MSc, Richard G.; Ball, Michael; Patel, Rashmi; Hayes, Richard D.; Dobson, Richard J.B.; Stewart, Robert
2014-01-01
Observational research using data from electronic health records (EHR) is a rapidly growing area, which promises both increased sample size and data richness - therefore unprecedented study power. However, in many medical domains, large amounts of potentially valuable data are contained within the free text clinical narrative. Manually reviewing free text to obtain desired information is an inefficient use of researcher time and skill. Previous work has demonstrated the feasibility of applying Natural Language Processing (NLP) to extract information. However, in real world research environments, the demand for NLP skills outweighs supply, creating a bottleneck in the secondary exploitation of the EHR. To address this, we present TextHunter, a tool for the creation of training data, construction of concept extraction machine learning models and their application to documents. Using confidence thresholds to ensure high precision (>90%), we achieved recall measurements as high as 99% in real world use cases. PMID:25954379
NASA Astrophysics Data System (ADS)
Kessel, Kerstin A.; Bougatf, Nina; Bohn, Christian; Engelmann, Uwe; Oetzel, Dieter; Bendl, Rolf; Debus, Jürgen; Combs, Stephanie E.
2012-02-01
Conducting clinical studies is rather difficult because of the large variety of voluminous datasets, different documentation styles, and various information systems, especially in radiation oncology. In this paper, we describe our development of a web-based documentation system with first approaches of automatic statistical analyses for transnational and multicenter clinical studies in particle therapy. It is possible to have immediate access to all patient information and exchange, store, process, and visualize text data, all types of DICOM images, especially DICOM RT, and any other multimedia data. Accessing the documentation system and submitting clinical data is possible for internal and external users (e.g. referring physicians from abroad, who are seeking the new technique of particle therapy for their patients). Thereby, security and privacy protection is ensured with the encrypted https protocol, client certificates, and an application gateway. Furthermore, all data can be pseudonymized. Integrated into the existing hospital environment, patient data is imported via various interfaces over HL7-messages and DICOM. Several further features replace manual input wherever possible and ensure data quality and entirety. With a form generator, studies can be individually designed to fit specific needs. By including all treated patients (also non-study patients), we gain the possibility for overall large-scale, retrospective analyses. Having recently begun documentation of our first six clinical studies, it has become apparent that the benefits lie in the simplification of research work, better study analyses quality and ultimately, the improvement of treatment concepts by evaluating the effectiveness of particle therapy.
Redundancy-Aware Topic Modeling for Patient Record Notes
Cohen, Raphael; Aviram, Iddo; Elhadad, Michael; Elhadad, Noémie
2014-01-01
The clinical notes in a given patient record contain much redundancy, in large part due to clinicians’ documentation habit of copying from previous notes in the record and pasting into a new note. Previous work has shown that this redundancy has a negative impact on the quality of text mining and topic modeling in particular. In this paper we describe a novel variant of Latent Dirichlet Allocation (LDA) topic modeling, Red-LDA, which takes into account the inherent redundancy of patient records when modeling content of clinical notes. To assess the value of Red-LDA, we experiment with three baselines and our novel redundancy-aware topic modeling method: given a large collection of patient records, (i) apply vanilla LDA to all documents in all input records; (ii) identify and remove all redundancy by chosing a single representative document for each record as input to LDA; (iii) identify and remove all redundant paragraphs in each record, leaving partial, non-redundant documents as input to LDA; and (iv) apply Red-LDA to all documents in all input records. Both quantitative evaluation carried out through log-likelihood on held-out data and topic coherence of produced topics and qualitative assessement of topics carried out by physicians show that Red-LDA produces superior models to all three baseline strategies. This research contributes to the emerging field of understanding the characteristics of the electronic health record and how to account for them in the framework of data mining. The code for the two redundancy-elimination baselines and Red-LDA is made publicly available to the community. PMID:24551060
Redundancy-aware topic modeling for patient record notes.
Cohen, Raphael; Aviram, Iddo; Elhadad, Michael; Elhadad, Noémie
2014-01-01
The clinical notes in a given patient record contain much redundancy, in large part due to clinicians' documentation habit of copying from previous notes in the record and pasting into a new note. Previous work has shown that this redundancy has a negative impact on the quality of text mining and topic modeling in particular. In this paper we describe a novel variant of Latent Dirichlet Allocation (LDA) topic modeling, Red-LDA, which takes into account the inherent redundancy of patient records when modeling content of clinical notes. To assess the value of Red-LDA, we experiment with three baselines and our novel redundancy-aware topic modeling method: given a large collection of patient records, (i) apply vanilla LDA to all documents in all input records; (ii) identify and remove all redundancy by chosing a single representative document for each record as input to LDA; (iii) identify and remove all redundant paragraphs in each record, leaving partial, non-redundant documents as input to LDA; and (iv) apply Red-LDA to all documents in all input records. Both quantitative evaluation carried out through log-likelihood on held-out data and topic coherence of produced topics and qualitative assessment of topics carried out by physicians show that Red-LDA produces superior models to all three baseline strategies. This research contributes to the emerging field of understanding the characteristics of the electronic health record and how to account for them in the framework of data mining. The code for the two redundancy-elimination baselines and Red-LDA is made publicly available to the community.
NASA Astrophysics Data System (ADS)
David, Peter; Hansen, Nichole; Nolan, James J.; Alcocer, Pedro
2015-05-01
The growth in text data available online is accompanied by a growth in the diversity of available documents. Corpora with extreme heterogeneity in terms of file formats, document organization, page layout, text style, and content are common. The absence of meaningful metadata describing the structure of online and open-source data leads to text extraction results that contain no information about document structure and are cluttered with page headers and footers, web navigation controls, advertisements, and other items that are typically considered noise. We describe an approach to document structure and metadata recovery that uses visual analysis of documents to infer the communicative intent of the author. Our algorithm identifies the components of documents such as titles, headings, and body content, based on their appearance. Because it operates on an image of a document, our technique can be applied to any type of document, including scanned images. Our approach to document structure recovery considers a finer-grained set of component types than prior approaches. In this initial work, we show that a machine learning approach to document structure recovery using a feature set based on the geometry and appearance of images of documents achieves a 60% greater F1- score than a baseline random classifier.
Use of General-purpose Negation Detection to Augment Concept Indexing of Medical Documents
Mutalik, Pradeep G.; Deshpande, Aniruddha; Nadkarni, Prakash M.
2001-01-01
Objectives: To test the hypothesis that most instances of negated concepts in dictated medical documents can be detected by a strategy that relies on tools developed for the parsing of formal (computer) languages—specifically, a lexical scanner (“lexer”) that uses regular expressions to generate a finite state machine, and a parser that relies on a restricted subset of context-free grammars, known as LALR(1) grammars. Methods: A diverse training set of 40 medical documents from a variety of specialties was manually inspected and used to develop a program (Negfinder) that contained rules to recognize a large set of negated patterns occurring in the text. Negfinder's lexer and parser were developed using tools normally used to generate programming language compilers. The input to Negfinder consisted of medical narrative that was preprocessed to recognize UMLS concepts: the text of a recognized concept had been replaced with a coded representation that included its UMLS concept ID. The program generated an index with one entry per instance of a concept in the document, where the presence or absence of negation of that concept was recorded. This information was used to mark up the text of each document by color-coding it to make it easier to inspect. The parser was then evaluated in two ways: 1) a test set of 60 documents (30 discharge summaries, 30 surgical notes) marked-up by Negfinder was inspected visually to quantify false-positive and false-negative results; and 2) a different test set of 10 documents was independently examined for negatives by a human observer and by Negfinder, and the results were compared. Results: In the first evaluation using marked-up documents, 8,358 instances of UMLS concepts were detected in the 60 documents, of which 544 were negations detected by the program and verified by human observation (true-positive results, or TPs). Thirteen instances were wrongly flagged as negated (false-positive results, or FPs), and the program missed 27 instances of negation (false-negative results, or FNs), yielding a sensitivity of 95.3 percent and a specificity of 97.7 percent. In the second evaluation using independent negation detection, 1,869 concepts were detected in 10 documents, with 135 TPs, 12 FPs, and 6 FNs, yielding a sensitivity of 95.7 percent and a specificity of 91.8 percent. One of the words “no,” “denies/denied,” “not,” or “without” was present in 92.5 percent of all negations. Conclusions: Negation of most concepts in medical narrative can be reliably detected by a simple strategy. The reliability of detection depends on several factors, the most important being the accuracy of concept matching. PMID:11687566
DOE Research and Development Accomplishments Help
be used to search, locate, access, and electronically download full-text research and development (R Browse Downloading, Viewing, and/or Searching Full-text Documents/Pages Searching the Database Search Features Search allows you to search the OCRed full-text document and bibliographic information, the
Semantic Theme Analysis of Pilot Incident Reports
NASA Technical Reports Server (NTRS)
Thirumalainambi, Rajkumar
2009-01-01
Pilots report accidents or incidents during take-off, on flight and landing to airline authorities and Federal aviation authority as well. The description of pilot reports for an incident contains technical terms related to Flight instruments and operations. Normal text mining approaches collect keywords from text documents and relate them among documents that are stored in database. Present approach will extract specific theme analysis of incident reports and semantically relate hierarchy of terms assigning weights of themes. Once the theme extraction has been performed for a given document, a unique key can be assigned to that document to cross linking the documents. Semantic linking will be used to categorize the documents based on specific rules that can help an end-user to analyze certain types of accidents. This presentation outlines the architecture of text mining for pilot incident reports for autonomous categorization of pilot incident reports using semantic theme analysis.
Mining protein function from text using term-based support vector machines
Rice, Simon B; Nenadic, Goran; Stapley, Benjamin J
2005-01-01
Background Text mining has spurred huge interest in the domain of biology. The goal of the BioCreAtIvE exercise was to evaluate the performance of current text mining systems. We participated in Task 2, which addressed assigning Gene Ontology terms to human proteins and selecting relevant evidence from full-text documents. We approached it as a modified form of the document classification task. We used a supervised machine-learning approach (based on support vector machines) to assign protein function and select passages that support the assignments. As classification features, we used a protein's co-occurring terms that were automatically extracted from documents. Results The results evaluated by curators were modest, and quite variable for different problems: in many cases we have relatively good assignment of GO terms to proteins, but the selected supporting text was typically non-relevant (precision spanning from 3% to 50%). The method appears to work best when a substantial set of relevant documents is obtained, while it works poorly on single documents and/or short passages. The initial results suggest that our approach can also mine annotations from text even when an explicit statement relating a protein to a GO term is absent. Conclusion A machine learning approach to mining protein function predictions from text can yield good performance only if sufficient training data is available, and significant amount of supporting data is used for prediction. The most promising results are for combined document retrieval and GO term assignment, which calls for the integration of methods developed in BioCreAtIvE Task 1 and Task 2. PMID:15960835
Aad, G; Abbott, B; Abdallah, J; Abdinov, O; Abeloos, B; Aben, R; Abolins, M; AbouZeid, O S; Abraham, N L; Abramowicz, H; Abreu, H; Abreu, R; Abulaiti, Y; Acharya, B S; Adamczyk, L; Adams, D L; Adelman, J; Adomeit, S; Adye, T; Affolder, A A; Agatonovic-Jovin, T; Agricola, J; Aguilar-Saavedra, J A; Ahlen, S P; Ahmadov, F; Aielli, G; Akerstedt, H; Åkesson, T P A; Akimov, A V; Alberghi, G L; Albert, J; Albrand, S; Alconada Verzini, M J; Aleksa, M; Aleksandrov, I N; Alexa, C; Alexander, G; Alexopoulos, T; Alhroob, M; Aliev, M; Alimonti, G; Alison, J; Alkire, S P; Allbrooke, B M M; Allen, B W; Allport, P P; Aloisio, A; Alonso, A; Alonso, F; Alpigiani, C; Alstaty, M; Alvarez Gonzalez, B; Álvarez Piqueras, D; Alviggi, M G; Amadio, B T; Amako, K; Amaral Coutinho, Y; Amelung, C; Amidei, D; Amor Dos Santos, S P; Amorim, A; Amoroso, S; Amundsen, G; Anastopoulos, C; Ancu, L S; Andari, N; Andeen, T; Anders, C F; Anders, G; Anders, J K; Anderson, K J; Andreazza, A; Andrei, V; Angelidakis, S; Angelozzi, I; Anger, P; Angerami, A; Anghinolfi, F; Anisenkov, A V; Anjos, N; Annovi, A; Antonelli, M; Antonov, A; Antos, J; Anulli, F; Aoki, M; Aperio Bella, L; Arabidze, G; Arai, Y; Araque, J P; Arce, A T H; Arduh, F A; Arguin, J-F; Argyropoulos, S; Arik, M; Armbruster, A J; Armitage, L J; Arnaez, O; Arnold, H; Arratia, M; Arslan, O; Artamonov, A; Artoni, G; Artz, S; Asai, S; Asbah, N; Ashkenazi, A; Åsman, B; Asquith, L; Assamagan, K; Astalos, R; Atkinson, M; Atlay, N B; Augsten, K; Avolio, G; Axen, B; Ayoub, M K; Azuelos, G; Baak, M A; Baas, A E; Baca, M J; Bachacou, H; Bachas, K; Backes, M; Backhaus, M; Bagiacchi, P; Bagnaia, P; Bai, Y; Baines, J T; Baker, O K; Baldin, E M; Balek, P; Balestri, T; Balli, F; Balunas, W K; Banas, E; Banerjee, Sw; Bannoura, A A E; Barak, L; Barberio, E L; Barberis, D; Barbero, M; Barillari, T; Barklow, T; Barlow, N; Barnes, S L; Barnett, B M; Barnett, R M; Barnovska, Z; Baroncelli, A; Barone, G; Barr, A J; Barranco Navarro, L; Barreiro, F; Barreiro Guimarães da Costa, J; Bartoldus, R; Barton, A E; Bartos, P; Basalaev, A; Bassalat, A; Bates, R L; Batista, S J; Batley, J R; Battaglia, M; Bauce, M; Bauer, F; Bawa, H S; Beacham, J B; Beattie, M D; Beau, T; Beauchemin, P H; Bechtle, P; Beck, H P; Becker, K; Becker, M; Beckingham, M; Becot, C; Beddall, A J; Beddall, A; Bednyakov, V A; Bedognetti, M; Bee, C P; Beemster, L J; Beermann, T A; Begel, M; Behr, J K; Belanger-Champagne, C; Bell, A S; Bella, G; Bellagamba, L; Bellerive, A; Bellomo, M; Belotskiy, K; Beltramello, O; Belyaev, N L; Benary, O; Benchekroun, D; Bender, M; Bendtz, K; Benekos, N; Benhammou, Y; Benhar Noccioli, E; Benitez, J; Benitez Garcia, J A; Benjamin, D P; Bensinger, J R; Bentvelsen, S; Beresford, L; Beretta, M; Berge, D; Bergeaas Kuutmann, E; Berger, N; Beringer, J; Berlendis, S; Bernard, N R; Bernius, C; Bernlochner, F U; Berry, T; Berta, P; Bertella, C; Bertoli, G; Bertolucci, F; Bertram, I A; Bertsche, C; Bertsche, D; Besjes, G J; Bessidskaia Bylund, O; Bessner, M; Besson, N; Betancourt, C; Bethke, S; Bevan, A J; Bhimji, W; Bianchi, R M; Bianchini, L; Bianco, M; Biebel, O; Biedermann, D; Bielski, R; Biesuz, N V; Biglietti, M; Bilbao De Mendizabal, J; Bilokon, H; Bindi, M; Binet, S; Bingul, A; Bini, C; Biondi, S; Bjergaard, D M; Black, C W; Black, J E; Black, K M; Blackburn, D; Blair, R E; Blanchard, J-B; Blanco, J E; Blazek, T; Bloch, I; Blocker, C; Blum, W; Blumenschein, U; Blunier, S; Bobbink, G J; Bobrovnikov, V S; Bocchetta, S S; Bocci, A; Bock, C; Boehler, M; Boerner, D; Bogaerts, J A; Bogavac, D; Bogdanchikov, A G; Bohm, C; Boisvert, V; Bold, T; Boldea, V; Boldyrev, A S; Bomben, M; Bona, M; Boonekamp, M; Borisov, A; Borissov, G; Bortfeldt, J; Bortoletto, D; Bortolotto, V; Bos, K; Boscherini, D; Bosman, M; Bossio Sola, J D; Boudreau, J; Bouffard, J; Bouhova-Thacker, E V; Boumediene, D; Bourdarios, C; Boutle, S K; Boveia, A; Boyd, J; Boyko, I R; Bracinik, J; Brandt, A; Brandt, G; Brandt, O; Bratzler, U; Brau, B; Brau, J E; Braun, H M; Breaden Madden, W D; Brendlinger, K; Brennan, A J; Brenner, L; Brenner, R; Bressler, S; Bristow, T M; Britton, D; Britzger, D; Brochu, F M; Brock, I; Brock, R; Brooijmans, G; Brooks, T; Brooks, W K; Brosamer, J; Brost, E; Broughton, J H; Bruckman de Renstrom, P A; Bruncko, D; Bruneliere, R; Bruni, A; Bruni, G; Brunt, B H; Bruschi, M; Bruscino, N; Bryant, P; Bryngemark, L; Buanes, T; Buat, Q; Buchholz, P; Buckley, A G; Budagov, I A; Buehrer, F; Bugge, M K; Bulekov, O; Bullock, D; Burckhart, H; Burdin, S; Burgard, C D; Burghgrave, B; Burka, K; Burke, S; Burmeister, I; Busato, E; Büscher, D; Büscher, V; Bussey, P; Butler, J M; Buttar, C M; Butterworth, J M; Butti, P; Buttinger, W; Buzatu, A; Buzykaev, A R; Cabrera Urbán, S; Caforio, D; Cairo, V M; Cakir, O; Calace, N; Calafiura, P; Calandri, A; Calderini, G; Calfayan, P; Caloba, L P; Calvet, D; Calvet, S; Calvet, T P; Camacho Toro, R; Camarda, S; Camarri, P; Cameron, D; Caminal Armadans, R; Camincher, C; Campana, S; Campanelli, M; Camplani, A; Campoverde, A; Canale, V; Canepa, A; Cano Bret, M; Cantero, J; Cantrill, R; Cao, T; Capeans Garrido, M D M; Caprini, I; Caprini, M; Capua, M; Caputo, R; Carbone, R M; Cardarelli, R; Cardillo, F; Carli, I; Carli, T; Carlino, G; Carminati, L; Caron, S; Carquin, E; Carrillo-Montoya, G D; Carter, J R; Carvalho, J; Casadei, D; Casado, M P; Casolino, M; Casper, D W; Castaneda-Miranda, E; Castelli, A; Castillo Gimenez, V; Castro, N F; Catinaccio, A; Catmore, J R; Cattai, A; Caudron, J; Cavaliere, V; Cavallaro, E; Cavalli, D; Cavalli-Sforza, M; Cavasinni, V; Ceradini, F; Cerda Alberich, L; Cerio, B C; Cerqueira, A S; Cerri, A; Cerrito, L; Cerutti, F; Cerv, M; Cervelli, A; Cetin, S A; Chafaq, A; Chakraborty, D; Chan, S K; Chan, Y L; Chang, P; Chapman, J D; Charlton, D G; Chatterjee, A; Chau, C C; Chavez Barajas, C A; Che, S; Cheatham, S; Chegwidden, A; Chekanov, S; Chekulaev, S V; Chelkov, G A; Chelstowska, M A; Chen, C; Chen, H; Chen, K; Chen, S; Chen, S; Chen, X; Chen, Y; Cheng, H C; Cheng, H J; Cheng, Y; Cheplakov, A; Cheremushkina, E; Cherkaoui El Moursli, R; Chernyatin, V; Cheu, E; Chevalier, L; Chiarella, V; Chiarelli, G; Chiodini, G; Chisholm, A S; Chitan, A; Chizhov, M V; Choi, K; Chomont, A R; Chouridou, S; Chow, B K B; Christodoulou, V; Chromek-Burckhart, D; Chudoba, J; Chuinard, A J; Chwastowski, J J; Chytka, L; Ciapetti, G; Ciftci, A K; Cinca, D; Cindro, V; Cioara, I A; Ciocio, A; Cirotto, F; Citron, Z H; Citterio, M; Ciubancan, M; Clark, A; Clark, B L; Clark, M R; Clark, P J; Clarke, R N; Clement, C; Coadou, Y; Cobal, M; Coccaro, A; Cochran, J; Coffey, L; Colasurdo, L; Cole, B; Cole, S; Colijn, A P; Collot, J; Colombo, T; Compostella, G; Conde Muiño, P; Coniavitis, E; Connell, S H; Connelly, I A; Consorti, V; Constantinescu, S; Conta, C; Conti, G; Conventi, F; Cooke, M; Cooper, B D; Cooper-Sarkar, A M; Cormier, K J R; Cornelissen, T; Corradi, M; Corriveau, F; Corso-Radu, A; Cortes-Gonzalez, A; Cortiana, G; Costa, G; Costa, M J; Costanzo, D; Cottin, G; Cowan, G; Cox, B E; Cranmer, K; Crawley, S J; Cree, G; Crépé-Renaudin, S; Crescioli, F; Cribbs, W A; Crispin Ortuzar, M; Cristinziani, M; Croft, V; Crosetti, G; Cuhadar Donszelmann, T; Cummings, J; Curatolo, M; Cúth, J; Cuthbert, C; Czirr, H; Czodrowski, P; D'Auria, S; D'Onofrio, M; Da Cunha Sargedas De Sousa, M J; Da Via, C; Dabrowski, W; Dado, T; Dai, T; Dale, O; Dallaire, F; Dallapiccola, C; Dam, M; Dandoy, J R; Dang, N P; Daniells, A C; Dann, N S; Danninger, M; Dano Hoffmann, M; Dao, V; Darbo, G; Darmora, S; Dassoulas, J; Dattagupta, A; Davey, W; David, C; Davidek, T; Davies, M; Davison, P; Dawe, E; Dawson, I; Daya-Ishmukhametova, R K; De, K; de Asmundis, R; De Benedetti, A; De Castro, S; De Cecco, S; De Groot, N; de Jong, P; De la Torre, H; De Lorenzi, F; De Pedis, D; De Salvo, A; De Sanctis, U; De Santo, A; De Vivie De Regie, J B; Dearnaley, W J; Debbe, R; Debenedetti, C; Dedovich, D V; Deigaard, I; Del Gaudio, M; Del Peso, J; Del Prete, T; Delgove, D; Deliot, F; Delitzsch, C M; Deliyergiyev, M; Dell'Acqua, A; Dell'Asta, L; Dell'Orso, M; Della Pietra, M; Della Volpe, D; Delmastro, M; Delsart, P A; Deluca, C; DeMarco, D A; Demers, S; Demichev, M; Demilly, A; Denisov, S P; Denysiuk, D; Derendarz, D; Derkaoui, J E; Derue, F; Dervan, P; Desch, K; Deterre, C; Dette, K; Deviveiros, P O; Dewhurst, A; Dhaliwal, S; Di Ciaccio, A; Di Ciaccio, L; Di Clemente, W K; Di Donato, C; Di Girolamo, A; Di Girolamo, B; Di Micco, B; Di Nardo, R; Di Simone, A; Di Sipio, R; Di Valentino, D; Diaconu, C; Diamond, M; Dias, F A; Diaz, M A; Diehl, E B; Dietrich, J; Diglio, S; Dimitrievska, A; Dingfelder, J; Dita, P; Dita, S; Dittus, F; Djama, F; Djobava, T; Djuvsland, J I; do Vale, M A B; Dobos, D; Dobre, M; Doglioni, C; Dohmae, T; Dolejsi, J; Dolezal, Z; Dolgoshein, B A; Donadelli, M; Donati, S; Dondero, P; Donini, J; Dopke, J; Doria, A; Dova, M T; Doyle, A T; Drechsler, E; Dris, M; Du, Y; Duarte-Campderros, J; Duchovni, E; Duckeck, G; Ducu, O A; Duda, D; Dudarev, A; Duflot, L; Duguid, L; Dührssen, M; Dumancic, M; Dunford, M; Duran Yildiz, H; Düren, M; Durglishvili, A; Duschinger, D; Dutta, B; Dyndal, M; Eckardt, C; Ecker, K M; Edgar, R C; Edwards, N C; Eifert, T; Eigen, G; Einsweiler, K; Ekelof, T; El Kacimi, M; Ellajosyula, V; Ellert, M; Elles, S; Ellinghaus, F; Elliot, A A; Ellis, N; Elmsheuser, J; Elsing, M; Emeliyanov, D; Enari, Y; Endner, O C; Endo, M; Ennis, J S; Erdmann, J; Ereditato, A; Ernis, G; Ernst, J; Ernst, M; Errede, S; Ertel, E; Escalier, M; Esch, H; Escobar, C; Esposito, B; Etienvre, A I; Etzion, E; Evans, H; Ezhilov, A; Fabbri, F; Fabbri, L; Facini, G; Fakhrutdinov, R M; Falciano, S; Falla, R J; Faltova, J; Fang, Y; Fanti, M; Farbin, A; Farilla, A; Farina, C; Farooque, T; Farrell, S; Farrington, S M; Farthouat, P; Fassi, F; Fassnacht, P; Fassouliotis, D; Faucci Giannelli, M; Favareto, A; Fawcett, W J; Fayard, L; Fedin, O L; Fedorko, W; Feigl, S; Feligioni, L; Feng, C; Feng, E J; Feng, H; Fenyuk, A B; Feremenga, L; Fernandez Martinez, P; Fernandez Perez, S; Ferrando, J; Ferrari, A; Ferrari, P; Ferrari, R; Ferreira de Lima, D E; Ferrer, A; Ferrere, D; Ferretti, C; Ferretto Parodi, A; Fiedler, F; Filipčič, A; Filipuzzi, M; Filthaut, F; Fincke-Keeler, M; Finelli, K D; Fiolhais, M C N; Fiorini, L; Firan, A; Fischer, A; Fischer, C; Fischer, J; Fisher, W C; Flaschel, N; Fleck, I; Fleischmann, P; Fletcher, G T; Fletcher, R R M; Flick, T; Floderus, A; Flores Castillo, L R; Flowerdew, M J; Forcolin, G T; Formica, A; Forti, A; Foster, A G; Fournier, D; Fox, H; Fracchia, S; Francavilla, P; Franchini, M; Francis, D; Franconi, L; Franklin, M; Frate, M; Fraternali, M; Freeborn, D; Fressard-Batraneanu, S M; Friedrich, F; Froidevaux, D; Frost, J A; Fukunaga, C; Fullana Torregrosa, E; Fusayasu, T; Fuster, J; Gabaldon, C; Gabizon, O; Gabrielli, A; Gabrielli, A; Gach, G P; Gadatsch, S; Gadomski, S; Gagliardi, G; Gagnon, L G; Gagnon, P; Galea, C; Galhardo, B; Gallas, E J; Gallop, B J; Gallus, P; Galster, G; Gan, K K; Gao, J; Gao, Y; Gao, Y S; Garay Walls, F M; García, C; García Navarro, J E; Garcia-Sciveres, M; Gardner, R W; Garelli, N; Garonne, V; Gascon Bravo, A; Gatti, C; Gaudiello, A; Gaudio, G; Gaur, B; Gauthier, L; Gavrilenko, I L; Gay, C; Gaycken, G; Gazis, E N; Gecse, Z; Gee, C N P; Geich-Gimbel, Ch; Geisler, M P; Gemme, C; Genest, M H; Geng, C; Gentile, S; George, S; Gerbaudo, D; Gershon, A; Ghasemi, S; Ghazlane, H; Ghneimat, M; Giacobbe, B; Giagu, S; Giannetti, P; Gibbard, B; Gibson, S M; Gignac, M; Gilchriese, M; Gillam, T P S; Gillberg, D; Gilles, G; Gingrich, D M; Giokaris, N; Giordani, M P; Giorgi, F M; Giorgi, F M; Giraud, P F; Giromini, P; Giugni, D; Giuli, F; Giuliani, C; Giulini, M; Gjelsten, B K; Gkaitatzis, S; Gkialas, I; Gkougkousis, E L; Gladilin, L K; Glasman, C; Glatzer, J; Glaysher, P C F; Glazov, A; Goblirsch-Kolb, M; Godlewski, J; Goldfarb, S; Golling, T; Golubkov, D; Gomes, A; Gonçalo, R; Goncalves Pinto Firmino Da Costa, J; Gonella, L; Gongadze, A; González de la Hoz, S; Gonzalez Parra, G; Gonzalez-Sevilla, S; Goossens, L; Gorbounov, P A; Gordon, H A; Gorelov, I; Gorini, B; Gorini, E; Gorišek, A; Gornicki, E; Goshaw, A T; Gössling, C; Gostkin, M I; Goudet, C R; Goujdami, D; Goussiou, A G; Govender, N; Gozani, E; Graber, L; Grabowska-Bold, I; Gradin, P O J; Grafström, P; Gramling, J; Gramstad, E; Grancagnolo, S; Gratchev, V; Gray, H M; Graziani, E; Greenwood, Z D; Grefe, C; Gregersen, K; Gregor, I M; Grenier, P; Grevtsov, K; Griffiths, J; Grillo, A A; Grimm, K; Grinstein, S; Gris, Ph; Grivaz, J-F; Groh, S; Grohs, J P; Gross, E; Grosse-Knetter, J; Grossi, G C; Grout, Z J; Guan, L; Guan, W; Guenther, J; Guescini, F; Guest, D; Gueta, O; Guido, E; Guillemin, T; Guindon, S; Gul, U; Gumpert, C; Guo, J; Guo, Y; Gupta, S; Gustavino, G; Gutierrez, P; Gutierrez Ortiz, N G; Gutschow, C; Guyot, C; Gwenlan, C; Gwilliam, C B; Haas, A; Haber, C; Hadavand, H K; Haddad, N; Hadef, A; Haefner, P; Hageböck, S; Hajduk, Z; Hakobyan, H; Haleem, M; Haley, J; Halladjian, G; Hallewell, G D; Hamacher, K; Hamal, P; Hamano, K; Hamilton, A; Hamity, G N; Hamnett, P G; Han, L; Hanagaki, K; Hanawa, K; Hance, M; Haney, B; Hanke, P; Hanna, R; Hansen, J B; Hansen, J D; Hansen, M C; Hansen, P H; Hara, K; Hard, A S; Harenberg, T; Hariri, F; Harkusha, S; Harrington, R D; Harrison, P F; Hartjes, F; Hasegawa, M; Hasegawa, Y; Hasib, A; Hassani, S; Haug, S; Hauser, R; Hauswald, L; Havranek, M; Hawkes, C M; Hawkings, R J; Hawkins, A D; Hayden, D; Hays, C P; Hays, J M; Hayward, H S; Haywood, S J; Head, S J; Heck, T; Hedberg, V; Heelan, L; Heim, S; Heim, T; Heinemann, B; Heinrich, J J; Heinrich, L; Heinz, C; Hejbal, J; Helary, L; Hellman, S; Helsens, C; Henderson, J; Henderson, R C W; Heng, Y; Henkelmann, S; Henriques Correia, A M; Henrot-Versille, S; Herbert, G H; Herde, H; Hernández Jiménez, Y; Herten, G; Hertenberger, R; Hervas, L; Hesketh, G G; Hessey, N P; Hetherly, J W; Hickling, R; Higón-Rodriguez, E; Hill, E; Hill, J C; Hiller, K H; Hillier, S J; Hinchliffe, I; Hines, E; Hinman, R R; Hirose, M; Hirschbuehl, D; Hobbs, J; Hod, N; Hodgkinson, M C; Hodgson, P; Hoecker, A; Hoeferkamp, M R; Hoenig, F; Hohlfeld, M; Hohn, D; Holmes, T R; Homann, M; Hong, T M; Hooberman, B H; Hopkins, W H; Horii, Y; Horton, A J; Hostachy, J-Y; Hou, S; Hoummada, A; Howarth, J; Hrabovsky, M; Hristova, I; Hrivnac, J; Hryn'ova, T; Hrynevich, A; Hsu, C; Hsu, P J; Hsu, S-C; Hu, D; Hu, Q; Huang, Y; Hubacek, Z; Hubaut, F; Huegging, F; Huffman, T B; Hughes, E W; Hughes, G; Huhtinen, M; Hülsing, T A; Huo, P; Huseynov, N; Huston, J; Huth, J; Iacobucci, G; Iakovidis, G; Ibragimov, I; Iconomidou-Fayard, L; Ideal, E; Idrissi, Z; Iengo, P; Igonkina, O; Iizawa, T; Ikegami, Y; Ikeno, M; Ilchenko, Y; Iliadis, D; Ilic, N; Ince, T; Introzzi, G; Ioannou, P; Iodice, M; Iordanidou, K; Ippolito, V; Ishino, M; Ishitsuka, M; Ishmukhametov, R; Issever, C; Istin, S; Ito, F; Iturbe Ponce, J M; Iuppa, R; Iwanski, W; Iwasaki, H; Izen, J M; Izzo, V; Jabbar, S; Jackson, B; Jackson, M; Jackson, P; Jain, V; Jakobi, K B; Jakobs, K; Jakobsen, S; Jakoubek, T; Jamin, D O; Jana, D K; Jansen, E; Jansky, R; Janssen, J; Janus, M; Jarlskog, G; Javadov, N; Javůrek, T; Jeanneau, F; Jeanty, L; Jejelava, J; Jeng, G-Y; Jennens, D; Jenni, P; Jentzsch, J; Jeske, C; Jézéquel, S; Ji, H; Jia, J; Jiang, H; Jiang, Y; Jiggins, S; Jimenez Pena, J; Jin, S; Jinaru, A; Jinnouchi, O; Johansson, P; Johns, K A; Johnson, W J; Jon-And, K; Jones, G; Jones, R W L; Jones, S; Jones, T J; Jongmanns, J; Jorge, P M; Jovicevic, J; Ju, X; Juste Rozas, A; Köhler, M K; Kaczmarska, A; Kado, M; Kagan, H; Kagan, M; Kahn, S J; Kajomovitz, E; Kalderon, C W; Kaluza, A; Kama, S; Kamenshchikov, A; Kanaya, N; Kaneti, S; Kanjir, L; Kantserov, V A; Kanzaki, J; Kaplan, B; Kaplan, L S; Kapliy, A; Kar, D; Karakostas, K; Karamaoun, A; Karastathis, N; Kareem, M J; Karentzos, E; Karnevskiy, M; Karpov, S N; Karpova, Z M; Karthik, K; Kartvelishvili, V; Karyukhin, A N; Kasahara, K; Kashif, L; Kass, R D; Kastanas, A; Kataoka, Y; Kato, C; Katre, A; Katzy, J; Kawagoe, K; Kawamoto, T; Kawamura, G; Kazama, S; Kazanin, V F; Keeler, R; Kehoe, R; Keller, J S; Kempster, J J; Kentaro, K; Keoshkerian, H; Kepka, O; Kerševan, B P; Kersten, S; Keyes, R A; Khalil-Zada, F; Khanov, A; Kharlamov, A G; Khoo, T J; Khovanskiy, V; Khramov, E; Khubua, J; Kido, S; Kim, H Y; Kim, S H; Kim, Y K; Kimura, N; Kind, O M; King, B T; King, M; King, S B; Kirk, J; Kiryunin, A E; Kishimoto, T; Kisielewska, D; Kiss, F; Kiuchi, K; Kivernyk, O; Kladiva, E; Klein, M H; Klein, M; Klein, U; Kleinknecht, K; Klimek, P; Klimentov, A; Klingenberg, R; Klinger, J A; Klioutchnikova, T; Kluge, E-E; Kluit, P; Kluth, S; Knapik, J; Kneringer, E; Knoops, E B F G; Knue, A; Kobayashi, A; Kobayashi, D; Kobayashi, T; Kobel, M; Kocian, M; Kodys, P; Koehler, N M; Koffas, T; Koffeman, E; Koi, T; Kolanoski, H; Kolb, M; Koletsou, I; Komar, A A; Komori, Y; Kondo, T; Kondrashova, N; Köneke, K; König, A C; Kono, T; Konoplich, R; Konstantinidis, N; Kopeliansky, R; Koperny, S; Köpke, L; Kopp, A K; Korcyl, K; Kordas, K; Korn, A; Korol, A A; Korolkov, I; Korolkova, E V; Kortner, O; Kortner, S; Kosek, T; Kostyukhin, V V; Kotwal, A; Kourkoumeli-Charalampidi, A; Kourkoumelis, C; Kouskoura, V; Kowalewska, A B; Kowalewski, R; Kowalski, T Z; Kozanecki, W; Kozhin, A S; Kramarenko, V A; Kramberger, G; Krasnopevtsev, D; Krasny, M W; Krasznahorkay, A; Kraus, J K; Kravchenko, A; Kretz, M; Kretzschmar, J; Kreutzfeldt, K; Krieger, P; Krizka, K; Kroeninger, K; Kroha, H; Kroll, J; Kroseberg, J; Krstic, J; Kruchonak, U; Krüger, H; Krumnack, N; Kruse, A; Kruse, M C; Kruskal, M; Kubota, T; Kucuk, H; Kuday, S; Kuechler, J T; Kuehn, S; Kugel, A; Kuger, F; Kuhl, A; Kuhl, T; Kukhtin, V; Kukla, R; Kulchitsky, Y; Kuleshov, S; Kuna, M; Kunigo, T; Kupco, A; Kurashige, H; Kurochkin, Y A; Kus, V; Kuwertz, E S; Kuze, M; Kvita, J; Kwan, T; Kyriazopoulos, D; La Rosa, A; La Rosa Navarro, J L; La Rotonda, L; Lacasta, C; Lacava, F; Lacey, J; Lacker, H; Lacour, D; Lacuesta, V R; Ladygin, E; Lafaye, R; Laforge, B; Lagouri, T; Lai, S; Lammers, S; Lampl, W; Lançon, E; Landgraf, U; Landon, M P J; Lang, V S; Lange, J C; Lankford, A J; Lanni, F; Lantzsch, K; Lanza, A; Laplace, S; Lapoire, C; Laporte, J F; Lari, T; Lasagni Manghi, F; Lassnig, M; Laurelli, P; Lavrijsen, W; Law, A T; Laycock, P; Lazovich, T; Lazzaroni, M; Le, B; Le Dortz, O; Le Guirriec, E; Le Menedeu, E; Le Quilleuc, E P; LeBlanc, M; LeCompte, T; Ledroit-Guillon, F; Lee, C A; Lee, S C; Lee, L; Lefebvre, G; Lefebvre, M; Legger, F; Leggett, C; Lehan, A; Lehmann Miotto, G; Lei, X; Leight, W A; Leisos, A; Leister, A G; Leite, M A L; Leitner, R; Lellouch, D; Lemmer, B; Leney, K J C; Lenz, T; Lenzi, B; Leone, R; Leone, S; Leonidopoulos, C; Leontsinis, S; Lerner, G; Leroy, C; Lesage, A A J; Lester, C G; Levchenko, M; Levêque, J; Levin, D; Levinson, L J; Levy, M; Leyko, A M; Leyton, M; Li, B; Li, H; Li, H L; Li, L; Li, L; Li, Q; Li, S; Li, X; Li, Y; Liang, Z; Liberti, B; Liblong, A; Lichard, P; Lie, K; Liebal, J; Liebig, W; Limosani, A; Lin, S C; Lin, T H; Lindquist, B E; Lipeles, E; Lipniacka, A; Lisovyi, M; Liss, T M; Lissauer, D; Lister, A; Litke, A M; Liu, B; Liu, D; Liu, H; Liu, H; Liu, J; Liu, J B; Liu, K; Liu, L; Liu, M; Liu, M; Liu, Y L; Liu, Y; Livan, M; Lleres, A; Llorente Merino, J; Lloyd, S L; Lo Sterzo, F; Lobodzinska, E; Loch, P; Lockman, W S; Loebinger, F K; Loevschall-Jensen, A E; Loew, K M; Loginov, A; Lohse, T; Lohwasser, K; Lokajicek, M; Long, B A; Long, J D; Long, R E; Longo, L; Looper, K A; Lopes, L; Lopez Mateos, D; Lopez Paredes, B; Lopez Paz, I; Lopez Solis, A; Lorenz, J; Lorenzo Martinez, N; Losada, M; Lösel, P J; Lou, X; Lounis, A; Love, J; Love, P A; Lu, H; Lu, N; Lubatti, H J; Luci, C; Lucotte, A; Luedtke, C; Luehring, F; Lukas, W; Luminari, L; Lundberg, O; Lund-Jensen, B; Lynn, D; Lysak, R; Lytken, E; Lyubushkin, V; Ma, H; Ma, L L; Ma, Y; Maccarrone, G; Macchiolo, A; Macdonald, C M; Maček, B; Machado Miguens, J; Madaffari, D; Madar, R; Maddocks, H J; Mader, W F; Madsen, A; Maeda, J; Maeland, S; Maeno, T; Maevskiy, A; Magradze, E; Mahlstedt, J; Maiani, C; Maidantchik, C; Maier, A A; Maier, T; Maio, A; Majewski, S; Makida, Y; Makovec, N; Malaescu, B; Malecki, Pa; Maleev, V P; Malek, F; Mallik, U; Malon, D; Malone, C; Maltezos, S; Malyukov, S; Mamuzic, J; Mancini, G; Mandelli, B; Mandelli, L; Mandić, I; Maneira, J; Manhaes de Andrade Filho, L; Manjarres Ramos, J; Mann, A; Mansoulie, B; Mantifel, R; Mantoani, M; Manzoni, S; Mapelli, L; Marceca, G; March, L; Marchiori, G; Marcisovsky, M; Marjanovic, M; Marley, D E; Marroquim, F; Marsden, S P; Marshall, Z; Marti-Garcia, S; Martin, B; Martin, T A; Martin, V J; Martin Dit Latour, B; Martinez, M; Martin-Haugh, S; Martoiu, V S; Martyniuk, A C; Marx, M; Marzin, A; Masetti, L; Mashimo, T; Mashinistov, R; Masik, J; Maslennikov, A L; Massa, I; Massa, L; Mastrandrea, P; Mastroberardino, A; Masubuchi, T; Mättig, P; Mattmann, J; Maurer, J; Maxfield, S J; Maximov, D A; Mazini, R; Mazza, S M; Mc Fadden, N C; Mc Goldrick, G; Mc Kee, S P; McCarn, A; McCarthy, R L; McCarthy, T G; McClymont, L I; McDonald, E F; McFarlane, K W; Mcfayden, J A; Mchedlidze, G; McMahon, S J; McPherson, R A; Medinnis, M; Meehan, S; Mehlhase, S; Mehta, A; Meier, K; Meineck, C; Meirose, B; Mellado Garcia, B R; Melo, M; Meloni, F; Mengarelli, A; Menke, S; Meoni, E; Mergelmeyer, S; Mermod, P; Merola, L; Meroni, C; Merritt, F S; Messina, A; Metcalfe, J; Mete, A S; Meyer, C; Meyer, C; Meyer, J-P; Meyer, J; Meyer Zu Theenhausen, H; Middleton, R P; Miglioranzi, S; Mijović, L; Mikenberg, G; Mikestikova, M; Mikuž, M; Milesi, M; Milic, A; Miller, D W; Mills, C; Milov, A; Milstead, D A; Minaenko, A A; Minami, Y; Minashvili, I A; Mincer, A I; Mindur, B; Mineev, M; Ming, Y; Mir, L M; Mistry, K P; Mitani, T; Mitrevski, J; Mitsou, V A; Miucci, A; Miyagawa, P S; Mjörnmark, J U; Moa, T; Mochizuki, K; Mohapatra, S; Mohr, W; Molander, S; Moles-Valls, R; Monden, R; Mondragon, M C; Mönig, K; Monk, J; Monnier, E; Montalbano, A; Montejo Berlingen, J; Monticelli, F; Monzani, S; Moore, R W; Morange, N; Moreno, D; Moreno Llácer, M; Morettini, P; Mori, D; Mori, T; Morii, M; Morinaga, M; Morisbak, V; Moritz, S; Morley, A K; Mornacchi, G; Morris, J D; Mortensen, S S; Morvaj, L; Mosidze, M; Moss, J; Motohashi, K; Mount, R; Mountricha, E; Mouraviev, S V; Moyse, E J W; Muanza, S; Mudd, R D; Mueller, F; Mueller, J; Mueller, R S P; Mueller, T; Muenstermann, D; Mullen, P; Mullier, G A; Munoz Sanchez, F J; Murillo Quijada, J A; Murray, W J; Musheghyan, H; Muškinja, M; Myagkov, A G; Myska, M; Nachman, B P; Nackenhorst, O; Nadal, J; Nagai, K; Nagai, R; Nagano, K; Nagasaka, Y; Nagata, K; Nagel, M; Nagy, E; Nairz, A M; Nakahama, Y; Nakamura, K; Nakamura, T; Nakano, I; Namasivayam, H; Naranjo Garcia, R F; Narayan, R; Narrias Villar, D I; Naryshkin, I; Naumann, T; Navarro, G; Nayyar, R; Neal, H A; Nechaeva, P Yu; Neep, T J; Nef, P D; Negri, A; Negrini, M; Nektarijevic, S; Nellist, C; Nelson, A; Nemecek, S; Nemethy, P; Nepomuceno, A A; Nessi, M; Neubauer, M S; Neumann, M; Neves, R M; Nevski, P; Newman, P R; Nguyen, D H; Nguyen Manh, T; Nickerson, R B; Nicolaidou, R; Nielsen, J; Nikiforov, A; Nikolaenko, V; Nikolic-Audit, I; Nikolopoulos, K; Nilsen, J K; Nilsson, P; Ninomiya, Y; Nisati, A; Nisius, R; Nobe, T; Nodulman, L; Nomachi, M; Nomidis, I; Nooney, T; Norberg, S; Nordberg, M; Norjoharuddeen, N; Novgorodova, O; Nowak, S; Nozaki, M; Nozka, L; Ntekas, K; Nurse, E; Nuti, F; O'grady, F; O'Neil, D C; O'Rourke, A A; O'Shea, V; Oakham, F G; Oberlack, H; Obermann, T; Ocariz, J; Ochi, A; Ochoa, I; Ochoa-Ricoux, J P; Oda, S; Odaka, S; Ogren, H; Oh, A; Oh, S H; Ohm, C C; Ohman, H; Oide, H; Okawa, H; Okumura, Y; Okuyama, T; Olariu, A; Oleiro Seabra, L F; Olivares Pino, S A; Oliveira Damazio, D; Olszewski, A; Olszowska, J; Onofre, A; Onogi, K; Onyisi, P U E; Oreglia, M J; Oren, Y; Orestano, D; Orlando, N; Orr, R S; Osculati, B; Ospanov, R; Otero Y Garzon, G; Otono, H; Ouchrif, M; Ould-Saada, F; Ouraou, A; Oussoren, K P; Ouyang, Q; Owen, M; Owen, R E; Ozcan, V E; Ozturk, N; Pachal, K; Pacheco Pages, A; Padilla Aranda, C; Pagáčová, M; Pagan Griso, S; Paige, F; Pais, P; Pajchel, K; Palacino, G; Palestini, S; Palka, M; Pallin, D; Palma, A; Panagiotopoulou, E St; Pandini, C E; Panduro Vazquez, J G; Pani, P; Panitkin, S; Pantea, D; Paolozzi, L; Papadopoulou, Th D; Papageorgiou, K; Paramonov, A; Paredes Hernandez, D; Parker, A J; Parker, M A; Parker, K A; Parodi, F; Parsons, J A; Parzefall, U; Pascuzzi, V R; Pasqualucci, E; Passaggio, S; Pastore, F; Pastore, Fr; Pásztor, G; Pataraia, S; Pater, J R; Pauly, T; Pearce, J; Pearson, B; Pedersen, L E; Pedersen, M; Pedraza Lopez, S; Pedro, R; Peleganchuk, S V; Pelikan, D; Penc, O; Peng, C; Peng, H; Penwell, J; Peralva, B S; Perego, M M; Perepelitsa, D V; Perez Codina, E; Perini, L; Pernegger, H; Perrella, S; Peschke, R; Peshekhonov, V D; Peters, K; Peters, R F Y; Petersen, B A; Petersen, T C; Petit, E; Petridis, A; Petridou, C; Petroff, P; Petrolo, E; Petrov, M; Petrucci, F; Pettersson, N E; Peyaud, A; Pezoa, R; Phillips, P W; Piacquadio, G; Pianori, E; Picazio, A; Piccaro, E; Piccinini, M; Pickering, M A; Piegaia, R; Pilcher, J E; Pilkington, A D; Pin, A W J; Pinamonti, M; Pinfold, J L; Pingel, A; Pires, S; Pirumov, H; Pitt, M; Plazak, L; Pleier, M-A; Pleskot, V; Plotnikova, E; Plucinski, P; Pluth, D; Poettgen, R; Poggioli, L; Pohl, D; Polesello, G; Poley, A; Policicchio, A; Polifka, R; Polini, A; Pollard, C S; Polychronakos, V; Pommès, K; Pontecorvo, L; Pope, B G; Popeneciu, G A; Popovic, D S; Poppleton, A; Pospisil, S; Potamianos, K; Potrap, I N; Potter, C J; Potter, C T; Poulard, G; Poveda, J; Pozdnyakov, V; Pozo Astigarraga, M E; Pralavorio, P; Pranko, A; Prell, S; Price, D; Price, L E; Primavera, M; Prince, S; Proissl, M; Prokofiev, K; Prokoshin, F; Protopopescu, S; Proudfoot, J; Przybycien, M; Puddu, D; Puldon, D; Purohit, M; Puzo, P; Qian, J; Qin, G; Qin, Y; Quadt, A; Quayle, W B; Queitsch-Maitland, M; Quilty, D; Raddum, S; Radeka, V; Radescu, V; Radhakrishnan, S K; Radloff, P; Rados, P; Ragusa, F; Rahal, G; Raine, J A; Rajagopalan, S; Rammensee, M; Rangel-Smith, C; Ratti, M G; Rauscher, F; Rave, S; Ravenscroft, T; Raymond, M; Read, A L; Readioff, N P; Rebuzzi, D M; Redelbach, A; Redlinger, G; Reece, R; Reeves, K; Rehnisch, L; Reichert, J; Reisin, H; Rembser, C; Ren, H; Rescigno, M; Resconi, S; Rezanova, O L; Reznicek, P; Rezvani, R; Richter, R; Richter, S; Richter-Was, E; Ricken, O; Ridel, M; Rieck, P; Riegel, C J; Rieger, J; Rifki, O; Rijssenbeek, M; Rimoldi, A; Rinaldi, L; Ristić, B; Ritsch, E; Riu, I; Rizatdinova, F; Rizvi, E; Rizzi, C; Robertson, S H; Robichaud-Veronneau, A; Robinson, D; Robinson, J E M; Robson, A; Roda, C; Rodina, Y; Rodriguez Perez, A; Rodriguez Rodriguez, D; Roe, S; Rogan, C S; Røhne, O; Romaniouk, A; Romano, M; Romano Saez, S M; Romero Adam, E; Rompotis, N; Ronzani, M; Roos, L; Ros, E; Rosati, S; Rosbach, K; Rose, P; Rosenthal, O; Rossetti, V; Rossi, E; Rossi, L P; Rosten, J H N; Rosten, R; Rotaru, M; Roth, I; Rothberg, J; Rousseau, D; Royon, C R; Rozanov, A; Rozen, Y; Ruan, X; Rubbo, F; Rud, V I; Rudolph, M S; Rühr, F; Ruiz-Martinez, A; Rurikova, Z; Rusakovich, N A; Ruschke, A; Russell, H L; Rutherfoord, J P; Ruthmann, N; Ryabov, Y F; Rybar, M; Rybkin, G; Ryu, S; Ryzhov, A; Rzehorz, G F; Saavedra, A F; Sabato, G; Sacerdoti, S; Sadrozinski, H F-W; Sadykov, R; Safai Tehrani, F; Saha, P; Sahinsoy, M; Saimpert, M; Saito, T; Sakamoto, H; Sakurai, Y; Salamanna, G; Salamon, A; Salazar Loyola, J E; Salek, D; Sales De Bruin, P H; Salihagic, D; Salnikov, A; Salt, J; Salvatore, D; Salvatore, F; Salvucci, A; Salzburger, A; Sammel, D; Sampsonidis, D; Sanchez, A; Sánchez, J; Sanchez Martinez, V; Sandaker, H; Sandbach, R L; Sander, H G; Sandhoff, M; Sandoval, C; Sandstroem, R; Sankey, D P C; Sannino, M; Sansoni, A; Santoni, C; Santonico, R; Santos, H; Santoyo Castillo, I; Sapp, K; Sapronov, A; Saraiva, J G; Sarrazin, B; Sasaki, O; Sasaki, Y; Sato, K; Sauvage, G; Sauvan, E; Savage, G; Savard, P; Sawyer, C; Sawyer, L; Saxon, J; Sbarra, C; Sbrizzi, A; Scanlon, T; Scannicchio, D A; Scarcella, M; Scarfone, V; Schaarschmidt, J; Schacht, P; Schaefer, D; Schaefer, R; Schaeffer, J; Schaepe, S; Schaetzel, S; Schäfer, U; Schaffer, A C; Schaile, D; Schamberger, R D; Scharf, V; Schegelsky, V A; Scheirich, D; Schernau, M; Schiavi, C; Schillo, C; Schioppa, M; Schlenker, S; Schmieden, K; Schmitt, C; Schmitt, S; Schmitz, S; Schneider, B; Schnoor, U; Schoeffel, L; Schoening, A; Schoenrock, B D; Schopf, E; Schorlemmer, A L S; Schott, M; Schovancova, J; Schramm, S; Schreyer, M; Schuh, N; Schultens, M J; Schultz-Coulon, H-C; Schulz, H; Schumacher, M; Schumm, B A; Schune, Ph; Schwanenberger, C; Schwartzman, A; Schwarz, T A; Schwegler, Ph; Schweiger, H; Schwemling, Ph; Schwienhorst, R; Schwindling, J; Schwindt, T; Sciolla, G; Scuri, F; Scutti, F; Searcy, J; Seema, P; Seidel, S C; Seiden, A; Seifert, F; Seixas, J M; Sekhniaidze, G; Sekhon, K; Sekula, S J; Seliverstov, D M; Semprini-Cesari, N; Serfon, C; Serin, L; Serkin, L; Sessa, M; Seuster, R; Severini, H; Sfiligoj, T; Sforza, F; Sfyrla, A; Shabalina, E; Shaikh, N W; Shan, L Y; Shang, R; Shank, J T; Shapiro, M; Shatalov, P B; Shaw, K; Shaw, S M; Shcherbakova, A; Shehu, C Y; Sherwood, P; Shi, L; Shimizu, S; Shimmin, C O; Shimojima, M; Shiyakova, M; Shmeleva, A; Shoaleh Saadi, D; Shochet, M J; Shojaii, S; Shrestha, S; Shulga, E; Shupe, M A; Sicho, P; Sidebo, P E; Sidiropoulou, O; Sidorov, D; Sidoti, A; Siegert, F; Sijacki, Dj; Silva, J; Silverstein, S B; Simak, V; Simard, O; Simic, Lj; Simion, S; Simioni, E; Simmons, B; Simon, D; Simon, M; Sinervo, P; Sinev, N B; Sioli, M; Siragusa, G; Sivoklokov, S Yu; Sjölin, J; Sjursen, T B; Skinner, M B; Skottowe, H P; Skubic, P; Slater, M; Slavicek, T; Slawinska, M; Sliwa, K; Slovak, R; Smakhtin, V; Smart, B H; Smestad, L; Smirnov, S Yu; Smirnov, Y; Smirnova, L N; Smirnova, O; Smith, M N K; Smith, R W; Smizanska, M; Smolek, K; Snesarev, A A; Snyder, S; Sobie, R; Socher, F; Soffer, A; Soh, D A; Sokhrannyi, G; Solans Sanchez, C A; Solar, M; Soldatov, E Yu; Soldevila, U; Solodkov, A A; Soloshenko, A; Solovyanov, O V; Solovyev, V; Sommer, P; Son, H; Song, H Y; Sood, A; Sopczak, A; Sopko, V; Sorin, V; Sosa, D; Sotiropoulou, C L; Soualah, R; Soukharev, A M; South, D; Sowden, B C; Spagnolo, S; Spalla, M; Spangenberg, M; Spanò, F; Sperlich, D; Spettel, F; Spighi, R; Spigo, G; Spiller, L A; Spousta, M; St Denis, R D; Stabile, A; Stamen, R; Stamm, S; Stanecka, E; Stanek, R W; Stanescu, C; Stanescu-Bellu, M; Stanitzki, M M; Stapnes, S; Starchenko, E A; Stark, G H; Stark, J; Staroba, P; Starovoitov, P; Stärz, S; Staszewski, R; Steinberg, P; Stelzer, B; Stelzer, H J; Stelzer-Chilton, O; Stenzel, H; Stewart, G A; Stillings, J A; Stockton, M C; Stoebe, M; Stoicea, G; Stolte, P; Stonjek, S; Stradling, A R; Straessner, A; Stramaglia, M E; Strandberg, J; Strandberg, S; Strandlie, A; Strauss, M; Strizenec, P; Ströhmer, R; Strom, D M; Stroynowski, R; Strubig, A; Stucci, S A; Stugu, B; Styles, N A; Su, D; Su, J; Subramaniam, R; Suchek, S; Sugaya, Y; Suk, M; Sulin, V V; Sultansoy, S; Sumida, T; Sun, S; Sun, X; Sundermann, J E; Suruliz, K; Susinno, G; Sutton, M R; Suzuki, S; Svatos, M; Swiatlowski, M; Sykora, I; Sykora, T; Ta, D; Taccini, C; Tackmann, K; Taenzer, J; Taffard, A; Tafirout, R; Taiblum, N; Takai, H; Takashima, R; Takeshita, T; Takubo, Y; Talby, M; Talyshev, A A; Tam, J Y C; Tan, K G; Tanaka, J; Tanaka, R; Tanaka, S; Tannenwald, B B; Tapia Araya, S; Tapprogge, S; Tarem, S; Tartarelli, G F; Tas, P; Tasevsky, M; Tashiro, T; Tassi, E; Tavares Delgado, A; Tayalati, Y; Taylor, A C; Taylor, G N; Taylor, P T E; Taylor, W; Teischinger, F A; Teixeira-Dias, P; Temming, K K; Temple, D; Ten Kate, H; Teng, P K; Teoh, J J; Tepel, F; Terada, S; Terashi, K; Terron, J; Terzo, S; Testa, M; Teuscher, R J; Theveneaux-Pelzer, T; Thomas, J P; Thomas-Wilsker, J; Thompson, E N; Thompson, P D; Thompson, A S; Thomsen, L A; Thomson, E; Thomson, M; Tibbetts, M J; Ticse Torres, R E; Tikhomirov, V O; Tikhonov, Yu A; Timoshenko, S; Tipton, P; Tisserant, S; Todome, K; Todorov, T; Todorova-Nova, S; Tojo, J; Tokár, S; Tokushuku, K; Tolley, E; Tomlinson, L; Tomoto, M; Tompkins, L; Toms, K; Tong, B; Torrence, E; Torres, H; Torró Pastor, E; Toth, J; Touchard, F; Tovey, D R; Trefzger, T; Tricoli, A; Trigger, I M; Trincaz-Duvoid, S; Tripiana, M F; Trischuk, W; Trocmé, B; Trofymov, A; Troncon, C; Trottier-McDonald, M; Trovatelli, M; Truong, L; Trzebinski, M; Trzupek, A; Tseng, J C-L; Tsiareshka, P V; Tsipolitis, G; Tsirintanis, N; Tsiskaridze, S; Tsiskaridze, V; Tskhadadze, E G; Tsui, K M; Tsukerman, I I; Tsulaia, V; Tsuno, S; Tsybychev, D; Tudorache, A; Tudorache, V; Tuna, A N; Tupputi, S A; Turchikhin, S; Turecek, D; Turgeman, D; Turra, R; Turvey, A J; Tuts, P M; Tyndel, M; Ucchielli, G; Ueda, I; Ueno, R; Ughetto, M; Ukegawa, F; Unal, G; Undrus, A; Unel, G; Ungaro, F C; Unno, Y; Unverdorben, C; Urban, J; Urquijo, P; Urrejola, P; Usai, G; Usanova, A; Vacavant, L; Vacek, V; Vachon, B; Valderanis, C; Valdes Santurio, E; Valencic, N; Valentinetti, S; Valero, A; Valery, L; Valkar, S; Vallecorsa, S; Valls Ferrer, J A; Van Den Wollenberg, W; Van Der Deijl, P C; van der Geer, R; van der Graaf, H; van Eldik, N; van Gemmeren, P; Van Nieuwkoop, J; van Vulpen, I; van Woerden, M C; Vanadia, M; Vandelli, W; Vanguri, R; Vaniachine, A; Vankov, P; Vardanyan, G; Vari, R; Varnes, E W; Varol, T; Varouchas, D; Vartapetian, A; Varvell, K E; Vasquez, J G; Vazeille, F; Vazquez Schroeder, T; Veatch, J; Veloce, L M; Veloso, F; Veneziano, S; Ventura, A; Venturi, M; Venturi, N; Venturini, A; Vercesi, V; Verducci, M; Verkerke, W; Vermeulen, J C; Vest, A; Vetterli, M C; Viazlo, O; Vichou, I; Vickey, T; Vickey Boeriu, O E; Viehhauser, G H A; Viel, S; Vigani, L; Vigne, R; Villa, M; Villaplana Perez, M; Vilucchi, E; Vincter, M G; Vinogradov, V B; Vittori, C; Vivarelli, I; Vlachos, S; Vlasak, M; Vogel, M; Vokac, P; Volpi, G; Volpi, M; von der Schmitt, H; von Toerne, E; Vorobel, V; Vorobev, K; Vos, M; Voss, R; Vossebeld, J H; Vranjes, N; Vranjes Milosavljevic, M; Vrba, V; Vreeswijk, M; Vuillermet, R; Vukotic, I; Vykydal, Z; Wagner, P; Wagner, W; Wahlberg, H; Wahrmund, S; Wakabayashi, J; Walder, J; Walker, R; Walkowiak, W; Wallangen, V; Wang, C; Wang, C; Wang, F; Wang, H; Wang, H; Wang, J; Wang, J; Wang, K; Wang, R; Wang, S M; Wang, T; Wang, T; Wang, X; Wanotayaroj, C; Warburton, A; Ward, C P; Wardrope, D R; Washbrook, A; Watkins, P M; Watson, A T; Watson, M F; Watts, G; Watts, S; Waugh, B M; Webb, S; Weber, M S; Weber, S W; Webster, J S; Weidberg, A R; Weinert, B; Weingarten, J; Weiser, C; Weits, H; Wells, P S; Wenaus, T; Wengler, T; Wenig, S; Wermes, N; Werner, M; Werner, P; Wessels, M; Wetter, J; Whalen, K; Whallon, N L; Wharton, A M; White, A; White, M J; White, R; White, S; Whiteson, D; Wickens, F J; Wiedenmann, W; Wielers, M; Wienemann, P; Wiglesworth, C; Wiik-Fuchs, L A M; Wildauer, A; Wilk, F; Wilkens, H G; Williams, H H; Williams, S; Willis, C; Willocq, S; Wilson, J A; Wingerter-Seez, I; Winklmeier, F; Winston, O J; Winter, B T; Wittgen, M; Wittkowski, J; Wollstadt, S J; Wolter, M W; Wolters, H; Wosiek, B K; Wotschack, J; Woudstra, M J; Wozniak, K W; Wu, M; Wu, M; Wu, S L; Wu, X; Wu, Y; Wyatt, T R; Wynne, B M; Xella, S; Xu, D; Xu, L; Yabsley, B; Yacoob, S; Yakabe, R; Yamaguchi, D; Yamaguchi, Y; Yamamoto, A; Yamamoto, S; Yamanaka, T; Yamauchi, K; Yamazaki, Y; Yan, Z; Yang, H; Yang, H; Yang, Y; Yang, Z; Yao, W-M; Yap, Y C; Yasu, Y; Yatsenko, E; Yau Wong, K H; Ye, J; Ye, S; Yeletskikh, I; Yen, A L; Yildirim, E; Yorita, K; Yoshida, R; Yoshihara, K; Young, C; Young, C J S; Youssef, S; Yu, D R; Yu, J; Yu, J M; Yu, J; Yuan, L; Yuen, S P Y; Yusuff, I; Zabinski, B; Zaidan, R; Zaitsev, A M; Zakharchuk, N; Zalieckas, J; Zaman, A; Zambito, S; Zanello, L; Zanzi, D; Zeitnitz, C; Zeman, M; Zemla, A; Zeng, J C; Zeng, Q; Zengel, K; Zenin, O; Ženiš, T; Zerwas, D; Zhang, D; Zhang, F; Zhang, G; Zhang, H; Zhang, J; Zhang, L; Zhang, R; Zhang, R; Zhang, X; Zhang, Z; Zhao, X; Zhao, Y; Zhao, Z; Zhemchugov, A; Zhong, J; Zhou, B; Zhou, C; Zhou, L; Zhou, L; Zhou, M; Zhou, N; Zhu, C G; Zhu, H; Zhu, J; Zhu, Y; Zhuang, X; Zhukov, K; Zibell, A; Zieminska, D; Zimine, N I; Zimmermann, C; Zimmermann, S; Zinonos, Z; Zinser, M; Ziolkowski, M; Živković, L; Zobernig, G; Zoccoli, A; Zur Nedden, M; Zurzolo, G; Zwalinski, L
2016-01-01
This article documents the performance of the ATLAS muon identification and reconstruction using the LHC dataset recorded at [Formula: see text] TeV in 2015. Using a large sample of [Formula: see text] and [Formula: see text] decays from 3.2 fb[Formula: see text] of pp collision data, measurements of the reconstruction efficiency, as well as of the momentum scale and resolution, are presented and compared to Monte Carlo simulations. The reconstruction efficiency is measured to be close to [Formula: see text] over most of the covered phase space ([Formula: see text] and [Formula: see text] GeV). The isolation efficiency varies between 93 and [Formula: see text] depending on the selection applied and on the momentum of the muon. Both efficiencies are well reproduced in simulation. In the central region of the detector, the momentum resolution is measured to be [Formula: see text] ([Formula: see text]) for muons from [Formula: see text] ([Formula: see text]) decays, and the momentum scale is known with an uncertainty of [Formula: see text]. In the region [Formula: see text], the [Formula: see text] resolution for muons from [Formula: see text] decays is [Formula: see text] while the precision of the momentum scale for low-[Formula: see text] muons from [Formula: see text] decays is about [Formula: see text].
The Electronic Documentation Project in the NASA mission control center environment
NASA Technical Reports Server (NTRS)
Wang, Lui; Leigh, Albert
1994-01-01
NASA's space programs like many other technical programs of its magnitude is supported by a large volume of technical documents. These documents are not only diverse but also abundant. Management, maintenance, and retrieval of these documents is a challenging problem by itself; but, relating and cross-referencing this wealth of information when it is all on a medium of paper is an even greater challenge. The Electronic Documentation Project (EDP) is to provide an electronic system capable of developing, distributing and controlling changes for crew/ground controller procedures and related documents. There are two primary motives for the solution. The first motive is to reduce the cost of maintaining the current paper based method of operations by replacing paper documents with electronic information storage and retrieval. And, the other is to improve the efficiency and provide enhanced flexibility in document usage. Initially, the current paper based system will be faithfully reproduced in an electronic format to be used in the document viewing system. In addition, this metaphor will have hypertext extensions. Hypertext features support basic functions such as full text searches, key word searches, data retrieval, and traversal between nodes of information as well as speeding up the data access rate. They enable related but separate documents to have relationships, and allow the user to explore information naturally through non-linear link traversals. The basic operational requirements of the document viewing system are to: provide an electronic corollary to the current method of paper based document usage; supplement and ultimately replace paper-based documents; maintain focused toward control center operations such as Flight Data File, Flight Rules and Console Handbook viewing; and be available NASA wide.
An automated system for generating program documentation
NASA Technical Reports Server (NTRS)
Hanney, R. J.
1970-01-01
A documentation program was developed in which the emphasis is placed on text content rather than flowcharting. It is keyword oriented, with 26 keywords that control the program. Seventeen of those keywords are recognized by the flowchart generator, three are related to text generation, and three have to do with control card and deck displays. The strongest advantage offered by the documentation program is that it produces the entire document. The document is prepared on 35mm microfilm, which is easy to store, and letter-size reproductions can be made inexpensively on bond paper.
A Feature Mining Based Approach for the Classification of Text Documents into Disjoint Classes.
ERIC Educational Resources Information Center
Nieto Sanchez, Salvador; Triantaphyllou, Evangelos; Kraft, Donald
2002-01-01
Proposes a new approach for classifying text documents into two disjoint classes. Highlights include a brief overview of document clustering; a data mining approach called the One Clause at a Time (OCAT) algorithm which is based on mathematical logic; vector space model (VSM); and comparing the OCAT to the VSM. (Author/LRW)
Text Categorization for Multi-Page Documents: A Hybrid Naive Bayes HMM Approach.
ERIC Educational Resources Information Center
Frasconi, Paolo; Soda, Giovanni; Vullo, Alessandro
Text categorization is typically formulated as a concept learning problem where each instance is a single isolated document. This paper is interested in a more general formulation where documents are organized as page sequences, as naturally occurring in digital libraries of scanned books and magazines. The paper describes a method for classifying…
DOE Office of Scientific and Technical Information (OSTI.GOV)
Burchard, Ross L.; Pierson, Kathleen P.; Trumbo, Derek
Tarjetas is used to generate requirements from source documents. These source documents are in a hierarchical XML format that have been produced from PDF documents processed through the “Reframe” software package. The software includes the ability to create Topics and associate text Snippets with those topics. Requirements are then generated and text Snippets with their associated Topics are referenced to the requirement. The software maintains traceability from the requirement ultimately to the source document that produced the snippet
75 FR 55267 - Airspace Designations; Incorporation By Reference
Federal Register 2010, 2011, 2012, 2013, 2014
2010-09-10
... airspace listings in FAA Order 7400.9T in full text as proposed rule documents in the Federal Register. Likewise, all amendments of these listings were published in full text as final rules in the Federal... Order 7400.9U in full text as proposed rule documents in the Federal Register. Likewise, all amendments...
76 FR 53328 - Airspace Designations; Incorporation by Reference
Federal Register 2010, 2011, 2012, 2013, 2014
2011-08-26
... proposed changes of the airspace listings in FAA Order 7400.9U in full text as proposed rule documents in the Federal Register. Likewise, all amendments of these listings were published in full text as final... the airspace listings in FAA Order 7400.9V in full text as proposed rule documents in the Federal...
ERIC Educational Resources Information Center
Beghtol, Clare
1986-01-01
Explicates a definition and theory of "aboutness" and aboutness analysis developed by text linguist van Dijk; explores implications of text linguistics for bibliographic classification theory; suggests the elements that a theory of the cognitive process of classifying documents needs to encompass; and delineates how people identify…
An Efficiency Comparison of Document Preparation Systems Used in Academic Research and Development
Knauff, Markus; Nejasmic, Jelica
2014-01-01
The choice of an efficient document preparation system is an important decision for any academic researcher. To assist the research community, we report a software usability study in which 40 researchers across different disciplines prepared scholarly texts with either Microsoft Word or LaTeX. The probe texts included simple continuous text, text with tables and subheadings, and complex text with several mathematical equations. We show that LaTeX users were slower than Word users, wrote less text in the same amount of time, and produced more typesetting, orthographical, grammatical, and formatting errors. On most measures, expert LaTeX users performed even worse than novice Word users. LaTeX users, however, more often report enjoying using their respective software. We conclude that even experienced LaTeX users may suffer a loss in productivity when LaTeX is used, relative to other document preparation systems. Individuals, institutions, and journals should carefully consider the ramifications of this finding when choosing document preparation strategies, or requiring them of authors. PMID:25526083
Multioriented and curved text lines extraction from Indian documents.
Pal, U; Roy, Partha Pratim
2004-08-01
There are printed artistic documents where text lines of a single page may not be parallel to each other. These text lines may have different orientations or the text lines may be curved shapes. For the optical character recognition (OCR) of these documents, we need to extract such lines properly. In this paper, we propose a novel scheme, mainly based on the concept of water reservoir analogy, to extract individual text lines from printed Indian documents containing multioriented and/or curve text lines. A reservoir is a metaphor to illustrate the cavity region of a character where water can be stored. In the proposed scheme, at first, connected components are labeled and identified either as isolated or touching. Next, each touching component is classified either straight type (S-type) or curve type (C-type), depending on the reservoir base-area and envelope points of the component. Based on the type (S-type or C-type) of a component two candidate points are computed from each touching component. Finally, candidate regions (neighborhoods of the candidate points) of the candidate points of each component are detected and after analyzing these candidate regions, components are grouped to get individual text lines.
A bioinformatics knowledge discovery in text application for grid computing
Castellano, Marcello; Mastronardi, Giuseppe; Bellotti, Roberto; Tarricone, Gianfranco
2009-01-01
Background A fundamental activity in biomedical research is Knowledge Discovery which has the ability to search through large amounts of biomedical information such as documents and data. High performance computational infrastructures, such as Grid technologies, are emerging as a possible infrastructure to tackle the intensive use of Information and Communication resources in life science. The goal of this work was to develop a software middleware solution in order to exploit the many knowledge discovery applications on scalable and distributed computing systems to achieve intensive use of ICT resources. Methods The development of a grid application for Knowledge Discovery in Text using a middleware solution based methodology is presented. The system must be able to: perform a user application model, process the jobs with the aim of creating many parallel jobs to distribute on the computational nodes. Finally, the system must be aware of the computational resources available, their status and must be able to monitor the execution of parallel jobs. These operative requirements lead to design a middleware to be specialized using user application modules. It included a graphical user interface in order to access to a node search system, a load balancing system and a transfer optimizer to reduce communication costs. Results A middleware solution prototype and the performance evaluation of it in terms of the speed-up factor is shown. It was written in JAVA on Globus Toolkit 4 to build the grid infrastructure based on GNU/Linux computer grid nodes. A test was carried out and the results are shown for the named entity recognition search of symptoms and pathologies. The search was applied to a collection of 5,000 scientific documents taken from PubMed. Conclusion In this paper we discuss the development of a grid application based on a middleware solution. It has been tested on a knowledge discovery in text process to extract new and useful information about symptoms and pathologies from a large collection of unstructured scientific documents. As an example a computation of Knowledge Discovery in Database was applied on the output produced by the KDT user module to extract new knowledge about symptom and pathology bio-entities. PMID:19534749
A bioinformatics knowledge discovery in text application for grid computing.
Castellano, Marcello; Mastronardi, Giuseppe; Bellotti, Roberto; Tarricone, Gianfranco
2009-06-16
A fundamental activity in biomedical research is Knowledge Discovery which has the ability to search through large amounts of biomedical information such as documents and data. High performance computational infrastructures, such as Grid technologies, are emerging as a possible infrastructure to tackle the intensive use of Information and Communication resources in life science. The goal of this work was to develop a software middleware solution in order to exploit the many knowledge discovery applications on scalable and distributed computing systems to achieve intensive use of ICT resources. The development of a grid application for Knowledge Discovery in Text using a middleware solution based methodology is presented. The system must be able to: perform a user application model, process the jobs with the aim of creating many parallel jobs to distribute on the computational nodes. Finally, the system must be aware of the computational resources available, their status and must be able to monitor the execution of parallel jobs. These operative requirements lead to design a middleware to be specialized using user application modules. It included a graphical user interface in order to access to a node search system, a load balancing system and a transfer optimizer to reduce communication costs. A middleware solution prototype and the performance evaluation of it in terms of the speed-up factor is shown. It was written in JAVA on Globus Toolkit 4 to build the grid infrastructure based on GNU/Linux computer grid nodes. A test was carried out and the results are shown for the named entity recognition search of symptoms and pathologies. The search was applied to a collection of 5,000 scientific documents taken from PubMed. In this paper we discuss the development of a grid application based on a middleware solution. It has been tested on a knowledge discovery in text process to extract new and useful information about symptoms and pathologies from a large collection of unstructured scientific documents. As an example a computation of Knowledge Discovery in Database was applied on the output produced by the KDT user module to extract new knowledge about symptom and pathology bio-entities.
Handling a Large Collection of PDF Documents
You have several options for making a large collection of PDF documents more accessible to your audience: avoid uploading altogether, use multiple document pages, and use document IDs as anchors for direct links within a document page.
On the role of words in the network structure of texts: Application to authorship attribution
NASA Astrophysics Data System (ADS)
Akimushkin, Camilo; Amancio, Diego R.; Oliveira, Osvaldo N.
2018-04-01
Well-established automatic analyses of texts mainly consider frequencies of linguistic units, e.g. letters, words, and bigrams. In a recent, alternative approach, medium and large-scale text structures were used in opposition to the belief that text structure is dominated by the language features. In this paper, we introduce a generalized similarity measure to compare texts which accounts for both the network structure of texts and the role of individual words in the networks. The similarity measure is used for authorship attribution of three collections of books, each composed of 8 authors and 10 books per author. High accuracy rates were obtained with typical values between 90% and 98 . 75%, much higher than with the traditional term frequency-inverse document frequency (tf-idf) approach for the same collections. These accuracies are also higher than those obtained solely with the topology of networks. We conclude that the different properties of specific words on the macroscopic scale structure of a whole text are as relevant as their frequency of appearance; conversely, considering the identity of nodes brings further knowledge about a piece of text represented as a network.
Tsuji, Shintarou; Nishimoto, Naoki; Ogasawara, Katsuhiko
2008-07-20
Although large medical texts are stored in electronic format, they are seldom reused because of the difficulty of processing narrative texts by computer. Morphological analysis is a key technology for extracting medical terms correctly and automatically. This process parses a sentence into its smallest unit, the morpheme. Phrases consisting of two or more technical terms, however, cause morphological analysis software to fail in parsing the sentence and output unprocessed terms as "unknown words." The purpose of this study was to reduce the number of unknown words in medical narrative text processing. The results of parsing the text with additional dictionaries were compared with the analysis of the number of unknown words in the national examination for radiologists. The ratio of unknown words was reduced 1.0% to 0.36% by adding terminologies of radiological technology, MeSH, and ICD-10 labels. The terminology of radiological technology was the most effective resource, being reduced by 0.62%. This result clearly showed the necessity of additional dictionary selection and trends in unknown words. The potential for this investigation is to make available a large body of clinical information that would otherwise be inaccessible for applications other than manual health care review by personnel.
Benge, James; Beach, Thomas; Gladding, Connie; Maestas, Gail
2008-01-01
The Military Health System (MHS) deployed its electronic health record (EHR), AHLTA to Military Treatment Facilities (MTFs) around the world. This paper focuses on the approach and barriers to using structured text in AHLTA to document care encounters and illustrates the direct correlation between the use of structured text and achievement of expected benefits. AHLTA uses commercially available products, a health data dictionary and standardized medical terminology, enabling the capture of structured computable data. With structured text stored in the AHLTA Clinical Data Repository (CDR), the MHS has seen a return on its EHR investment with improvements in the accuracy and completeness of coding and the documentation of care provided. Determining the aspects of documentation where structured text is most beneficial, as well as the degree of structured text needed has been a significant challenge. This paper describes how the economic value framework aligns the enterprise strategic objectives with the EHR investment features, performance metrics and expected benefits. The framework analyses focus on return on investment calculations, baseline assessment and post-implementation benefits validation. Cost avoidance, revenue enhancements and operational improvements, such as evidence-based medicine and medical surveillance can be directly attributed to use structured text.
Global and Local Features Based Classification for Bleed-Through Removal
NASA Astrophysics Data System (ADS)
Hu, Xiangyu; Lin, Hui; Li, Shutao; Sun, Bin
2016-12-01
The text on one side of historical documents often seeps through and appears on the other side, so the bleed-through is a common problem in historical document images. It makes the document images hard to read and the text difficult to recognize. To improve the image quality and readability, the bleed-through has to be removed. This paper proposes a global and local features extraction based bleed-through removal method. The Gaussian mixture model is used to get the global features of the images. Local features are extracted by the patch around each pixel. Then, the extreme learning machine classifier is utilized to classify the scanned images into the foreground text and the bleed-through component. Experimental results on real document image datasets show that the proposed method outperforms the state-of-the-art bleed-through removal methods and preserves the text strokes well.
Yoo, Illhoi; Hu, Xiaohua; Song, Il-Yeol
2007-11-27
A huge amount of biomedical textual information has been produced and collected in MEDLINE for decades. In order to easily utilize biomedical information in the free text, document clustering and text summarization together are used as a solution for text information overload problem. In this paper, we introduce a coherent graph-based semantic clustering and summarization approach for biomedical literature. Our extensive experimental results show the approach shows 45% cluster quality improvement and 72% clustering reliability improvement, in terms of misclassification index, over Bisecting K-means as a leading document clustering approach. In addition, our approach provides concise but rich text summary in key concepts and sentences. Our coherent biomedical literature clustering and summarization approach that takes advantage of ontology-enriched graphical representations significantly improves the quality of document clusters and understandability of documents through summaries.
Yoo, Illhoi; Hu, Xiaohua; Song, Il-Yeol
2007-01-01
Background A huge amount of biomedical textual information has been produced and collected in MEDLINE for decades. In order to easily utilize biomedical information in the free text, document clustering and text summarization together are used as a solution for text information overload problem. In this paper, we introduce a coherent graph-based semantic clustering and summarization approach for biomedical literature. Results Our extensive experimental results show the approach shows 45% cluster quality improvement and 72% clustering reliability improvement, in terms of misclassification index, over Bisecting K-means as a leading document clustering approach. In addition, our approach provides concise but rich text summary in key concepts and sentences. Conclusion Our coherent biomedical literature clustering and summarization approach that takes advantage of ontology-enriched graphical representations significantly improves the quality of document clusters and understandability of documents through summaries. PMID:18047705
Text Generation: The State of the Art and the Literature.
ERIC Educational Resources Information Center
Mann, William C.; And Others
This report comprises two documents which describe the state of the art of computer generation of natural language text. Both were prepared by a panel of individuals who are active in research on text generation. The first document assesses the techniques now available for use in systems design, covering all of the technical methods by which…
DOCU-TEXT: A tool before the data dictionary
NASA Technical Reports Server (NTRS)
Carter, B.
1983-01-01
DOCU-TEXT, a proprietary software package that aids in the production of documentation for a data processing organization and can be installed and operated only on IBM computers is discussed. In organizing information that ultimately will reside in a data dictionary, DOCU-TEXT proved to be a useful documentation tool in extracting information from existing production jobs, procedure libraries, system catalogs, control data sets and related files. DOCU-TEXT reads these files to derive data that is useful at the system level. The output of DOCU-TEXT is a series of user selectable reports. These reports can reflect the interactions within a single job stream, a complete system, or all the systems in an installation. Any single report, or group of reports, can be generated in an independent documentation pass.
Data mining of text as a tool in authorship attribution
NASA Astrophysics Data System (ADS)
Visa, Ari J. E.; Toivonen, Jarmo; Autio, Sami; Maekinen, Jarno; Back, Barbro; Vanharanta, Hannu
2001-03-01
It is common that text documents are characterized and classified by keywords that the authors use to give them. Visa et al. have developed a new methodology based on prototype matching. The prototype is an interesting document or a part of an extracted, interesting text. This prototype is matched with the document database of the monitored document flow. The new methodology is capable of extracting the meaning of the document in a certain degree. Our claim is that the new methodology is also capable of authenticating the authorship. To verify this claim two tests were designed. The test hypothesis was that the words and the word order in the sentences could authenticate the author. In the first test three authors were selected. The selected authors were William Shakespeare, Edgar Allan Poe, and George Bernard Shaw. Three texts from each author were examined. Every text was one by one used as a prototype. The two nearest matches with the prototype were noted. The second test uses the Reuters-21578 financial news database. A group of 25 short financial news reports from five different authors are examined. Our new methodology and the interesting results from the two tests are reported in this paper. In the first test, for Shakespeare and for Poe all cases were successful. For Shaw one text was confused with Poe. In the second test the Reuters-21578 financial news were identified by the author relatively well. The resolution is that our text mining methodology seems to be capable of authorship attribution.
Using color management in color document processing
NASA Astrophysics Data System (ADS)
Nehab, Smadar
1995-04-01
Color Management Systems have been used for several years in Desktop Publishing (DTP) environments. While this development hasn't matured yet, we are already experiencing the next generation of the color imaging revolution-Device Independent Color for the small office/home office (SOHO) environment. Though there are still open technical issues with device independent color matching, they are not the focal point of this paper. This paper discusses two new and crucial aspects in using color management in color document processing: the management of color objects and their associated color rendering methods; a proposal for a precedence order and handshaking protocol among the various software components involved in color document processing. As color peripherals become affordable to the SOHO market, color management also becomes a prerequisite for common document authoring applications such as word processors. The first color management solutions were oriented towards DTP environments whose requirements were largely different. For example, DTP documents are image-centric, as opposed to SOHO documents that are text and charts centric. To achieve optimal reproduction on low-cost SOHO peripherals, it is critical that different color rendering methods are used for the different document object types. The first challenge in using color management of color document processing is the association of rendering methods with object types. As a result of an evolutionary process, color matching solutions are now available as application software, as driver embedded software and as operating system extensions. Consequently, document processing faces a new challenge, the correct selection of the color matching solution while avoiding duplicate color corrections.
Document Exploration and Automatic Knowledge Extraction for Unstructured Biomedical Text
NASA Astrophysics Data System (ADS)
Chu, S.; Totaro, G.; Doshi, N.; Thapar, S.; Mattmann, C. A.; Ramirez, P.
2015-12-01
We describe our work on building a web-browser based document reader with built-in exploration tool and automatic concept extraction of medical entities for biomedical text. Vast amounts of biomedical information are offered in unstructured text form through scientific publications and R&D reports. Utilizing text mining can help us to mine information and extract relevant knowledge from a plethora of biomedical text. The ability to employ such technologies to aid researchers in coping with information overload is greatly desirable. In recent years, there has been an increased interest in automatic biomedical concept extraction [1, 2] and intelligent PDF reader tools with the ability to search on content and find related articles [3]. Such reader tools are typically desktop applications and are limited to specific platforms. Our goal is to provide researchers with a simple tool to aid them in finding, reading, and exploring documents. Thus, we propose a web-based document explorer, which we called Shangri-Docs, which combines a document reader with automatic concept extraction and highlighting of relevant terms. Shangri-Docsalso provides the ability to evaluate a wide variety of document formats (e.g. PDF, Words, PPT, text, etc.) and to exploit the linked nature of the Web and personal content by performing searches on content from public sites (e.g. Wikipedia, PubMed) and private cataloged databases simultaneously. Shangri-Docsutilizes Apache cTAKES (clinical Text Analysis and Knowledge Extraction System) [4] and Unified Medical Language System (UMLS) to automatically identify and highlight terms and concepts, such as specific symptoms, diseases, drugs, and anatomical sites, mentioned in the text. cTAKES was originally designed specially to extract information from clinical medical records. Our investigation leads us to extend the automatic knowledge extraction process of cTAKES for biomedical research domain by improving the ontology guided information extraction process. We will describe our experience and implementation of our system and share lessons learned from our development. We will also discuss ways in which this could be adapted to other science fields. [1] Funk et al., 2014. [2] Kang et al., 2014. [3] Utopia Documents, http://utopiadocs.com [4] Apache cTAKES, http://ctakes.apache.org
PDF text classification to leverage information extraction from publication reports.
Bui, Duy Duc An; Del Fiol, Guilherme; Jonnalagadda, Siddhartha
2016-06-01
Data extraction from original study reports is a time-consuming, error-prone process in systematic review development. Information extraction (IE) systems have the potential to assist humans in the extraction task, however majority of IE systems were not designed to work on Portable Document Format (PDF) document, an important and common extraction source for systematic review. In a PDF document, narrative content is often mixed with publication metadata or semi-structured text, which add challenges to the underlining natural language processing algorithm. Our goal is to categorize PDF texts for strategic use by IE systems. We used an open-source tool to extract raw texts from a PDF document and developed a text classification algorithm that follows a multi-pass sieve framework to automatically classify PDF text snippets (for brevity, texts) into TITLE, ABSTRACT, BODYTEXT, SEMISTRUCTURE, and METADATA categories. To validate the algorithm, we developed a gold standard of PDF reports that were included in the development of previous systematic reviews by the Cochrane Collaboration. In a two-step procedure, we evaluated (1) classification performance, and compared it with machine learning classifier, and (2) the effects of the algorithm on an IE system that extracts clinical outcome mentions. The multi-pass sieve algorithm achieved an accuracy of 92.6%, which was 9.7% (p<0.001) higher than the best performing machine learning classifier that used a logistic regression algorithm. F-measure improvements were observed in the classification of TITLE (+15.6%), ABSTRACT (+54.2%), BODYTEXT (+3.7%), SEMISTRUCTURE (+34%), and MEDADATA (+14.2%). In addition, use of the algorithm to filter semi-structured texts and publication metadata improved performance of the outcome extraction system (F-measure +4.1%, p=0.002). It also reduced of number of sentences to be processed by 44.9% (p<0.001), which corresponds to a processing time reduction of 50% (p=0.005). The rule-based multi-pass sieve framework can be used effectively in categorizing texts extracted from PDF documents. Text classification is an important prerequisite step to leverage information extraction from PDF documents. Copyright © 2016 Elsevier Inc. All rights reserved.
Tank waste remediation system functions and requirements document
DOE Office of Scientific and Technical Information (OSTI.GOV)
Carpenter, K.E
1996-10-03
This is the Tank Waste Remediation System (TWRS) Functions and Requirements Document derived from the TWRS Technical Baseline. The document consists of several text sections that provide the purpose, scope, background information, and an explanation of how this document assists the application of Systems Engineering to the TWRS. The primary functions identified in the TWRS Functions and Requirements Document are identified in Figure 4.1 (Section 4.0) Currently, this document is part of the overall effort to develop the TWRS Functional Requirements Baseline, and contains the functions and requirements needed to properly define the top three TWRS function levels. TWRS Technicalmore » Baseline information (RDD-100 database) included in the appendices of the attached document contain the TWRS functions, requirements, and architecture necessary to define the TWRS Functional Requirements Baseline. Document organization and user directions are provided in the introductory text. This document will continue to be modified during the TWRS life-cycle.« less
It’s about This and That: A Description of Anaphoric Expressions in Clinical Text
Wang, Yan; Melton, Genevieve B.; Pakhomov, Serguei
2011-01-01
Although anaphoric expressions are very common in biomedical and clinical documents, little work has been done to systematically characterize their use in clinical text. Samples of ‘it’, ‘this’, and ‘that’ expressions occurring in inpatient clinical notes from four metropolitan hospitals were analyzed using a combination of semi-automated and manual annotation techniques. We developed a rule-based approach to filter potential non-referential expressions. A physician then manually annotated 1000 potential referential instances to determine referent status and the antecedent of each referent expression. A distributional analysis of the three referring expressions in the entire corpus of notes demonstrates a high prevalence of anaphora and large variance in distributions of referential expressions with different notes. Our results confirm that anaphoric expressions are common in clinical texts. Effective co-reference resolution with anaphoric expressions remains an important challenge in medical natural language processing research. PMID:22195211
NASA Astrophysics Data System (ADS)
Fume, Kosei; Ishitani, Yasuto
2008-01-01
We propose a document categorization method based on a document model that can be defined externally for each task and that categorizes Web content or business documents into a target category in accordance with the similarity of the model. The main feature of the proposed method consists of two aspects of semantics extraction from an input document. The semantics of terms are extracted by the semantic pattern analysis and implicit meanings of document substructure are specified by a bottom-up text clustering technique focusing on the similarity of text line attributes. We have constructed a system based on the proposed method for trial purposes. The experimental results show that the system achieves more than 80% classification accuracy in categorizing Web content and business documents into 15 or 70 categories.
39 CFR 3001.10 - Form and number of copies of documents.
Code of Federal Regulations, 2010 CFR
2010-07-01
... service must be printed from a text-based pdf version of the document, where possible. Otherwise, they may... generated in either Acrobat (pdf), Word, or WordPerfect, or Rich Text Format (rtf). [67 FR 67559, Nov. 6...
Use of speech-to-text technology for documentation by healthcare providers.
Ajami, Sima
2016-01-01
Medical records are a critical component of a patient's treatment. However, documentation of patient-related information is considered a secondary activity in the provision of healthcare services, often leading to incomplete medical records and patient data of low quality. Advances in information technology (IT) in the health system and registration of information in electronic health records (EHR) using speechto- text conversion software have facilitated service delivery. This narrative review is a literature search with the help of libraries, books, conference proceedings, databases of Science Direct, PubMed, Proquest, Springer, SID (Scientific Information Database), and search engines such as Yahoo, and Google. I used the following keywords and their combinations: speech recognition, automatic report documentation, voice to text software, healthcare, information, and voice recognition. Due to lack of knowledge of other languages, I searched all texts in English or Persian with no time limits. Of a total of 70, only 42 articles were selected. Speech-to-text conversion technology offers opportunities to improve the documentation process of medical records, reduce cost and time of recording information, enhance the quality of documentation, improve the quality of services provided to patients, and support healthcare providers in legal matters. Healthcare providers should recognize the impact of this technology on service delivery.
Annotating Socio-Cultural Structures in Text
2012-10-31
parts of speech (POS) within text, using the Stanford Part of Speech Tagger (Stanford Log-Linear, 2011). The ERDC-CERL taxonomy is then used to...annotated NP/VP Pane: Shows the sentence parsed using the Parts of Speech tagger Document View Pane: Specifies the document (being annotated) in three...first parsed using the Stanford Parts of Speech tagger and converted to an XML document both components which are done through the Import function
Text mining for the biocuration workflow
Hirschman, Lynette; Burns, Gully A. P. C; Krallinger, Martin; Arighi, Cecilia; Cohen, K. Bretonnel; Valencia, Alfonso; Wu, Cathy H.; Chatr-Aryamontri, Andrew; Dowell, Karen G.; Huala, Eva; Lourenço, Anália; Nash, Robert; Veuthey, Anne-Lise; Wiegers, Thomas; Winter, Andrew G.
2012-01-01
Molecular biology has become heavily dependent on biological knowledge encoded in expert curated biological databases. As the volume of biological literature increases, biocurators need help in keeping up with the literature; (semi-) automated aids for biocuration would seem to be an ideal application for natural language processing and text mining. However, to date, there have been few documented successes for improving biocuration throughput using text mining. Our initial investigations took place for the workshop on ‘Text Mining for the BioCuration Workflow’ at the third International Biocuration Conference (Berlin, 2009). We interviewed biocurators to obtain workflows from eight biological databases. This initial study revealed high-level commonalities, including (i) selection of documents for curation; (ii) indexing of documents with biologically relevant entities (e.g. genes); and (iii) detailed curation of specific relations (e.g. interactions); however, the detailed workflows also showed many variabilities. Following the workshop, we conducted a survey of biocurators. The survey identified biocurator priorities, including the handling of full text indexed with biological entities and support for the identification and prioritization of documents for curation. It also indicated that two-thirds of the biocuration teams had experimented with text mining and almost half were using text mining at that time. Analysis of our interviews and survey provide a set of requirements for the integration of text mining into the biocuration workflow. These can guide the identification of common needs across curated databases and encourage joint experimentation involving biocurators, text mining developers and the larger biomedical research community. PMID:22513129
Text mining for the biocuration workflow.
Hirschman, Lynette; Burns, Gully A P C; Krallinger, Martin; Arighi, Cecilia; Cohen, K Bretonnel; Valencia, Alfonso; Wu, Cathy H; Chatr-Aryamontri, Andrew; Dowell, Karen G; Huala, Eva; Lourenço, Anália; Nash, Robert; Veuthey, Anne-Lise; Wiegers, Thomas; Winter, Andrew G
2012-01-01
Molecular biology has become heavily dependent on biological knowledge encoded in expert curated biological databases. As the volume of biological literature increases, biocurators need help in keeping up with the literature; (semi-) automated aids for biocuration would seem to be an ideal application for natural language processing and text mining. However, to date, there have been few documented successes for improving biocuration throughput using text mining. Our initial investigations took place for the workshop on 'Text Mining for the BioCuration Workflow' at the third International Biocuration Conference (Berlin, 2009). We interviewed biocurators to obtain workflows from eight biological databases. This initial study revealed high-level commonalities, including (i) selection of documents for curation; (ii) indexing of documents with biologically relevant entities (e.g. genes); and (iii) detailed curation of specific relations (e.g. interactions); however, the detailed workflows also showed many variabilities. Following the workshop, we conducted a survey of biocurators. The survey identified biocurator priorities, including the handling of full text indexed with biological entities and support for the identification and prioritization of documents for curation. It also indicated that two-thirds of the biocuration teams had experimented with text mining and almost half were using text mining at that time. Analysis of our interviews and survey provide a set of requirements for the integration of text mining into the biocuration workflow. These can guide the identification of common needs across curated databases and encourage joint experimentation involving biocurators, text mining developers and the larger biomedical research community.
Replacement Attack: A New Zero Text Watermarking Attack
NASA Astrophysics Data System (ADS)
Bashardoost, Morteza; Mohd Rahim, Mohd Shafry; Saba, Tanzila; Rehman, Amjad
2017-03-01
The main objective of zero watermarking methods that are suggested for the authentication of textual properties is to increase the fragility of produced watermarks against tampering attacks. On the other hand, zero watermarking attacks intend to alter the contents of document without changing the watermark. In this paper, the Replacement attack is proposed, which focuses on maintaining the location of the words in the document. The proposed text watermarking attack is specifically effective on watermarking approaches that exploit words' transition in the document. The evaluation outcomes prove that tested word-based method are unable to detect the existence of replacement attack in the document. Moreover, the comparison results show that the size of Replacement attack is estimated less accurate than other common types of zero text watermarking attacks.
C4ISR Architecture Working Group (AWG), Architecture Framework Version 2.0.
1997-12-18
Vision Name Name/identifier of document that contains doctrine, goals, or vision Type Doctrine, goals, or vision Description Text summary description...e.g., organization, directive, order) Description Text summary of tasking •Rules, Criteria, or Conventions Name Name/identifier of document that...contains rules, criteria, or conventions Type One of: rules, criteria, or conventions Description Text summary description of contents or
A multi-ontology approach to annotate scientific documents based on a modularization technique.
Gomes, Priscilla Corrêa E Castro; Moura, Ana Maria de Carvalho; Cavalcanti, Maria Cláudia
2015-12-01
Scientific text annotation has become an important task for biomedical scientists. Nowadays, there is an increasing need for the development of intelligent systems to support new scientific findings. Public databases available on the Web provide useful data, but much more useful information is only accessible in scientific texts. Text annotation may help as it relies on the use of ontologies to maintain annotations based on a uniform vocabulary. However, it is difficult to use an ontology, especially those that cover a large domain. In addition, since scientific texts explore multiple domains, which are covered by distinct ontologies, it becomes even more difficult to deal with such task. Moreover, there are dozens of ontologies in the biomedical area, and they are usually big in terms of the number of concepts. It is in this context that ontology modularization can be useful. This work presents an approach to annotate scientific documents using modules of different ontologies, which are built according to a module extraction technique. The main idea is to analyze a set of single-ontology annotations on a text to find out the user interests. Based on these annotations a set of modules are extracted from a set of distinct ontologies, and are made available for the user, for complementary annotation. The reduced size and focus of the extracted modules tend to facilitate the annotation task. An experiment was conducted to evaluate this approach, with the participation of a bioinformatician specialist of the Laboratory of Peptides and Proteins of the IOC/Fiocruz, who was interested in discovering new drug targets aiming at the combat of tropical diseases. Copyright © 2015 Elsevier Inc. All rights reserved.
Extracting and connecting chemical structures from text sources using chemicalize.org.
Southan, Christopher; Stracz, Andras
2013-04-23
Exploring bioactive chemistry requires navigating between structures and data from a variety of text-based sources. While PubChem currently includes approximately 16 million document-extracted structures (15 million from patents) the extent of public inter-document and document-to-database links is still well below any estimated total, especially for journal articles. A major expansion in access to text-entombed chemistry is enabled by chemicalize.org. This on-line resource can process IUPAC names, SMILES, InChI strings, CAS numbers and drug names from pasted text, PDFs or URLs to generate structures, calculate properties and launch searches. Here, we explore its utility for answering questions related to chemical structures in documents and where these overlap with database records. These aspects are illustrated using a common theme of Dipeptidyl Peptidase 4 (DPPIV) inhibitors. Full-text open URL sources facilitated the download of over 1400 structures from a DPPIV patent and the alignment of specific examples with IC50 data. Uploading the SMILES to PubChem revealed extensive linking to patents and papers, including prior submissions from chemicalize.org as submitting source. A DPPIV medicinal chemistry paper was completely extracted and structures were aligned to the activity results table, as well as linked to other documents via PubChem. In both cases, key structures with data were partitioned from common chemistry by dividing them into individual new PDFs for conversion. Over 500 structures were also extracted from a batch of PubMed abstracts related to DPPIV inhibition. The drug structures could be stepped through each text occurrence and included some converted MeSH-only IUPAC names not linked in PubChem. Performing set intersections proved effective for detecting compounds-in-common between documents and merged extractions. This work demonstrates the utility of chemicalize.org for the exploration of chemical structure connectivity between documents and databases, including structure searches in PubChem, InChIKey searches in Google and the chemicalize.org archive. It has the flexibility to extract text from any internal, external or Web source. It synergizes with other open tools and the application is undergoing continued development. It should thus facilitate progress in medicinal chemistry, chemical biology and other bioactive chemistry domains.
Neurolinguistic approach to natural language processing with applications to medical text analysis.
Duch, Włodzisław; Matykiewicz, Paweł; Pestian, John
2008-12-01
Understanding written or spoken language presumably involves spreading neural activation in the brain. This process may be approximated by spreading activation in semantic networks, providing enhanced representations that involve concepts not found directly in the text. The approximation of this process is of great practical and theoretical interest. Although activations of neural circuits involved in representation of words rapidly change in time snapshots of these activations spreading through associative networks may be captured in a vector model. Concepts of similar type activate larger clusters of neurons, priming areas in the left and right hemisphere. Analysis of recent brain imaging experiments shows the importance of the right hemisphere non-verbal clusterization. Medical ontologies enable development of a large-scale practical algorithm to re-create pathways of spreading neural activations. First concepts of specific semantic type are identified in the text, and then all related concepts of the same type are added to the text, providing expanded representations. To avoid rapid growth of the extended feature space after each step only the most useful features that increase document clusterization are retained. Short hospital discharge summaries are used to illustrate how this process works on a real, very noisy data. Expanded texts show significantly improved clustering and may be classified with much higher accuracy. Although better approximations to the spreading of neural activations may be devised a practical approach presented in this paper helps to discover pathways used by the brain to process specific concepts, and may be used in large-scale applications.
One-click scanning of large-size documents using mobile phone camera
NASA Astrophysics Data System (ADS)
Liu, Sijiang; Jiang, Bo; Yang, Yuanjie
2016-07-01
Currently mobile apps for document scanning do not provide convenient operations to tackle large-size documents. In this paper, we present a one-click scanning approach for large-size documents using mobile phone camera. After capturing a continuous video of documents, our approach automatically extracts several key frames by optical flow analysis. Then based on key frames, a mobile GPU based image stitching method is adopted to generate a completed document image with high details. There are no extra manual intervention in the process and experimental results show that our app performs well, showing convenience and practicability for daily life.
Contemporary issues in HIM. The application layer--III.
Wear, L L; Pinkert, J R
1993-07-01
We have seen document preparation systems evolve from basic line editors through powerful, sophisticated desktop publishing programs. This component of the application layer is probably one of the most used, and most readily identifiable. Ask grade school children nowadays, and many will tell you that they have written a paper on a computer. Next month will be a "fun" tour through a number of other application programs we find useful. They will range from a simple notebook reminder to a sophisticated photograph processor. Application layer: Software targeted for the end user, focusing on a specific application area, and typically residing in the computer system as distinct components on top of the OS. Desktop publishing: A document preparation program that begins with the text features of a word processor, then adds the ability for a user to incorporate outputs from a variety of graphic programs, spreadsheets, and other applications. Line editor: A document preparation program that manipulates text in a file on the basis of numbered lines. Word processor: A document preparation program that can, among other things, reformat sections of documents, move and replace blocks of text, use multiple character fonts, automatically create a table of contents and index, create complex tables, and combine text and graphics.
Probabilistic topic modeling for the analysis and classification of genomic sequences
2015-01-01
Background Studies on genomic sequences for classification and taxonomic identification have a leading role in the biomedical field and in the analysis of biodiversity. These studies are focusing on the so-called barcode genes, representing a well defined region of the whole genome. Recently, alignment-free techniques are gaining more importance because they are able to overcome the drawbacks of sequence alignment techniques. In this paper a new alignment-free method for DNA sequences clustering and classification is proposed. The method is based on k-mers representation and text mining techniques. Methods The presented method is based on Probabilistic Topic Modeling, a statistical technique originally proposed for text documents. Probabilistic topic models are able to find in a document corpus the topics (recurrent themes) characterizing classes of documents. This technique, applied on DNA sequences representing the documents, exploits the frequency of fixed-length k-mers and builds a generative model for a training group of sequences. This generative model, obtained through the Latent Dirichlet Allocation (LDA) algorithm, is then used to classify a large set of genomic sequences. Results and conclusions We performed classification of over 7000 16S DNA barcode sequences taken from Ribosomal Database Project (RDP) repository, training probabilistic topic models. The proposed method is compared to the RDP tool and Support Vector Machine (SVM) classification algorithm in a extensive set of trials using both complete sequences and short sequence snippets (from 400 bp to 25 bp). Our method reaches very similar results to RDP classifier and SVM for complete sequences. The most interesting results are obtained when short sequence snippets are considered. In these conditions the proposed method outperforms RDP and SVM with ultra short sequences and it exhibits a smooth decrease of performance, at every taxonomic level, when the sequence length is decreased. PMID:25916734
A semantic graph-based approach to biomedical summarisation.
Plaza, Laura; Díaz, Alberto; Gervás, Pablo
2011-09-01
Access to the vast body of research literature that is available in biomedicine and related fields may be improved by automatic summarisation. This paper presents a method for summarising biomedical scientific literature that takes into consideration the characteristics of the domain and the type of documents. To address the problem of identifying salient sentences in biomedical texts, concepts and relations derived from the Unified Medical Language System (UMLS) are arranged to construct a semantic graph that represents the document. A degree-based clustering algorithm is then used to identify different themes or topics within the text. Different heuristics for sentence selection, intended to generate different types of summaries, are tested. A real document case is drawn up to illustrate how the method works. A large-scale evaluation is performed using the recall-oriented understudy for gisting-evaluation (ROUGE) metrics. The results are compared with those achieved by three well-known summarisers (two research prototypes and a commercial application) and two baselines. Our method significantly outperforms all summarisers and baselines. The best of our heuristics achieves an improvement in performance of almost 7.7 percentage units in the ROUGE-1 score over the LexRank summariser (0.7862 versus 0.7302). A qualitative analysis of the summaries also shows that our method succeeds in identifying sentences that cover the main topic of the document and also considers other secondary or "satellite" information that might be relevant to the user. The method proposed is proved to be an efficient approach to biomedical literature summarisation, which confirms that the use of concepts rather than terms can be very useful in automatic summarisation, especially when dealing with highly specialised domains. Copyright © 2011 Elsevier B.V. All rights reserved.
Segmentation-driven compound document coding based on H.264/AVC-INTRA.
Zaghetto, Alexandre; de Queiroz, Ricardo L
2007-07-01
In this paper, we explore H.264/AVC operating in intraframe mode to compress a mixed image, i.e., composed of text, graphics, and pictures. Even though mixed contents (compound) documents usually require the use of multiple compressors, we apply a single compressor for both text and pictures. For that, distortion is taken into account differently between text and picture regions. Our approach is to use a segmentation-driven adaptation strategy to change the H.264/AVC quantization parameter on a macroblock by macroblock basis, i.e., we deviate bits from pictorial regions to text in order to keep text edges sharp. We show results of a segmentation driven quantizer adaptation method applied to compress documents. Our reconstructed images have better text sharpness compared to straight unadapted coding, at negligible visual losses on pictorial regions. Our results also highlight the fact that H.264/AVC-INTRA outperforms coders such as JPEG-2000 as a single coder for compound images.
ERIC Educational Resources Information Center
Thomas, Georgelle; Fishburne, Robert P.
Part of the Anthropology Curriculum Project, the document contains a programmed text on evolution and a vocabulary pronunciation guide. The unit is intended for use by students in social studies and science courses in the 5th, 6th, and 7th grades. The bulk of the document, the programmed text, is organized in a question answer format. Students are…
NASA Astrophysics Data System (ADS)
Suzuki, Izumi; Mikami, Yoshiki; Ohsato, Ario
A technique that acquires documents in the same category with a given short text is introduced. Regarding the given text as a training document, the system marks up the most similar document, or sufficiently similar documents, from among the document domain (or entire Web). The system then adds the marked documents to the training set to learn the set, and this process is repeated until no more documents are marked. Setting a monotone increasing property to the similarity as it learns enables the system to 1) detect the correct timing so that no more documents remain to be marked and to 2) decide the threshold value that the classifier uses. In addition, under the condition that the normalization process is limited to what term weights are divided by a p-norm of the weights, the linear classifier in which training documents are indexed in a binary manner is the only instance that satisfies the monotone increasing property. The feasibility of the proposed technique was confirmed through an examination of binary similarity and using English and German documents randomly selected from the Web.
Mujtaba, Ghulam; Shuib, Liyana; Raj, Ram Gopal; Rajandram, Retnagowri; Shaikh, Khairunisa; Al-Garadi, Mohammed Ali
2018-06-01
Text categorization has been used extensively in recent years to classify plain-text clinical reports. This study employs text categorization techniques for the classification of open narrative forensic autopsy reports. One of the key steps in text classification is document representation. In document representation, a clinical report is transformed into a format that is suitable for classification. The traditional document representation technique for text categorization is the bag-of-words (BoW) technique. In this study, the traditional BoW technique is ineffective in classifying forensic autopsy reports because it merely extracts frequent but discriminative features from clinical reports. Moreover, this technique fails to capture word inversion, as well as word-level synonymy and polysemy, when classifying autopsy reports. Hence, the BoW technique suffers from low accuracy and low robustness unless it is improved with contextual and application-specific information. To overcome the aforementioned limitations of the BoW technique, this research aims to develop an effective conceptual graph-based document representation (CGDR) technique to classify 1500 forensic autopsy reports from four (4) manners of death (MoD) and sixteen (16) causes of death (CoD). Term-based and Systematized Nomenclature of Medicine-Clinical Terms (SNOMED CT) based conceptual features were extracted and represented through graphs. These features were then used to train a two-level text classifier. The first level classifier was responsible for predicting MoD. In addition, the second level classifier was responsible for predicting CoD using the proposed conceptual graph-based document representation technique. To demonstrate the significance of the proposed technique, its results were compared with those of six (6) state-of-the-art document representation techniques. Lastly, this study compared the effects of one-level classification and two-level classification on the experimental results. The experimental results indicated that the CGDR technique achieved 12% to 15% improvement in accuracy compared with fully automated document representation baseline techniques. Moreover, two-level classification obtained better results compared with one-level classification. The promising results of the proposed conceptual graph-based document representation technique suggest that pathologists can adopt the proposed system as their basis for second opinion, thereby supporting them in effectively determining CoD. Copyright © 2018 Elsevier Inc. All rights reserved.
Unapparent Information Revelation: Text Mining for Counterterrorism
NASA Astrophysics Data System (ADS)
Srihari, Rohini K.
Unapparent information revelation (UIR) is a special case of text mining that focuses on detecting possible links between concepts across multiple text documents by generating an evidence trail explaining the connection. A traditional search involving, for example, two or more person names will attempt to find documents mentioning both these individuals. This research focuses on a different interpretation of such a query: what is the best evidence trail across documents that explains a connection between these individuals? For example, all may be good golfers. A generalization of this task involves query terms representing general concepts (e.g. indictment, foreign policy). Previous approaches to this problem have focused on graph mining involving hyperlinked documents, and link analysis exploiting named entities. A new robust framework is presented, based on (i) generating concept chain graphs, a hybrid content representation, (ii) performing graph matching to select candidate subgraphs, and (iii) subsequently using graphical models to validate hypotheses using ranked evidence trails. We adapt the DUC data set for cross-document summarization to evaluate evidence trails generated by this approach
Maria Sibylla Merian and the metamorphosis of natural history.
Etheridge, Kay
2011-03-01
Known primarily for creating beautiful images of butterflies and flowers, Maria Sibylla Merian (German, 1647-1717) has remained largely unappreciated for her seminal contribution to early modern natural history. Merian was indeed a talented artist, but she clearly thought of herself as a naturalist, and employed both text and images to depict lepidopteran metamorphosis and behavior with unprecedented accuracy and detail. Merian documented larvae and adult insects feeding not only on plants, but also on other animals, and she depicted other creatures preying on insects. An image of battling spiders and ants and the accompanying text in her 1705 Metamorphosis insectorum surinamensium illuminated the world of tropical arthropods in a way that was groundbreaking, and set the stage for a new way to envision nature. Copyright © 2010 Elsevier Ltd. All rights reserved.
Real-time text extraction based on the page layout analysis system
NASA Astrophysics Data System (ADS)
Soua, M.; Benchekroun, A.; Kachouri, R.; Akil, M.
2017-05-01
Several approaches were proposed in order to extract text from scanned documents. However, text extraction in heterogeneous documents stills a real challenge. Indeed, text extraction in this context is a difficult task because of the variation of the text due to the differences of sizes, styles and orientations, as well as to the complexity of the document region background. Recently, we have proposed the improved hybrid binarization based on Kmeans method (I-HBK)5 to extract suitably the text from heterogeneous documents. In this method, the Page Layout Analysis (PLA), part of the Tesseract OCR engine, is used to identify text and image regions. Afterwards our hybrid binarization is applied separately on each kind of regions. In one side, gamma correction is employed before to process image regions. In the other side, binarization is performed directly on text regions. Then, a foreground and background color study is performed to correct inverted region colors. Finally, characters are located from the binarized regions based on the PLA algorithm. In this work, we extend the integration of the PLA algorithm within the I-HBK method. In addition, to speed up the separation of text and image step, we employ an efficient GPU acceleration. Through the performed experiments, we demonstrate the high F-measure accuracy of the PLA algorithm reaching 95% on the LRDE dataset. In addition, we illustrate the sequential and the parallel compared PLA versions. The obtained results give a speedup of 3.7x when comparing the parallel PLA implementation on GPU GTX 660 to the CPU version.
27 CFR 19.677 - Large plant applications-organizational documents.
Code of Federal Regulations, 2011 CFR
2011-04-01
... 27 Alcohol, Tobacco Products and Firearms 1 2011-04-01 2011-04-01 false Large plant applications-organizational documents. 19.677 Section 19.677 Alcohol, Tobacco Products and Firearms ALCOHOL AND TOBACCO TAX... Fuel Use Obtaining A Permit § 19.677 Large plant applications—organizational documents. In addition to...
Improving Text Recall with Multiple Summaries
ERIC Educational Resources Information Center
van der Meij, Hans; van der Meij, Jan
2012-01-01
Background. QuikScan (QS) is an innovative design that aims to improve accessibility, comprehensibility, and subsequent recall of expository text by means of frequent within-document summaries that are formatted as numbered list items. The numbers in the QS summaries correspond to numbers placed in the body of the document where the summarized…
Analysis of line structure in handwritten documents using the Hough transform
NASA Astrophysics Data System (ADS)
Ball, Gregory R.; Kasiviswanathan, Harish; Srihari, Sargur N.; Narayanan, Aswin
2010-01-01
In the analysis of handwriting in documents a central task is that of determining line structure of the text, e.g., number of text lines, location of their starting and end-points, line-width, etc. While simple methods can handle ideal images, real world documents have complexities such as overlapping line structure, variable line spacing, line skew, document skew, noisy or degraded images etc. This paper explores the application of the Hough transform method to handwritten documents with the goal of automatically determining global document line structure in a top-down manner which can then be used in conjunction with a bottom-up method such as connected component analysis. The performance is significantly better than other top-down methods, such as the projection profile method. In addition, we evaluate the performance of skew analysis by the Hough transform on handwritten documents.
Document image cleanup and binarization
NASA Astrophysics Data System (ADS)
Wu, Victor; Manmatha, Raghaven
1998-04-01
Image binarization is a difficult task for documents with text over textured or shaded backgrounds, poor contrast, and/or considerable noise. Current optical character recognition (OCR) and document analysis technology do not handle such documents well. We have developed a simple yet effective algorithm for document image clean-up and binarization. The algorithm consists of two basic steps. In the first step, the input image is smoothed using a low-pass filter. The smoothing operation enhances the text relative to any background texture. This is because background texture normally has higher frequency than text does. The smoothing operation also removes speckle noise. In the second step, the intensity histogram of the smoothed image is computed and a threshold automatically selected as follows. For black text, the first peak of the histogram corresponds to text. Thresholding the image at the value of the valley between the first and second peaks of the histogram binarizes the image well. In order to reliably identify the valley, the histogram is smoothed by a low-pass filter before the threshold is computed. The algorithm has been applied to some 50 images from a wide variety of source: digitized video frames, photos, newspapers, advertisements in magazines or sales flyers, personal checks, etc. There are 21820 characters and 4406 words in these images. 91 percent of the characters and 86 percent of the words are successfully cleaned up and binarized. A commercial OCR was applied to the binarized text when it consisted of fonts which were OCR recognizable. The recognition rate was 84 percent for the characters and 77 percent for the words.
Leveraging Text Content for Management of Construction Project Documents
ERIC Educational Resources Information Center
Alqady, Mohammed
2012-01-01
The construction industry is a knowledge intensive industry. Thousands of documents are generated by construction projects. Documents, as information carriers, must be managed effectively to ensure successful project management. The fact that a single project can produce thousands of documents and that a lot of the documents are generated in a…
The Islamic State Battle Plan: Press Release Natural Language Processing
2016-06-01
Processing, text mining , corpus, generalized linear model, cascade, R Shiny, leaflet, data visualization 15. NUMBER OF PAGES 83 16. PRICE CODE...Terrorism and Responses to Terrorism TDM Term Document Matrix TF Term Frequency TF-IDF Term Frequency-Inverse Document Frequency tm text mining (R...package=leaflet. Feinerer I, Hornik K (2015) Text Mining Package “tm,” Version 0.6-2. (Jul 3) https://cran.r-project.org/web/packages/tm/tm.pdf
A Semi-supervised Heat Kernel Pagerank MBO Algorithm for Data Classification
2016-07-01
financial predictions, etc. and is finding growing use in text mining studies. In this paper, we present an efficient algorithm for classification of high...video data, set of images, hyperspectral data, medical data, text data, etc. Moreover, the framework provides a way to analyze data whose different...also be incorporated. For text classification, one can use tfidf (term frequency inverse document frequency) to form feature vectors for each document
ERIC Educational Resources Information Center
Stahl, Steven A.; And Others
To examine the effects of students reading multiple documents on their perceptions of a historical event, in this case the "discovery" of America by Christopher Columbus, 85 high school freshmen read 3 of 4 different texts (or sets of texts) dealing with Columbus. One text was an encyclopedia article, one a set of articles from…
M68000 RNF text formatter user's manual
NASA Technical Reports Server (NTRS)
Will, R. W.; Grantham, C.
1985-01-01
A powerful, flexible text formatting program, RNF, is described. It is designed to automate many of the tedious elements of typing, including breaking a document into pages with titles and page numbers, formatting chapter and section headings, keeping track of page numbers for use in a table of contents, justifying lines by inserting blanks to give an even right margin, and inserting figures and footnotes at appropriate places on the page. The RNF program greatly facilitates both preparing and modifying a document because it allows you to concentrate your efforts on the content of the document instead of its appearance and because it removes the necessity of retyping text that has not changed.
Inferring Group Processes from Computer-Mediated Affective Text Analysis
DOE Office of Scientific and Technical Information (OSTI.GOV)
Schryver, Jack C; Begoli, Edmon; Jose, Ajith
2011-02-01
Political communications in the form of unstructured text convey rich connotative meaning that can reveal underlying group social processes. Previous research has focused on sentiment analysis at the document level, but we extend this analysis to sub-document levels through a detailed analysis of affective relationships between entities extracted from a document. Instead of pure sentiment analysis, which is just positive or negative, we explore nuances of affective meaning in 22 affect categories. Our affect propagation algorithm automatically calculates and displays extracted affective relationships among entities in graphical form in our prototype (TEAMSTER), starting with seed lists of affect terms. Severalmore » useful metrics are defined to infer underlying group processes by aggregating affective relationships discovered in a text. Our approach has been validated with annotated documents from the MPQA corpus, achieving a performance gain of 74% over comparable random guessers.« less
Using ontology network structure in text mining.
Berndt, Donald J; McCart, James A; Luther, Stephen L
2010-11-13
Statistical text mining treats documents as bags of words, with a focus on term frequencies within documents and across document collections. Unlike natural language processing (NLP) techniques that rely on an engineered vocabulary or a full-featured ontology, statistical approaches do not make use of domain-specific knowledge. The freedom from biases can be an advantage, but at the cost of ignoring potentially valuable knowledge. The approach proposed here investigates a hybrid strategy based on computing graph measures of term importance over an entire ontology and injecting the measures into the statistical text mining process. As a starting point, we adapt existing search engine algorithms such as PageRank and HITS to determine term importance within an ontology graph. The graph-theoretic approach is evaluated using a smoking data set from the i2b2 National Center for Biomedical Computing, cast as a simple binary classification task for categorizing smoking-related documents, demonstrating consistent improvements in accuracy.
Evaluation of PHI Hunter in Natural Language Processing Research.
Redd, Andrew; Pickard, Steve; Meystre, Stephane; Scehnet, Jeffrey; Bolton, Dan; Heavirland, Julia; Weaver, Allison Lynn; Hope, Carol; Garvin, Jennifer Hornung
2015-01-01
We introduce and evaluate a new, easily accessible tool using a common statistical analysis and business analytics software suite, SAS, which can be programmed to remove specific protected health information (PHI) from a text document. Removal of PHI is important because the quantity of text documents used for research with natural language processing (NLP) is increasing. When using existing data for research, an investigator must remove all PHI not needed for the research to comply with human subjects' right to privacy. This process is similar, but not identical, to de-identification of a given set of documents. PHI Hunter removes PHI from free-form text. It is a set of rules to identify and remove patterns in text. PHI Hunter was applied to 473 Department of Veterans Affairs (VA) text documents randomly drawn from a research corpus stored as unstructured text in VA files. PHI Hunter performed well with PHI in the form of identification numbers such as Social Security numbers, phone numbers, and medical record numbers. The most commonly missed PHI items were names and locations. Incorrect removal of information occurred with text that looked like identification numbers. PHI Hunter fills a niche role that is related to but not equal to the role of de-identification tools. It gives research staff a tool to reasonably increase patient privacy. It performs well for highly sensitive PHI categories that are rarely used in research, but still shows possible areas for improvement. More development for patterns of text and linked demographic tables from electronic health records (EHRs) would improve the program so that more precise identifiable information can be removed. PHI Hunter is an accessible tool that can flexibly remove PHI not needed for research. If it can be tailored to the specific data set via linked demographic tables, its performance will improve in each new document set.
Evaluation of PHI Hunter in Natural Language Processing Research
Redd, Andrew; Pickard, Steve; Meystre, Stephane; Scehnet, Jeffrey; Bolton, Dan; Heavirland, Julia; Weaver, Allison Lynn; Hope, Carol; Garvin, Jennifer Hornung
2015-01-01
Objectives We introduce and evaluate a new, easily accessible tool using a common statistical analysis and business analytics software suite, SAS, which can be programmed to remove specific protected health information (PHI) from a text document. Removal of PHI is important because the quantity of text documents used for research with natural language processing (NLP) is increasing. When using existing data for research, an investigator must remove all PHI not needed for the research to comply with human subjects’ right to privacy. This process is similar, but not identical, to de-identification of a given set of documents. Materials and methods PHI Hunter removes PHI from free-form text. It is a set of rules to identify and remove patterns in text. PHI Hunter was applied to 473 Department of Veterans Affairs (VA) text documents randomly drawn from a research corpus stored as unstructured text in VA files. Results PHI Hunter performed well with PHI in the form of identification numbers such as Social Security numbers, phone numbers, and medical record numbers. The most commonly missed PHI items were names and locations. Incorrect removal of information occurred with text that looked like identification numbers. Discussion PHI Hunter fills a niche role that is related to but not equal to the role of de-identification tools. It gives research staff a tool to reasonably increase patient privacy. It performs well for highly sensitive PHI categories that are rarely used in research, but still shows possible areas for improvement. More development for patterns of text and linked demographic tables from electronic health records (EHRs) would improve the program so that more precise identifiable information can be removed. Conclusions PHI Hunter is an accessible tool that can flexibly remove PHI not needed for research. If it can be tailored to the specific data set via linked demographic tables, its performance will improve in each new document set. PMID:26807078
An Introduction to the Extensible Markup Language (XML).
ERIC Educational Resources Information Center
Bryan, Martin
1998-01-01
Describes Extensible Markup Language (XML), a subset of the Standard Generalized Markup Language (SGML) that is designed to make it easy to interchange structured documents over the Internet. Topics include Document Type Definition (DTD), components of XML, the use of XML, text and non-text elements, and uses for XML-coded files. (LRW)
Medical record management systems: criticisms and new perspectives.
Frénot, S; Laforest, F
1999-06-01
The first generation of computerized medical records stored the data as text, but these records did not bring any improvement in information manipulation. The use of a relational database management system (DBMS) has largely solved this problem as it allows for data requests by using SQL. However, this requires data structuring which is not very appropriate to medicine. Moreover, the use of templates and icon user interfaces has introduced a deviation from the paper-based record (still existing). The arrival of hypertext user interfaces has proven to be of interest to fill the gap between the paper-based medical record and its electronic version. We think that further improvement can be accomplished by using a fully document-based system. We present the architecture, advantages and disadvantages of classical DBMS-based and Web/DBMS-based solutions. We also present a document-based solution and explain its advantages, which include communication, security, flexibility and genericity.
Improving PHENIX search with Solr, Nutch and Drupal.
NASA Astrophysics Data System (ADS)
Morrison, Dave; Sourikova, Irina
2012-12-01
During its 20 years of R&D, construction and operation the PHENIX experiment at the Relativistic Heavy Ion Collider (RHIC) has accumulated large amounts of proprietary collaboration data that is hosted on many servers around the world and is not open for commercial search engines for indexing and searching. The legacy search infrastructure did not scale well with the fast growing PHENIX document base and produced results inadequate in both precision and recall. After considering the possible alternatives that would provide an aggregated, fast, full text search of a variety of data sources and file formats we decided to use Nutch [1] as a web crawler and Solr [2] as a search engine. To present XML-based Solr search results in a user-friendly format we use Drupal [3] as a web interface to Solr. We describe the experience of building a federated search for a heterogeneous collection of 10 million PHENIX documents with Nutch, Solr and Drupal.
Extracting biomedical events from pairs of text entities
2015-01-01
Background Huge amounts of electronic biomedical documents, such as molecular biology reports or genomic papers are generated daily. Nowadays, these documents are mainly available in the form of unstructured free texts, which require heavy processing for their registration into organized databases. This organization is instrumental for information retrieval, enabling to answer the advanced queries of researchers and practitioners in biology, medicine, and related fields. Hence, the massive data flow calls for efficient automatic methods of text-mining that extract high-level information, such as biomedical events, from biomedical text. The usual computational tools of Natural Language Processing cannot be readily applied to extract these biomedical events, due to the peculiarities of the domain. Indeed, biomedical documents contain highly domain-specific jargon and syntax. These documents also describe distinctive dependencies, making text-mining in molecular biology a specific discipline. Results We address biomedical event extraction as the classification of pairs of text entities into the classes corresponding to event types. The candidate pairs of text entities are recursively provided to a multiclass classifier relying on Support Vector Machines. This recursive process extracts events involving other events as arguments. Compared to joint models based on Markov Random Fields, our model simplifies inference and hence requires shorter training and prediction times along with lower memory capacity. Compared to usual pipeline approaches, our model passes over a complex intermediate problem, while making a more extensive usage of sophisticated joint features between text entities. Our method focuses on the core event extraction of the Genia task of BioNLP challenges yielding the best result reported so far on the 2013 edition. PMID:26201478
Text processing through Web services: calling Whatizit.
Rebholz-Schuhmann, Dietrich; Arregui, Miguel; Gaudan, Sylvain; Kirsch, Harald; Jimeno, Antonio
2008-01-15
Text-mining (TM) solutions are developing into efficient services to researchers in the biomedical research community. Such solutions have to scale with the growing number and size of resources (e.g. available controlled vocabularies), with the amount of literature to be processed (e.g. about 17 million documents in PubMed) and with the demands of the user community (e.g. different methods for fact extraction). These demands motivated the development of a server-based solution for literature analysis. Whatizit is a suite of modules that analyse text for contained information, e.g. any scientific publication or Medline abstracts. Special modules identify terms and then link them to the corresponding entries in bioinformatics databases such as UniProtKb/Swiss-Prot data entries and gene ontology concepts. Other modules identify a set of selected annotation types like the set produced by the EBIMed analysis pipeline for proteins. In the case of Medline abstracts, Whatizit offers access to EBI's in-house installation via PMID or term query. For large quantities of the user's own text, the server can be operated in a streaming mode (http://www.ebi.ac.uk/webservices/whatizit).
What Can Pictures Tell Us About Web Pages? Improving Document Search Using Images.
Rodriguez-Vaamonde, Sergio; Torresani, Lorenzo; Fitzgibbon, Andrew W
2015-06-01
Traditional Web search engines do not use the images in the HTML pages to find relevant documents for a given query. Instead, they typically operate by computing a measure of agreement between the keywords provided by the user and only the text portion of each page. In this paper we study whether the content of the pictures appearing in a Web page can be used to enrich the semantic description of an HTML document and consequently boost the performance of a keyword-based search engine. We present a Web-scalable system that exploits a pure text-based search engine to find an initial set of candidate documents for a given query. Then, the candidate set is reranked using visual information extracted from the images contained in the pages. The resulting system retains the computational efficiency of traditional text-based search engines with only a small additional storage cost needed to encode the visual information. We test our approach on one of the TREC Million Query Track benchmarks where we show that the exploitation of visual content yields improvement in accuracies for two distinct text-based search engines, including the system with the best reported performance on this benchmark. We further validate our approach by collecting document relevance judgements on our search results using Amazon Mechanical Turk. The results of this experiment confirm the improvement in accuracy produced by our image-based reranker over a pure text-based system.
Web Prep: How to Prepare NAS Reports For Publication on the Web
NASA Technical Reports Server (NTRS)
Walatka, Pamela; Balakrishnan, Prithika; Clucas, Jean; McCabe, R. Kevin; Felchle, Gail; Brickell, Cristy
1996-01-01
This document contains specific advice and requirements for NASA Ames Code IN authors of NAS reports. Much of the information may be of interest to other authors writing for the Web. WebPrep has a graphic Table of Contents in the form of a WebToon, which simulates a discussion between a scientist and a Web publishing consultant. In the WebToon, Frequently Asked Questions about preparing reports for the Web are linked to relevant text in the body of this document. We also provide a text-only Table of Contents. The text for this document is divided into chapters: each chapter corresponds to one frame of the WebToons. The chapter topics are: converting text to HTML, converting 2D graphic images to gif, creating imagemaps and tables, converting movie and audio files to Web formats, supplying 3D interactive data, and (briefly) JAVA capabilities. The last chapter is specifically for NAS staff authors. The Glossary-Index lists web related words and links to topics covered in the main text.
Text-interpreter language for flexible generation of patient notes and instructions.
Forker, T S
1992-01-01
An interpreted computer language has been developed along with a windowed user interface and multi-printer-support formatter to allow preparation of documentation of patient visits, including progress notes, prescriptions, excuses for work/school, outpatient laboratory requisitions, and patient instructions. Input is by trackball or mouse with little or no keyboard skill required. For clinical problems with specific protocols, the clinician can be prompted with problem-specific items of history, exam, and lab data to be gathered and documented. The language implements a number of text-related commands as well as branching logic and arithmetic commands. In addition to generating text, it is simple to implement arithmetic calculations such as weight-specific drug dosages; multiple branching decision-support protocols for paramedical personnel (or physicians); and calculation of clinical scores (e.g., coma or trauma scores) while simultaneously documenting the status of each component of the score. ASCII text files produced by the interpreter are available for computerized quality audit. Interpreter instructions are contained in text files users can customize with any text editor.
How tobacco companies have used package quantity for consumer targeting.
Persoskie, Alexander; Donaldson, Elisabeth A; Ryant, Chase
2018-05-31
Package quantity refers to the number of cigarettes or amount of other tobacco product in a package. Many countries restrict minimum cigarette package quantities to avoid low-cost packs that may lower barriers to youth smoking. We reviewed Truth Tobacco Industry Documents to understand tobacco companies' rationales for introducing new package quantities, including companies' expectations and research regarding how package quantity may influence consumer behaviour. A snowball sampling method (phase 1), a static search string (phase 2) and a follow-up snowball search (phase 3) identified 216 documents, mostly from the 1980s and 1990s, concerning cigarettes (200), roll-your-own tobacco (9), smokeless tobacco (6) and 'smokeless cigarettes' (1). Companies introduced small and large packages to motivate brand-switching and continued use among current users when faced with low market share or threats such as tax-induced price increases or competitors' use of price promotions. Companies developed and evaluated package quantities for specific brands and consumer segments. Large packages offered value-for-money and matched long-term, heavy users' consumption rates. Small packages were cheaper, matched consumption rates of newer and lighter users, and increased products' novelty, ease of carrying and perceived freshness. Some users also preferred small packages as a way to try to limit consumption or quit. Industry documents speculated about many potential effects of package quantity on appeal and use, depending on brand and consumer segment. The search was non-exhaustive, and we could not assess the quality of much of the research or other information on which the documents relied. © Article author(s) (or their employer(s) unless otherwise stated in the text of the article) 2018. All rights reserved. No commercial use is permitted unless otherwise expressly granted.
Domain-independent information extraction in unstructured text
DOE Office of Scientific and Technical Information (OSTI.GOV)
Irwin, N.H.
Extracting information from unstructured text has become an important research area in recent years due to the large amount of text now electronically available. This status report describes the findings and work done during the second year of a two-year Laboratory Directed Research and Development Project. Building on the first-year`s work of identifying important entities, this report details techniques used to group words into semantic categories and to output templates containing selective document content. Using word profiles and category clustering derived during a training run, the time-consuming knowledge-building task can be avoided. Though the output still lacks in completeness whenmore » compared to systems with domain-specific knowledge bases, the results do look promising. The two approaches are compatible and could complement each other within the same system. Domain-independent approaches retain appeal as a system that adapts and learns will soon outpace a system with any amount of a priori knowledge.« less
10 CFR 2.1013 - Use of the electronic docket during the proceeding.
Code of Federal Regulations, 2010 CFR
2010-01-01
... bi-tonal documents. (v) Electronic submissions must be generated in the appropriate PDF output format by using: (A) PDF—Formatted Text and Graphics for textual documents converted from native applications; (B) PDF—Searchable Image (Exact) for textual documents converted from scanned documents; and (C...
VisualUrText: A Text Analytics Tool for Unstructured Textual Data
NASA Astrophysics Data System (ADS)
Zainol, Zuraini; Jaymes, Mohd T. H.; Nohuddin, Puteri N. E.
2018-05-01
The growing amount of unstructured text over Internet is tremendous. Text repositories come from Web 2.0, business intelligence and social networking applications. It is also believed that 80-90% of future growth data is available in the form of unstructured text databases that may potentially contain interesting patterns and trends. Text Mining is well known technique for discovering interesting patterns and trends which are non-trivial knowledge from massive unstructured text data. Text Mining covers multidisciplinary fields involving information retrieval (IR), text analysis, natural language processing (NLP), data mining, machine learning statistics and computational linguistics. This paper discusses the development of text analytics tool that is proficient in extracting, processing, analyzing the unstructured text data and visualizing cleaned text data into multiple forms such as Document Term Matrix (DTM), Frequency Graph, Network Analysis Graph, Word Cloud and Dendogram. This tool, VisualUrText, is developed to assist students and researchers for extracting interesting patterns and trends in document analyses.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Dimmick, Ross
This document contains updates to the Supplemental Information Sandia National Laboratories/New Mexico Site-Wide Environmental Impact Statement Source Documents that were developed in 2010. In general, this addendum provides calendar year 2010 data, along with changes or additions to text in the original documents.
Neurolinguistic Approach to Natural Language Processing with Applications to Medical Text Analysis
Matykiewicz, Paweł; Pestian, John
2008-01-01
Understanding written or spoken language presumably involves spreading neural activation in the brain. This process may be approximated by spreading activation in semantic networks, providing enhanced representations that involve concepts that are not found directly in the text. Approximation of this process is of great practical and theoretical interest. Although activations of neural circuits involved in representation of words rapidly change in time snapshots of these activations spreading through associative networks may be captured in a vector model. Concepts of similar type activate larger clusters of neurons, priming areas in the left and right hemisphere. Analysis of recent brain imaging experiments shows the importance of the right hemisphere non-verbal clusterization. Medical ontologies enable development of a large-scale practical algorithm to re-create pathways of spreading neural activations. First concepts of specific semantic type are identified in the text, and then all related concepts of the same type are added to the text, providing expanded representations. To avoid rapid growth of the extended feature space after each step only the most useful features that increase document clusterization are retained. Short hospital discharge summaries are used to illustrate how this process works on a real, very noisy data. Expanded texts show significantly improved clustering and may be classified with much higher accuracy. Although better approximations to the spreading of neural activations may be devised a practical approach presented in this paper helps to discover pathways used by the brain to process specific concepts, and may be used in large-scale applications. PMID:18614334
ERIC Educational Resources Information Center
Göçer, Ali
2014-01-01
In this study, Turkish text-based written examination questions posed to students in secondary schools were examined. In this research, document analysis method within the framework of the qualitative research approach was used. The data obtained from the documents consisting of written examination papers were analyzed with content analysis…
ERIC Educational Resources Information Center
Stromso, Helge I.; Braten, Ivar; Britt, M. Anne
2010-01-01
In many situations, readers are asked to learn from multiple documents. Many studies have found that evaluating the trustworthiness and usefulness of document sources is an important skill in such learning situations. There has been, however, no direct evidence that attending to source information helps readers learn from and interpret a…
García-Remesal, Miguel; Maojo, Victor; Crespo, José
2010-01-01
In this paper we present a knowledge engineering approach to automatically recognize and extract genetic sequences from scientific articles. To carry out this task, we use a preliminary recognizer based on a finite state machine to extract all candidate DNA/RNA sequences. The latter are then fed into a knowledge-based system that automatically discards false positives and refines noisy and incorrectly merged sequences. We created the knowledge base by manually analyzing different manuscripts containing genetic sequences. Our approach was evaluated using a test set of 211 full-text articles in PDF format containing 3134 genetic sequences. For such set, we achieved 87.76% precision and 97.70% recall respectively. This method can facilitate different research tasks. These include text mining, information extraction, and information retrieval research dealing with large collections of documents containing genetic sequences.
King, Gary; Pan, Jennifer; Roberts, Margaret E
2014-08-22
Existing research on the extensive Chinese censorship organization uses observational methods with well-known limitations. We conducted the first large-scale experimental study of censorship by creating accounts on numerous social media sites, randomly submitting different texts, and observing from a worldwide network of computers which texts were censored and which were not. We also supplemented interviews with confidential sources by creating our own social media site, contracting with Chinese firms to install the same censoring technologies as existing sites, and--with their software, documentation, and even customer support--reverse-engineering how it all works. Our results offer rigorous support for the recent hypothesis that criticisms of the state, its leaders, and their policies are published, whereas posts about real-world events with collective action potential are censored. Copyright © 2014, American Association for the Advancement of Science.
Contribution to terminology internationalization by word alignment in parallel corpora.
Deléger, Louise; Merkel, Magnus; Zweigenbaum, Pierre
2006-01-01
Creating a complete translation of a large vocabulary is a time-consuming task, which requires skilled and knowledgeable medical translators. Our goal is to examine to which extent such a task can be alleviated by a specific natural language processing technique, word alignment in parallel corpora. We experiment with translation from English to French. Build a large corpus of parallel, English-French documents, and automatically align it at the document, sentence and word levels using state-of-the-art alignment methods and tools. Then project English terms from existing controlled vocabularies to the aligned word pairs, and examine the number and quality of the putative French translations obtained thereby. We considered three American vocabularies present in the UMLS with three different translation statuses: the MeSH, SNOMED CT, and the MedlinePlus Health Topics. We obtained several thousand new translations of our input terms, this number being closely linked to the number of terms in the input vocabularies. Our study shows that alignment methods can extract a number of new term translations from large bodies of text with a moderate human reviewing effort, and thus contribute to help a human translator obtain better translation coverage of an input vocabulary. Short-term perspectives include their application to a corpus 20 times larger than that used here, together with more focused methods for term extraction.
Contribution to Terminology Internationalization by Word Alignment in Parallel Corpora
Deléger, Louise; Merkel, Magnus; Zweigenbaum, Pierre
2006-01-01
Background and objectives Creating a complete translation of a large vocabulary is a time-consuming task, which requires skilled and knowledgeable medical translators. Our goal is to examine to which extent such a task can be alleviated by a specific natural language processing technique, word alignment in parallel corpora. We experiment with translation from English to French. Methods Build a large corpus of parallel, English-French documents, and automatically align it at the document, sentence and word levels using state-of-the-art alignment methods and tools. Then project English terms from existing controlled vocabularies to the aligned word pairs, and examine the number and quality of the putative French translations obtained thereby. We considered three American vocabularies present in the UMLS with three different translation statuses: the MeSH, SNOMED CT, and the MedlinePlus Health Topics. Results We obtained several thousand new translations of our input terms, this number being closely linked to the number of terms in the input vocabularies. Conclusion Our study shows that alignment methods can extract a number of new term translations from large bodies of text with a moderate human reviewing effort, and thus contribute to help a human translator obtain better translation coverage of an input vocabulary. Short-term perspectives include their application to a corpus 20 times larger than that used here, together with more focused methods for term extraction. PMID:17238328
Automatic Identification of Topic Tags from Texts Based on Expansion-Extraction Approach
ERIC Educational Resources Information Center
Yang, Seungwon
2013-01-01
Identifying topics of a textual document is useful for many purposes. We can organize the documents by topics in digital libraries. Then, we could browse and search for the documents with specific topics. By examining the topics of a document, we can quickly understand what the document is about. To augment the traditional manual way of topic…
Document reconstruction by layout analysis of snippets
NASA Astrophysics Data System (ADS)
Kleber, Florian; Diem, Markus; Sablatnig, Robert
2010-02-01
Document analysis is done to analyze entire forms (e.g. intelligent form analysis, table detection) or to describe the layout/structure of a document. Also skew detection of scanned documents is performed to support OCR algorithms that are sensitive to skew. In this paper document analysis is applied to snippets of torn documents to calculate features for the reconstruction. Documents can either be destroyed by the intention to make the printed content unavailable (e.g. tax fraud investigation, business crime) or due to time induced degeneration of ancient documents (e.g. bad storage conditions). Current reconstruction methods for manually torn documents deal with the shape, inpainting and texture synthesis techniques. In this paper the possibility of document analysis techniques of snippets to support the matching algorithm by considering additional features are shown. This implies a rotational analysis, a color analysis and a line detection. As a future work it is planned to extend the feature set with the paper type (blank, checked, lined), the type of the writing (handwritten vs. machine printed) and the text layout of a snippet (text size, line spacing). Preliminary results show that these pre-processing steps can be performed reliably on a real dataset consisting of 690 snippets.
Text Authorship Identified Using the Dynamics of Word Co-Occurrence Networks
Akimushkin, Camilo; Amancio, Diego Raphael; Oliveira, Osvaldo Novais
2017-01-01
Automatic identification of authorship in disputed documents has benefited from complex network theory as this approach does not require human expertise or detailed semantic knowledge. Networks modeling entire books can be used to discriminate texts from different sources and understand network growth mechanisms, but only a few studies have probed the suitability of networks in modeling small chunks of text to grasp stylistic features. In this study, we introduce a methodology based on the dynamics of word co-occurrence networks representing written texts to classify a corpus of 80 texts by 8 authors. The texts were divided into sections with equal number of linguistic tokens, from which time series were created for 12 topological metrics. Since 73% of all series were stationary (ARIMA(p, 0, q)) and the remaining were integrable of first order (ARIMA(p, 1, q)), probability distributions could be obtained for the global network metrics. The metrics exhibit bell-shaped non-Gaussian distributions, and therefore distribution moments were used as learning attributes. With an optimized supervised learning procedure based on a nonlinear transformation performed by Isomap, 71 out of 80 texts were correctly classified using the K-nearest neighbors algorithm, i.e. a remarkable 88.75% author matching success rate was achieved. Hence, purely dynamic fluctuations in network metrics can characterize authorship, thus paving the way for a robust description of large texts in terms of small evolving networks. PMID:28125703
Asymmetric author-topic model for knowledge discovering of big data in toxicogenomics.
Chung, Ming-Hua; Wang, Yuping; Tang, Hailin; Zou, Wen; Basinger, John; Xu, Xiaowei; Tong, Weida
2015-01-01
The advancement of high-throughput screening technologies facilitates the generation of massive amount of biological data, a big data phenomena in biomedical science. Yet, researchers still heavily rely on keyword search and/or literature review to navigate the databases and analyses are often done in rather small-scale. As a result, the rich information of a database has not been fully utilized, particularly for the information embedded in the interactive nature between data points that are largely ignored and buried. For the past 10 years, probabilistic topic modeling has been recognized as an effective machine learning algorithm to annotate the hidden thematic structure of massive collection of documents. The analogy between text corpus and large-scale genomic data enables the application of text mining tools, like probabilistic topic models, to explore hidden patterns of genomic data and to the extension of altered biological functions. In this paper, we developed a generalized probabilistic topic model to analyze a toxicogenomics dataset that consists of a large number of gene expression data from the rat livers treated with drugs in multiple dose and time-points. We discovered the hidden patterns in gene expression associated with the effect of doses and time-points of treatment. Finally, we illustrated the ability of our model to identify the evidence of potential reduction of animal use.
Miwa, Makoto; Ohta, Tomoko; Rak, Rafal; Rowley, Andrew; Kell, Douglas B.; Pyysalo, Sampo; Ananiadou, Sophia
2013-01-01
Motivation: To create, verify and maintain pathway models, curators must discover and assess knowledge distributed over the vast body of biological literature. Methods supporting these tasks must understand both the pathway model representations and the natural language in the literature. These methods should identify and order documents by relevance to any given pathway reaction. No existing system has addressed all aspects of this challenge. Method: We present novel methods for associating pathway model reactions with relevant publications. Our approach extracts the reactions directly from the models and then turns them into queries for three text mining-based MEDLINE literature search systems. These queries are executed, and the resulting documents are combined and ranked according to their relevance to the reactions of interest. We manually annotate document-reaction pairs with the relevance of the document to the reaction and use this annotation to study several ranking methods, using various heuristic and machine-learning approaches. Results: Our evaluation shows that the annotated document-reaction pairs can be used to create a rule-based document ranking system, and that machine learning can be used to rank documents by their relevance to pathway reactions. We find that a Support Vector Machine-based system outperforms several baselines and matches the performance of the rule-based system. The success of the query extraction and ranking methods are used to update our existing pathway search system, PathText. Availability: An online demonstration of PathText 2 and the annotated corpus are available for research purposes at http://www.nactem.ac.uk/pathtext2/. Contact: makoto.miwa@manchester.ac.uk Supplementary information: Supplementary data are available at Bioinformatics online. PMID:23813008
tESA: a distributional measure for calculating semantic relatedness.
Rybinski, Maciej; Aldana-Montes, José Francisco
2016-12-28
Semantic relatedness is a measure that quantifies the strength of a semantic link between two concepts. Often, it can be efficiently approximated with methods that operate on words, which represent these concepts. Approximating semantic relatedness between texts and concepts represented by these texts is an important part of many text and knowledge processing tasks of crucial importance in the ever growing domain of biomedical informatics. The problem of most state-of-the-art methods for calculating semantic relatedness is their dependence on highly specialized, structured knowledge resources, which makes these methods poorly adaptable for many usage scenarios. On the other hand, the domain knowledge in the Life Sciences has become more and more accessible, but mostly in its unstructured form - as texts in large document collections, which makes its use more challenging for automated processing. In this paper we present tESA, an extension to a well known Explicit Semantic Relatedness (ESA) method. In our extension we use two separate sets of vectors, corresponding to different sections of the articles from the underlying corpus of documents, as opposed to the original method, which only uses a single vector space. We present an evaluation of Life Sciences domain-focused applicability of both tESA and domain-adapted Explicit Semantic Analysis. The methods are tested against a set of standard benchmarks established for the evaluation of biomedical semantic relatedness quality. Our experiments show that the propsed method achieves results comparable with or superior to the current state-of-the-art methods. Additionally, a comparative discussion of the results obtained with tESA and ESA is presented, together with a study of the adaptability of the methods to different corpora and their performance with different input parameters. Our findings suggest that combined use of the semantics from different sections (i.e. extending the original ESA methodology with the use of title vectors) of the documents of scientific corpora may be used to enhance the performance of a distributional semantic relatedness measures, which can be observed in the largest reference datasets. We also present the impact of the proposed extension on the size of distributional representations.
Generating descriptive visual words and visual phrases for large-scale image applications.
Zhang, Shiliang; Tian, Qi; Hua, Gang; Huang, Qingming; Gao, Wen
2011-09-01
Bag-of-visual Words (BoWs) representation has been applied for various problems in the fields of multimedia and computer vision. The basic idea is to represent images as visual documents composed of repeatable and distinctive visual elements, which are comparable to the text words. Notwithstanding its great success and wide adoption, visual vocabulary created from single-image local descriptors is often shown to be not as effective as desired. In this paper, descriptive visual words (DVWs) and descriptive visual phrases (DVPs) are proposed as the visual correspondences to text words and phrases, where visual phrases refer to the frequently co-occurring visual word pairs. Since images are the carriers of visual objects and scenes, a descriptive visual element set can be composed by the visual words and their combinations which are effective in representing certain visual objects or scenes. Based on this idea, a general framework is proposed for generating DVWs and DVPs for image applications. In a large-scale image database containing 1506 object and scene categories, the visual words and visual word pairs descriptive to certain objects or scenes are identified and collected as the DVWs and DVPs. Experiments show that the DVWs and DVPs are informative and descriptive and, thus, are more comparable with the text words than the classic visual words. We apply the identified DVWs and DVPs in several applications including large-scale near-duplicated image retrieval, image search re-ranking, and object recognition. The combination of DVW and DVP performs better than the state of the art in large-scale near-duplicated image retrieval in terms of accuracy, efficiency and memory consumption. The proposed image search re-ranking algorithm: DWPRank outperforms the state-of-the-art algorithm by 12.4% in mean average precision and about 11 times faster in efficiency.
Rectification of curved document images based on single view three-dimensional reconstruction.
Kang, Lai; Wei, Yingmei; Jiang, Jie; Bai, Liang; Lao, Songyang
2016-10-01
Since distortions in camera-captured document images significantly affect the accuracy of optical character recognition (OCR), distortion removal plays a critical role for document digitalization systems using a camera for image capturing. This paper proposes a novel framework that performs three-dimensional (3D) reconstruction and rectification of camera-captured document images. While most existing methods rely on additional calibrated hardware or multiple images to recover the 3D shape of a document page, or make a simple but not always valid assumption on the corresponding 3D shape, our framework is more flexible and practical since it only requires a single input image and is able to handle a general locally smooth document surface. The main contributions of this paper include a new iterative refinement scheme for baseline fitting from connected components of text line, an efficient discrete vertical text direction estimation algorithm based on convex hull projection profile analysis, and a 2D distortion grid construction method based on text direction function estimation using 3D regularization. In order to examine the performance of our proposed method, both qualitative and quantitative evaluation and comparison with several recent methods are conducted in our experiments. The experimental results demonstrate that the proposed method outperforms relevant approaches for camera-captured document image rectification, in terms of improvements on both visual distortion removal and OCR accuracy.
Reminder Cards Improve Physician Documentation of Obesity But Not Obesity Counseling.
Shungu, Nicholas; Miller, Marshal N; Mills, Geoffrey; Patel, Neesha; de la Paz, Amanda; Rose, Victoria; Kropa, Jill; Edi, Rina; Levy, Emily; Crenshaw, Margaret; Hwang, Chris
2015-01-01
Physicians frequently fail to document obesity and obesity-related counseling. We sought to determine whether attaching a physical reminder card to patient encounter forms would increase electronic medical record (EMR) assessment of and documentation of obesity and dietary counseling. Reminder cards for obesity documentation were attached to encounter forms for patient encounters over a 2-week intervention period. For visits in the intervention period, the EMR was retrospectively reviewed for BMI, assessment of "obesity" or "morbid obesity" as an active problem, free-text dietary counseling within physician notes, and assessment of "dietary counseling" as an active problem. These data were compared to those collected through a retrospective chart review during a 2-week pre-intervention period. We also compared physician self-report of documentation via reminder cards with EMR documentation. We found significant improvement in the primary endpoint of assessment of "obesity" or "morbid obesity" as an active problem (42.5% versus 28%) compared to the pre-intervention period. There was no significant difference in the primary endpoints of free-text dietary counseling or assessment of "dietary counseling" as an active problem between the groups. Physician self-reporting of assessment of "obesity" or "morbid obesity" as an active problem (77.7% versus 42.5%), free-text dietary counseling on obesity (69.1% versus 35.4%) and assessment of "dietary counseling" as an active problem (54.3% versus 25.2%) were all significantly higher than those reflected in EMR documentation. This study demonstrates that physical reminder cards are a successful means of increasing obesity documentation rates among providers but do not necessarily increase rates of obesity-related counseling or documentation of counseling. Our study suggests that even with such interventions, physicians are likely under-documenting obesity and counseling compared to self-reported rates.
Benchmarking infrastructure for mutation text mining
2014-01-01
Background Experimental research on the automatic extraction of information about mutations from texts is greatly hindered by the lack of consensus evaluation infrastructure for the testing and benchmarking of mutation text mining systems. Results We propose a community-oriented annotation and benchmarking infrastructure to support development, testing, benchmarking, and comparison of mutation text mining systems. The design is based on semantic standards, where RDF is used to represent annotations, an OWL ontology provides an extensible schema for the data and SPARQL is used to compute various performance metrics, so that in many cases no programming is needed to analyze results from a text mining system. While large benchmark corpora for biological entity and relation extraction are focused mostly on genes, proteins, diseases, and species, our benchmarking infrastructure fills the gap for mutation information. The core infrastructure comprises (1) an ontology for modelling annotations, (2) SPARQL queries for computing performance metrics, and (3) a sizeable collection of manually curated documents, that can support mutation grounding and mutation impact extraction experiments. Conclusion We have developed the principal infrastructure for the benchmarking of mutation text mining tasks. The use of RDF and OWL as the representation for corpora ensures extensibility. The infrastructure is suitable for out-of-the-box use in several important scenarios and is ready, in its current state, for initial community adoption. PMID:24568600
Benchmarking infrastructure for mutation text mining.
Klein, Artjom; Riazanov, Alexandre; Hindle, Matthew M; Baker, Christopher Jo
2014-02-25
Experimental research on the automatic extraction of information about mutations from texts is greatly hindered by the lack of consensus evaluation infrastructure for the testing and benchmarking of mutation text mining systems. We propose a community-oriented annotation and benchmarking infrastructure to support development, testing, benchmarking, and comparison of mutation text mining systems. The design is based on semantic standards, where RDF is used to represent annotations, an OWL ontology provides an extensible schema for the data and SPARQL is used to compute various performance metrics, so that in many cases no programming is needed to analyze results from a text mining system. While large benchmark corpora for biological entity and relation extraction are focused mostly on genes, proteins, diseases, and species, our benchmarking infrastructure fills the gap for mutation information. The core infrastructure comprises (1) an ontology for modelling annotations, (2) SPARQL queries for computing performance metrics, and (3) a sizeable collection of manually curated documents, that can support mutation grounding and mutation impact extraction experiments. We have developed the principal infrastructure for the benchmarking of mutation text mining tasks. The use of RDF and OWL as the representation for corpora ensures extensibility. The infrastructure is suitable for out-of-the-box use in several important scenarios and is ready, in its current state, for initial community adoption.
Raynor, David K; Dickinson, David
2009-04-01
Effective written consumer medicines information is essential to support safe and effective medicine taking, but the wording and layout of currently provided materials do not meet patients' needs. To identify principles from the wider discipline of information design for use by health professionals when developing or assessing written drug information for patients. Six experts in information design nominated texts on best practice in information design applicable to consumer medicines information. A content analysis identified key principles that were tabulated to bring out key themes. Six texts that met the inclusion criteria, were identified, and content analysis indentified 4 themes: words, type, lines, and layout. Within these main themes, there were 24 subthemes. Selected principles relating to these subthemes were: use short familiar words, short sentences, and short headings that stand out from the text; use a conversational tone of voice, addressing the reader as "you"; use a large type size while retaining sufficient white space; use bullet points to organize lists; use unjustified text (ragged right) and bold, lower-case text for emphasis. Pictures or graphics do not necessarily improve a document. Applying the good information design principles identified to written consumer medicines information could support health professionals when developing and assessing drug information for patients.
Let Documents Talk to Each Other: A Computer Model for Connection of Short Documents.
ERIC Educational Resources Information Center
Chen, Z.
1993-01-01
Discusses the integration of scientific texts through the connection of documents and describes a computer model that can connect short documents. Information retrieval and artificial intelligence are discussed; a prototype system of the model is explained; and the model is compared to other computer models. (17 references) (LRW)
30 CFR 285.115 - Documents incorporated by reference.
Code of Federal Regulations, 2011 CFR
2011-07-01
... incorporating by reference the documents listed in the table in paragraph (e) of this section. The Director of...: ER29AP09.104 (e) This paragraph lists documents incorporated by reference. To easily reference text of the... 30 Mineral Resources 2 2011-07-01 2011-07-01 false Documents incorporated by reference. 285.115...
75 FR 28594 - Ready-to-Learn Television Program
Federal Register 2010, 2011, 2012, 2013, 2014
2010-05-21
... Federal Register. Free Internet access to the official edition of the Federal Register and the Code of... Access to This Document: You can view this document, as well as all other documents of this Department published in the Federal Register, in text or Adobe Portable Document Format (PDF) on the Internet at the...
BoB, a best-of-breed automated text de-identification system for VHA clinical documents.
Ferrández, Oscar; South, Brett R; Shen, Shuying; Friedlin, F Jeffrey; Samore, Matthew H; Meystre, Stéphane M
2013-01-01
De-identification allows faster and more collaborative clinical research while protecting patient confidentiality. Clinical narrative de-identification is a tedious process that can be alleviated by automated natural language processing methods. The goal of this research is the development of an automated text de-identification system for Veterans Health Administration (VHA) clinical documents. We devised a novel stepwise hybrid approach designed to improve the current strategies used for text de-identification. The proposed system is based on a previous study on the best de-identification methods for VHA documents. This best-of-breed automated clinical text de-identification system (aka BoB) tackles the problem as two separate tasks: (1) maximize patient confidentiality by redacting as much protected health information (PHI) as possible; and (2) leave de-identified documents in a usable state preserving as much clinical information as possible. We evaluated BoB with a manually annotated corpus of a variety of VHA clinical notes, as well as with the 2006 i2b2 de-identification challenge corpus. We present evaluations at the instance- and token-level, with detailed results for BoB's main components. Moreover, an existing text de-identification system was also included in our evaluation. BoB's design efficiently takes advantage of the methods implemented in its pipeline, resulting in high sensitivity values (especially for sensitive PHI categories) and a limited number of false positives. Our system successfully addressed VHA clinical document de-identification, and its hybrid stepwise design demonstrates robustness and efficiency, prioritizing patient confidentiality while leaving most clinical information intact.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Choo, Jaegul; Kim, Hannah; Clarkson, Edward
In this paper, we present an interactive visual information retrieval and recommendation system, called VisIRR, for large-scale document discovery. VisIRR effectively combines the paradigms of (1) a passive pull through query processes for retrieval and (2) an active push that recommends items of potential interest to users based on their preferences. Equipped with an efficient dynamic query interface against a large-scale corpus, VisIRR organizes the retrieved documents into high-level topics and visualizes them in a 2D space, representing the relationships among the topics along with their keyword summary. In addition, based on interactive personalized preference feedback with regard to documents,more » VisIRR provides document recommendations from the entire corpus, which are beyond the retrieved sets. Such recommended documents are visualized in the same space as the retrieved documents, so that users can seamlessly analyze both existing and newly recommended ones. This article presents novel computational methods, which make these integrated representations and fast interactions possible for a large-scale document corpus. We illustrate how the system works by providing detailed usage scenarios. Finally, we present preliminary user study results for evaluating the effectiveness of the system.« less
Choo, Jaegul; Kim, Hannah; Clarkson, Edward; ...
2018-01-31
In this paper, we present an interactive visual information retrieval and recommendation system, called VisIRR, for large-scale document discovery. VisIRR effectively combines the paradigms of (1) a passive pull through query processes for retrieval and (2) an active push that recommends items of potential interest to users based on their preferences. Equipped with an efficient dynamic query interface against a large-scale corpus, VisIRR organizes the retrieved documents into high-level topics and visualizes them in a 2D space, representing the relationships among the topics along with their keyword summary. In addition, based on interactive personalized preference feedback with regard to documents,more » VisIRR provides document recommendations from the entire corpus, which are beyond the retrieved sets. Such recommended documents are visualized in the same space as the retrieved documents, so that users can seamlessly analyze both existing and newly recommended ones. This article presents novel computational methods, which make these integrated representations and fast interactions possible for a large-scale document corpus. We illustrate how the system works by providing detailed usage scenarios. Finally, we present preliminary user study results for evaluating the effectiveness of the system.« less
On the Creation of Hypertext Links in Full-Text Documents: Measurement of Inter-Linker Consistency.
ERIC Educational Resources Information Center
Ellis, David; And Others
1994-01-01
Describes a study in which several different sets of hypertext links are inserted by different people in full-text documents. The degree of similarity between the sets is measured using coefficients and topological indices. As in comparable studies of inter-indexer consistency, the sets of links used by different people showed little similarity.…
Robust keyword retrieval method for OCRed text
NASA Astrophysics Data System (ADS)
Fujii, Yusaku; Takebe, Hiroaki; Tanaka, Hiroshi; Hotta, Yoshinobu
2011-01-01
Document management systems have become important because of the growing popularity of electronic filing of documents and scanning of books, magazines, manuals, etc., through a scanner or a digital camera, for storage or reading on a PC or an electronic book. Text information acquired by optical character recognition (OCR) is usually added to the electronic documents for document retrieval. Since texts generated by OCR generally include character recognition errors, robust retrieval methods have been introduced to overcome this problem. In this paper, we propose a retrieval method that is robust against both character segmentation and recognition errors. In the proposed method, the insertion of noise characters and dropping of characters in the keyword retrieval enables robustness against character segmentation errors, and character substitution in the keyword of the recognition candidate for each character in OCR or any other character enables robustness against character recognition errors. The recall rate of the proposed method was 15% higher than that of the conventional method. However, the precision rate was 64% lower.
Text-image alignment for historical handwritten documents
NASA Astrophysics Data System (ADS)
Zinger, S.; Nerbonne, J.; Schomaker, L.
2009-01-01
We describe our work on text-image alignment in context of building a historical document retrieval system. We aim at aligning images of words in handwritten lines with their text transcriptions. The images of handwritten lines are automatically segmented from the scanned pages of historical documents and then manually transcribed. To train automatic routines to detect words in an image of handwritten text, we need a training set - images of words with their transcriptions. We present our results on aligning words from the images of handwritten lines and their corresponding text transcriptions. Alignment based on the longest spaces between portions of handwriting is a baseline. We then show that relative lengths, i.e. proportions of words in their lines, can be used to improve the alignment results considerably. To take into account the relative word length, we define the expressions for the cost function that has to be minimized for aligning text words with their images. We apply right to left alignment as well as alignment based on exhaustive search. The quality assessment of these alignments shows correct results for 69% of words from 100 lines, or 90% of partially correct and correct alignments combined.
Vogel, Markus; Kaisers, Wolfgang; Wassmuth, Ralf; Mayatepek, Ertan
2015-11-03
Clinical documentation has undergone a change due to the usage of electronic health records. The core element is to capture clinical findings and document therapy electronically. Health care personnel spend a significant portion of their time on the computer. Alternatives to self-typing, such as speech recognition, are currently believed to increase documentation efficiency and quality, as well as satisfaction of health professionals while accomplishing clinical documentation, but few studies in this area have been published to date. This study describes the effects of using a Web-based medical speech recognition system for clinical documentation in a university hospital on (1) documentation speed, (2) document length, and (3) physician satisfaction. Reports of 28 physicians were randomized to be created with (intervention) or without (control) the assistance of a Web-based system of medical automatic speech recognition (ASR) in the German language. The documentation was entered into a browser's text area and the time to complete the documentation including all necessary corrections, correction effort, number of characters, and mood of participant were stored in a database. The underlying time comprised text entering, text correction, and finalization of the documentation event. Participants self-assessed their moods on a scale of 1-3 (1=good, 2=moderate, 3=bad). Statistical analysis was done using permutation tests. The number of clinical reports eligible for further analysis stood at 1455. Out of 1455 reports, 718 (49.35%) were assisted by ASR and 737 (50.65%) were not assisted by ASR. Average documentation speed without ASR was 173 (SD 101) characters per minute, while it was 217 (SD 120) characters per minute using ASR. The overall increase in documentation speed through Web-based ASR assistance was 26% (P=.04). Participants documented an average of 356 (SD 388) characters per report when not assisted by ASR and 649 (SD 561) characters per report when assisted by ASR. Participants' average mood rating was 1.3 (SD 0.6) using ASR assistance compared to 1.6 (SD 0.7) without ASR assistance (P<.001). We conclude that medical documentation with the assistance of Web-based speech recognition leads to an increase in documentation speed, document length, and participant mood when compared to self-typing. Speech recognition is a meaningful and effective tool for the clinical documentation process.
Mining the Text: 34 Text Features that Can Ease or Obstruct Text Comprehension and Use
ERIC Educational Resources Information Center
White, Sheida
2012-01-01
This article presents 34 characteristics of texts and tasks ("text features") that can make continuous (prose), noncontinuous (document), and quantitative texts easier or more difficult for adolescents and adults to comprehend and use. The text features were identified by examining the assessment tasks and associated texts in the national…
Assessing semantic similarity of texts - Methods and algorithms
NASA Astrophysics Data System (ADS)
Rozeva, Anna; Zerkova, Silvia
2017-12-01
Assessing the semantic similarity of texts is an important part of different text-related applications like educational systems, information retrieval, text summarization, etc. This task is performed by sophisticated analysis, which implements text-mining techniques. Text mining involves several pre-processing steps, which provide for obtaining structured representative model of the documents in a corpus by means of extracting and selecting the features, characterizing their content. Generally the model is vector-based and enables further analysis with knowledge discovery approaches. Algorithms and measures are used for assessing texts at syntactical and semantic level. An important text-mining method and similarity measure is latent semantic analysis (LSA). It provides for reducing the dimensionality of the document vector space and better capturing the text semantics. The mathematical background of LSA for deriving the meaning of the words in a given text by exploring their co-occurrence is examined. The algorithm for obtaining the vector representation of words and their corresponding latent concepts in a reduced multidimensional space as well as similarity calculation are presented.
Relevance popularity: A term event model based feature selection scheme for text classification.
Feng, Guozhong; An, Baiguo; Yang, Fengqin; Wang, Han; Zhang, Libiao
2017-01-01
Feature selection is a practical approach for improving the performance of text classification methods by optimizing the feature subsets input to classifiers. In traditional feature selection methods such as information gain and chi-square, the number of documents that contain a particular term (i.e. the document frequency) is often used. However, the frequency of a given term appearing in each document has not been fully investigated, even though it is a promising feature to produce accurate classifications. In this paper, we propose a new feature selection scheme based on a term event Multinomial naive Bayes probabilistic model. According to the model assumptions, the matching score function, which is based on the prediction probability ratio, can be factorized. Finally, we derive a feature selection measurement for each term after replacing inner parameters by their estimators. On a benchmark English text datasets (20 Newsgroups) and a Chinese text dataset (MPH-20), our numerical experiment results obtained from using two widely used text classifiers (naive Bayes and support vector machine) demonstrate that our method outperformed the representative feature selection methods.
Hoelzer, S; Schweiger, R K; Boettcher, H A; Tafazzoli, A G; Dudeck, J
2001-01-01
The purpose of guidelines in clinical practice is to improve the effectiveness and efficiency of clinical care. It is known that nationally or internationally produced guidelines which, in particular, do not involve medical processes at the time of consultation, do not take local factors into account, and have no consistent implementation strategy, have limited impact in changing either the behaviour of physicians, or patterns of care. The literature provides evidence for the effectiveness of computerization of CPGs for increasing compliance and improving patient outcomes. Probably the most effective concepts are knowledge-based functions for decision support or monitoring that are integrated in clinical information systems. This approach is mostly restricted by the effort required for development and maintenance of the information systems and the limited number of implemented medical rules. Most of the guidelines are text-based, and are primarily published in medical journals and posted on the internet. However, internet-published guidelines have little impact on the behaviour of physicians. It can be difficult and time-consuming to browse the internet to find (a) the correct guidelines to an existing diagnosis and (b) and adequate recommendation for a specific clinical problem. Our objective is to provide a web-based guideline service that takes as input clinical data on a particular patient and returns as output a customizable set of recommendations regarding diagnosis and treatment. Information in healthcare is to a very large extent transmitted and stored as unstructured or slightly structured text such as discharge letters, reports, forms, etc. The same applies for facilities containing medical information resources for clinical purposes and research such as text books, articles, guidelines, etc. Physicians are used to obtaining information from text-based sources. Since most guidelines are text-based, it would be practical to use a document-based solution that preserves the original cohesiveness. The lack of structure limits the automatic identification and extraction of the information contained in these resources. For this reason, we have chosen a document-based approach using eXtensible Markup Language (XML) with its schema definition and related technologies. XML empowers the applications for in-context searching. In addition it allows the same content to be represented in different ways. Our XML reference clinical data model for guidelines has been realized with the XML schema definition. The schema is used for structuring new text-based guidelines and updating existing documents. It is also used to establish search strategies on the document base. We hypothesize that enabling the physicians to query the available CPGs easily, and to get access to selected and specific information at the point of care will foster increased use. Based on current evidence we are confident that it will have substantial impact on the care provided, and will improve health outcomes.
Federal Register 2010, 2011, 2012, 2013, 2014
2010-07-02
... published in the Federal Register. Free Internet access to the official edition of the Federal Register and.... Electronic Access to This Document: You can view this document, as well as all other documents of this Department published in the Federal Register, in text or Adobe Portable Document Format (PDF) on the Internet...
Text-mining analysis of mHealth research.
Ozaydin, Bunyamin; Zengul, Ferhat; Oner, Nurettin; Delen, Dursun
2017-01-01
In recent years, because of the advancements in communication and networking technologies, mobile technologies have been developing at an unprecedented rate. mHealth, the use of mobile technologies in medicine, and the related research has also surged parallel to these technological advancements. Although there have been several attempts to review mHealth research through manual processes such as systematic reviews, the sheer magnitude of the number of studies published in recent years makes this task very challenging. The most recent developments in machine learning and text mining offer some potential solutions to address this challenge by allowing analyses of large volumes of texts through semi-automated processes. The objective of this study is to analyze the evolution of mHealth research by utilizing text-mining and natural language processing (NLP) analyses. The study sample included abstracts of 5,644 mHealth research articles, which were gathered from five academic search engines by using search terms such as mobile health, and mHealth. The analysis used the Text Explorer module of JMP Pro 13 and an iterative semi-automated process involving tokenizing, phrasing, and terming. After developing the document term matrix (DTM) analyses such as single value decomposition (SVD), topic, and hierarchical document clustering were performed, along with the topic-informed document clustering approach. The results were presented in the form of word-clouds and trend analyses. There were several major findings regarding research clusters and trends. First, our results confirmed time-dependent nature of terminology use in mHealth research. For example, in earlier versus recent years the use of terminology changed from "mobile phone" to "smartphone" and from "applications" to "apps". Second, ten clusters for mHealth research were identified including (I) Clinical Research on Lifestyle Management, (II) Community Health, (III) Literature Review, (IV) Medical Interventions, (V) Research Design, (VI) Infrastructure, (VII) Applications, (VIII) Research and Innovation in Health Technologies, (IX) Sensor-based Devices and Measurement Algorithms, (X) Survey-based Research. Third, the trend analyses indicated the infrastructure cluster as the highest percentage researched area until 2014. The Research and Innovation in Health Technologies cluster experienced the largest increase in numbers of publications in recent years, especially after 2014. This study is unique because it is the only known study utilizing text-mining analyses to reveal the streams and trends for mHealth research. The fast growth in mobile technologies is expected to lead to higher numbers of studies focusing on mHealth and its implications for various healthcare outcomes. Findings of this study can be utilized by researchers in identifying areas for future studies.
Text-mining analysis of mHealth research
Zengul, Ferhat; Oner, Nurettin; Delen, Dursun
2017-01-01
In recent years, because of the advancements in communication and networking technologies, mobile technologies have been developing at an unprecedented rate. mHealth, the use of mobile technologies in medicine, and the related research has also surged parallel to these technological advancements. Although there have been several attempts to review mHealth research through manual processes such as systematic reviews, the sheer magnitude of the number of studies published in recent years makes this task very challenging. The most recent developments in machine learning and text mining offer some potential solutions to address this challenge by allowing analyses of large volumes of texts through semi-automated processes. The objective of this study is to analyze the evolution of mHealth research by utilizing text-mining and natural language processing (NLP) analyses. The study sample included abstracts of 5,644 mHealth research articles, which were gathered from five academic search engines by using search terms such as mobile health, and mHealth. The analysis used the Text Explorer module of JMP Pro 13 and an iterative semi-automated process involving tokenizing, phrasing, and terming. After developing the document term matrix (DTM) analyses such as single value decomposition (SVD), topic, and hierarchical document clustering were performed, along with the topic-informed document clustering approach. The results were presented in the form of word-clouds and trend analyses. There were several major findings regarding research clusters and trends. First, our results confirmed time-dependent nature of terminology use in mHealth research. For example, in earlier versus recent years the use of terminology changed from “mobile phone” to “smartphone” and from “applications” to “apps”. Second, ten clusters for mHealth research were identified including (I) Clinical Research on Lifestyle Management, (II) Community Health, (III) Literature Review, (IV) Medical Interventions, (V) Research Design, (VI) Infrastructure, (VII) Applications, (VIII) Research and Innovation in Health Technologies, (IX) Sensor-based Devices and Measurement Algorithms, (X) Survey-based Research. Third, the trend analyses indicated the infrastructure cluster as the highest percentage researched area until 2014. The Research and Innovation in Health Technologies cluster experienced the largest increase in numbers of publications in recent years, especially after 2014. This study is unique because it is the only known study utilizing text-mining analyses to reveal the streams and trends for mHealth research. The fast growth in mobile technologies is expected to lead to higher numbers of studies focusing on mHealth and its implications for various healthcare outcomes. Findings of this study can be utilized by researchers in identifying areas for future studies. PMID:29430456
Food entries in a large allergy data repository
Plasek, Joseph M.; Goss, Foster R.; Lai, Kenneth H.; Lau, Jason J.; Seger,, Diane L.; Blumenthal, Kimberly G.; Wickner, Paige G.; Slight, Sarah P.; Chang, Frank Y.; Topaz, Maxim; Bates, David W.
2016-01-01
Objective Accurate food adverse sensitivity documentation in electronic health records (EHRs) is crucial to patient safety. This study examined, encoded, and grouped foods that caused any adverse sensitivity in a large allergy repository using natural language processing and standard terminologies. Methods Using the Medical Text Extraction, Reasoning, and Mapping System (MTERMS), we processed both structured and free-text entries stored in an enterprise-wide allergy repository (Partners’ Enterprise-wide Allergy Repository), normalized diverse food allergen terms into concepts, and encoded these concepts using the Systematized Nomenclature of Medicine – Clinical Terms (SNOMED-CT) and Unique Ingredient Identifiers (UNII) terminologies. Concept coverage also was assessed for these two terminologies. We further categorized allergen concepts into groups and calculated the frequencies of these concepts by group. Finally, we conducted an external validation of MTERMS’s performance when identifying food allergen terms, using a randomized sample from a different institution. Results We identified 158 552 food allergen records (2140 unique terms) in the Partners repository, corresponding to 672 food allergen concepts. High-frequency groups included shellfish (19.3%), fruits or vegetables (18.4%), dairy (9.0%), peanuts (8.5%), tree nuts (8.5%), eggs (6.0%), grains (5.1%), and additives (4.7%). Ambiguous, generic concepts such as “nuts” and “seafood” accounted for 8.8% of the records. SNOMED-CT covered more concepts than UNII in terms of exact (81.7% vs 68.0%) and partial (14.3% vs 9.7%) matches. Discussion Adverse sensitivities to food are diverse, and existing standard terminologies have gaps in their coverage of the breadth of allergy concepts. Conclusion New strategies are needed to represent and standardize food adverse sensitivity concepts, to improve documentation in EHRs. PMID:26384406
Food entries in a large allergy data repository.
Plasek, Joseph M; Goss, Foster R; Lai, Kenneth H; Lau, Jason J; Seger, Diane L; Blumenthal, Kimberly G; Wickner, Paige G; Slight, Sarah P; Chang, Frank Y; Topaz, Maxim; Bates, David W; Zhou, Li
2016-04-01
Accurate food adverse sensitivity documentation in electronic health records (EHRs) is crucial to patient safety. This study examined, encoded, and grouped foods that caused any adverse sensitivity in a large allergy repository using natural language processing and standard terminologies. Using the Medical Text Extraction, Reasoning, and Mapping System (MTERMS), we processed both structured and free-text entries stored in an enterprise-wide allergy repository (Partners' Enterprise-wide Allergy Repository), normalized diverse food allergen terms into concepts, and encoded these concepts using the Systematized Nomenclature of Medicine - Clinical Terms (SNOMED-CT) and Unique Ingredient Identifiers (UNII) terminologies. Concept coverage also was assessed for these two terminologies. We further categorized allergen concepts into groups and calculated the frequencies of these concepts by group. Finally, we conducted an external validation of MTERMS's performance when identifying food allergen terms, using a randomized sample from a different institution. We identified 158 552 food allergen records (2140 unique terms) in the Partners repository, corresponding to 672 food allergen concepts. High-frequency groups included shellfish (19.3%), fruits or vegetables (18.4%), dairy (9.0%), peanuts (8.5%), tree nuts (8.5%), eggs (6.0%), grains (5.1%), and additives (4.7%). Ambiguous, generic concepts such as "nuts" and "seafood" accounted for 8.8% of the records. SNOMED-CT covered more concepts than UNII in terms of exact (81.7% vs 68.0%) and partial (14.3% vs 9.7%) matches. Adverse sensitivities to food are diverse, and existing standard terminologies have gaps in their coverage of the breadth of allergy concepts. New strategies are needed to represent and standardize food adverse sensitivity concepts, to improve documentation in EHRs. © The Author 2015. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For Permissions, please email: journals.permissions@oup.com.
ERIC Educational Resources Information Center
Stadtler, Marc; Scharrer, Lisa; Brummernhenrich, Benjamin; Bromme, Rainer
2013-01-01
Past research has shown that readers often fail to notice conflicts in text. In our present study we investigated whether accessing information from multiple documents instead of a single document might alleviate this problem by motivating readers to integrate information. We further tested whether this effect would be moderated by source…
Automatic document classification of biological literature
Chen, David; Müller, Hans-Michael; Sternberg, Paul W
2006-01-01
Background Document classification is a wide-spread problem with many applications, from organizing search engine snippets to spam filtering. We previously described Textpresso, a text-mining system for biological literature, which marks up full text according to a shallow ontology that includes terms of biological interest. This project investigates document classification in the context of biological literature, making use of the Textpresso markup of a corpus of Caenorhabditis elegans literature. Results We present a two-step text categorization algorithm to classify a corpus of C. elegans papers. Our classification method first uses a support vector machine-trained classifier, followed by a novel, phrase-based clustering algorithm. This clustering step autonomously creates cluster labels that are descriptive and understandable by humans. This clustering engine performed better on a standard test-set (Reuters 21578) compared to previously published results (F-value of 0.55 vs. 0.49), while producing cluster descriptions that appear more useful. A web interface allows researchers to quickly navigate through the hierarchy and look for documents that belong to a specific concept. Conclusion We have demonstrated a simple method to classify biological documents that embodies an improvement over current methods. While the classification results are currently optimized for Caenorhabditis elegans papers by human-created rules, the classification engine can be adapted to different types of documents. We have demonstrated this by presenting a web interface that allows researchers to quickly navigate through the hierarchy and look for documents that belong to a specific concept. PMID:16893465
Mental Mechanisms for Topics Identification
2014-01-01
Topics identification (TI) is the process that consists in determining the main themes present in natural language documents. The current TI modeling paradigm aims at acquiring semantic information from statistic properties of large text datasets. We investigate the mental mechanisms responsible for the identification of topics in a single document given existing knowledge. Our main hypothesis is that topics are the result of accumulated neural activation of loosely organized information stored in long-term memory (LTM). We experimentally tested our hypothesis with a computational model that simulates LTM activation. The model assumes activation decay as an unavoidable phenomenon originating from the bioelectric nature of neural systems. Since decay should negatively affect the quality of topics, the model predicts the presence of short-term memory (STM) to keep the focus of attention on a few words, with the expected outcome of restoring quality to a baseline level. Our experiments measured topics quality of over 300 documents with various decay rates and STM capacity. Our results showed that accumulated activation of loosely organized information was an effective mental computational commodity to identify topics. It was furthermore confirmed that rapid decay is detrimental to topics quality but that limited capacity STM restores quality to a baseline level, even exceeding it slightly. PMID:24744775
Text grouping in patent analysis using adaptive K-means clustering algorithm
NASA Astrophysics Data System (ADS)
Shanie, Tiara; Suprijadi, Jadi; Zulhanif
2017-03-01
Patents are one of the Intellectual Property. Analyzing patent is one requirement in knowing well the development of technology in each country and in the world now. This study uses the patent document coming from the Espacenet server about Green Tea. Patent documents related to the technology in the field of tea is still widespread, so it will be difficult for users to information retrieval (IR). Therefore, it is necessary efforts to categorize documents in a specific group of related terms contained therein. This study uses titles patent text data with the proposed Green Tea in Statistical Text Mining methods consists of two phases: data preparation and data analysis stage. The data preparation phase uses Text Mining methods and data analysis stage is done by statistics. Statistical analysis in this study using a cluster analysis algorithm, the Adaptive K-Means Clustering Algorithm. Results from this study showed that based on the maximum value Silhouette, generate 87 clusters associated fifteen terms therein that can be utilized in the process of information retrieval needs.
Building Background Knowledge through Reading: Rethinking Text Sets
ERIC Educational Resources Information Center
Lupo, Sarah M.; Strong, John Z.; Lewis, William; Walpole, Sharon; McKenna, Michael C.
2018-01-01
To increase reading volume and help students access challenging texts, the authors propose a four-dimensional framework for text sets. The quad text set framework is designed around a target text: a challenging content area text, such as a canonical literary work, research article, or historical primary source document. The three remaining…
Adaptive removal of background and white space from document images using seam categorization
NASA Astrophysics Data System (ADS)
Fillion, Claude; Fan, Zhigang; Monga, Vishal
2011-03-01
Document images are obtained regularly by rasterization of document content and as scans of printed documents. Resizing via background and white space removal is often desired for better consumption of these images, whether on displays or in print. While white space and background are easy to identify in images, existing methods such as naïve removal and content aware resizing (seam carving) each have limitations that can lead to undesirable artifacts, such as uneven spacing between lines of text or poor arrangement of content. An adaptive method based on image content is hence needed. In this paper we propose an adaptive method to intelligently remove white space and background content from document images. Document images are different from pictorial images in structure. They typically contain objects (text letters, pictures and graphics) separated by uniform background, which include both white paper space and other uniform color background. Pixels in uniform background regions are excellent candidates for deletion if resizing is required, as they introduce less change in document content and style, compared with deletion of object pixels. We propose a background deletion method that exploits both local and global context. The method aims to retain the document structural information and image quality.
49 CFR 1104.2 - Document specifications.
Code of Federal Regulations, 2014 CFR
2014-10-01
... to facilitate automated processing in document sheet feeders, original documents of more than one... textual submissions. Use of color in filings is limited to images such as graphs, maps and photographs. To facilitate automated processing of color pages, color pages may not be inserted among pages containing text...
49 CFR 1104.2 - Document specifications.
Code of Federal Regulations, 2010 CFR
2010-10-01
... to facilitate automated processing in document sheet feeders, original documents of more than one... textual submissions. Use of color in filings is limited to images such as graphs, maps and photographs. To facilitate automated processing of color pages, color pages may not be inserted among pages containing text...
49 CFR 1104.2 - Document specifications.
Code of Federal Regulations, 2012 CFR
2012-10-01
... to facilitate automated processing in document sheet feeders, original documents of more than one... textual submissions. Use of color in filings is limited to images such as graphs, maps and photographs. To facilitate automated processing of color pages, color pages may not be inserted among pages containing text...
49 CFR 1104.2 - Document specifications.
Code of Federal Regulations, 2011 CFR
2011-10-01
... to facilitate automated processing in document sheet feeders, original documents of more than one... textual submissions. Use of color in filings is limited to images such as graphs, maps and photographs. To facilitate automated processing of color pages, color pages may not be inserted among pages containing text...
49 CFR 1104.2 - Document specifications.
Code of Federal Regulations, 2013 CFR
2013-10-01
... to facilitate automated processing in document sheet feeders, original documents of more than one... textual submissions. Use of color in filings is limited to images such as graphs, maps and photographs. To facilitate automated processing of color pages, color pages may not be inserted among pages containing text...
Enhancement of Text Representations Using Related Document Titles.
ERIC Educational Resources Information Center
Salton, G.; Zhang, Y.
1986-01-01
Briefly reviews various methodologies for constructing enhanced document representations, discusses their general lack of usefulness, and describes a method of document indexing which uses title words taken from bibliographically related items. Evaluation of this process indicates that it is not sufficiently reliable to warrant incorporation into…
How Hierarchical Topics Evolve in Large Text Corpora.
Cui, Weiwei; Liu, Shixia; Wu, Zhuofeng; Wei, Hao
2014-12-01
Using a sequence of topic trees to organize documents is a popular way to represent hierarchical and evolving topics in text corpora. However, following evolving topics in the context of topic trees remains difficult for users. To address this issue, we present an interactive visual text analysis approach to allow users to progressively explore and analyze the complex evolutionary patterns of hierarchical topics. The key idea behind our approach is to exploit a tree cut to approximate each tree and allow users to interactively modify the tree cuts based on their interests. In particular, we propose an incremental evolutionary tree cut algorithm with the goal of balancing 1) the fitness of each tree cut and the smoothness between adjacent tree cuts; 2) the historical and new information related to user interests. A time-based visualization is designed to illustrate the evolving topics over time. To preserve the mental map, we develop a stable layout algorithm. As a result, our approach can quickly guide users to progressively gain profound insights into evolving hierarchical topics. We evaluate the effectiveness of the proposed method on Amazon's Mechanical Turk and real-world news data. The results show that users are able to successfully analyze evolving topics in text data.
Tuarob, Suppawong; Tucker, Conrad S; Salathe, Marcel; Ram, Nilam
2014-06-01
The role of social media as a source of timely and massive information has become more apparent since the era of Web 2.0.Multiple studies illustrated the use of information in social media to discover biomedical and health-related knowledge.Most methods proposed in the literature employ traditional document classification techniques that represent a document as a bag of words.These techniques work well when documents are rich in text and conform to standard English; however, they are not optimal for social media data where sparsity and noise are norms.This paper aims to address the limitations posed by the traditional bag-of-word based methods and propose to use heterogeneous features in combination with ensemble machine learning techniques to discover health-related information, which could prove to be useful to multiple biomedical applications, especially those needing to discover health-related knowledge in large scale social media data.Furthermore, the proposed methodology could be generalized to discover different types of information in various kinds of textual data. Social media data is characterized by an abundance of short social-oriented messages that do not conform to standard languages, both grammatically and syntactically.The problem of discovering health-related knowledge in social media data streams is then transformed into a text classification problem, where a text is identified as positive if it is health-related and negative otherwise.We first identify the limitations of the traditional methods which train machines with N-gram word features, then propose to overcome such limitations by utilizing the collaboration of machine learning based classifiers, each of which is trained to learn a semantically different aspect of the data.The parameter analysis for tuning each classifier is also reported. Three data sets are used in this research.The first data set comprises of approximately 5000 hand-labeled tweets, and is used for cross validation of the classification models in the small scale experiment, and for training the classifiers in the real-world large scale experiment.The second data set is a random sample of real-world Twitter data in the US.The third data set is a random sample of real-world Facebook Timeline posts. Two sets of evaluations are conducted to investigate the proposed model's ability to discover health-related information in the social media domain: small scale and large scale evaluations.The small scale evaluation employs 10-fold cross validation on the labeled data, and aims to tune parameters of the proposed models, and to compare with the stage-of-the-art method.The large scale evaluation tests the trained classification models on the native, real-world data sets, and is needed to verify the ability of the proposed model to handle the massive heterogeneity in real-world social media. The small scale experiment reveals that the proposed method is able to mitigate the limitations in the well established techniques existing in the literature, resulting in performance improvement of 18.61% (F-measure).The large scale experiment further reveals that the baseline fails to perform well on larger data with higher degrees of heterogeneity, while the proposed method is able to yield reasonably good performance and outperform the baseline by 46.62% (F-Measure) on average. Copyright © 2014 Elsevier Inc. All rights reserved.
Simple-random-sampling-based multiclass text classification algorithm.
Liu, Wuying; Wang, Lin; Yi, Mianzhu
2014-01-01
Multiclass text classification (MTC) is a challenging issue and the corresponding MTC algorithms can be used in many applications. The space-time overhead of the algorithms must be concerned about the era of big data. Through the investigation of the token frequency distribution in a Chinese web document collection, this paper reexamines the power law and proposes a simple-random-sampling-based MTC (SRSMTC) algorithm. Supported by a token level memory to store labeled documents, the SRSMTC algorithm uses a text retrieval approach to solve text classification problems. The experimental results on the TanCorp data set show that SRSMTC algorithm can achieve the state-of-the-art performance at greatly reduced space-time requirements.
Basic test framework for the evaluation of text line segmentation and text parameter extraction.
Brodić, Darko; Milivojević, Dragan R; Milivojević, Zoran
2010-01-01
Text line segmentation is an essential stage in off-line optical character recognition (OCR) systems. It is a key because inaccurately segmented text lines will lead to OCR failure. Text line segmentation of handwritten documents is a complex and diverse problem, complicated by the nature of handwriting. Hence, text line segmentation is a leading challenge in handwritten document image processing. Due to inconsistencies in measurement and evaluation of text segmentation algorithm quality, some basic set of measurement methods is required. Currently, there is no commonly accepted one and all algorithm evaluation is custom oriented. In this paper, a basic test framework for the evaluation of text feature extraction algorithms is proposed. This test framework consists of a few experiments primarily linked to text line segmentation, skew rate and reference text line evaluation. Although they are mutually independent, the results obtained are strongly cross linked. In the end, its suitability for different types of letters and languages as well as its adaptability are its main advantages. Thus, the paper presents an efficient evaluation method for text analysis algorithms.
Basic Test Framework for the Evaluation of Text Line Segmentation and Text Parameter Extraction
Brodić, Darko; Milivojević, Dragan R.; Milivojević, Zoran
2010-01-01
Text line segmentation is an essential stage in off-line optical character recognition (OCR) systems. It is a key because inaccurately segmented text lines will lead to OCR failure. Text line segmentation of handwritten documents is a complex and diverse problem, complicated by the nature of handwriting. Hence, text line segmentation is a leading challenge in handwritten document image processing. Due to inconsistencies in measurement and evaluation of text segmentation algorithm quality, some basic set of measurement methods is required. Currently, there is no commonly accepted one and all algorithm evaluation is custom oriented. In this paper, a basic test framework for the evaluation of text feature extraction algorithms is proposed. This test framework consists of a few experiments primarily linked to text line segmentation, skew rate and reference text line evaluation. Although they are mutually independent, the results obtained are strongly cross linked. In the end, its suitability for different types of letters and languages as well as its adaptability are its main advantages. Thus, the paper presents an efficient evaluation method for text analysis algorithms. PMID:22399932
BioC: a minimalist approach to interoperability for biomedical text processing
Comeau, Donald C.; Islamaj Doğan, Rezarta; Ciccarese, Paolo; Cohen, Kevin Bretonnel; Krallinger, Martin; Leitner, Florian; Lu, Zhiyong; Peng, Yifan; Rinaldi, Fabio; Torii, Manabu; Valencia, Alfonso; Verspoor, Karin; Wiegers, Thomas C.; Wu, Cathy H.; Wilbur, W. John
2013-01-01
A vast amount of scientific information is encoded in natural language text, and the quantity of such text has become so great that it is no longer economically feasible to have a human as the first step in the search process. Natural language processing and text mining tools have become essential to facilitate the search for and extraction of information from text. This has led to vigorous research efforts to create useful tools and to create humanly labeled text corpora, which can be used to improve such tools. To encourage combining these efforts into larger, more powerful and more capable systems, a common interchange format to represent, store and exchange the data in a simple manner between different language processing systems and text mining tools is highly desirable. Here we propose a simple extensible mark-up language format to share text documents and annotations. The proposed annotation approach allows a large number of different annotations to be represented including sentences, tokens, parts of speech, named entities such as genes or diseases and relationships between named entities. In addition, we provide simple code to hold this data, read it from and write it back to extensible mark-up language files and perform some sample processing. We also describe completed as well as ongoing work to apply the approach in several directions. Code and data are available at http://bioc.sourceforge.net/. Database URL: http://bioc.sourceforge.net/ PMID:24048470
Extending information retrieval methods to personalized genomic-based studies of disease.
Ye, Shuyun; Dawson, John A; Kendziorski, Christina
2014-01-01
Genomic-based studies of disease now involve diverse types of data collected on large groups of patients. A major challenge facing statistical scientists is how best to combine the data, extract important features, and comprehensively characterize the ways in which they affect an individual's disease course and likelihood of response to treatment. We have developed a survival-supervised latent Dirichlet allocation (survLDA) modeling framework to address these challenges. Latent Dirichlet allocation (LDA) models have proven extremely effective at identifying themes common across large collections of text, but applications to genomics have been limited. Our framework extends LDA to the genome by considering each patient as a "document" with "text" detailing his/her clinical events and genomic state. We then further extend the framework to allow for supervision by a time-to-event response. The model enables the efficient identification of collections of clinical and genomic features that co-occur within patient subgroups, and then characterizes each patient by those features. An application of survLDA to The Cancer Genome Atlas ovarian project identifies informative patient subgroups showing differential response to treatment, and validation in an independent cohort demonstrates the potential for patient-specific inference.
JEnsembl: a version-aware Java API to Ensembl data systems.
Paterson, Trevor; Law, Andy
2012-11-01
The Ensembl Project provides release-specific Perl APIs for efficient high-level programmatic access to data stored in various Ensembl database schema. Although Perl scripts are perfectly suited for processing large volumes of text-based data, Perl is not ideal for developing large-scale software applications nor embedding in graphical interfaces. The provision of a novel Java API would facilitate type-safe, modular, object-orientated development of new Bioinformatics tools with which to access, analyse and visualize Ensembl data. The JEnsembl API implementation provides basic data retrieval and manipulation functionality from the Core, Compara and Variation databases for all species in Ensembl and EnsemblGenomes and is a platform for the development of a richer API to Ensembl datasources. The JEnsembl architecture uses a text-based configuration module to provide evolving, versioned mappings from database schema to code objects. A single installation of the JEnsembl API can therefore simultaneously and transparently connect to current and previous database instances (such as those in the public archive) thus facilitating better analysis repeatability and allowing 'through time' comparative analyses to be performed. Project development, released code libraries, Maven repository and documentation are hosted at SourceForge (http://jensembl.sourceforge.net).
Handling a Collection of PDF Documents
You have several options for making a large collection of PDF documents more accessible to your audience: avoid uploading altogether, use multiple document pages, and use document IDs as anchors for direct links within a document page.
Collaborative filtering to improve navigation of large radiology knowledge resources.
Kahn, Charles E
2005-06-01
Collaborative filtering is a knowledge-discovery technique that can help guide readers to items of potential interest based on the experience of prior users. This study sought to determine the impact of collaborative filtering on navigation of a large, Web-based radiology knowledge resource. Collaborative filtering was applied to a collection of 1,168 radiology hypertext documents available via the Internet. An item-based collaborative filtering algorithm identified each document's six most closely related documents based on 248,304 page views in an 18-day period. Documents were amended to include links to their related documents, and use was analyzed over the next 5 days. The mean number of documents viewed per visit increased from 1.57 to 1.74 (P < 0.0001). Collaborative filtering can increase a radiology information resource's utilization and can improve its usefulness and ease of navigation. The technique holds promise for improving navigation of large Internet-based radiology knowledge resources.
Literature evidence in open targets - a target validation platform.
Kafkas, Şenay; Dunham, Ian; McEntyre, Johanna
2017-06-06
We present the Europe PMC literature component of Open Targets - a target validation platform that integrates various evidence to aid drug target identification and validation. The component identifies target-disease associations in documents and ranks the documents based on their confidence from the Europe PMC literature database, by using rules utilising expert-provided heuristic information. The confidence score of a given document represents how valuable the document is in the scope of target validation for a given target-disease association by taking into account the credibility of the association based on the properties of the text. The component serves the platform regularly with the up-to-date data since December, 2015. Currently, there are a total number of 1168365 distinct target-disease associations text mined from >26 million PubMed abstracts and >1.2 million Open Access full text articles. Our comparative analyses on the current available evidence data in the platform revealed that 850179 of these associations are exclusively identified by literature mining. This component helps the platform's users by providing the most relevant literature hits for a given target and disease. The text mining evidence along with the other types of evidence can be explored visually through https://www.targetvalidation.org and all the evidence data is available for download in json format from https://www.targetvalidation.org/downloads/data .
40 CFR Appendix A to Part 66 - Technical Support Document
Code of Federal Regulations, 2010 CFR
2010-07-01
... 40 Protection of Environment 15 2010-07-01 2010-07-01 false Technical Support Document A Appendix A to Part 66 Protection of Environment ENVIRONMENTAL PROTECTION AGENCY (CONTINUED) AIR PROGRAMS...—Technical Support Document Note: For text of appendix A see appendix A to part 67. ...
Implementation of the common phrase index method on the phrase query for information retrieval
NASA Astrophysics Data System (ADS)
Fatmawati, Triyah; Zaman, Badrus; Werdiningsih, Indah
2017-08-01
As the development of technology, the process of finding information on the news text is easy, because the text of the news is not only distributed in print media, such as newspapers, but also in electronic media that can be accessed using the search engine. In the process of finding relevant documents on the search engine, a phrase often used as a query. The number of words that make up the phrase query and their position obviously affect the relevance of the document produced. As a result, the accuracy of the information obtained will be affected. Based on the outlined problem, the purpose of this research was to analyze the implementation of the common phrase index method on information retrieval. This research will be conducted in English news text and implemented on a prototype to determine the relevance level of the documents produced. The system is built with the stages of pre-processing, indexing, term weighting calculation, and cosine similarity calculation. Then the system will display the document search results in a sequence, based on the cosine similarity. Furthermore, system testing will be conducted using 100 documents and 20 queries. That result is then used for the evaluation stage. First, determine the relevant documents using kappa statistic calculation. Second, determine the system success rate using precision, recall, and F-measure calculation. In this research, the result of kappa statistic calculation was 0.71, so that the relevant documents are eligible for the system evaluation. Then the calculation of precision, recall, and F-measure produces precision of 0.37, recall of 0.50, and F-measure of 0.43. From this result can be said that the success rate of the system to produce relevant documents is low.
Khan, Imtiaz A; Fraser, Adam; Bray, Mark-Anthony; Smith, Paul J; White, Nick S; Carpenter, Anne E; Errington, Rachel J
2014-12-01
Experimental reproducibility is fundamental to the progress of science. Irreproducible research decreases the efficiency of basic biological research and drug discovery and impedes experimental data reuse. A major contributing factor to irreproducibility is difficulty in interpreting complex experimental methodologies and designs from written text and in assessing variations among different experiments. Current bioinformatics initiatives either are focused on computational research reproducibility (i.e. data analysis) or laboratory information management systems. Here, we present a software tool, ProtocolNavigator, which addresses the largely overlooked challenges of interpretation and assessment. It provides a biologist-friendly open-source emulation-based tool for designing, documenting and reproducing biological experiments. ProtocolNavigator was implemented in Python 2.7, using the wx module to build the graphical user interface. It is a platform-independent software and freely available from http://protocolnavigator.org/index.html under the GPL v2 license. © The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
NASA Astrophysics Data System (ADS)
Munro, Rosemary; Lang, Rüdiger; Klaes, Dieter; Poli, Gabriele; Retscher, Christian; Lindstrot, Rasmus; Huckle, Roger; Lacan, Antoine; Grzegorski, Michael; Holdak, Andriy; Kokhanovsky, Alexander; Livschitz, Jakob; Eisinger, Michael
2016-03-01
The Global Ozone Monitoring Experiment-2 (GOME-2) flies on the Metop series of satellites, the space component of the EUMETSAT Polar System. In this paper we will provide an overview of the instrument design, the on-ground calibration and characterization activities, in-flight calibration, and level 0 to 1 data processing. The current status of the level 1 data is presented and points of specific relevance to users are highlighted. Long-term level 1 data consistency is also discussed and plans for future work are outlined. The information contained in this paper summarizes a large number of technical reports and related documents containing information that is not currently available in the published literature. These reports and documents are however made available on the EUMETSAT web pages and readers requiring more details than can be provided in this overview paper will find appropriate references at relevant points in the text.
NASA Astrophysics Data System (ADS)
Munro, R.; Lang, R.; Klaes, D.; Poli, G.; Retscher, C.; Lindstrot, R.; Huckle, R.; Lacan, A.; Grzegorski, M.; Holdak, A.; Kokhanovsky, A.; Livschitz, J.; Eisinger, M.
2015-08-01
The Global Ozone Monitoring Experiment-2 (GOME-2) flies on the Metop series of satellites, the space component of the EUMETSAT Polar System. In this paper we will provide an overview of the instrument design, the on-ground calibration and characterisation activities, in-flight calibration, and level 0 to 1 data processing. The quality of the level 1 data is presented and points of specific relevance to users are highlighted. Long-term level 1 data consistency is also discussed and plans for future work are outlined. The information contained in this paper summarises a large number of technical reports and related documents containing information that is not currently available in the published literature. These reports and documents are however made available on the EUMETSAT web pages (http://www.eumetsat.int) and readers requiring more details than can be provided in this overview paper will find appropriate references at relevant points in the text.
Design and realization of the compound text-based test questions library management system
NASA Astrophysics Data System (ADS)
Shi, Lei; Feng, Lin; Zhao, Xin
2011-12-01
The test questions library management system is the essential part of the on-line examination system. The basic demand for which is to deal with compound text including information like images, formulae and create the corresponding Word documents. Having compared with the two current solutions of creating documents, this paper presents a design proposal of Word Automation mechanism based on OLE/COM technology, and discusses the way of Word Automation application in detail and at last provides the operating results of the system which have high reference value in improving the generated efficiency of project documents and report forms.
Introducing Text Analytics as a Graduate Business School Course
ERIC Educational Resources Information Center
Edgington, Theresa M.
2011-01-01
Text analytics refers to the process of analyzing unstructured data from documented sources, including open-ended surveys, blogs, and other types of web dialog. Text analytics has enveloped the concept of text mining, an analysis approach influenced heavily from data mining. While text mining has been covered extensively in various computer…
Discovery of novel biomarkers and phenotypes by semantic technologies
2013-01-01
Background Biomarkers and target-specific phenotypes are important to targeted drug design and individualized medicine, thus constituting an important aspect of modern pharmaceutical research and development. More and more, the discovery of relevant biomarkers is aided by in silico techniques based on applying data mining and computational chemistry on large molecular databases. However, there is an even larger source of valuable information available that can potentially be tapped for such discoveries: repositories constituted by research documents. Results This paper reports on a pilot experiment to discover potential novel biomarkers and phenotypes for diabetes and obesity by self-organized text mining of about 120,000 PubMed abstracts, public clinical trial summaries, and internal Merck research documents. These documents were directly analyzed by the InfoCodex semantic engine, without prior human manipulations such as parsing. Recall and precision against established, but different benchmarks lie in ranges up to 30% and 50% respectively. Retrieval of known entities missed by other traditional approaches could be demonstrated. Finally, the InfoCodex semantic engine was shown to discover new diabetes and obesity biomarkers and phenotypes. Amongst these were many interesting candidates with a high potential, although noticeable noise (uninteresting or obvious terms) was generated. Conclusions The reported approach of employing autonomous self-organising semantic engines to aid biomarker discovery, supplemented by appropriate manual curation processes, shows promise and has potential to impact, conservatively, a faster alternative to vocabulary processes dependent on humans having to read and analyze all the texts. More optimistically, it could impact pharmaceutical research, for example to shorten time-to-market of novel drugs, or speed up early recognition of dead ends and adverse reactions. PMID:23402646
Detection of figure and caption pairs based on disorder measurements
NASA Astrophysics Data System (ADS)
Faure, Claudie; Vincent, Nicole
2010-01-01
Figures inserted in documents mediate a kind of information for which the visual modality is more appropriate than the text. A complete understanding of a figure often necessitates the reading of its caption or to establish a relationship with the main text using a numbered figure identifier which is replicated in the caption and in the main text. A figure and its caption are closely related; they constitute single multimodal components (FC-pair) that Document Image Analysis cannot extract with text and graphics segmentation. We propose a method to go further than the graphics and text segmentation in order to extract FC-pairs without performing a full labelling of the page components. Horizontal and vertical text lines are detected in the pages. The graphics are associated with selected text lines to initiate the detector of FC-pairs. Spatial and visual disorders are introduced to define a layout model in terms of properties. It enables to cope with most of the numerous spatial arrangements of graphics and text lines. The detector of FC-pairs performs operations in order to eliminate the layout disorder and assigns a quality value to each FC-pair. The processed documents were collected in medic@, the digital historical collection of the BIUM (Bibliothèque InterUniversitaire Médicale). A first set of 98 pages constitutes the design set. Then 298 pages were collected to evaluate the system. The performances are the result of a full process, from the binarisation of the digital images to the detection of FC-pairs.
Cohen, Raphael; Elhadad, Michael; Elhadad, Noémie
2013-01-16
The increasing availability of Electronic Health Record (EHR) data and specifically free-text patient notes presents opportunities for phenotype extraction. Text-mining methods in particular can help disease modeling by mapping named-entities mentions to terminologies and clustering semantically related terms. EHR corpora, however, exhibit specific statistical and linguistic characteristics when compared with corpora in the biomedical literature domain. We focus on copy-and-paste redundancy: clinicians typically copy and paste information from previous notes when documenting a current patient encounter. Thus, within a longitudinal patient record, one expects to observe heavy redundancy. In this paper, we ask three research questions: (i) How can redundancy be quantified in large-scale text corpora? (ii) Conventional wisdom is that larger corpora yield better results in text mining. But how does the observed EHR redundancy affect text mining? Does such redundancy introduce a bias that distorts learned models? Or does the redundancy introduce benefits by highlighting stable and important subsets of the corpus? (iii) How can one mitigate the impact of redundancy on text mining? We analyze a large-scale EHR corpus and quantify redundancy both in terms of word and semantic concept repetition. We observe redundancy levels of about 30% and non-standard distribution of both words and concepts. We measure the impact of redundancy on two standard text-mining applications: collocation identification and topic modeling. We compare the results of these methods on synthetic data with controlled levels of redundancy and observe significant performance variation. Finally, we compare two mitigation strategies to avoid redundancy-induced bias: (i) a baseline strategy, keeping only the last note for each patient in the corpus; (ii) removing redundant notes with an efficient fingerprinting-based algorithm. (a)For text mining, preprocessing the EHR corpus with fingerprinting yields significantly better results. Before applying text-mining techniques, one must pay careful attention to the structure of the analyzed corpora. While the importance of data cleaning has been known for low-level text characteristics (e.g., encoding and spelling), high-level and difficult-to-quantify corpus characteristics, such as naturally occurring redundancy, can also hurt text mining. Fingerprinting enables text-mining techniques to leverage available data in the EHR corpus, while avoiding the bias introduced by redundancy.
New Framework for Cross-Domain Document Classification
2011-03-01
classification. The following paragraphs will introduce these related works in more detail. Wang et al . attempted to improve the accuracy of text document...of using Wikipedia to develop a thesaurus [20]. Gabrilovich et al . had an approach that is more elaborate in its use of Wikipedia text [21]. The...did show a modest improvement when it is performed using the Wikipedia information. Wang et al . improved on the results of co-clustering algorithm [24
Enhancement of the Shared Graphics Workspace.
1987-12-31
participants to share videodisc images and computer graphics displayed in color and text and facsimile information displayed in black on amber. They...could annotate the information in up to five * colors and print the annotated version at both sites, using a standard fax machine. The SGWS also used a fax...system to display a document, whether text or photo, the camera scans the document, digitizes the data, and sends it via direct memory access (DMA) to
Essie: A Concept-based Search Engine for Structured Biomedical Text
Ide, Nicholas C.; Loane, Russell F.; Demner-Fushman, Dina
2007-01-01
This article describes the algorithms implemented in the Essie search engine that is currently serving several Web sites at the National Library of Medicine. Essie is a phrase-based search engine with term and concept query expansion and probabilistic relevancy ranking. Essie’s design is motivated by an observation that query terms are often conceptually related to terms in a document, without actually occurring in the document text. Essie’s performance was evaluated using data and standard evaluation methods from the 2003 and 2006 Text REtrieval Conference (TREC) Genomics track. Essie was the best-performing search engine in the 2003 TREC Genomics track and achieved results comparable to those of the highest-ranking systems on the 2006 TREC Genomics track task. Essie shows that a judicious combination of exploiting document structure, phrase searching, and concept based query expansion is a useful approach for information retrieval in the biomedical domain. PMID:17329729
An IR-Based Approach Utilizing Query Expansion for Plagiarism Detection in MEDLINE.
Nawab, Rao Muhammad Adeel; Stevenson, Mark; Clough, Paul
2017-01-01
The identification of duplicated and plagiarized passages of text has become an increasingly active area of research. In this paper, we investigate methods for plagiarism detection that aim to identify potential sources of plagiarism from MEDLINE, particularly when the original text has been modified through the replacement of words or phrases. A scalable approach based on Information Retrieval is used to perform candidate document selection-the identification of a subset of potential source documents given a suspicious text-from MEDLINE. Query expansion is performed using the ULMS Metathesaurus to deal with situations in which original documents are obfuscated. Various approaches to Word Sense Disambiguation are investigated to deal with cases where there are multiple Concept Unique Identifiers (CUIs) for a given term. Results using the proposed IR-based approach outperform a state-of-the-art baseline based on Kullback-Leibler Distance.
Automatic indexing of scanned documents: a layout-based approach
NASA Astrophysics Data System (ADS)
Esser, Daniel; Schuster, Daniel; Muthmann, Klemens; Berger, Michael; Schill, Alexander
2012-01-01
Archiving official written documents such as invoices, reminders and account statements in business and private area gets more and more important. Creating appropriate index entries for document archives like sender's name, creation date or document number is a tedious manual work. We present a novel approach to handle automatic indexing of documents based on generic positional extraction of index terms. For this purpose we apply the knowledge of document templates stored in a common full text search index to find index positions that were successfully extracted in the past.
ERIC Educational Resources Information Center
Sheehan, Kathleen M.
2015-01-01
The "TextEvaluator"® text analysis tool is a fully automated text complexity evaluation tool designed to help teachers, curriculum specialists, textbook publishers, and test developers select texts that are consistent with the text complexity guidelines specified in the Common Core State Standards.This paper documents the procedure used…
NASA Technical Reports Server (NTRS)
Genuardi, Michael T.
1993-01-01
One strategy for machine-aided indexing (MAI) is to provide a concept-level analysis of the textual elements of documents or document abstracts. In such systems, natural-language phrases are analyzed in order to identify and classify concepts related to a particular subject domain. The overall performance of these MAI systems is largely dependent on the quality and comprehensiveness of their knowledge bases. These knowledge bases function to (1) define the relations between a controlled indexing vocabulary and natural language expressions; (2) provide a simple mechanism for disambiguation and the determination of relevancy; and (3) allow the extension of concept-hierarchical structure to all elements of the knowledge file. After a brief description of the NASA Machine-Aided Indexing system, concerns related to the development and maintenance of MAI knowledge bases are discussed. Particular emphasis is given to statistically-based text analysis tools designed to aid the knowledge base developer. One such tool, the Knowledge Base Building (KBB) program, presents the domain expert with a well-filtered list of synonyms and conceptually-related phrases for each thesaurus concept. Another tool, the Knowledge Base Maintenance (KBM) program, functions to identify areas of the knowledge base affected by changes in the conceptual domain (for example, the addition of a new thesaurus term). An alternate use of the KBM as an aid in thesaurus construction is also discussed.
76 FR 27309 - Committee on Measures of Student Success
Federal Register 2010, 2011, 2012, 2013, 2014
2011-05-11
... version of this document is the document published in the Federal Register. Free Internet access to the... text or Adobe Portable Document Format (PDF) on the Internet at the following site: http://www.ed.gov/news/fed-register/index.html . To use PDF you must have Adobe Acrobat Reader, which is available free...
76 FR 50198 - Committee on Measures of Student Success
Federal Register 2010, 2011, 2012, 2013, 2014
2011-08-12
...: The official version of this document is the document published in the Federal Register. Free Internet... Federal Register, in text or Adobe Portable Document Format (PDF) on the Internet at the following site... is available free at this site. If you have questions about using PDF, call the U.S. Government...
10 CFR 2.304 - Formal requirements for documents; signatures; acceptance for filing.
Code of Federal Regulations, 2010 CFR
2010-01-01
... documents. In addition to the requirements in this part, paper documents must be stapled or bound on the left side; typewritten, printed, or otherwise reproduced in permanent form on good unglazed paper of... not less than one inch. Text must be double-spaced, except that quotations may be single-spaced and...
Evaluating Combinations of Ranked Lists and Visualizations of Inter-Document Similarity.
ERIC Educational Resources Information Center
Allan, James; Leuski, Anton; Swan, Russell; Byrd, Donald
2001-01-01
Considers how ideas from document clustering can be used to improve retrieval accuracy of ranked lists in interactive systems and how to evaluate system effectiveness. Describes a TREC (Text Retrieval Conference) study that constructed and evaluated systems that present the user with ranked lists and a visualization of inter-document similarities.…
Combining approaches to on-line handwriting information retrieval
NASA Astrophysics Data System (ADS)
Peña Saldarriaga, Sebastián; Viard-Gaudin, Christian; Morin, Emmanuel
2010-01-01
In this work, we propose to combine two quite different approaches for retrieving handwritten documents. Our hypothesis is that different retrieval algorithms should retrieve different sets of documents for the same query. Therefore, significant improvements in retrieval performances can be expected. The first approach is based on information retrieval techniques carried out on the noisy texts obtained through handwriting recognition, while the second approach is recognition-free using a word spotting algorithm. Results shows that for texts having a word error rate (WER) lower than 23%, the performances obtained with the combined system are close to the performances obtained on clean digital texts. In addition, for poorly recognized texts (WER > 52%), an improvement of nearly 17% can be observed with respect to the best available baseline method.
National Survey of Patients’ Bill of Rights Statutes
Jacob, Dan M.; Hochhauser, Mark; Parker, Ruth M.
2009-01-01
BACKGROUND Despite vigorous national debate between 1999–2001 the federal patients’ bill of rights (PBOR) was not enacted. However, states have enacted legislation and the Joint Commission defined an accreditation standard to present patients with their rights. Because such initiatives can be undermined by overly complex language, we surveyed the readability of hospital PBOR documents as well as texts mandated by state law. METHODS State Web sites and codes were searched to identify PBOR statutes for general patient populations. The rights addressed were compared with the 12 themes presented in the American Hospital Association’s (AHA) PBOR text of 2002. In addition, we obtained PBOR texts from a sample of hospitals in each state. Readability was evaluated using Prose, a software program which reports an average of eight readability formulas. RESULTS Of 23 states with a PBOR statute for the general public, all establish a grievance policy, four protect a private right of action, and one stipulates fines for violations. These laws address an average of 7.4 of the 12 AHA themes. Nine states’ statutes specify PBOR text for distribution to patients. These documents have an average readability of 15th grade (range, 11.6, New York, to 17.0, Minnesota). PBOR documents from 240 US hospitals have an average readability of 14th grade (range, 8.2 to 17.0). CONCLUSIONS While the average U.S. adult reads at an 8th grade reading level, an advanced college reading level is routinely required to read PBOR documents. Patients are not likely to learn about their rights from documents they cannot read. PMID:19189192
The Ecological Approach to Text Visualization.
ERIC Educational Resources Information Center
Wise, James A.
1999-01-01
Presents both theoretical and technical bases on which to build a "science of text visualization." The Spatial Paradigm for Information Retrieval and Exploration (SPIRE) text-visualization system, which images information from free-text documents as natural terrains, serves as an example of the "ecological approach" in its visual metaphor, its…
Academic Journal Embargoes and Full Text Databases.
ERIC Educational Resources Information Center
Brooks, Sam
2003-01-01
Documents the reasons for embargoes of academic journals in full text databases (i.e., publisher-imposed delays on the availability of full text content) and provides insight regarding common misconceptions. Tables present data on selected journals covering a cross-section of subjects and publishers and comparing two full text business databases.…
Methods of Sparse Modeling and Dimensionality Reduction to Deal with Big Data
2015-04-01
supervised learning (c). Our framework consists of two separate phases: (a) first find an initial space in an unsupervised manner; then (b) utilize label...model that can learn thousands of topics from a large set of documents and infer the topic mixture of each document, 2) a supervised dimension reduction...model that can learn thousands of topics from a large set of documents and infer the topic mixture of each document, (i) a method of supervised
Text Classification for Organizational Researchers
Kobayashi, Vladimer B.; Mol, Stefan T.; Berkers, Hannah A.; Kismihók, Gábor; Den Hartog, Deanne N.
2017-01-01
Organizations are increasingly interested in classifying texts or parts thereof into categories, as this enables more effective use of their information. Manual procedures for text classification work well for up to a few hundred documents. However, when the number of documents is larger, manual procedures become laborious, time-consuming, and potentially unreliable. Techniques from text mining facilitate the automatic assignment of text strings to categories, making classification expedient, fast, and reliable, which creates potential for its application in organizational research. The purpose of this article is to familiarize organizational researchers with text mining techniques from machine learning and statistics. We describe the text classification process in several roughly sequential steps, namely training data preparation, preprocessing, transformation, application of classification techniques, and validation, and provide concrete recommendations at each step. To help researchers develop their own text classifiers, the R code associated with each step is presented in a tutorial. The tutorial draws from our own work on job vacancy mining. We end the article by discussing how researchers can validate a text classification model and the associated output. PMID:29881249
Using complex networks for text classification: Discriminating informative and imaginative documents
NASA Astrophysics Data System (ADS)
de Arruda, Henrique F.; Costa, Luciano da F.; Amancio, Diego R.
2016-01-01
Statistical methods have been widely employed in recent years to grasp many language properties. The application of such techniques have allowed an improvement of several linguistic applications, such as machine translation and document classification. In the latter, many approaches have emphasised the semantical content of texts, as is the case of bag-of-word language models. These approaches have certainly yielded reasonable performance. However, some potential features such as the structural organization of texts have been used only in a few studies. In this context, we probe how features derived from textual structure analysis can be effectively employed in a classification task. More specifically, we performed a supervised classification aiming at discriminating informative from imaginative documents. Using a networked model that describes the local topological/dynamical properties of function words, we achieved an accuracy rate of up to 95%, which is much higher than similar networked approaches. A systematic analysis of feature relevance revealed that symmetry and accessibility measurements are among the most prominent network measurements. Our results suggest that these measurements could be used in related language applications, as they play a complementary role in characterising texts.
Waste management CDM projects barriers NVivo 10® qualitative dataset.
Bufoni, André Luiz; de Sousa Ferreira, Aracéli Cristina; Oliveira, Luciano Basto
2017-12-01
This article contains one NVivo 10® file with the complete 432 projects design documents (PDD) of seven waste management sector industries registered as Clean Development Mechanism (CDM) under United Nations Framework Convention on Climate Change (UNFCCC) Kyoto Protocol Initiative from 2004 to 2014. All data analyses and sample statistics made during the research remain in the file. We coded PDDs in 890 fragments of text, classified in five categories of barriers (nodes): technological, financial, human resources, regulatory, socio-political. The data supports the findings of author thesis [1] and other two indexed publication in Waste Management Journal: "The financial attractiveness assessment of large waste management projects registered as clean development mechanism" and "The declared barriers of the large developing countries waste management projects: The STAR model" [2], [3]. The data allows any computer assisted qualitative content analysis (CAQCA) on the sector and it is available at Mendeley [4].
Native Language Processing using Exegy Text Miner
DOE Office of Scientific and Technical Information (OSTI.GOV)
Compton, J
2007-10-18
Lawrence Livermore National Laboratory's New Architectures Testbed recently evaluated Exegy's Text Miner appliance to assess its applicability to high-performance, automated native language analysis. The evaluation was performed with support from the Computing Applications and Research Department in close collaboration with Global Security programs, and institutional activities in native language analysis. The Exegy Text Miner is a special-purpose device for detecting and flagging user-supplied patterns of characters, whether in streaming text or in collections of documents at very high rates. Patterns may consist of simple lists of words or complex expressions with sub-patterns linked by logical operators. These searches are accomplishedmore » through a combination of specialized hardware (i.e., one or more field-programmable gates arrays in addition to general-purpose processors) and proprietary software that exploits these individual components in an optimal manner (through parallelism and pipelining). For this application the Text Miner has performed accurately and reproducibly at high speeds approaching those documented by Exegy in its technical specifications. The Exegy Text Miner is primarily intended for the single-byte ASCII characters used in English, but at a technical level its capabilities are language-neutral and can be applied to multi-byte character sets such as those found in Arabic and Chinese. The system is used for searching databases or tracking streaming text with respect to one or more lexicons. In a real operational environment it is likely that data would need to be processed separately for each lexicon or search technique. However, the searches would be so fast that multiple passes should not be considered as a limitation a priori. Indeed, it is conceivable that large databases could be searched as often as necessary if new queries were deemed worthwhile. This project is concerned with evaluating the Exegy Text Miner installed in the New Architectures Testbed running under software version 2.0. The concrete goals of the evaluation were to test the speed and accuracy of the Exegy and explore ways that it could be employed in current or future text-processing projects at Lawrence Livermore National Laboratory (LLNL). This study extended beyond this to evaluate its suitability for processing foreign language sources. The scope of this study was limited to the capabilities of the Exegy Text Miner in the file search mode and does not attempt simulating the streaming mode. Since the capabilities of the machine are invariant to the choice of input mode and since timing should not depend on this choice, it was felt that the added effort was not necessary for this restricted study.« less
Ensemble LUT classification for degraded document enhancement
NASA Astrophysics Data System (ADS)
Obafemi-Ajayi, Tayo; Agam, Gady; Frieder, Ophir
2008-01-01
The fast evolution of scanning and computing technologies have led to the creation of large collections of scanned paper documents. Examples of such collections include historical collections, legal depositories, medical archives, and business archives. Moreover, in many situations such as legal litigation and security investigations scanned collections are being used to facilitate systematic exploration of the data. It is almost always the case that scanned documents suffer from some form of degradation. Large degradations make documents hard to read and substantially deteriorate the performance of automated document processing systems. Enhancement of degraded document images is normally performed assuming global degradation models. When the degradation is large, global degradation models do not perform well. In contrast, we propose to estimate local degradation models and use them in enhancing degraded document images. Using a semi-automated enhancement system we have labeled a subset of the Frieder diaries collection.1 This labeled subset was then used to train an ensemble classifier. The component classifiers are based on lookup tables (LUT) in conjunction with the approximated nearest neighbor algorithm. The resulting algorithm is highly effcient. Experimental evaluation results are provided using the Frieder diaries collection.1
NASA Astrophysics Data System (ADS)
Zhang, Hui; Wang, Deqing; Wu, Wenjun; Hu, Hongping
2012-11-01
In today's business environment, enterprises are increasingly under pressure to process the vast amount of data produced everyday within enterprises. One method is to focus on the business intelligence (BI) applications and increasing the commercial added-value through such business analytics activities. Term weighting scheme, which has been used to convert the documents as vectors in the term space, is a vital task in enterprise Information Retrieval (IR), text categorisation, text analytics, etc. When determining term weight in a document, the traditional TF-IDF scheme sets weight value for the term considering only its occurrence frequency within the document and in the entire set of documents, which leads to some meaningful terms that cannot get the appropriate weight. In this article, we propose a new term weighting scheme called Term Frequency - Function of Document Frequency (TF-FDF) to address this issue. Instead of using monotonically decreasing function such as Inverse Document Frequency, FDF presents a convex function that dynamically adjusts weights according to the significance of the words in a document set. This function can be manually tuned based on the distribution of the most meaningful words which semantically represent the document set. Our experiments show that the TF-FDF can achieve higher value of Normalised Discounted Cumulative Gain in IR than that of TF-IDF and its variants, and improving the accuracy of relevance ranking of the IR results.
ERIC Educational Resources Information Center
Armbruster, Bonnie B.; Anderson, Thomas H.
Idea-mapping (i-mapping), a way of representing ideas from a text in the form of a diagram, is defined and illustrated in this document as a way to help students "see" how the ideas they read are linked to each other. The first portion of the document discusses the fundamental relationships found in texts (A is a characteristic of B, A…
Language Classification using N-grams Accelerated by FPGA-based Bloom Filters
DOE Office of Scientific and Technical Information (OSTI.GOV)
Jacob, A; Gokhale, M
N-Gram (n-character sequences in text documents) counting is a well-established technique used in classifying the language of text in a document. In this paper, n-gram processing is accelerated through the use of reconfigurable hardware on the XtremeData XD1000 system. Our design employs parallelism at multiple levels, with parallel Bloom Filters accessing on-chip RAM, parallel language classifiers, and parallel document processing. In contrast to another hardware implementation (HAIL algorithm) that uses off-chip SRAM for lookup, our highly scalable implementation uses only on-chip memory blocks. Our implementation of end-to-end language classification runs at 85x comparable software and 1.45x the competing hardware design.
32 CFR 1801.44 - Action by appeals authority.
Code of Federal Regulations, 2010 CFR
2010-07-01
... CENTER PUBLIC RIGHTS UNDER THE PRIVACY ACT OF 1974 Action On Privacy Act Administrative Appeals § 1801.44... request, the document(s) (sanitized and full text) at issue, and the findings of any concerned office...
Text, photo, and line extraction in scanned documents
NASA Astrophysics Data System (ADS)
Erkilinc, M. Sezer; Jaber, Mustafa; Saber, Eli; Bauer, Peter; Depalov, Dejan
2012-07-01
We propose a page layout analysis algorithm to classify a scanned document into different regions such as text, photo, or strong lines. The proposed scheme consists of five modules. The first module performs several image preprocessing techniques such as image scaling, filtering, color space conversion, and gamma correction to enhance the scanned image quality and reduce the computation time in later stages. Text detection is applied in the second module wherein wavelet transform and run-length encoding are employed to generate and validate text regions, respectively. The third module uses a Markov random field based block-wise segmentation that employs a basis vector projection technique with maximum a posteriori probability optimization to detect photo regions. In the fourth module, methods for edge detection, edge linking, line-segment fitting, and Hough transform are utilized to detect strong edges and lines. In the last module, the resultant text, photo, and edge maps are combined to generate a page layout map using K-Means clustering. The proposed algorithm has been tested on several hundred documents that contain simple and complex page layout structures and contents such as articles, magazines, business cards, dictionaries, and newsletters, and compared against state-of-the-art page-segmentation techniques with benchmark performance. The results indicate that our methodology achieves an average of ˜89% classification accuracy in text, photo, and background regions.
Support Vector Machines: Relevance Feedback and Information Retrieval.
ERIC Educational Resources Information Center
Drucker, Harris; Shahrary, Behzad; Gibbon, David C.
2002-01-01
Compares support vector machines (SVMs) to Rocchio, Ide regular and Ide dec-hi algorithms in information retrieval (IR) of text documents using relevancy feedback. If the preliminary search is so poor that one has to search through many documents to find at least one relevant document, then SVM is preferred. Includes nine tables. (Contains 24…
17 CFR 4.1 - Requirements as to form.
Code of Federal Regulations, 2013 CFR
2013-04-01
... table of contents is required, the electronic document must either include page numbers in the text or... as to form. (a) Each document distributed pursuant to this part 4 must be: (1) Clear and legible; (2...” disclosed under this part 4 must be displayed in capital letters and in boldface type. (c) Where a document...
17 CFR 4.1 - Requirements as to form.
Code of Federal Regulations, 2010 CFR
2010-04-01
... table of contents is required, the electronic document must either include page numbers in the text or... as to form. (a) Each document distributed pursuant to this part 4 must be: (1) Clear and legible; (2...” disclosed under this part 4 must be displayed in capital letters and in boldface type. (c) Where a document...
17 CFR 4.1 - Requirements as to form.
Code of Federal Regulations, 2012 CFR
2012-04-01
... table of contents is required, the electronic document must either include page numbers in the text or... as to form. (a) Each document distributed pursuant to this part 4 must be: (1) Clear and legible; (2...” disclosed under this part 4 must be displayed in capital letters and in boldface type. (c) Where a document...
17 CFR 4.1 - Requirements as to form.
Code of Federal Regulations, 2011 CFR
2011-04-01
... table of contents is required, the electronic document must either include page numbers in the text or... as to form. (a) Each document distributed pursuant to this part 4 must be: (1) Clear and legible; (2...” disclosed under this part 4 must be displayed in capital letters and in boldface type. (c) Where a document...
17 CFR 4.1 - Requirements as to form.
Code of Federal Regulations, 2014 CFR
2014-04-01
... table of contents is required, the electronic document must either include page numbers in the text or... as to form. (a) Each document distributed pursuant to this part 4 must be: (1) Clear and legible; (2...” disclosed under this part 4 must be displayed in capital letters and in boldface type. (c) Where a document...
EFL Learners' Multiple Documents Literacy: Effects of a Strategy-Directed Intervention Program
ERIC Educational Resources Information Center
Karimi, Mohammad Nabi
2015-01-01
There is a substantial body of L2 research documenting the central role of strategy instruction in reading comprehension. However, this line of research has been conducted mostly within the single text paradigm of reading research. With reading literacy undergoing a marked shift from single source reading to multiple documents literacy, little is…
American Catholic Higher Education. Essential Documents, 1967-1990.
ERIC Educational Resources Information Center
Gallin, Alice, Ed.
This reference volume contains the texts of documents pertinent to the development of Catholic higher education during the years from 1967 to 1990. The documents reveal church officials' and university presidents' collaborative efforts to address the questions of what it means it mean to be a university or college and what it means for such an…
Business Documents Don't Have to Be Boring
ERIC Educational Resources Information Center
Schultz, Benjamin
2006-01-01
With business documents, visuals can serve to enhance the written word in conveying the message. Images can be especially effective when used subtly, on part of the page, on successive pages to provide continuity, or even set as watermarks over the entire page. A main reason given for traditional text-only business documents is that they are…
Federal Register 2010, 2011, 2012, 2013, 2014
2011-10-12
... 11-134] Facilitating the Deployment of Text-to-911 and Other Next Generation 911 Applications... 911 Public Safety Answering Points (PSAPs) via text, photos, videos, and data and enhance the... 22, 2011. The full text of this document is available for public inspection during regular business...
Different Words for the Same Concept: Learning Collaboratively from Multiple Documents
ERIC Educational Resources Information Center
Jucks, Regina; Paus, Elisabeth
2013-01-01
This study investigated how varying the lexical encodings of technical terms in multiple texts influences learners' dyadic processing of scientific-related information. Fifty-seven pairs of college students read journalistic texts on depression. Each partner in a dyad received one text; for half of the dyads the partner's text contained different…
Banerjee, Arindam; Ghosh, Joydeep
2004-05-01
Competitive learning mechanisms for clustering, in general, suffer from poor performance for very high-dimensional (>1000) data because of "curse of dimensionality" effects. In applications such as document clustering, it is customary to normalize the high-dimensional input vectors to unit length, and it is sometimes also desirable to obtain balanced clusters, i.e., clusters of comparable sizes. The spherical kmeans (spkmeans) algorithm, which normalizes the cluster centers as well as the inputs, has been successfully used to cluster normalized text documents in 2000+ dimensional space. Unfortunately, like regular kmeans and its soft expectation-maximization-based version, spkmeans tends to generate extremely imbalanced clusters in high-dimensional spaces when the desired number of clusters is large (tens or more). This paper first shows that the spkmeans algorithm can be derived from a certain maximum likelihood formulation using a mixture of von Mises-Fisher distributions as the generative model, and in fact, it can be considered as a batch-mode version of (normalized) competitive learning. The proposed generative model is then adapted in a principled way to yield three frequency-sensitive competitive learning variants that are applicable to static data and produced high-quality and well-balanced clusters for high-dimensional data. Like kmeans, each iteration is linear in the number of data points and in the number of clusters for all the three algorithms. A frequency-sensitive algorithm to cluster streaming data is also proposed. Experimental results on clustering of high-dimensional text data sets are provided to show the effectiveness and applicability of the proposed techniques. Index Terms-Balanced clustering, expectation maximization (EM), frequency-sensitive competitive learning (FSCL), high-dimensional clustering, kmeans, normalized data, scalable clustering, streaming data, text clustering.
Automatic system for computer program documentation
NASA Technical Reports Server (NTRS)
Simmons, D. B.; Elliott, R. W.; Arseven, S.; Colunga, D.
1972-01-01
Work done on a project to design an automatic system for computer program documentation aids was made to determine what existing programs could be used effectively to document computer programs. Results of the study are included in the form of an extensive bibliography and working papers on appropriate operating systems, text editors, program editors, data structures, standards, decision tables, flowchart systems, and proprietary documentation aids. The preliminary design for an automated documentation system is also included. An actual program has been documented in detail to demonstrate the types of output that can be produced by the proposed system.
2010-01-01
Background In the United States, the Health Insurance Portability and Accountability Act (HIPAA) protects the confidentiality of patient data and requires the informed consent of the patient and approval of the Internal Review Board to use data for research purposes, but these requirements can be waived if data is de-identified. For clinical data to be considered de-identified, the HIPAA "Safe Harbor" technique requires 18 data elements (called PHI: Protected Health Information) to be removed. The de-identification of narrative text documents is often realized manually, and requires significant resources. Well aware of these issues, several authors have investigated automated de-identification of narrative text documents from the electronic health record, and a review of recent research in this domain is presented here. Methods This review focuses on recently published research (after 1995), and includes relevant publications from bibliographic queries in PubMed, conference proceedings, the ACM Digital Library, and interesting publications referenced in already included papers. Results The literature search returned more than 200 publications. The majority focused only on structured data de-identification instead of narrative text, on image de-identification, or described manual de-identification, and were therefore excluded. Finally, 18 publications describing automated text de-identification were selected for detailed analysis of the architecture and methods used, the types of PHI detected and removed, the external resources used, and the types of clinical documents targeted. All text de-identification systems aimed to identify and remove person names, and many included other types of PHI. Most systems used only one or two specific clinical document types, and were mostly based on two different groups of methodologies: pattern matching and machine learning. Many systems combined both approaches for different types of PHI, but the majority relied only on pattern matching, rules, and dictionaries. Conclusions In general, methods based on dictionaries performed better with PHI that is rarely mentioned in clinical text, but are more difficult to generalize. Methods based on machine learning tend to perform better, especially with PHI that is not mentioned in the dictionaries used. Finally, the issues of anonymization, sufficient performance, and "over-scrubbing" are discussed in this publication. PMID:20678228
Meystre, Stephane M; Friedlin, F Jeffrey; South, Brett R; Shen, Shuying; Samore, Matthew H
2010-08-02
In the United States, the Health Insurance Portability and Accountability Act (HIPAA) protects the confidentiality of patient data and requires the informed consent of the patient and approval of the Internal Review Board to use data for research purposes, but these requirements can be waived if data is de-identified. For clinical data to be considered de-identified, the HIPAA "Safe Harbor" technique requires 18 data elements (called PHI: Protected Health Information) to be removed. The de-identification of narrative text documents is often realized manually, and requires significant resources. Well aware of these issues, several authors have investigated automated de-identification of narrative text documents from the electronic health record, and a review of recent research in this domain is presented here. This review focuses on recently published research (after 1995), and includes relevant publications from bibliographic queries in PubMed, conference proceedings, the ACM Digital Library, and interesting publications referenced in already included papers. The literature search returned more than 200 publications. The majority focused only on structured data de-identification instead of narrative text, on image de-identification, or described manual de-identification, and were therefore excluded. Finally, 18 publications describing automated text de-identification were selected for detailed analysis of the architecture and methods used, the types of PHI detected and removed, the external resources used, and the types of clinical documents targeted. All text de-identification systems aimed to identify and remove person names, and many included other types of PHI. Most systems used only one or two specific clinical document types, and were mostly based on two different groups of methodologies: pattern matching and machine learning. Many systems combined both approaches for different types of PHI, but the majority relied only on pattern matching, rules, and dictionaries. In general, methods based on dictionaries performed better with PHI that is rarely mentioned in clinical text, but are more difficult to generalize. Methods based on machine learning tend to perform better, especially with PHI that is not mentioned in the dictionaries used. Finally, the issues of anonymization, sufficient performance, and "over-scrubbing" are discussed in this publication.
Onboard shuttle on-line software requirements system: Prototype
NASA Technical Reports Server (NTRS)
Kolkhorst, Barbara; Ogletree, Barry
1989-01-01
The prototype discussed here was developed as proof of a concept for a system which could support high volumes of requirements documents with integrated text and graphics; the solution proposed here could be extended to other projects whose goal is to place paper documents in an electronic system for viewing and printing purposes. The technical problems (such as conversion of documentation between word processors, management of a variety of graphics file formats, and difficulties involved in scanning integrated text and graphics) would be very similar for other systems of this type. Indeed, technological advances in areas such as scanning hardware and software and display terminals insure that some of the problems encountered here will be solved in the near-term (less than five years). Examples of these solvable problems include automated input of integrated text and graphics, errors in the recognition process, and the loss of image information which results from the digitization process. The solution developed for the Online Software Requirements System is modular and allows hardware and software components to be upgraded or replaced as industry solutions mature. The extensive commercial software content allows the NASA customer to apply resources to solving the problem and maintaining documents.
Comparative Analysis of Document level Text Classification Algorithms using R
NASA Astrophysics Data System (ADS)
Syamala, Maganti; Nalini, N. J., Dr; Maguluri, Lakshamanaphaneendra; Ragupathy, R., Dr.
2017-08-01
From the past few decades there has been tremendous volumes of data available in Internet either in structured or unstructured form. Also, there is an exponential growth of information on Internet, so there is an emergent need of text classifiers. Text mining is an interdisciplinary field which draws attention on information retrieval, data mining, machine learning, statistics and computational linguistics. And to handle this situation, a wide range of supervised learning algorithms has been introduced. Among all these K-Nearest Neighbor(KNN) is efficient and simplest classifier in text classification family. But KNN suffers from imbalanced class distribution and noisy term features. So, to cope up with this challenge we use document based centroid dimensionality reduction(CentroidDR) using R Programming. By combining these two text classification techniques, KNN and Centroid classifiers, we propose a scalable and effective flat classifier, called MCenKNN which works well substantially better than CenKNN.
Tsatsaronis, George; Balikas, Georgios; Malakasiotis, Prodromos; Partalas, Ioannis; Zschunke, Matthias; Alvers, Michael R; Weissenborn, Dirk; Krithara, Anastasia; Petridis, Sergios; Polychronopoulos, Dimitris; Almirantis, Yannis; Pavlopoulos, John; Baskiotis, Nicolas; Gallinari, Patrick; Artiéres, Thierry; Ngomo, Axel-Cyrille Ngonga; Heino, Norman; Gaussier, Eric; Barrio-Alvers, Liliana; Schroeder, Michael; Androutsopoulos, Ion; Paliouras, Georgios
2015-04-30
This article provides an overview of the first BIOASQ challenge, a competition on large-scale biomedical semantic indexing and question answering (QA), which took place between March and September 2013. BIOASQ assesses the ability of systems to semantically index very large numbers of biomedical scientific articles, and to return concise and user-understandable answers to given natural language questions by combining information from biomedical articles and ontologies. The 2013 BIOASQ competition comprised two tasks, Task 1a and Task 1b. In Task 1a participants were asked to automatically annotate new PUBMED documents with MESH headings. Twelve teams participated in Task 1a, with a total of 46 system runs submitted, and one of the teams performing consistently better than the MTI indexer used by NLM to suggest MESH headings to curators. Task 1b used benchmark datasets containing 29 development and 282 test English questions, along with gold standard (reference) answers, prepared by a team of biomedical experts from around Europe and participants had to automatically produce answers. Three teams participated in Task 1b, with 11 system runs. The BIOASQ infrastructure, including benchmark datasets, evaluation mechanisms, and the results of the participants and baseline methods, is publicly available. A publicly available evaluation infrastructure for biomedical semantic indexing and QA has been developed, which includes benchmark datasets, and can be used to evaluate systems that: assign MESH headings to published articles or to English questions; retrieve relevant RDF triples from ontologies, relevant articles and snippets from PUBMED Central; produce "exact" and paragraph-sized "ideal" answers (summaries). The results of the systems that participated in the 2013 BIOASQ competition are promising. In Task 1a one of the systems performed consistently better from the NLM's MTI indexer. In Task 1b the systems received high scores in the manual evaluation of the "ideal" answers; hence, they produced high quality summaries as answers. Overall, BIOASQ helped obtain a unified view of how techniques from text classification, semantic indexing, document and passage retrieval, question answering, and text summarization can be combined to allow biomedical experts to obtain concise, user-understandable answers to questions reflecting their real information needs.
PuReD-MCL: a graph-based PubMed document clustering methodology.
Theodosiou, T; Darzentas, N; Angelis, L; Ouzounis, C A
2008-09-01
Biomedical literature is the principal repository of biomedical knowledge, with PubMed being the most complete database collecting, organizing and analyzing such textual knowledge. There are numerous efforts that attempt to exploit this information by using text mining and machine learning techniques. We developed a novel approach, called PuReD-MCL (Pubmed Related Documents-MCL), which is based on the graph clustering algorithm MCL and relevant resources from PubMed. PuReD-MCL avoids using natural language processing (NLP) techniques directly; instead, it takes advantage of existing resources, available from PubMed. PuReD-MCL then clusters documents efficiently using the MCL graph clustering algorithm, which is based on graph flow simulation. This process allows users to analyse the results by highlighting important clues, and finally to visualize the clusters and all relevant information using an interactive graph layout algorithm, for instance BioLayout Express 3D. The methodology was applied to two different datasets, previously used for the validation of the document clustering tool TextQuest. The first dataset involves the organisms Escherichia coli and yeast, whereas the second is related to Drosophila development. PuReD-MCL successfully reproduces the annotated results obtained from TextQuest, while at the same time provides additional insights into the clusters and the corresponding documents. Source code in perl and R are available from http://tartara.csd.auth.gr/~theodos/
Automated compound classification using a chemical ontology.
Bobach, Claudia; Böhme, Timo; Laube, Ulf; Püschel, Anett; Weber, Lutz
2012-12-29
Classification of chemical compounds into compound classes by using structure derived descriptors is a well-established method to aid the evaluation and abstraction of compound properties in chemical compound databases. MeSH and recently ChEBI are examples of chemical ontologies that provide a hierarchical classification of compounds into general compound classes of biological interest based on their structural as well as property or use features. In these ontologies, compounds have been assigned manually to their respective classes. However, with the ever increasing possibilities to extract new compounds from text documents using name-to-structure tools and considering the large number of compounds deposited in databases, automated and comprehensive chemical classification methods are needed to avoid the error prone and time consuming manual classification of compounds. In the present work we implement principles and methods to construct a chemical ontology of classes that shall support the automated, high-quality compound classification in chemical databases or text documents. While SMARTS expressions have already been used to define chemical structure class concepts, in the present work we have extended the expressive power of such class definitions by expanding their structure-based reasoning logic. Thus, to achieve the required precision and granularity of chemical class definitions, sets of SMARTS class definitions are connected by OR and NOT logical operators. In addition, AND logic has been implemented to allow the concomitant use of flexible atom lists and stereochemistry definitions. The resulting chemical ontology is a multi-hierarchical taxonomy of concept nodes connected by directed, transitive relationships. A proposal for a rule based definition of chemical classes has been made that allows to define chemical compound classes more precisely than before. The proposed structure-based reasoning logic allows to translate chemistry expert knowledge into a computer interpretable form, preventing erroneous compound assignments and allowing automatic compound classification. The automated assignment of compounds in databases, compound structure files or text documents to their related ontology classes is possible through the integration with a chemical structure search engine. As an application example, the annotation of chemical structure files with a prototypic ontology is demonstrated.
Automated compound classification using a chemical ontology
2012-01-01
Background Classification of chemical compounds into compound classes by using structure derived descriptors is a well-established method to aid the evaluation and abstraction of compound properties in chemical compound databases. MeSH and recently ChEBI are examples of chemical ontologies that provide a hierarchical classification of compounds into general compound classes of biological interest based on their structural as well as property or use features. In these ontologies, compounds have been assigned manually to their respective classes. However, with the ever increasing possibilities to extract new compounds from text documents using name-to-structure tools and considering the large number of compounds deposited in databases, automated and comprehensive chemical classification methods are needed to avoid the error prone and time consuming manual classification of compounds. Results In the present work we implement principles and methods to construct a chemical ontology of classes that shall support the automated, high-quality compound classification in chemical databases or text documents. While SMARTS expressions have already been used to define chemical structure class concepts, in the present work we have extended the expressive power of such class definitions by expanding their structure-based reasoning logic. Thus, to achieve the required precision and granularity of chemical class definitions, sets of SMARTS class definitions are connected by OR and NOT logical operators. In addition, AND logic has been implemented to allow the concomitant use of flexible atom lists and stereochemistry definitions. The resulting chemical ontology is a multi-hierarchical taxonomy of concept nodes connected by directed, transitive relationships. Conclusions A proposal for a rule based definition of chemical classes has been made that allows to define chemical compound classes more precisely than before. The proposed structure-based reasoning logic allows to translate chemistry expert knowledge into a computer interpretable form, preventing erroneous compound assignments and allowing automatic compound classification. The automated assignment of compounds in databases, compound structure files or text documents to their related ontology classes is possible through the integration with a chemical structure search engine. As an application example, the annotation of chemical structure files with a prototypic ontology is demonstrated. PMID:23273256
Predicate Argument Structure Analysis for Use Case Description Modeling
NASA Astrophysics Data System (ADS)
Takeuchi, Hironori; Nakamura, Taiga; Yamaguchi, Takahira
In a large software system development project, many documents are prepared and updated frequently. In such a situation, support is needed for looking through these documents easily to identify inconsistencies and to maintain traceability. In this research, we focus on the requirements documents such as use cases and consider how to create models from the use case descriptions in unformatted text. In the model construction, we propose a few semantic constraints based on the features of the use cases and use them for a predicate argument structure analysis to assign semantic labels to actors and actions. With this approach, we show that we can assign semantic labels without enhancing any existing general lexical resources such as case frame dictionaries and design a less language-dependent model construction architecture. By using the constructed model, we consider a system for quality analysis of the use cases and automated test case generation to keep the traceability between document sets. We evaluated the reuse of the existing use cases and generated test case steps automatically with the proposed prototype system from real-world use cases in the development of a system using a packaged application. Based on the evaluation, we show how to construct models with high precision from English and Japanese use case data. Also, we could generate good test cases for about 90% of the real use cases through the manual improvement of the descriptions based on the feedback from the quality analysis system.
ERIC Educational Resources Information Center
Congress of the U.S., Washington, DC. House Committee on the Judiciary.
This document contains witnesses' testimonies and prepared statements from the Congressional hearing called to consider enactment of H.R. 2673, a bill to facilitate implementation of the 1980 Hague Convention on the Civil Aspects of International Child Abduction. The text of H.R. 2673 is included in the document as is the text of H.R. 3971, a bill…
Graphics-based intelligent search and abstracting using Data Modeling
NASA Astrophysics Data System (ADS)
Jaenisch, Holger M.; Handley, James W.; Case, Carl T.; Songy, Claude G.
2002-11-01
This paper presents an autonomous text and context-mining algorithm that converts text documents into point clouds for visual search cues. This algorithm is applied to the task of data-mining a scriptural database comprised of the Old and New Testaments from the Bible and the Book of Mormon, Doctrine and Covenants, and the Pearl of Great Price. Results are generated which graphically show the scripture that represents the average concept of the database and the mining of the documents down to the verse level.
Abdulla, Ahmed AbdoAziz Ahmed; Lin, Hongfei; Xu, Bo; Banbhrani, Santosh Kumar
2016-07-25
Biomedical literature retrieval is becoming increasingly complex, and there is a fundamental need for advanced information retrieval systems. Information Retrieval (IR) programs scour unstructured materials such as text documents in large reserves of data that are usually stored on computers. IR is related to the representation, storage, and organization of information items, as well as to access. In IR one of the main problems is to determine which documents are relevant and which are not to the user's needs. Under the current regime, users cannot precisely construct queries in an accurate way to retrieve particular pieces of data from large reserves of data. Basic information retrieval systems are producing low-quality search results. In our proposed system for this paper we present a new technique to refine Information Retrieval searches to better represent the user's information need in order to enhance the performance of information retrieval by using different query expansion techniques and apply a linear combinations between them, where the combinations was linearly between two expansion results at one time. Query expansions expand the search query, for example, by finding synonyms and reweighting original terms. They provide significantly more focused, particularized search results than do basic search queries. The retrieval performance is measured by some variants of MAP (Mean Average Precision) and according to our experimental results, the combination of best results of query expansion is enhanced the retrieved documents and outperforms our baseline by 21.06 %, even it outperforms a previous study by 7.12 %. We propose several query expansion techniques and their combinations (linearly) to make user queries more cognizable to search engines and to produce higher-quality search results.
Automatic generation of reports at the TELECOM SCC
NASA Astrophysics Data System (ADS)
Beltan, Thierry; Jalbaud, Myriam; Fronton, Jean Francois
In-orbit satellite follow-up produces a certain amount of reports on a regular basis (daily, weekly, quarterly, annually). Most of these documents use the information of former issues with the increments of the last period of time. They are made up of text, tables, graphs or pictures. The system presented here is the SGMT (Systeme de Gestion de la Memoire Technique), which means Technical Memory Mangement System. It provides the system operators with tools to generate the greatest part of these reports, as automatically as possible. It gives an easy access to the reports and the large amount of available memory enables the user to consult data on the complete lifetime of a satellite family.
NASA Technical Reports Server (NTRS)
Srivastava, Ashok; McIntosh, Dawn; Castle, Pat; Pontikakis, Manos; Diev, Vesselin; Zane-Ulman, Brett; Turkov, Eugene; Akella, Ram; Xu, Zuobing; Kumaresan, Sakthi Preethi
2006-01-01
This viewgraph document describes the data mining system developed at NASA Ames. Many NASA programs have large numbers (and types) of problem reports.These free text reports are written by a number of different people, thus the emphasis and wording vary considerably With so much data to sift through, analysts (subject experts) need help identifying any possible safety issues or concerns and help them confirm that they haven't missed important problems. Unsupervised clustering is the initial step to accomplish this; We think we can go much farther, specifically, identify possible recurring anomalies. Recurring anomalies may be indicators of larger systemic problems. The requirement to identify these anomalies has led to the development of Recurring Anomaly Discovery System (ReADS).
Linguistic measures of chemical diversity and the "keywords" of molecular collections.
Woźniak, Michał; Wołos, Agnieszka; Modrzyk, Urszula; Górski, Rafał L; Winkowski, Jan; Bajczyk, Michał; Szymkuć, Sara; Grzybowski, Bartosz A; Eder, Maciej
2018-05-15
Computerized linguistic analyses have proven of immense value in comparing and searching through large text collections ("corpora"), including those deposited on the Internet - indeed, it would nowadays be hard to imagine browsing the Web without, for instance, search algorithms extracting most appropriate keywords from documents. This paper describes how such corpus-linguistic concepts can be extended to chemistry based on characteristic "chemical words" that span more than traditional functional groups and, instead, look at common structural fragments molecules share. Using these words, it is possible to quantify the diversity of chemical collections/databases in new ways and to define molecular "keywords" by which such collections are best characterized and annotated.
Term Familiarity to indicate Perceived and Actual Difficulty of Text in Medical Digital Libraries.
Leroy, Gondy; Endicott, James E
2011-10-01
With increasing text digitization, digital libraries can personalize materials for individuals with different education levels and language skills. To this end, documents need meta-information describing their difficulty level. Previous attempts at such labeling used readability formulas but the formulas have not been validated with modern texts and their outcome is seldom associated with actual difficulty. We focus on medical texts and are developing new, evidence-based meta-tags that are associated with perceived and actual text difficulty. This work describes a first tag, term familiarity , which is based on term frequency in the Google corpus. We evaluated its feasibility to serve as a tag by looking at a document corpus (N=1,073) and found that terms in blogs or journal articles displayed unexpected but significantly different scores. Term familiarity was then applied to texts and results from a previous user study (N=86) and could better explain differences for perceived and actual difficulty.
NASA Astrophysics Data System (ADS)
de Andrade Lopes, Alneu; Minghim, Rosane; Melo, Vinícius; Paulovich, Fernando V.
2006-01-01
The current availability of information many times impair the tasks of searching, browsing and analyzing information pertinent to a topic of interest. This paper presents a methodology to create a meaningful graphical representation of documents corpora targeted at supporting exploration of correlated documents. The purpose of such an approach is to produce a map from a document body on a research topic or field based on the analysis of their contents, and similarities amongst articles. The document map is generated, after text pre-processing, by projecting the data in two dimensions using Latent Semantic Indexing. The projection is followed by hierarchical clustering to support sub-area identification. The map can be interactively explored, helping to narrow down the search for relevant articles. Tests were performed using a collection of documents pre-classified into three research subject classes: Case-Based Reasoning, Information Retrieval, and Inductive Logic Programming. The map produced was capable of separating the main areas and approaching documents by their similarity, revealing possible topics, and identifying boundaries between them. The tool can deal with the exploration of inter-topics and intra-topic relationship and is useful in many contexts that need deciding on relevant articles to read, such as scientific research, education, and training.
ERIC Educational Resources Information Center
Larsen, Kent S., Ed.
Materials in this resource document were compiled for use in a Washington seminar directed to the interests of state and local government to develop strategies for privacy protection. Included are the texts of issue papers and supporting documents in the following subject areas: (1) criminal justice information; (2) public employee records; (3)…
AFT-QuEST Consortium Yearbook. Proceedings of the AFT-QuEST Consortium (April 22-26, 1973).
ERIC Educational Resources Information Center
American Federation of Teachers, Washington, DC.
This document is a report on the proceedings of the 1973 American Federation of Teachers-Quality Educational Standards in Teaching (AFT-QuEST) consortium sponsored by the AFT. Included in this document are the texts of speeches and outlines of workshops and iscussions. The document is divided into the following sections: goals, major proposals,…
A Comparison of Product Realization Frameworks
1993-10-01
software (integrated FrameMaker ). Also included are BOLD for on-line documentation delivery, printer/plotter support, and 18 network licensing support. AMPLE...are built with DSS. Documentation tools include an on-line information system (BOLD), text editing (Notepad), word processing (integrated FrameMaker ...within an application. FrameMaker is fully integrated with the Falcon Framework to provide consistent documentation capabilities within engineering
Conjunctive Cohesion in English Language EU Documents--A Corpus-Based Analysis and Its Implications
ERIC Educational Resources Information Center
Trebits, Anna
2009-01-01
This paper reports the findings of a study which forms part of a larger-scale research project investigating the use of English in the documents of the European Union (EU). The documents of the EU show various features of texts written for legal, business and other specific purposes. Moreover, the translation services of the EU institutions often…
Research notes : information at your fingertips!
DOT National Transportation Integrated Search
2000-03-01
TRIS Online includes full-text reports or links to publishers or suppliers of the original documents. You will find titles, publication dates, authors, abstracts, and document sources. : Each year over 20,000 new records are added to TRIS. The databa...
The Council of Europe and Sport, 1966-1998. Volume III: Texts of the Anti-Doping Convention.
ERIC Educational Resources Information Center
Council of Europe, Strasbourg (France).
This document presents texts in the field of sports and doping that were adopted by various committees of the Council of Europe. The seven sections present: (1) "Texts Adopted by the Committee of Ministers, 1996-1988"; (2) "Texts Adopted at the Conferences of European Ministers Responsible for Sport Since 1978" and…
A Study of Readability of Texts in Bangla through Machine Learning Approaches
ERIC Educational Resources Information Center
Sinha, Manjira; Basu, Anupam
2016-01-01
In this work, we have investigated text readability in Bangla language. Text readability is an indicator of the suitability of a given document with respect to a target reader group. Therefore, text readability has huge impact on educational content preparation. The advances in the field of natural language processing have enabled the automatic…
A Survey of Text Materials Used in Aviation Maintenance Technician Schools. Final Report.
ERIC Educational Resources Information Center
Allen, David; Bowers, William K.
The report documents the results of a national survey of book publishing firms and aviation maintenance technician schools to (1) identify the text materials used in the training of aviation mechanics; (2) appraise the suitability and availability of identified text materials; and (3) determine the adequacy of the text materials in meeting the…
Automation for System Safety Analysis
NASA Technical Reports Server (NTRS)
Malin, Jane T.; Fleming, Land; Throop, David; Thronesbery, Carroll; Flores, Joshua; Bennett, Ted; Wennberg, Paul
2009-01-01
This presentation describes work to integrate a set of tools to support early model-based analysis of failures and hazards due to system-software interactions. The tools perform and assist analysts in the following tasks: 1) extract model parts from text for architecture and safety/hazard models; 2) combine the parts with library information to develop the models for visualization and analysis; 3) perform graph analysis and simulation to identify and evaluate possible paths from hazard sources to vulnerable entities and functions, in nominal and anomalous system-software configurations and scenarios; and 4) identify resulting candidate scenarios for software integration testing. There has been significant technical progress in model extraction from Orion program text sources, architecture model derivation (components and connections) and documentation of extraction sources. Models have been derived from Internal Interface Requirements Documents (IIRDs) and FMEA documents. Linguistic text processing is used to extract model parts and relationships, and the Aerospace Ontology also aids automated model development from the extracted information. Visualizations of these models assist analysts in requirements overview and in checking consistency and completeness.
Correcting geometric and photometric distortion of document images on a smartphone
NASA Astrophysics Data System (ADS)
Simon, Christian; Williem; Park, In Kyu
2015-01-01
A set of document image processing algorithms for improving the optical character recognition (OCR) capability of smartphone applications is presented. The scope of the problem covers the geometric and photometric distortion correction of document images. The proposed framework was developed to satisfy industrial requirements. It is implemented on an off-the-shelf smartphone with limited resources in terms of speed and memory. Geometric distortions, i.e., skew and perspective distortion, are corrected by sending horizontal and vertical vanishing points toward infinity in a downsampled image. Photometric distortion includes image degradation from moiré pattern noise and specular highlights. Moiré pattern noise is removed using low-pass filters with different sizes independently applied to the background and text region. The contrast of the text in a specular highlighted area is enhanced by locally enlarging the intensity difference between the background and text while the noise is suppressed. Intensive experiments indicate that the proposed methods show a consistent and robust performance on a smartphone with a runtime of less than 1 s.
Boost OCR accuracy using iVector based system combination approach
NASA Astrophysics Data System (ADS)
Peng, Xujun; Cao, Huaigu; Natarajan, Prem
2015-01-01
Optical character recognition (OCR) is a challenging task because most existing preprocessing approaches are sensitive to writing style, writing material, noises and image resolution. Thus, a single recognition system cannot address all factors of real document images. In this paper, we describe an approach to combine diverse recognition systems by using iVector based features, which is a newly developed method in the field of speaker verification. Prior to system combination, document images are preprocessed and text line images are extracted with different approaches for each system, where iVector is transformed from a high-dimensional supervector of each text line and is used to predict the accuracy of OCR. We merge hypotheses from multiple recognition systems according to the overlap ratio and the predicted OCR score of text line images. We present evaluation results on an Arabic document database where the proposed method is compared against the single best OCR system using word error rate (WER) metric.
JEnsembl: a version-aware Java API to Ensembl data systems
Paterson, Trevor; Law, Andy
2012-01-01
Motivation: The Ensembl Project provides release-specific Perl APIs for efficient high-level programmatic access to data stored in various Ensembl database schema. Although Perl scripts are perfectly suited for processing large volumes of text-based data, Perl is not ideal for developing large-scale software applications nor embedding in graphical interfaces. The provision of a novel Java API would facilitate type-safe, modular, object-orientated development of new Bioinformatics tools with which to access, analyse and visualize Ensembl data. Results: The JEnsembl API implementation provides basic data retrieval and manipulation functionality from the Core, Compara and Variation databases for all species in Ensembl and EnsemblGenomes and is a platform for the development of a richer API to Ensembl datasources. The JEnsembl architecture uses a text-based configuration module to provide evolving, versioned mappings from database schema to code objects. A single installation of the JEnsembl API can therefore simultaneously and transparently connect to current and previous database instances (such as those in the public archive) thus facilitating better analysis repeatability and allowing ‘through time’ comparative analyses to be performed. Availability: Project development, released code libraries, Maven repository and documentation are hosted at SourceForge (http://jensembl.sourceforge.net). Contact: jensembl-develop@lists.sf.net, andy.law@roslin.ed.ac.uk, trevor.paterson@roslin.ed.ac.uk PMID:22945789
A Simple Test Identifies Selection on Complex Traits.
Beissinger, Tim; Kruppa, Jochen; Cavero, David; Ha, Ngoc-Thuy; Erbe, Malena; Simianer, Henner
2018-05-01
Important traits in agricultural, natural, and human populations are increasingly being shown to be under the control of many genes that individually contribute only a small proportion of genetic variation. However, the majority of modern tools in quantitative and population genetics, including genome-wide association studies and selection-mapping protocols, are designed to identify individual genes with large effects. We have developed an approach to identify traits that have been under selection and are controlled by large numbers of loci. In contrast to existing methods, our technique uses additive-effects estimates from all available markers, and relates these estimates to allele-frequency change over time. Using this information, we generate a composite statistic, denoted [Formula: see text] which can be used to test for significant evidence of selection on a trait. Our test requires pre- and postselection genotypic data but only a single time point with phenotypic information. Simulations demonstrate that [Formula: see text] is powerful for identifying selection, particularly in situations where the trait being tested is controlled by many genes, which is precisely the scenario where classical approaches for selection mapping are least powerful. We apply this test to breeding populations of maize and chickens, where we demonstrate the successful identification of selection on traits that are documented to have been under selection. Copyright © 2018 Beissinger et al.
Liu, Yuanchao; Liu, Ming; Wang, Xin
2015-01-01
The objective of text clustering is to divide document collections into clusters based on the similarity between documents. In this paper, an extension-based feature modeling approach towards semantically sensitive text clustering is proposed along with the corresponding feature space construction and similarity computation method. By combining the similarity in traditional feature space and that in extension space, the adverse effects of the complexity and diversity of natural language can be addressed and clustering semantic sensitivity can be improved correspondingly. The generated clusters can be organized using different granularities. The experimental evaluations on well-known clustering algorithms and datasets have verified the effectiveness of our approach.
Liu, Yuanchao; Liu, Ming; Wang, Xin
2015-01-01
The objective of text clustering is to divide document collections into clusters based on the similarity between documents. In this paper, an extension-based feature modeling approach towards semantically sensitive text clustering is proposed along with the corresponding feature space construction and similarity computation method. By combining the similarity in traditional feature space and that in extension space, the adverse effects of the complexity and diversity of natural language can be addressed and clustering semantic sensitivity can be improved correspondingly. The generated clusters can be organized using different granularities. The experimental evaluations on well-known clustering algorithms and datasets have verified the effectiveness of our approach. PMID:25794172
Use of Co-occurrences for Temporal Expressions Annotation
NASA Astrophysics Data System (ADS)
Craveiro, Olga; Macedo, Joaquim; Madeira, Henrique
The annotation or extraction of temporal information from text documents is becoming increasingly important in many natural language processing applications such as text summarization, information retrieval, question answering, etc.. This paper presents an original method for easy recognition of temporal expressions in text documents. The method creates semantically classified temporal patterns, using word co-occurrences obtained from training corpora and a pre-defined seed keywords set, derived from the used language temporal references. A participation on a Portuguese named entity evaluation contest showed promising effectiveness and efficiency results. This approach can be adapted to recognize other type of expressions or languages, within other contexts, by defining the suitable word sets and training corpora.
Classifying clinical notes with pain assessment using machine learning.
Fodeh, Samah Jamal; Finch, Dezon; Bouayad, Lina; Luther, Stephen L; Ling, Han; Kerns, Robert D; Brandt, Cynthia
2017-12-26
Pain is a significant public health problem, affecting millions of people in the USA. Evidence has highlighted that patients with chronic pain often suffer from deficits in pain care quality (PCQ) including pain assessment, treatment, and reassessment. Currently, there is no intelligent and reliable approach to identify PCQ indicators inelectronic health records (EHR). Hereby, we used unstructured text narratives in the EHR to derive pain assessment in clinical notes for patients with chronic pain. Our dataset includes patients with documented pain intensity rating ratings > = 4 and initial musculoskeletal diagnoses (MSD) captured by (ICD-9-CM codes) in fiscal year 2011 and a minimal 1 year of follow-up (follow-up period is 3-yr maximum); with complete data on key demographic variables. A total of 92 patients with 1058 notes was used. First, we manually annotated qualifiers and descriptors of pain assessment using the annotation schema that we previously developed. Second, we developed a reliable classifier for indicators of pain assessment in clinical note. Based on our annotation schema, we found variations in documenting the subclasses of pain assessment. In positive notes, providers mostly documented assessment of pain site (67%) and intensity of pain (57%), followed by persistence (32%). In only 27% of positive notes, did providers document a presumed etiology for the pain complaint or diagnosis. Documentation of patients' reports of factors that aggravate pain was only present in 11% of positive notes. Random forest classifier achieved the best performance labeling clinical notes with pain assessment information, compared to other classifiers; 94, 95, 94, and 94% was observed in terms of accuracy, PPV, F1-score, and AUC, respectively. Despite the wide spectrum of research that utilizes machine learning in many clinical applications, none explored using these methods for pain assessment research. In addition, previous studies using large datasets to detect and analyze characteristics of patients with various types of pain have relied exclusively on billing and coded data as the main source of information. This study, in contrast, harnessed unstructured narrative text data from the EHR to detect pain assessment clinical notes. We developed a Random forest classifier to identify clinical notes with pain assessment information. Compared to other classifiers, ours achieved the best results in most of the reported metrics. Graphical abstract Framework for detecting pain assessment in clinical notes.
Automatic target validation based on neuroscientific literature mining for tractography
Vasques, Xavier; Richardet, Renaud; Hill, Sean L.; Slater, David; Chappelier, Jean-Cedric; Pralong, Etienne; Bloch, Jocelyne; Draganski, Bogdan; Cif, Laura
2015-01-01
Target identification for tractography studies requires solid anatomical knowledge validated by an extensive literature review across species for each seed structure to be studied. Manual literature review to identify targets for a given seed region is tedious and potentially subjective. Therefore, complementary approaches would be useful. We propose to use text-mining models to automatically suggest potential targets from the neuroscientific literature, full-text articles and abstracts, so that they can be used for anatomical connection studies and more specifically for tractography. We applied text-mining models to three structures: two well-studied structures, since validated deep brain stimulation targets, the internal globus pallidus and the subthalamic nucleus and, the nucleus accumbens, an exploratory target for treating psychiatric disorders. We performed a systematic review of the literature to document the projections of the three selected structures and compared it with the targets proposed by text-mining models, both in rat and primate (including human). We ran probabilistic tractography on the nucleus accumbens and compared the output with the results of the text-mining models and literature review. Overall, text-mining the literature could find three times as many targets as two man-weeks of curation could. The overall efficiency of the text-mining against literature review in our study was 98% recall (at 36% precision), meaning that over all the targets for the three selected seeds, only one target has been missed by text-mining. We demonstrate that connectivity for a structure of interest can be extracted from a very large amount of publications and abstracts. We believe this tool will be useful in helping the neuroscience community to facilitate connectivity studies of particular brain regions. The text mining tools used for the study are part of the HBP Neuroinformatics Platform, publicly available at http://connectivity-brainer.rhcloud.com/. PMID:26074781
ASM Based Synthesis of Handwritten Arabic Text Pages
Al-Hamadi, Ayoub; Elzobi, Moftah; El-etriby, Sherif; Ghoneim, Ahmed
2015-01-01
Document analysis tasks, as text recognition, word spotting, or segmentation, are highly dependent on comprehensive and suitable databases for training and validation. However their generation is expensive in sense of labor and time. As a matter of fact, there is a lack of such databases, which complicates research and development. This is especially true for the case of Arabic handwriting recognition, that involves different preprocessing, segmentation, and recognition methods, which have individual demands on samples and ground truth. To bypass this problem, we present an efficient system that automatically turns Arabic Unicode text into synthetic images of handwritten documents and detailed ground truth. Active Shape Models (ASMs) based on 28046 online samples were used for character synthesis and statistical properties were extracted from the IESK-arDB database to simulate baselines and word slant or skew. In the synthesis step ASM based representations are composed to words and text pages, smoothed by B-Spline interpolation and rendered considering writing speed and pen characteristics. Finally, we use the synthetic data to validate a segmentation method. An experimental comparison with the IESK-arDB database encourages to train and test document analysis related methods on synthetic samples, whenever no sufficient natural ground truthed data is available. PMID:26295059
ASM Based Synthesis of Handwritten Arabic Text Pages.
Dinges, Laslo; Al-Hamadi, Ayoub; Elzobi, Moftah; El-Etriby, Sherif; Ghoneim, Ahmed
2015-01-01
Document analysis tasks, as text recognition, word spotting, or segmentation, are highly dependent on comprehensive and suitable databases for training and validation. However their generation is expensive in sense of labor and time. As a matter of fact, there is a lack of such databases, which complicates research and development. This is especially true for the case of Arabic handwriting recognition, that involves different preprocessing, segmentation, and recognition methods, which have individual demands on samples and ground truth. To bypass this problem, we present an efficient system that automatically turns Arabic Unicode text into synthetic images of handwritten documents and detailed ground truth. Active Shape Models (ASMs) based on 28046 online samples were used for character synthesis and statistical properties were extracted from the IESK-arDB database to simulate baselines and word slant or skew. In the synthesis step ASM based representations are composed to words and text pages, smoothed by B-Spline interpolation and rendered considering writing speed and pen characteristics. Finally, we use the synthetic data to validate a segmentation method. An experimental comparison with the IESK-arDB database encourages to train and test document analysis related methods on synthetic samples, whenever no sufficient natural ground truthed data is available.
Yang, Jianji J; Cohen, Aaron M; Cohen, Aaron; McDonagh, Marian S
2008-11-06
Automatic document classification can be valuable in increasing the efficiency in updating systematic reviews (SR). In order for the machine learning process to work well, it is critical to create and maintain high-quality training datasets consisting of expert SR inclusion/exclusion decisions. This task can be laborious, especially when the number of topics is large and source data format is inconsistent.To approach this problem, we build an automated system to streamline the required steps, from initial notification of update in source annotation files to loading the data warehouse, along with a web interface to monitor the status of each topic. In our current collection of 26 SR topics, we were able to standardize almost all of the relevance judgments and recovered PMIDs for over 80% of all articles. Of those PMIDs, over 99% were correct in a manual random sample study. Our system performs an essential function in creating training and evaluation data sets for SR text mining research.
Yang, Jianji J.; Cohen, Aaron M.; McDonagh, Marian S.
2008-01-01
Automatic document classification can be valuable in increasing the efficiency in updating systematic reviews (SR). In order for the machine learning process to work well, it is critical to create and maintain high-quality training datasets consisting of expert SR inclusion/exclusion decisions. This task can be laborious, especially when the number of topics is large and source data format is inconsistent. To approach this problem, we build an automated system to streamline the required steps, from initial notification of update in source annotation files to loading the data warehouse, along with a web interface to monitor the status of each topic. In our current collection of 26 SR topics, we were able to standardize almost all of the relevance judgments and recovered PMIDs for over 80% of all articles. Of those PMIDs, over 99% were correct in a manual random sample study. Our system performs an essential function in creating training and evaluation datasets for SR text mining research. PMID:18999194
Task-Driven Comparison of Topic Models.
Alexander, Eric; Gleicher, Michael
2016-01-01
Topic modeling, a method of statistically extracting thematic content from a large collection of texts, is used for a wide variety of tasks within text analysis. Though there are a growing number of tools and techniques for exploring single models, comparisons between models are generally reduced to a small set of numerical metrics. These metrics may or may not reflect a model's performance on the analyst's intended task, and can therefore be insufficient to diagnose what causes differences between models. In this paper, we explore task-centric topic model comparison, considering how we can both provide detail for a more nuanced understanding of differences and address the wealth of tasks for which topic models are used. We derive comparison tasks from single-model uses of topic models, which predominantly fall into the categories of understanding topics, understanding similarity, and understanding change. Finally, we provide several visualization techniques that facilitate these tasks, including buddy plots, which combine color and position encodings to allow analysts to readily view changes in document similarity.
The use of the liquid crystal display (LCD) panel as a teaching aid in medical lectures.
Wong, K T
1992-01-01
The liquid crystal display (LCD) panel is designed to project on-screen information of a microcomputer onto a larger screen with the aid of a standard overhead projector, so that large audiences may view on-screen information without having to crowd around the TV monitor. As little has been written about its use as a visual aid in medical teaching, the present report documents its use in a series of pathology lectures delivered, over a 2-year period, to two classes of about 150 medical students each. Some advantages of the LCD panel over the 35mm slide include the flexibility of last-minute text changes and less lead time needed for text preparation. It eliminates the problems of messy last-minute changes in, and improves legibility of, handwritten overhead projector transparencies. The disadvantages of using an LCD panel include the relatively bulky equipment which may pose transport problems, image clarity that is inferior to the 35mm slide, and equipment costs.
Raptor: An Enterprise Knowledge Discovery Engine Version 2.0
DOE Office of Scientific and Technical Information (OSTI.GOV)
2011-08-31
The Raptor Version 2.0 computer code uses a set of documents as seed documents to recommend documents of interest from a large, target set of documents. The computer code provides results that show the recommended documents with the highest similarity to the seed documents. Version 2.0 was specifically developed to work with SharePoint 2007 and MS SQL server.
Goldacre, Ben; Gray, Jonathan
2016-04-08
OpenTrials is a collaborative and open database for all available structured data and documents on all clinical trials, threaded together by individual trial. With a versatile and expandable data schema, it is initially designed to host and match the following documents and data for each trial: registry entries; links, abstracts, or texts of academic journal papers; portions of regulatory documents describing individual trials; structured data on methods and results extracted by systematic reviewers or other researchers; clinical study reports; and additional documents such as blank consent forms, blank case report forms, and protocols. The intention is to create an open, freely re-usable index of all such information and to increase discoverability, facilitate research, identify inconsistent data, enable audits on the availability and completeness of this information, support advocacy for better data and drive up standards around open data in evidence-based medicine. The project has phase I funding. This will allow us to create a practical data schema and populate the database initially through web-scraping, basic record linkage techniques, crowd-sourced curation around selected drug areas, and import of existing sources of structured and documents. It will also allow us to create user-friendly web interfaces onto the data and conduct user engagement workshops to optimise the database and interface designs. Where other projects have set out to manually and perfectly curate a narrow range of information on a smaller number of trials, we aim to use a broader range of techniques and attempt to match a very large quantity of information on all trials. We are currently seeking feedback and additional sources of structured data.
NASA Astrophysics Data System (ADS)
Wilson, B. D.; McGibbney, L. J.; Mattmann, C. A.; Ramirez, P.; Joyce, M.; Whitehall, K. D.
2015-12-01
Quantifying scientific relevancy is of increasing importance to NASA and the research community. Scientific relevancy may be defined by mapping the impacts of a particular NASA mission, instrument, and/or retrieved variables to disciplines such as climate predictions, natural hazards detection and mitigation processes, education, and scientific discoveries. Related to relevancy, is the ability to expose data with similar attributes. This in turn depends upon the ability for us to extract latent, implicit document features from scientific data and resources and make them explicit, accessible and useable for search activities amongst others. This paper presents MemexGATE; a server side application, command line interface and computing environment for running large scale metadata extraction, general architecture text engineering, document classification and indexing tasks over document resources such as social media streams, scientific literature archives, legal documentation, etc. This work builds on existing experiences using MemexGATE (funded, developed and validated through the DARPA Memex Progrjam PI Mattmann) for extracting and leveraging latent content features from document resources within the Materials Research domain. We extend the software functionality capability to the domain of scientific literature with emphasis on the expansion of gazetteer lists, named entity rules, natural language construct labeling (e.g. synonym, antonym, hyponym, etc.) efforts to enable extraction of latent content features from data hosted by wide variety of scientific literature vendors (AGU Meeting Abstract Database, Springer, Wiley Online, Elsevier, etc.) hosting earth science literature. Such literature makes both implicit and explicit references to NASA datasets and relationships between such concepts stored across EOSDIS DAAC's hence we envisage that a significant part of this effort will also include development and understanding of relevancy signals which can ultimately be utilized for improved search and relevancy ranking across scientific literature.
Pankau, Thomas; Wichmann, Gunnar; Neumuth, Thomas; Preim, Bernhard; Dietz, Andreas; Stumpp, Patrick; Boehm, Andreas
2015-10-01
Many treatment approaches are available for head and neck cancer (HNC), leading to challenges for a multidisciplinary medical team in matching each patient with an appropriate regimen. In this effort, primary diagnostics and its reliable documentation are indispensable. A three-dimensional (3D) documentation system was developed and tested to determine its influence on interpretation of these data, especially for TNM classification. A total of 42 HNC patient data sets were available, including primary diagnostics such as panendoscopy, performed and evaluated by an experienced head and neck surgeon. In addition to the conventional panendoscopy form and report, a 3D representation was generated with the "Tumor Therapy Manager" (TTM) software. These cases were randomly re-evaluated by 11 experienced otolaryngologists from five hospitals, half with and half without the TTM data. The accuracy of tumor staging was assessed by pre-post comparison of the TNM classification. TNM staging showed no significant differences in tumor classification (T) with and without 3D from TTM. However, there was a significant decrease in standard deviation from 0.86 to 0.63 via TTM ([Formula: see text]). In nodal staging without TTM, the lymph nodes (N) were significantly underestimated with [Formula: see text] classes compared with [Formula: see text] with TTM ([Formula: see text]). Likewise, the standard deviation was reduced from 0.79 to 0.69 ([Formula: see text]). There was no influence of TTM results on the evaluation of distant metastases (M). TNM staging was more reproducible and nodal staging more accurate when 3D documentation of HNC primary data was available to experienced otolaryngologists. The more precise assessment of the tumor classification with TTM should provide improved decision-making concerning therapy, especially within the interdisciplinary tumor board.
NASA Astrophysics Data System (ADS)
Boling, M. E.
1989-09-01
Prototypes were assembled pursuant to recommendations made in report K/DSRD-96, Issues and Approaches for Electronic Document Approval and Transmittal Using Digital Signatures and Text Authentication, and to examine and discover the possibilities for integrating available hardware and software to provide cost effective systems for digital signatures and text authentication. These prototypes show that on a LAN, a multitasking, windowed, mouse/keyboard menu-driven interface can be assembled to provide easy and quick access to bit-mapped images of documents, electronic forms and electronic mail messages with a means to sign, encrypt, deliver, receive or retrieve and authenticate text and signatures. In addition they show that some of this same software may be used in a classified environment using host to terminal transactions to accomplish these same operations. Finally, a prototype was developed demonstrating that binary files may be signed electronically and sent by point to point communication and over ARPANET to remote locations where the authenticity of the code and signature may be verified. Related studies on the subject of electronic signatures and text authentication using public key encryption were done within the Department of Energy. These studies include timing studies of public key encryption software and hardware and testing of experimental user-generated host resident software for public key encryption. This software used commercially available command-line source code. These studies are responsive to an initiative within the Office of the Secretary of Defense (OSD) for the protection of unclassified but sensitive data. It is notable that these related studies are all built around the same commercially available public key encryption products from the private sector and that the software selection was made independently by each study group.
New directions in biomedical text annotation: definitions, guidelines and corpus construction
Wilbur, W John; Rzhetsky, Andrey; Shatkay, Hagit
2006-01-01
Background While biomedical text mining is emerging as an important research area, practical results have proven difficult to achieve. We believe that an important first step towards more accurate text-mining lies in the ability to identify and characterize text that satisfies various types of information needs. We report here the results of our inquiry into properties of scientific text that have sufficient generality to transcend the confines of a narrow subject area, while supporting practical mining of text for factual information. Our ultimate goal is to annotate a significant corpus of biomedical text and train machine learning methods to automatically categorize such text along certain dimensions that we have defined. Results We have identified five qualitative dimensions that we believe characterize a broad range of scientific sentences, and are therefore useful for supporting a general approach to text-mining: focus, polarity, certainty, evidence, and directionality. We define these dimensions and describe the guidelines we have developed for annotating text with regard to them. To examine the effectiveness of the guidelines, twelve annotators independently annotated the same set of 101 sentences that were randomly selected from current biomedical periodicals. Analysis of these annotations shows 70–80% inter-annotator agreement, suggesting that our guidelines indeed present a well-defined, executable and reproducible task. Conclusion We present our guidelines defining a text annotation task, along with annotation results from multiple independently produced annotations, demonstrating the feasibility of the task. The annotation of a very large corpus of documents along these guidelines is currently ongoing. These annotations form the basis for the categorization of text along multiple dimensions, to support viable text mining for experimental results, methodology statements, and other forms of information. We are currently developing machine learning methods, to be trained and tested on the annotated corpus, that would allow for the automatic categorization of biomedical text along the general dimensions that we have presented. The guidelines in full detail, along with annotated examples, are publicly available. PMID:16867190
Enriching a document collection by integrating information extraction and PDF annotation
NASA Astrophysics Data System (ADS)
Powley, Brett; Dale, Robert; Anisimoff, Ilya
2009-01-01
Modern digital libraries offer all the hyperlinking possibilities of the World Wide Web: when a reader finds a citation of interest, in many cases she can now click on a link to be taken to the cited work. This paper presents work aimed at providing the same ease of navigation for legacy PDF document collections that were created before the possibility of integrating hyperlinks into documents was ever considered. To achieve our goal, we need to carry out two tasks: first, we need to identify and link citations and references in the text with high reliability; and second, we need the ability to determine physical PDF page locations for these elements. We demonstrate the use of a high-accuracy citation extraction algorithm which significantly improves on earlier reported techniques, and a technique for integrating PDF processing with a conventional text-stream based information extraction pipeline. We demonstrate these techniques in the context of a particular document collection, this being the ACL Anthology; but the same approach can be applied to other document sets.
Federal Register 2010, 2011, 2012, 2013, 2014
2012-08-01
... restrictions has been revised. The Designated List and the regulatory text in that document contain language which is inadvertently not consistent with the rest of the document as to the historical period that the...
10 CFR 961.11 - Text of the contract.
Code of Federal Regulations, 2014 CFR
2014-01-01
... program including information on cost projections, project plans and progress reports. 5. (a) Beginning on...-type documents or computer software (including computer programs, computer software data bases, and computer software documentation). Examples of technical data include research and engineering data...
10 CFR 961.11 - Text of the contract.
Code of Federal Regulations, 2013 CFR
2013-01-01
... program including information on cost projections, project plans and progress reports. 5. (a) Beginning on...-type documents or computer software (including computer programs, computer software data bases, and computer software documentation). Examples of technical data include research and engineering data...
76 FR 17353 - Aviation Communications
Federal Register 2010, 2011, 2012, 2013, 2014
2011-03-29
... publication). The full text of this document is available for public inspection and copying during regular... Printing, Inc., 445 12th Street, SW., Room CY-B402, Washington, DC 20554. The full text may also be...
Creating a Gold Standard for the Readability Measurement of Health Texts
Kandula, Sasikiran; Zeng-Treitler, Qing
2008-01-01
Developing easy-to-read health texts for consumers continues to be a challenge in health communication. Though readability formulae such as Flesch-Kincaid Grade Level have been used in many studies, they were found to be inadequate to estimate the difficulty of some types of health texts. One impediment to the development of new readability assessment techniques is the absence of a gold standard that can be used to validate them. To overcome this deficiency, we have compiled a corpus of 324 health documents consisting of six different types of texts. These documents were manually reviewed and assigned a readability level (1-7 Likert scale) by a panel of five health literacy experts. The expert assigned ratings were found to be highly correlated with a patient representative’s readability ratings (r = 0.81, p<0.0001). PMID:18999150
TES: A Text Extraction System.
ERIC Educational Resources Information Center
Goh, A.; Hui, S. C.
1996-01-01
Describes how TES, a text extraction system, is able to electronically retrieve a set of sentences from a document to form an indicative abstract. Discusses various text abstraction techniques and related work in the area, provides an overview of the TES system, and compares system results against manually produced abstracts. (LAM)
The Relative Ease of Writing Narrative Text.
ERIC Educational Resources Information Center
Kellogg, Ronald T.; And Others
A study investigated whether the narrative writing task is more compatible with the structure of conscious thought than are other writing tasks. If so, composing a narrative text should demand less cognitive effort, occur more fluently, and yield a more coherent document than composing persuasive or descriptive texts. Sixteen college students were…
Death as Insight into Life: Adolescents' Gothic Text Encounters
ERIC Educational Resources Information Center
Del Nero, Jennifer
2017-01-01
This qualitative case study explores adolescents' responses to texts containing death and destruction, a seminal trope of the Gothic literary genre. Participants read both classic and popular culture texts featuring characters grappling with death in their seventh grade reading classroom. Observations, interviews, and documents were collected and…
Analysing Representations of Otherness Using Different Text-Types.
ERIC Educational Resources Information Center
Murphy-LeJeune, Elizabeth; And Others
1996-01-01
Demonstrates how the teacher can use texts to confront learners with cultural representations. Four texts are used to represent a literary extract, a student essay, an advertising document, and a newspaper article. The article illustrates approaches that borrow from stylistics, linguistics, and discourse analysis. (21 references) (Author/CK)
An Overall Perspective of Machine Translation with Its Shortcomings
ERIC Educational Resources Information Center
Akbari, Alireza
2014-01-01
The petition for language translation has strikingly augmented recently due to cross-cultural communication and exchange of information. In order to communicate well, text should be translated correctly and completely in each field such as legal documents, technical texts, scientific texts, publicity leaflets, and instructional materials. In this…
Semantic Metadata for Heterogeneous Spatial Planning Documents
NASA Astrophysics Data System (ADS)
Iwaniak, A.; Kaczmarek, I.; Łukowicz, J.; Strzelecki, M.; Coetzee, S.; Paluszyński, W.
2016-09-01
Spatial planning documents contain information about the principles and rights of land use in different zones of a local authority. They are the basis for administrative decision making in support of sustainable development. In Poland these documents are published on the Web according to a prescribed non-extendable XML schema, designed for optimum presentation to humans in HTML web pages. There is no document standard, and limited functionality exists for adding references to external resources. The text in these documents is discoverable and searchable by general-purpose web search engines, but the semantics of the content cannot be discovered or queried. The spatial information in these documents is geographically referenced but not machine-readable. Major manual efforts are required to integrate such heterogeneous spatial planning documents from various local authorities for analysis, scenario planning and decision support. This article presents results of an implementation using machine-readable semantic metadata to identify relationships among regulations in the text, spatial objects in the drawings and links to external resources. A spatial planning ontology was used to annotate different sections of spatial planning documents with semantic metadata in the Resource Description Framework in Attributes (RDFa). The semantic interpretation of the content, links between document elements and links to external resources were embedded in XHTML pages. An example and use case from the spatial planning domain in Poland is presented to evaluate its efficiency and applicability. The solution enables the automated integration of spatial planning documents from multiple local authorities to assist decision makers with understanding and interpreting spatial planning information. The approach is equally applicable to legal documents from other countries and domains, such as cultural heritage and environmental management.
Recent Experiments with INQUERY
1995-11-01
were conducted with version of the INQUERY information retrieval system INQUERY is based on the Bayesian inference network retrieval model It is...corpus based query expansion For TREC a subset of of the adhoc document set was used to build the InFinder database None of the...experiments that showed signi cant improvements in retrieval eectiveness when document rankings based on the entire document text are combined with
The soil physics contributions of Edgar Buckingham
Nimmo, J.R.; Landa, E.R.
2005-01-01
During 1902 to 1906 as a soil physicist at the USDA Bureau of Soils (BOS), Edgar Buckingham originated the concepts of matric potential, soil-water retention curves, specific water capacity, and unsaturated hydraulic conductivity (K) as a distinct property of a soil. He applied a formula equivalent to Darcy's law (though without specific mention of Darcy's work) to unsaturated flow. He also contributed significant research on quasi-empirical formulas for K as a function of water content, water flow in capillary crevices and in thin films, and scaling. Buckingham's work on gas flow in soils produced paradigms that are consistent with our current understanding. His work on evaporation elucidated the concept of self-mulching and produced sound and sometimes paradoxical generalizations concerning conditions that favor or retard evaporation. Largely overshadowing those achievements, however, is that he launched a theory, still accepted today, that could predict transient water content as a function of time and space. Recently discovered documents reveal some of the arguments Buckingham had with BOS officials, including the text of a two-paragraph conclusion of his famous 1907 report on soil water, and the official letter documenting rejection of that text. Strained interpersonal relations motivated the departure of Buckingham and other brilliant physicists (N.E. Dorsey, F.H. King, and Lyman Briggs) from the BOS during 1903 to 1906. Given that Buckingham and his BOS colleagues had been rapidly developing the means of quantifying unsaturated flow, these strained relations probably slowed the advancement of unsaturated flow theory. ?? Soil Science Society of America.
Analysis of Nature of Science Included in Recent Popular Writing Using Text Mining Techniques
NASA Astrophysics Data System (ADS)
Jiang, Feng; McComas, William F.
2014-09-01
This study examined the inclusion of nature of science (NOS) in popular science writing to determine whether it could serve supplementary resource for teaching NOS and to evaluate the accuracy of text mining and classification as a viable research tool in science education research. Four groups of documents published from 2001 to 2010 were analyzed: Scientific American, Discover magazine, winners of the Royal Society Winton Prize for Science Books, and books from NSTA's list of Outstanding Science Trade Books. Computer analysis categorized passages in the selected documents based on their inclusions of NOS. Human analysis assessed the frequency, context, coverage, and accuracy of the inclusions of NOS within computer identified NOS passages. NOS was rarely addressed in selected document sets but somewhat more frequently addressed in the letters section of the two magazines. This result suggests that readers seem interested in the discussion of NOS-related themes. In the popular science books analyzed, NOS presentations were found more likely to be aggregated in the beginning and the end of the book, rather than scattered throughout. The most commonly addressed NOS elements in the analyzed documents are science and society and empiricism in science. Only one inaccurate presentation of NOS were identified in all analyzed documents. The text mining technique demonstrated exciting performance, which invites more applications of the technique to analyze other aspects of science textbooks, popular science writing, or other materials involved in science teaching and learning.
Query-oriented evidence extraction to support evidence-based medicine practice.
Sarker, Abeed; Mollá, Diego; Paris, Cecile
2016-02-01
Evidence-based medicine practice requires medical practitioners to rely on the best available evidence, in addition to their expertise, when making clinical decisions. The medical domain boasts a large amount of published medical research data, indexed in various medical databases such as MEDLINE. As the size of this data grows, practitioners increasingly face the problem of information overload, and past research has established the time-associated obstacles faced by evidence-based medicine practitioners. In this paper, we focus on the problem of automatic text summarisation to help practitioners quickly find query-focused information from relevant documents. We utilise an annotated corpus that is specialised for the task of evidence-based summarisation of text. In contrast to past summarisation approaches, which mostly rely on surface level features to identify salient pieces of texts that form the summaries, our approach focuses on the use of corpus-based statistics, and domain-specific lexical knowledge for the identification of summary contents. We also apply a target-sentence-specific summarisation technique that reduces the problem of underfitting that persists in generic summarisation models. In automatic evaluations run over a large number of annotated summaries, our extractive summarisation technique statistically outperforms various baseline and benchmark summarisation models with a percentile rank of 96.8%. A manual evaluation shows that our extractive summarisation approach is capable of selecting content with high recall and precision, and may thus be used to generate bottom-line answers to practitioners' queries. Our research shows that the incorporation of specialised data and domain-specific knowledge can significantly improve text summarisation performance in the medical domain. Due to the vast amounts of medical text available, and the high growth of this form of data, we suspect that such summarisation techniques will address the time-related obstacles associated with evidence-based medicine. Copyright © 2015 Elsevier Inc. All rights reserved.
ERIC Educational Resources Information Center
Happel, Sue; Loeb, Joyce
Although the activities in this unit are designed primarily for students in the intermediate grades, the document's text, illustrations, and bibliographic references are suitable for anyone interested in learning about Africa. Following a brief introduction and map work, the document is arranged into six sections. Section 1 traces Africa's history…
10 CFR 961.11 - Text of the contract.
Code of Federal Regulations, 2012 CFR
2012-01-01
... characteristic, of a specific or technical nature. It may, for example, document research, experimental... computer software documentation). Examples of technical data include research and engineering data... repository, to take title to the spent nuclear fuel or high-level radioactive waste involved as expeditiously...
10 CFR 961.11 - Text of the contract.
Code of Federal Regulations, 2011 CFR
2011-01-01
... characteristic, of a specific or technical nature. It may, for example, document research, experimental... computer software documentation). Examples of technical data include research and engineering data... repository, to take title to the spent nuclear fuel or high-level radioactive waste involved as expeditiously...
48 CFR 752.7003 - Documentation for payment.
Code of Federal Regulations, 2011 CFR
2011-10-01
.... 752.7003 Section 752.7003 Federal Acquisition Regulations System AGENCY FOR INTERNATIONAL DEVELOPMENT CLAUSES AND FORMS SOLICITATION PROVISIONS AND CONTRACT CLAUSES Texts of USAID Contract Clauses 752.7003 Documentation for payment. The following clause is required in all USAID direct contracts, excluding fixed price...
Facilitating access to information in large documents with an intelligent hypertext system
NASA Technical Reports Server (NTRS)
Mathe, Nathalie
1993-01-01
Retrieving specific information from large amounts of documentation is not an easy task. It could be facilitated if information relevant in the current problem solving context could be automatically supplied to the user. As a first step towards this goal, we have developed an intelligent hypertext system called CID (Computer Integrated Documentation) and tested it on the Space Station Freedom requirement documents. The CID system enables integration of various technical documents in a hypertext framework and includes an intelligent context-sensitive indexing and retrieval mechanism. This mechanism utilizes on-line user information requirements and relevance feedback either to reinforce current indexing in case of success or to generate new knowledge in case of failure. This allows the CID system to provide helpful responses, based on previous usage of the documentation, and to improve its performance over time.
ERIC Educational Resources Information Center
Kamin, Leon; Green, Winifred
This document comprises three presentations made on March 23 at a symposium sponsored by the Southern Regional Council focusing on Human Intelligence, Social Science and Social Policy. The first of the three parts of the document is the text of the principal presentation, made by Dr. Leon Kamin, Chariman of the Department of Psychology at…
Mouriño García, Marcos Antonio; Pérez Rodríguez, Roberto; Anido Rifón, Luis E
2015-01-01
Automatic classification of text documents into a set of categories has a lot of applications. Among those applications, the automatic classification of biomedical literature stands out as an important application for automatic document classification strategies. Biomedical staff and researchers have to deal with a lot of literature in their daily activities, so it would be useful a system that allows for accessing to documents of interest in a simple and effective way; thus, it is necessary that these documents are sorted based on some criteria-that is to say, they have to be classified. Documents to classify are usually represented following the bag-of-words (BoW) paradigm. Features are words in the text-thus suffering from synonymy and polysemy-and their weights are just based on their frequency of occurrence. This paper presents an empirical study of the efficiency of a classifier that leverages encyclopedic background knowledge-concretely Wikipedia-in order to create bag-of-concepts (BoC) representations of documents, understanding concept as "unit of meaning", and thus tackling synonymy and polysemy. Besides, the weighting of concepts is based on their semantic relevance in the text. For the evaluation of the proposal, empirical experiments have been conducted with one of the commonly used corpora for evaluating classification and retrieval of biomedical information, OHSUMED, and also with a purpose-built corpus of MEDLINE biomedical abstracts, UVigoMED. Results obtained show that the Wikipedia-based bag-of-concepts representation outperforms the classical bag-of-words representation up to 157% in the single-label classification problem and up to 100% in the multi-label problem for OHSUMED corpus, and up to 122% in the single-label classification problem and up to 155% in the multi-label problem for UVigoMED corpus.
Reinert, Christiane; Kremmler, Lukas; Burock, Susen; Bogdahn, Ulrich; Wick, Wolfgang; Gleiter, Christoph H; Koller, Michael; Hau, Peter
2014-01-01
In randomised controlled trials (RCTs), patient informed consent documents are an essential cornerstone of the study flow. However, these documents are often oversized in format and content. Clinical experience suggests that study information sheets are often not used as an aid to decision-making due to their complexity. We analysed nine patient informed consent documents from clinical neuro-oncological phase III-studies running at a German Brain Tumour Centre with the objective to investigate the quality of these documents. Text length, formal layout, readability, application of ethical and legal requirements, scientific evidence and social aspects were used as rating categories. Results were assessed quantitatively by two independents investigators and were depicted using net diagrams. All patient informed consent documents were of insufficient quality in all categories except that ethical and legal requirements were fulfilled. Notably, graduate levels were required to read and understand five of nine consent documents. Quality deficits were consistent between the individual study information texts. Irrespective of formal aspects, a document that is intended to inform and motivate patients to participate in a study needs to be well-structured and understandable. We therefore strongly mandate to re-design patient informed consent documents in a patient-friendly way. Specifically, standardised components with a scientific foundation should be provided that could be retrieved at various times, adapted to the mode of treatment and the patient's knowledge, and could weigh information dependent of the stage of treatment decision. Copyright © 2013 Elsevier Ltd. All rights reserved.
Categorizing document by fuzzy C-Means and K-nearest neighbors approach
NASA Astrophysics Data System (ADS)
Priandini, Novita; Zaman, Badrus; Purwanti, Endah
2017-08-01
Increasing of technology had made categorizing documents become important. It caused by increasing of number of documents itself. Managing some documents by categorizing is one of Information Retrieval application, because it involve text mining on its process. Whereas, categorization technique could be done both Fuzzy C-Means (FCM) and K-Nearest Neighbors (KNN) method. This experiment would consolidate both methods. The aim of the experiment is increasing performance of document categorize. First, FCM is in order to clustering training documents. Second, KNN is in order to categorize testing document until the output of categorization is shown. Result of the experiment is 14 testing documents retrieve relevantly to its category. Meanwhile 6 of 20 testing documents retrieve irrelevant to its category. Result of system evaluation shows that both precision and recall are 0,7.
Garcia Castro, Leyla Jael; Berlanga, Rafael; Garcia, Alexander
2015-10-01
Although full-text articles are provided by the publishers in electronic formats, it remains a challenge to find related work beyond the title and abstract context. Identifying related articles based on their abstract is indeed a good starting point; this process is straightforward and does not consume as many resources as full-text based similarity would require. However, further analyses may require in-depth understanding of the full content. Two articles with highly related abstracts can be substantially different regarding the full content. How similarity differs when considering title-and-abstract versus full-text and which semantic similarity metric provides better results when dealing with full-text articles are the main issues addressed in this manuscript. We have benchmarked three similarity metrics - BM25, PMRA, and Cosine, in order to determine which one performs best when using concept-based annotations on full-text documents. We also evaluated variations in similarity values based on title-and-abstract against those relying on full-text. Our test dataset comprises the Genomics track article collection from the 2005 Text Retrieval Conference. Initially, we used an entity recognition software to semantically annotate titles and abstracts as well as full-text with concepts defined in the Unified Medical Language System (UMLS®). For each article, we created a document profile, i.e., a set of identified concepts, term frequency, and inverse document frequency; we then applied various similarity metrics to those document profiles. We considered correlation, precision, recall, and F1 in order to determine which similarity metric performs best with concept-based annotations. For those full-text articles available in PubMed Central Open Access (PMC-OA), we also performed dispersion analyses in order to understand how similarity varies when considering full-text articles. We have found that the PubMed Related Articles similarity metric is the most suitable for full-text articles annotated with UMLS concepts. For similarity values above 0.8, all metrics exhibited an F1 around 0.2 and a recall around 0.1; BM25 showed the highest precision close to 1; in all cases the concept-based metrics performed better than the word-stem-based one. Our experiments show that similarity values vary when considering only title-and-abstract versus full-text similarity. Therefore, analyses based on full-text become useful when a given research requires going beyond title and abstract, particularly regarding connectivity across articles. Visualization available at ljgarcia.github.io/semsim.benchmark/, data available at http://dx.doi.org/10.5281/zenodo.13323. Copyright © 2015 Elsevier Inc. All rights reserved.
NASA Astrophysics Data System (ADS)
Manzella, Giuseppe M. R.; Bartolini, Andrea; Bustaffa, Franco; D'Angelo, Paolo; De Mattei, Maurizio; Frontini, Francesca; Maltese, Maurizio; Medone, Daniele; Monachini, Monica; Novellino, Antonio; Spada, Andrea
2016-04-01
The MAPS (Marine Planning and Service Platform) project is aiming at building a computer platform supporting a Marine Information and Knowledge System. One of the main objective of the project is to develop a repository that should gather, classify and structure marine scientific literature and data thus guaranteeing their accessibility to researchers and institutions by means of standard protocols. In oceanography the cost related to data collection is very high and the new paradigm is based on the concept to collect once and re-use many times (for re-analysis, marine environment assessment, studies on trends, etc). This concept requires the access to quality controlled data and to information that is provided in reports (grey literature) and/or in relevant scientific literature. Hence, creation of new technology is needed by integrating several disciplines such as data management, information systems, knowledge management. In one of the most important EC projects on data management, namely SeaDataNet (www.seadatanet.org), an initial example of knowledge management is provided through the Common Data Index, that is providing links to data and (eventually) to papers. There are efforts to develop search engines to find author's contributions to scientific literature or publications. This implies the use of persistent identifiers (such as DOI), as is done in ORCID. However very few efforts are dedicated to link publications to the data cited or used or that can be of importance for the published studies. This is the objective of MAPS. Full-text technologies are often unsuccessful since they assume the presence of specific keywords in the text; in order to fix this problem, the MAPS project suggests to use different semantic technologies for retrieving the text and data and thus getting much more complying results. The main parts of our design of the search engine are: • Syntactic parser - This module is responsible for the extraction of "rich words" from the text: the whole document gets parsed to extract the words which are more meaningful for the main argument of the document, and applies the extraction in the form of N-grams (mono-grams, bi-grams, tri-grams). • MAPS database - This module is a simple database which contains all the N-grams used by MAPS (physical parameters from SeaDataNet vocabularies) to define our marine "ontology". • Relation identifier - This module performs the most important task of identifying relationships between the N-gram extracted from the text by the parser and the provided oceanographic terminology. It checks N-grams supplied by the Syntactic parser and then matches them with the terms stored in the MAPS database. Found matches are returned back to the parser with flexed form appearing in the source text. • A "relaxed" extractor - This option can be activated when the search engine is launched. It was introduced to give the user a chance to create new N-grams combining existing mono-grams and bi-grams in the database with rich-words found within the source text. The innovation of a semantic engine lies in the fact that the process is not just about the retrieval of already known documents by means of a simple term query but rather the retrieval of a population of documents whose existence was unknown. The system answers by showing a screenshot of results ordered according to the following criteria: • Relevance - of the document with respect to the concept that is searched • Date - of publication of the paper • Source - data provider as defined in the SeaDataNet Common Data Index • Matrix - environmental matrices as defined in the oceanographic field • Geographic area - area specified in the text • Clustering - the process of organizing objects into groups whose members are similar The clustering returns as the output the related documents. For each document the MAPS visualization provides: • Title, author, source/provider of data, web address • Tagging of key terms or concepts • Summary of the document • Visualization of the whole document The possibility of inserting the number of citations for each document among the criteria of the advanced search is currently undergoing; in this case the engine should be able to connect to any of the existing bibliographic citation systems (such as Google Scholar, Scopus, etc.).
IRIS Toxicological Review of Naphthalene (2004, External Review Draft, Update)
[Update Jun 2004] This document contains revision of the inhalation cancer assessment and other selected text from the 1998 draft as indicated: Sections of this document pertaining to the inhalation cancer assessment are presented as draft for external peer review purposes only ...
40 CFR Appendix V to Part 51 - Criteria for Determining the Completeness of Plan Submissions
Code of Federal Regulations, 2010 CFR
2010-07-01
... an exact duplicate of the hard copy with changes indicated, signed documents need to be in portable document format, rules need to be in text format and files need to be submitted in manageable amounts (e.g...
77 FR 76606 - Community Development Financial Institutions Fund
Federal Register 2010, 2011, 2012, 2013, 2014
2012-12-28
... form, with pre-set text limits and font size restrictions. Applicants must submit their narrative responses by using the FY 2013 CDFI Program Application narrative template document. This Word document...) A-133 Narrative Report; (iv) Institution Level Report; (v) Transaction Level Report (for Awardees...
A Survey in Indexing and Searching XML Documents.
ERIC Educational Resources Information Center
Luk, Robert W. P.; Leong, H. V.; Dillon, Tharam S.; Chan, Alvin T. S.; Croft, W. Bruce; Allan, James
2002-01-01
Discussion of XML focuses on indexing techniques for XML documents, grouping them into flat-file, semistructured, and structured indexing paradigms. Highlights include searching techniques, including full text search and multistage search; search result presentations; database and information retrieval system integration; XML query languages; and…
1 CFR 18.16 - Reinstatement of expired regulations.
Code of Federal Regulations, 2011 CFR
2011-01-01
... Section 18.16 General Provisions ADMINISTRATIVE COMMITTEE OF THE FEDERAL REGISTER PREPARATION, TRANSMITTAL, AND PROCESSING OF DOCUMENTS PREPARATION AND TRANSMITTAL OF DOCUMENTS GENERALLY § 18.16 Reinstatement... Regulations data base which have expired by their own terms only by republishing the regulations in full text...
1 CFR 18.16 - Reinstatement of expired regulations.
Code of Federal Regulations, 2010 CFR
2010-01-01
... Section 18.16 General Provisions ADMINISTRATIVE COMMITTEE OF THE FEDERAL REGISTER PREPARATION, TRANSMITTAL, AND PROCESSING OF DOCUMENTS PREPARATION AND TRANSMITTAL OF DOCUMENTS GENERALLY § 18.16 Reinstatement... Regulations data base which have expired by their own terms only by republishing the regulations in full text...
Section 3. The SPARROW Surface Water-Quality Model: Theory, Application and User Documentation
Schwarz, G.E.; Hoos, A.B.; Alexander, R.B.; Smith, R.A.
2006-01-01
SPARROW (SPAtially Referenced Regressions On Watershed attributes) is a watershed modeling technique for relating water-quality measurements made at a network of monitoring stations to attributes of the watersheds containing the stations. The core of the model consists of a nonlinear regression equation describing the non-conservative transport of contaminants from point and diffuse sources on land to rivers and through the stream and river network. The model predicts contaminant flux, concentration, and yield in streams and has been used to evaluate alternative hypotheses about the important contaminant sources and watershed properties that control transport over large spatial scales. This report provides documentation for the SPARROW modeling technique and computer software to guide users in constructing and applying basic SPARROW models. The documentation gives details of the SPARROW software, including the input data and installation requirements, and guidance in the specification, calibration, and application of basic SPARROW models, as well as descriptions of the model output and its interpretation. The documentation is intended for both researchers and water-resource managers with interest in using the results of existing models and developing and applying new SPARROW models. The documentation of the model is presented in two parts. Part 1 provides a theoretical and practical introduction to SPARROW modeling techniques, which includes a discussion of the objectives, conceptual attributes, and model infrastructure of SPARROW. Part 1 also includes background on the commonly used model specifications and the methods for estimating and evaluating parameters, evaluating model fit, and generating water-quality predictions and measures of uncertainty. Part 2 provides a user's guide to SPARROW, which includes a discussion of the software architecture and details of the model input requirements and output files, graphs, and maps. The text documentation and computer software are available on the Web at http://usgs.er.gov/sparrow/sparrow-mod/.
Restoration of recto-verso colour documents using correlated component analysis
NASA Astrophysics Data System (ADS)
Tonazzini, Anna; Bedini, Luigi
2013-12-01
In this article, we consider the problem of removing see-through interferences from pairs of recto-verso documents acquired either in grayscale or RGB modality. The see-through effect is a typical degradation of historical and archival documents or manuscripts, and is caused by transparency or seeping of ink from the reverse side of the page. We formulate the problem as one of separating two individual texts, overlapped in the recto and verso maps of the colour channels through a linear convolutional mixing operator, where the mixing coefficients are unknown, while the blur kernels are assumed known a priori or estimated off-line. We exploit statistical techniques of blind source separation to estimate both the unknown model parameters and the ideal, uncorrupted images of the two document sides. We show that recently proposed correlated component analysis techniques overcome the already satisfactory performance of independent component analysis techniques and colour decorrelation, when the two texts are even sensibly correlated.
Tagline: Information Extraction for Semi-Structured Text Elements in Medical Progress Notes
ERIC Educational Resources Information Center
Finch, Dezon Kile
2012-01-01
Text analysis has become an important research activity in the Department of Veterans Affairs (VA). Statistical text mining and natural language processing have been shown to be very effective for extracting useful information from medical documents. However, neither of these techniques is effective at extracting the information stored in…
Detection and Evaluation of Cheating on College Exams Using Supervised Classification
ERIC Educational Resources Information Center
Cavalcanti, Elmano Ramalho; Pires, Carlos Eduardo; Cavalcanti, Elmano Pontes; Pires, Vládia Freire
2012-01-01
Text mining has been used for various purposes, such as document classification and extraction of domain-specific information from text. In this paper we present a study in which text mining methodology and algorithms were properly employed for academic dishonesty (cheating) detection and evaluation on open-ended college exams, based on document…
Image/text automatic indexing and retrieval system using context vector approach
NASA Astrophysics Data System (ADS)
Qing, Kent P.; Caid, William R.; Ren, Clara Z.; McCabe, Patrick
1995-11-01
Thousands of documents and images are generated daily both on and off line on the information superhighway and other media. Storage technology has improved rapidly to handle these data but indexing this information is becoming very costly. HNC Software Inc. has developed a technology for automatic indexing and retrieval of free text and images. This technique is demonstrated and is based on the concept of `context vectors' which encode a succinct representation of the associated text and features of sub-image. In this paper, we will describe the Automated Librarian System which was designed for free text indexing and the Image Content Addressable Retrieval System (ICARS) which extends the technique from the text domain into the image domain. Both systems have the ability to automatically assign indices for a new document and/or image based on the content similarities in the database. ICARS also has the capability to retrieve images based on similarity of content using index terms, text description, and user-generated images as a query without performing segmentation or object recognition.
Lagassé, Lisa P; Rimal, Rajiv N; Smith, Katherine C; Storey, J Douglas; Rhoades, Elizabeth; Barnett, Daniel J; Omer, Saad B; Links, Jonathan
2011-01-01
We assessed the literacy level and readability of online communications about H1N1/09 influenza issued by the Centers for Disease Control and Prevention (CDC) during the first month of outbreak. Documents were classified as targeting one of six audiences ranging in technical expertise. Flesch-Kincaid (FK) measure assessed literacy level for each group of documents. ANOVA models tested for differences in FK scores across target audiences and over time. Readability was assessed for documents targeting non-technical audiences using the Suitability Assessment of Materials (SAM). Overall, there was a main-effect by audience, F(5, 82) = 29.72, P<.001, but FK scores did not vary over time, F(2, 82) = .34, P>.05. A time-by-audience interaction was significant, F(10, 82) = 2.11, P<.05. Documents targeting non-technical audiences were found to be text-heavy and densely-formatted. The vocabulary and writing style were found to adequately reflect audience needs. The reading level of CDC guidance documents about H1N1/09 influenza varied appropriately according to the intended audience; sub-optimal formatting and layout may have rendered some text difficult to comprehend.
76 FR 45666 - Provisions Common to Registered Entities; Correction
Federal Register 2010, 2011, 2012, 2013, 2014
2011-08-01
...: This document corrects incorrect text published in the Federal Register of July 27, 2011, regarding... 44794, in the right column, in Sec. 40.6(a), the text ``other than a rule delisting or withdrawing the...
78 FR 46310 - Petition for Reconsideration of Action in Rulemaking Proceeding
Federal Register 2010, 2011, 2012, 2013, 2014
2013-07-31
... document, Report No. 2985, released June11, 2013. The full text of Report No. 2985 is available for viewing... the Deployment of Text-to-911 and Other Next Generation 911 Applications; Framework for Next...
Integrating query of relational and textual data in clinical databases: a case study.
Fisk, John M; Mutalik, Pradeep; Levin, Forrest W; Erdos, Joseph; Taylor, Caroline; Nadkarni, Prakash
2003-01-01
The authors designed and implemented a clinical data mart composed of an integrated information retrieval (IR) and relational database management system (RDBMS). Using commodity software, which supports interactive, attribute-centric text and relational searches, the mart houses 2.8 million documents that span a five-year period and supports basic IR features such as Boolean searches, stemming, and proximity and fuzzy searching. Results are relevance-ranked using either "total documents per patient" or "report type weighting." Non-curated medical text has a significant degree of malformation with respect to spelling and punctuation, which creates difficulties for text indexing and searching. Presently, the IR facilities of RDBMS packages lack the features necessary to handle such malformed text adequately. A robust IR+RDBMS system can be developed, but it requires integrating RDBMSs with third-party IR software. RDBMS vendors need to make their IR offerings more accessible to non-programmers.
Making automated computer program documentation a feature of total system design
NASA Technical Reports Server (NTRS)
Wolf, A. W.
1970-01-01
It is pointed out that in large-scale computer software systems, program documents are too often fraught with errors, out of date, poorly written, and sometimes nonexistent in whole or in part. The means are described by which many of these typical system documentation problems were overcome in a large and dynamic software project. A systems approach was employed which encompassed such items as: (1) configuration management; (2) standards and conventions; (3) collection of program information into central data banks; (4) interaction among executive, compiler, central data banks, and configuration management; and (5) automatic documentation. A complete description of the overall system is given.
Text-mining and information-retrieval services for molecular biology
Krallinger, Martin; Valencia, Alfonso
2005-01-01
Text-mining in molecular biology - defined as the automatic extraction of information about genes, proteins and their functional relationships from text documents - has emerged as a hybrid discipline on the edges of the fields of information science, bioinformatics and computational linguistics. A range of text-mining applications have been developed recently that will improve access to knowledge for biologists and database annotators. PMID:15998455
Mujtaba, Ghulam; Shuib, Liyana; Raj, Ram Gopal; Rajandram, Retnagowri; Shaikh, Khairunisa
2018-07-01
Automatic text classification techniques are useful for classifying plaintext medical documents. This study aims to automatically predict the cause of death from free text forensic autopsy reports by comparing various schemes for feature extraction, term weighing or feature value representation, text classification, and feature reduction. For experiments, the autopsy reports belonging to eight different causes of death were collected, preprocessed and converted into 43 master feature vectors using various schemes for feature extraction, representation, and reduction. The six different text classification techniques were applied on these 43 master feature vectors to construct a classification model that can predict the cause of death. Finally, classification model performance was evaluated using four performance measures i.e. overall accuracy, macro precision, macro-F-measure, and macro recall. From experiments, it was found that that unigram features obtained the highest performance compared to bigram, trigram, and hybrid-gram features. Furthermore, in feature representation schemes, term frequency, and term frequency with inverse document frequency obtained similar and better results when compared with binary frequency, and normalized term frequency with inverse document frequency. Furthermore, the chi-square feature reduction approach outperformed Pearson correlation, and information gain approaches. Finally, in text classification algorithms, support vector machine classifier outperforms random forest, Naive Bayes, k-nearest neighbor, decision tree, and ensemble-voted classifier. Our results and comparisons hold practical importance and serve as references for future works. Moreover, the comparison outputs will act as state-of-art techniques to compare future proposals with existing automated text classification techniques. Copyright © 2017 Elsevier Ltd and Faculty of Forensic and Legal Medicine. All rights reserved.
Document control and Conduct of Operations
DOE Office of Scientific and Technical Information (OSTI.GOV)
Collins, S.K.; Meltzer, F.L.
1993-01-01
Department of Energy (DOE) Order 5480.19, Conduct of operations, places stringent requirements on a wide range of activities at DOE facilities. These requirements directly affect personnel at the Advanced Test Reactor (ATR), which is located in the Test Reactor Area of the Idaho National Engineering Laboratory and operated for DOE by EG G Idaho, Inc. In order for the ATR to comply with 5480.19, the very personality of the reactor facility's document control unit has had to undergo a major change. The Facility and Administrative Support Unit (FAS) is charged with nudntenance of ATR's controlled ddcuments-diousands of operating and administrativemore » procedures. Prior to the imposition of 5480.19, FAS was content to operate in a clerical support mode, seldom questioning or seeking to improve. This numer of doing business is inappropriate within the framework of DOE 5480.19 and is also at odds with the approach to Total Quality Management (TQM) promulgated by EG G Idaho.To comply with the requirements of 5480.19, FAS has Actively applied TQM principles. Empowered personnel to unprove operations through the establishment of a teatn approach. Begun an automation process that is already paying large dividends in terms of improved procedure accuracy and compliance. A state-of-the-art text retrival system is already in place. We are vigorously pursuing fully automated document tmcidng and document management. This paper describes in detail the steps taken to date, the improvements and the lessons learned. It aLw discusses plans for the future that will enable FAS to support the ATR in inccreasing its responsiveness to the Conduct of Operations Order.« less
Developing a system to track meaningful outcome measures in head and neck cancer treatment.
Walters, Ronald S; Albright, Heidi W; Weber, Randal S; Feeley, Thomas W; Hanna, Ehab Y; Cantor, Scott B; Lewis, Carol M; Burke, Thomas W
2014-02-01
The health care industry, including consumers, providers, and payers of health care, recognize the importance of developing meaningful, patient-centered measures. This article describes our experience using an existing electronic medical record largely based on free text formats without structured documentation, in conjunction with tumor registry abstraction techniques, to obtain and analyze data for use in clinical improvement and public reporting. We performed a retrospective analysis of 2467 previously untreated patients treated with curative intent who presented with laryngeal, pharyngeal, or oral cavity cancer in order to develop a system to monitor and report meaningful outcome metrics of head and neck cancer treatment. Patients treated between 1995 and 2006 were analyzed for the primary outcomes of survival at 1 and 2 years, the ability to speak at 1 year posttreatment, and the ability to swallow at 1 year posttreatment. We encountered significant limitations in clinical documentation because of the lack of standardization of meaningful measures, as well limitations with data abstraction using a retrospective approach to reporting measures. Almost 5000 person-hours were required for data abstraction, quality review, and reporting, at a cost of approximately $134,000. Our multidisciplinary teams document extensive patient information; however, data is not stored in easily accessible formats for measurement, comparison, and reporting. We recommend identifying measures meaningful to patients, providers, and payers to be documented throughout the patients' entire treatment cycle, and significant investment in the improvements to electronic medical records and tumor registry reporting in order to provide meaningful quality measures for the future. Copyright © 2013 Wiley Periodicals, Inc.
Emplacing India's "medicities".
Murray, Susan F; Bisht, Ramila; Pitchforth, Emma
2016-11-01
Plans for 'medicities', announced in the Indian press from 2007 onwards, were to provide large scale 'one-stop-shops' of super-speciality medical services supplemented by diagnostics, education, research facilities, and other aspects of healthcare and lifestyle consumption. Placing this phenomenon within the recent domestic and global political economy of health, we then draw on recent research literatures on place and health to offer an analysis of the narration of these new healthcare places given in promotional texts from press media, official documents and marketing materials. We consider the implications of such analytic undertakings for the understanding of the evolving landscapes of contemporary health care in middle-income countries, and end with some reflections on the tensions now appearing in the medicity model. Copyright © 2016. Published by Elsevier Ltd.
Tanihara, Shinichi
2015-01-01
Uncoded diagnoses in health insurance claims (HICs) may introduce bias into Japanese health statistics dependent on computerized HICs. This study's aim was to identify the causes and characteristics of uncoded diagnoses. Uncoded diagnoses from computerized HICs (outpatient, inpatient, and the diagnosis procedure-combination per-diem payment system [DPC/PDPS]) submitted to the National Health Insurance Organization of Kumamoto Prefecture in May 2010 were analyzed. The text documentation accompanying the uncoded diagnoses was used to classify diagnoses in accordance with the International Classification of Diseases-10 (ICD-10). The text documentation was also classified into four categories using the standard descriptions of diagnoses defined in the master files of the computerized HIC system: 1) standard descriptions of diagnoses, 2) standard descriptions with a modifier, 3) non-standard descriptions of diagnoses, and 4) unclassifiable text documentation. Using these classifications, the proportions of uncoded diagnoses by ICD-10 disease category were calculated. Of the uncoded diagnoses analyzed (n = 363 753), non-standard descriptions of diagnoses for outpatient, inpatient, and DPC/PDPS HICs comprised 12.1%, 14.6%, and 1.0% of uncoded diagnoses, respectively. The proportion of uncoded diagnoses with standard descriptions with a modifier for Diseases of the eye and adnexa was significantly higher than the overall proportion of uncoded diagnoses among every HIC type. The pattern of uncoded diagnoses differed by HIC type and disease category. Evaluating the proportion of uncoded diagnoses in all medical facilities and developing effective coding methods for diagnoses with modifiers, prefixes, and suffixes should reduce number of uncoded diagnoses in computerized HICs and improve the quality of HIC databases.
Ferrández, Oscar; South, Brett R; Shen, Shuying; Friedlin, F Jeffrey; Samore, Matthew H; Meystre, Stéphane M
2012-07-27
The increased use and adoption of Electronic Health Records (EHR) causes a tremendous growth in digital information useful for clinicians, researchers and many other operational purposes. However, this information is rich in Protected Health Information (PHI), which severely restricts its access and possible uses. A number of investigators have developed methods for automatically de-identifying EHR documents by removing PHI, as specified in the Health Insurance Portability and Accountability Act "Safe Harbor" method.This study focuses on the evaluation of existing automated text de-identification methods and tools, as applied to Veterans Health Administration (VHA) clinical documents, to assess which methods perform better with each category of PHI found in our clinical notes; and when new methods are needed to improve performance. We installed and evaluated five text de-identification systems "out-of-the-box" using a corpus of VHA clinical documents. The systems based on machine learning methods were trained with the 2006 i2b2 de-identification corpora and evaluated with our VHA corpus, and also evaluated with a ten-fold cross-validation experiment using our VHA corpus. We counted exact, partial, and fully contained matches with reference annotations, considering each PHI type separately, or only one unique 'PHI' category. Performance of the systems was assessed using recall (equivalent to sensitivity) and precision (equivalent to positive predictive value) metrics, as well as the F(2)-measure. Overall, systems based on rules and pattern matching achieved better recall, and precision was always better with systems based on machine learning approaches. The highest "out-of-the-box" F(2)-measure was 67% for partial matches; the best precision and recall were 95% and 78%, respectively. Finally, the ten-fold cross validation experiment allowed for an increase of the F(2)-measure to 79% with partial matches. The "out-of-the-box" evaluation of text de-identification systems provided us with compelling insight about the best methods for de-identification of VHA clinical documents. The errors analysis demonstrated an important need for customization to PHI formats specific to VHA documents. This study informed the planning and development of a "best-of-breed" automatic de-identification application for VHA clinical text.
Storing and Viewing Electronic Documents.
ERIC Educational Resources Information Center
Falk, Howard
1999-01-01
Discusses the conversion of fragile library materials to computer storage and retrieval to extend the life of the items and to improve accessibility through the World Wide Web. Highlights include entering the images, including scanning; optical character recognition; full text and manual indexing; and available document- and image-management…
A Digital Library in the Mid-Nineties, Ahead or On Schedule?
ERIC Educational Resources Information Center
Dijkstra, Joost
1994-01-01
Discussion of the future possibilities of digital library systems highlights digital projects developed at Tilburg University (Netherlands). Topics addressed include online access to databases; electronic document delivery; agreements between libraries and Elsevier Science publishers to provide journal articles; full text document delivery; and…
Code of Federal Regulations, 2011 CFR
2011-01-01
... white paper. (ii) On paper not larger than 81/2 by 11 inches. (iii) In black ink. (iv) Text double... 1 inch on each side. (vii) The original not bound or hole-punched, only held together with removable...
Code of Federal Regulations, 2010 CFR
2010-01-01
... white paper. (ii) On paper not larger than 81/2 by 11 inches. (iii) In black ink. (iv) Text double... 1 inch on each side. (vii) The original not bound or hole-punched, only held together with removable...
Some utilities to help produce Rich Text Files from Stata.
Gillman, Matthew S
Producing RTF files from Stata can be difficult and somewhat cryptic. Utilities are introduced to simplify this process; one builds up a table row-by-row, another inserts a PNG image file into an RTF document, and the others start and finish the RTF document.
Federal Register 2010, 2011, 2012, 2013, 2014
2011-12-01
...: This is a summary of the Commission's document, Report No. 2937, released November 15, 2011. The full text of this document is available for viewing and copying in Room CY-B402, 445 12th Street SW...
Line Segmentation in Handwritten Assamese and Meetei Mayek Script Using Seam Carving Based Algorithm
NASA Astrophysics Data System (ADS)
Kumar, Chandan Jyoti; Kalita, Sanjib Kr.
Line segmentation is a key stage in an Optical Character Recognition system. This paper primarily concerns the problem of text line extraction on color and grayscale manuscript pages of two major North-east Indian regional Scripts, Assamese and Meetei Mayek. Line segmentation of handwritten text in Assamese and Meetei Mayek scripts is an uphill task primarily because of the structural features of both the scripts and varied writing styles. Line segmentation of a document image is been achieved by using the Seam carving technique, in this paper. Researchers from various regions used this approach for content aware resizing of an image. However currently many researchers are implementing Seam Carving for line segmentation phase of OCR. Although it is a language independent technique, mostly experiments are done over Arabic, Greek, German and Chinese scripts. Two types of seams are generated, medial seams approximate the orientation of each text line, and separating seams separated one line of text from another. Experiments are performed extensively over various types of documents and detailed analysis of the evaluations reflects that the algorithm performs well for even documents with multiple scripts. In this paper, we present a comparative study of accuracy of this method over different types of data.
Example Elaboration as a Neglected Instructional Strategy
DOE Office of Scientific and Technical Information (OSTI.GOV)
Girill, T R
Over the last decade an unfolding cognitive-psychology research program on how learners use examples to develop effective problem solving expertise has yielded well-established empirical findings. Chi et al., Renkl, Reimann, and Neubert (in various papers) have confirmed statistically significant differences in how good and poor learners inferentially elaborate (self explain) example steps as they study. Such example elaboration is highly relevant to software documentation and training, yet largely neglected in the current literature. This paper summarizes the neglected research on example use and puts its neglect in a disciplinary perspective. The author then shows that differences in support for examplemore » elaboration in commercial software documentation reveal previously over looked usability issues. These issues involve example summaries, using goals and goal structures to reinforce example elaborations, and prompting readers to recognize the role of example parts. Secondly, I show how these same example elaboration techniques can build cognitive maturity among underperforming high school students who study technical writing. Principle based elaborations, condition elaborations, and role recognition of example steps all have their place in innovative, high school level, technical writing exercises, and all promote far transfer problem solving. Finally, I use these studies to clarify the constructivist debate over what writers and readers contribute to text meaning. I argue that writers can influence how readers elaborate on examples, and that because of the great empirical differences in example study effectiveness (and reader choices) writers should do what they can (through within text design features) to encourage readers to elaborate examples in the most successful ways. Example elaboration is a uniquely effective way to learn from worked technical examples. This paper summarizes years of research that clarifies example elaboration. I then show how example elaboration can make complex software documentation more useful, improve the benefits of technical writing exercises for underperforming students, and enlighten the general discussion of how writers can and should help their readers.« less
Information Extraction for System-Software Safety Analysis: Calendar Year 2008 Year-End Report
NASA Technical Reports Server (NTRS)
Malin, Jane T.
2009-01-01
This annual report describes work to integrate a set of tools to support early model-based analysis of failures and hazards due to system-software interactions. The tools perform and assist analysts in the following tasks: 1) extract model parts from text for architecture and safety/hazard models; 2) combine the parts with library information to develop the models for visualization and analysis; 3) perform graph analysis and simulation to identify and evaluate possible paths from hazard sources to vulnerable entities and functions, in nominal and anomalous system-software configurations and scenarios; and 4) identify resulting candidate scenarios for software integration testing. There has been significant technical progress in model extraction from Orion program text sources, architecture model derivation (components and connections) and documentation of extraction sources. Models have been derived from Internal Interface Requirements Documents (IIRDs) and FMEA documents. Linguistic text processing is used to extract model parts and relationships, and the Aerospace Ontology also aids automated model development from the extracted information. Visualizations of these models assist analysts in requirements overview and in checking consistency and completeness.
DOE Office of Scientific and Technical Information (OSTI.GOV)
NONE
This large document provides a catalog of the location of large numbers of reports pertaining to the charge of the Presidential Advisory Committee on Human Radiation Research and is arranged as a series of appendices. Titles of the appendices are Appendix A- Records at the Washington National Records Center Reviewed in Whole or Part by DoD Personnel or Advisory Committee Staff; Appendix B- Brief Descriptions of Records Accessions in the Advisory Committee on Human Radiation Experiments (ACHRE) Research Document Collection; Appendix C- Bibliography of Secondary Sources Used by ACHRE; Appendix D- Brief Descriptions of Human Radiation Experiments Identified by ACHRE,more » and Indexes; Appendix E- Documents Cited in the ACHRE Final Report and other Separately Described Materials from the ACHRE Document Collection; Appendix F- Schedule of Advisory Committee Meetings and Meeting Documentation; and Appendix G- Technology Note.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Wynne, Adam S.
2011-05-05
In many application domains in science and engineering, data produced by sensors, instruments and networks is naturally processed by software applications structured as a pipeline . Pipelines comprise a sequence of software components that progressively process discrete units of data to produce a desired outcome. For example, in a Web crawler that is extracting semantics from text on Web sites, the first stage in the pipeline might be to remove all HTML tags to leave only the raw text of the document. The second step may parse the raw text to break it down into its constituent grammatical parts, suchmore » as nouns, verbs and so on. Subsequent steps may look for names of people or places, interesting events or times so documents can be sequenced on a time line. Each of these steps can be written as a specialized program that works in isolation with other steps in the pipeline. In many applications, simple linear software pipelines are sufficient. However, more complex applications require topologies that contain forks and joins, creating pipelines comprising branches where parallel execution is desirable. It is also increasingly common for pipelines to process very large files or high volume data streams which impose end-to-end performance constraints. Additionally, processes in a pipeline may have specific execution requirements and hence need to be distributed as services across a heterogeneous computing and data management infrastructure. From a software engineering perspective, these more complex pipelines become problematic to implement. While simple linear pipelines can be built using minimal infrastructure such as scripting languages, complex topologies and large, high volume data processing requires suitable abstractions, run-time infrastructures and development tools to construct pipelines with the desired qualities-of-service and flexibility to evolve to handle new requirements. The above summarizes the reasons we created the MeDICi Integration Framework (MIF) that is designed for creating high-performance, scalable and modifiable software pipelines. MIF exploits a low friction, robust, open source middleware platform and extends it with component and service-based programmatic interfaces that make implementing complex pipelines simple. The MIF run-time automatically handles queues between pipeline elements in order to handle request bursts, and automatically executes multiple instances of pipeline elements to increase pipeline throughput. Distributed pipeline elements are supported using a range of configurable communications protocols, and the MIF interfaces provide efficient mechanisms for moving data directly between two distributed pipeline elements.« less
Improving text recall with multiple summaries.
van der Meij, Hans; van der Meij, Jan
2012-06-01
QuikScan (QS) is an innovative design that aims to improve accessibility, comprehensibility, and subsequent recall of expository text by means of frequent within-document summaries that are formatted as numbered list items. The numbers in the QS summaries correspond to numbers placed in the body of the document where the summarized ideas are discussed in full. To examine the influence of QS summaries on participants' perceptions of text quality (i.e., comprehensibility, structure, and interest) and recall, an experimental - control group design compared the effects of a QS text with a structured abstract (SA) text. Forty psychology students participated voluntarily or received course credits. Students first read a control (SA) or experimental (QS) text on flashbulb memory (FBM). Next, their perceptions of text quality were measured through a questionnaire. Recall was assessed with an open answer test with items for facts, comprehension and higher order information. Perceptions of text quality did not vary across conditions. But QS did lead to significantly and substantially (d= 1.57) higher overall recall scores. Participants with the QS text performed significantly better on all item types than participants with the SA text. Studying a QS text led to a substantial improvement in recall compared to an SA text. Further research is needed to examine how readers study QS texts and whether a text model hypothesis or a repetition effect hypothesis accounts for the effectiveness. The first hypothesis posits that the QS summaries support the reader in constructing a text schema. The second attributes the effects of these summaries to their repetition of text topics. ©2011 The British Psychological Society.
ERIC Educational Resources Information Center
Farri, Oladimeji Feyisetan
2012-01-01
Large quantities of redundant clinical data are usually transferred from one clinical document to another, making the review of such documents cognitively burdensome and potentially error-prone. Inadequate designs of electronic health record (EHR) clinical document user interfaces probably contribute to the difficulties clinicians experience while…
Generating Hierarchical Document Indices from Common Denominators in Large Document Collections.
ERIC Educational Resources Information Center
O'Kane, Kevin C.
1996-01-01
Describes an algorithm for computer generation of hierarchical indexes for document collections. The resulting index, when presented with a graphical interface, provides users with a view of a document collection that permits general browsing and informal search activities via an access method that requires no keyboard entry or prior knowledge of…
Data from clinical notes: a perspective on the tension between structure and flexible documentation
Denny, Joshua C; Xu, Hua; Lorenzi, Nancy; Stead, William W; Johnson, Kevin B
2011-01-01
Clinical documentation is central to patient care. The success of electronic health record system adoption may depend on how well such systems support clinical documentation. A major goal of integrating clinical documentation into electronic heath record systems is to generate reusable data. As a result, there has been an emphasis on deploying computer-based documentation systems that prioritize direct structured documentation. Research has demonstrated that healthcare providers value different factors when writing clinical notes, such as narrative expressivity, amenability to the existing workflow, and usability. The authors explore the tension between expressivity and structured clinical documentation, review methods for obtaining reusable data from clinical notes, and recommend that healthcare providers be able to choose how to document patient care based on workflow and note content needs. When reusable data are needed from notes, providers can use structured documentation or rely on post-hoc text processing to produce structured data, as appropriate. PMID:21233086
NASA Technical Reports Server (NTRS)
Frisch, Harold P.
2007-01-01
Engineers, who design systems using text specification documents, focus their work upon the completed system to meet Performance, time and budget goals. Consistency and integrity is difficult to maintain within text documents for a single complex system and more difficult to maintain as several systems are combined into higher-level systems, are maintained over decades, and evolve technically and in performance through updates. This system design approach frequently results in major changes during the system integration and test phase, and in time and budget overruns. Engineers who build system specification documents within a model-based systems environment go a step further and aggregate all of the data. They interrelate all of the data to insure consistency and integrity. After the model is constructed, the various system specification documents are prepared, all from the same database. The consistency and integrity of the model is assured, therefore the consistency and integrity of the various specification documents is insured. This article attempts to define model-based systems relative to such an environment. The intent is to expose the complexity of the enabling problem by outlining what is needed, why it is needed and how needs are being addressed by international standards writing teams.
Multiple Representation Document Development
1987-08-06
Interleaf Pub- lishing System [6], or FrameMaker [5], formatting is an integral part of document editing. Here, the document is reformatted as it is edited...consider the task of laying out a page of windowed text. In a layout-driven WYSIWYG system like PageMaker [11], Ready-Set-Go! [12], or FrameMaker [5...in MSIWord, FrameMaker , and Tioga are all handled by a noninteractive off-line program. Direct manipulation, from the processing point of view
Reeves, Anthony P; Xie, Yiting; Liu, Shuang
2017-04-01
With the advent of fully automated image analysis and modern machine learning methods, there is a need for very large image datasets having documented segmentations for both computer algorithm training and evaluation. This paper presents a method and implementation for facilitating such datasets that addresses the critical issue of size scaling for algorithm validation and evaluation; current evaluation methods that are usually used in academic studies do not scale to large datasets. This method includes protocols for the documentation of many regions in very large image datasets; the documentation may be incrementally updated by new image data and by improved algorithm outcomes. This method has been used for 5 years in the context of chest health biomarkers from low-dose chest CT images that are now being used with increasing frequency in lung cancer screening practice. The lung scans are segmented into over 100 different anatomical regions, and the method has been applied to a dataset of over 20,000 chest CT images. Using this framework, the computer algorithms have been developed to achieve over 90% acceptable image segmentation on the complete dataset.
Code of Federal Regulations, 2010 CFR
2010-01-01
... Leadership on Reducing Text Messaging While Driving 13513 Order 13513 Presidential Documents Executive Orders Executive Order 13513 of October 1, 2009 EO 13513 Federal Leadership on Reducing Text Messaging While... leadership in improving safety on our roads and highways and to enhance the efficiency of Federal contracting...
University Students' Subject Disposition According to Text Types
ERIC Educational Resources Information Center
Batur, Zekerya
2017-01-01
The aim of this study is to reveal the subject trends of university students according to species. This is a qualitative study based on document review. The data of the study was obtained from 67 volunteering in-service Turkish teachers' worksheets. The worksheets were classified according to text types. Text types were determined based on the…
76 FR 10405 - Federal Copyright Protection of Sound Recordings Fixed Before February 15, 1972
Federal Register 2010, 2011, 2012, 2013, 2014
2011-02-24
... file in either the Adobe Portable Document File (PDF) format that contains searchable, accessible text (not an image); Microsoft Word; WordPerfect; Rich Text Format (RTF); or ASCII text file format (not a..., comments may be delivered in hard copy. If hand delivered by a private party, an original [[Page 10406...
Finding Relevant Data in a Sea of Languages
2016-04-26
full machine-translated text , unbiased word clouds , query-biased word clouds , and query-biased sentence...and information retrieval to automate language processing tasks so that the limited number of linguists available for analyzing text and spoken...the crime (stock market). The Cross-LAnguage Search Engine (CLASE) has already preprocessed the documents, extracting text to identify the language
Exploring the Role of Reformulations and a Model Text in EFL Students' Writing Performance
ERIC Educational Resources Information Center
Yang, Luxin; Zhang, Ling
2010-01-01
This study examined the effectiveness of reformulation and model text in a three-stage writing task (composing-comparison-revising) in an EFL writing class in a Beijing university. The study documented 10 university students' writing performance from the composing (Stage 1) and comparing (Stage 2, where students compare their own text to a…
77 FR 50907 - Airspace Designations; Incorporation by Reference
Federal Register 2010, 2011, 2012, 2013, 2014
2012-08-23
... FAA processed all proposed changes of the airspace listings in FAA Order 7400.9V in full text as... in full text as final rules in the Federal Register. This rule reflects the periodic integration of... changes of the airspace listings in FAA Order 7400.9W in full text as proposed rule documents in the...
78 FR 52847 - Airspace Designations; Incorporation by Reference
Federal Register 2010, 2011, 2012, 2013, 2014
2013-08-27
... FAA processed all proposed changes of the airspace listings in FAA Order 7400.9W in full text as... in full text as final rules in the Federal Register. This rule reflects the periodic integration of... changes of the airspace listings in FAA Order 7400.9X in full text as proposed rule documents in the...
EXTENSIBLE DATABASE FRAMEWORK FOR MANAGEMENT OF UNSTRUCTURED AND SEMI-STRUCTURED DOCUMENTS
NASA Technical Reports Server (NTRS)
Gawdiak, Yuri O. (Inventor); La, Tracy T. (Inventor); Lin, Shu-Chun Y. (Inventor); Malof, David A. (Inventor); Tran, Khai Peter B. (Inventor)
2005-01-01
Method and system for querying a collection of Unstructured or semi-structured documents to identify presence of, and provide context and/or content for, keywords and/or keyphrases. The documents are analyzed and assigned a node structure, including an ordered sequence of mutually exclusive node segments or strings. Each node has an associated set of at least four, five or six attributes with node information and can represent a format marker or text, with the last node in any node segment usually being a text node. A keyword (or keyphrase) is specified. and the last node in each node segment is searched for a match with the keyword. When a match is found at a query node, or at a node determined with reference to a query node, the system displays the context andor the content of the query node.
Home Page, Sweet Home Page: Creating a Web Presence.
ERIC Educational Resources Information Center
Falcigno, Kathleen; Green, Tim
1995-01-01
Focuses primarily on design issues and practical concerns involved in creating World Wide Web documents for use within an organization. Concerns for those developing Web home pages are: learning HyperText Markup Language (HTML); defining customer group; allocating staff resources for maintenance of documents; providing feedback mechanism for…
Proceedings-1979 third annual practical conference on communication
DOE Office of Scientific and Technical Information (OSTI.GOV)
Not Available
1981-04-01
Topics covered at the meeting include: nonacademic writing, writer and editor training in technical publications, readability of technical documents, guide for beginning technical editors, a visual aids data base, newsletter publishing, style guide for a project management organization, word processing, computer graphics, text management for technical documentation, and typographical terminology.
E-Texts, Mobile Browsing, and Rich Internet Applications
ERIC Educational Resources Information Center
Godwin-Jones, Robert
2007-01-01
Online reading is evolving beyond the perusal of static documents with Web pages inviting readers to become commentators, collaborators, and critics. The much-ballyhooed Web 2.0 is essentially a transition from online consumer to consumer/producer/participant. An online document may well include embedded multimedia or contain other forms of…
32 CFR 1901.44 - Action by appeals authority.
Code of Federal Regulations, 2010 CFR
2010-07-01
... request, the document(s) (sanitized and full text) at issue, and the findings of the concerned Deputy... appearances shall be permitted without the express permission of the Panel. (c) Decision by the Historical... refer the request to the CIA Historical Records Policy Board which acts as the senior corporate board...
Implicature, Pragmatics, and Documentation: A Comparative Study
ERIC Educational Resources Information Center
Wright, David
2008-01-01
This study investigates the link between the linguistic principles of implicature and pragmatics and software documentation. When implicatures are created in conversation or text, the listener or reader is required to fill in missing information not overtly stated. This information is usually filled in on the basis of previous knowledge or…
Some utilities to help produce Rich Text Files from Stata
Gillman, Matthew S.
2018-01-01
Producing RTF files from Stata can be difficult and somewhat cryptic. Utilities are introduced to simplify this process; one builds up a table row-by-row, another inserts a PNG image file into an RTF document, and the others start and finish the RTF document. PMID:29731697
Electronic Document Management Systems: Where Are They Today?
ERIC Educational Resources Information Center
Koulopoulos, Thomas M.; Frappaolo, Carl
1993-01-01
Discusses developments in document management systems based on a survey of over 400 corporations and government agencies. Text retrieval and imaging markets, architecture and integration, purchasing plans, and vendor market leaders are covered. Five graphs present data on user preferences for improvements. A sidebar article reviews the development…
Targeting Tumor-Initiating Cells for the Therapeutics of Breast Cancer
2017-10-01
policy or decision unless so designated by other documentation. REPORT DOCUMENTATION PAGE Form Approved OMB No. 0704-0188 Public reporting...clarifies or supports the text. Examples include original copies of journal articles, reprints of manuscripts and abstracts, a curriculum vitae, patent applications, study questionnaires, and surveys , etc. None.
Answer Mining from On-Line Documents
2001-01-01
successions occurred at IBM in 1999? In addition, questions may also ask about de- velopments of events or trends that are usually answered by a text ... summary . Since data produc- ing these summaries can be sourced in different documents, summary fusion techniques as pro- posed in (Radev and McKeown
PlateRunner: A Search Engine to Identify EMR Boilerplates.
Divita, Guy; Workman, T Elizabeth; Carter, Marjorie E; Redd, Andrew; Samore, Matthew H; Gundlapalli, Adi V
2016-01-01
Medical text contains boilerplated content, an artifact of pull-down forms from EMRs. Boilerplated content is the source of challenges for concept extraction on clinical text. This paper introduces PlateRunner, a search engine on boilerplates from the US Department of Veterans Affairs (VA) EMR. Boilerplates containing concepts should be identified and reviewed to recognize challenging formats, identify high yield document titles, and fine tune section zoning. This search engine has the capability to filter negated and asserted concepts, save and search query results. This tool can save queries, search results, and documents found for later analysis.
Words, concepts, or both: optimal indexing units for automated information retrieval.
Hersh, W. R.; Hickam, D. H.; Leone, T. J.
1992-01-01
What is the best way to represent the content of documents in an information retrieval system? This study compares the retrieval effectiveness of five different methods for automated (machine-assigned) indexing using three test collections. The consistently best methods are those that use indexing based on the words that occur in the available text of each document. Methods used to map text into concepts from a controlled vocabulary showed no advantage over the word-based methods. This study also looked at an approach to relevance feedback which showed benefit for both word-based and concept-based methods. PMID:1482951
DOE Office of Scientific and Technical Information (OSTI.GOV)
Buican, I.; Hriscu, V.; Amador, M.
For the past four and a half years, the International Communication Committee at Los Alamos National Laboratory has been working to develop a set of guidelines for writing technical and scientific documents in International English, that is, English for those whose native language is not English. Originally designed for documents intended for presentation in English to an international audience of technical experts, the International English guidelines apply equally well to the preparation of English text for translation. This is the second workshop in a series devoted to the topic of translation. The authors focus on the advantages of using Internationalmore » English, rather than various methods of simplifying language, to prepare scientific and technical text for translation.« less
A review on computational systems biology of pathogen–host interactions
Durmuş, Saliha; Çakır, Tunahan; Özgür, Arzucan; Guthke, Reinhard
2015-01-01
Pathogens manipulate the cellular mechanisms of host organisms via pathogen–host interactions (PHIs) in order to take advantage of the capabilities of host cells, leading to infections. The crucial role of these interspecies molecular interactions in initiating and sustaining infections necessitates a thorough understanding of the corresponding mechanisms. Unlike the traditional approach of considering the host or pathogen separately, a systems-level approach, considering the PHI system as a whole is indispensable to elucidate the mechanisms of infection. Following the technological advances in the post-genomic era, PHI data have been produced in large-scale within the last decade. Systems biology-based methods for the inference and analysis of PHI regulatory, metabolic, and protein–protein networks to shed light on infection mechanisms are gaining increasing demand thanks to the availability of omics data. The knowledge derived from the PHIs may largely contribute to the identification of new and more efficient therapeutics to prevent or cure infections. There are recent efforts for the detailed documentation of these experimentally verified PHI data through Web-based databases. Despite these advances in data archiving, there are still large amounts of PHI data in the biomedical literature yet to be discovered, and novel text mining methods are in development to unearth such hidden data. Here, we review a collection of recent studies on computational systems biology of PHIs with a special focus on the methods for the inference and analysis of PHI networks, covering also the Web-based databases and text-mining efforts to unravel the data hidden in the literature. PMID:25914674
Text Processing Differences among Readers.
ERIC Educational Resources Information Center
Garner, Ruth
Explanations for differences in reading proficiency should be constructed around an atlas of reading-related individual differences in cognition. Such an atlas should include well-documented "bottom-up", text-driven reading strategies and less thoroughly investigated "top-down", schema-driven reading strategies. Research…
Bilingual Babel: Cuneiform Texts in Two or More Languages from Ancient Mesopotamia and Beyond.
ERIC Educational Resources Information Center
Cooper, Jerrold
1993-01-01
Discusses bilingualism in written cuneiform texts from ancient Babylonia and Sumeria. Describes the development of formats and techniques that enabled two or more languages on a single document to coexist harmoniously and productively. (SR)
Automatic Extraction of Destinations, Origins and Route Parts from Human Generated Route Directions
NASA Astrophysics Data System (ADS)
Zhang, Xiao; Mitra, Prasenjit; Klippel, Alexander; Maceachren, Alan
Researchers from the cognitive and spatial sciences are studying text descriptions of movement patterns in order to examine how humans communicate and understand spatial information. In particular, route directions offer a rich source of information on how cognitive systems conceptualize movement patterns by segmenting them into meaningful parts. Route directions are composed using a plethora of cognitive spatial organization principles: changing levels of granularity, hierarchical organization, incorporation of cognitively and perceptually salient elements, and so forth. Identifying such information in text documents automatically is crucial for enabling machine-understanding of human spatial language. The benefits are: a) creating opportunities for large-scale studies of human linguistic behavior; b) extracting and georeferencing salient entities (landmarks) that are used by human route direction providers; c) developing methods to translate route directions to sketches and maps; and d) enabling queries on large corpora of crawled/analyzed movement data. In this paper, we introduce our approach and implementations that bring us closer to the goal of automatically processing linguistic route directions. We report on research directed at one part of the larger problem, that is, extracting the three most critical parts of route directions and movement patterns in general: origin, destination, and route parts. We use machine-learning based algorithms to extract these parts of routes, including, for example, destination names and types. We prove the effectiveness of our approach in several experiments using hand-tagged corpora.
Document creation, linking, and maintenance system
Claghorn, Ronald [Pasco, WA
2011-02-15
A document creation and citation system designed to maintain a database of reference documents. The content of a selected document may be automatically scanned and indexed by the system. The selected documents may also be manually indexed by a user prior to the upload. The indexed documents may be uploaded and stored within a database for later use. The system allows a user to generate new documents by selecting content within the reference documents stored within the database and inserting the selected content into a new document. The system allows the user to customize and augment the content of the new document. The system also generates citations to the selected content retrieved from the reference documents. The citations may be inserted into the new document in the appropriate location and format, as directed by the user. The new document may be uploaded into the database and included with the other reference documents. The system also maintains the database of reference documents so that when changes are made to a reference document, the author of a document referencing the changed document will be alerted to make appropriate changes to his document. The system also allows visual comparison of documents so that the user may see differences in the text of the documents.
Aviation System Analysis Capability Quick Response System Report Server User’s Guide.
1996-10-01
primary data sources for the QRS Report Server are the following: ♦ United States Department of Transportation airline service quality per- formance...and to cross-reference sections of this document. is used to indicate quoted text messages from WWW pages. is used for WWW page and section titles...would link the user to another document or another section of the same document. ALL CAPS is used to indicate Report Server variables for which the
ERIC Educational Resources Information Center
Common Core State Standards Initiative, 2010
2010-01-01
The text samples presented in this document primarily serve to exemplify the level of complexity and quality that the Standards require all students in a given grade band to engage with. Additionally, they are suggestive of the breadth of texts that students should encounter in the text types required by the Standards. The choices should serve as…
NASA Technical Reports Server (NTRS)
Ambur, Manjula Y.; Adams, David L.; Trinidad, P. Paul
1997-01-01
NASA Langley Technical Library has been involved in developing systems for full-text information delivery of NACA/NASA technical reports since 1991. This paper will describe the two prototypes it has developed and the present production system configuration. The prototype systems are a NACA CD-ROM of thirty-three classic paper NACA reports and a network-based Full-text Electronic Reports Documents System (FEDS) constructed from both paper and electronic formats of NACA and NASA reports. The production system is the DigiDoc System (DIGItal Documents) presently being developed based on the experiences gained from the two prototypes. DigiDoc configuration integrates the on-line catalog database World Wide Web interface and PDF technology to provide a powerful and flexible search and retrieval system. It describes in detail significant achievements and lessons learned in terms of data conversion, storage technologies, full-text searching and retrieval, and image databases. The conclusions from the experiences of digitization and full- text access and future plans for DigiDoc system implementation are discussed.
A comparison study between MLP and convolutional neural network models for character recognition
NASA Astrophysics Data System (ADS)
Ben Driss, S.; Soua, M.; Kachouri, R.; Akil, M.
2017-05-01
Optical Character Recognition (OCR) systems have been designed to operate on text contained in scanned documents and images. They include text detection and character recognition in which characters are described then classified. In the classification step, characters are identified according to their features or template descriptions. Then, a given classifier is employed to identify characters. In this context, we have proposed the unified character descriptor (UCD) to represent characters based on their features. Then, matching was employed to ensure the classification. This recognition scheme performs a good OCR Accuracy on homogeneous scanned documents, however it cannot discriminate characters with high font variation and distortion.3 To improve recognition, classifiers based on neural networks can be used. The multilayer perceptron (MLP) ensures high recognition accuracy when performing a robust training. Moreover, the convolutional neural network (CNN), is gaining nowadays a lot of popularity for its high performance. Furthermore, both CNN and MLP may suffer from the large amount of computation in the training phase. In this paper, we establish a comparison between MLP and CNN. We provide MLP with the UCD descriptor and the appropriate network configuration. For CNN, we employ the convolutional network designed for handwritten and machine-printed character recognition (Lenet-5) and we adapt it to support 62 classes, including both digits and characters. In addition, GPU parallelization is studied to speed up both of MLP and CNN classifiers. Based on our experimentations, we demonstrate that the used real-time CNN is 2x more relevant than MLP when classifying characters.
Kotecha, Aachal; Longstaff, Simon; Azuara-Blanco, Augusto; Kirwan, James F; Morgan, James Edwards; Spencer, Anne Fiona; Foster, Paul J
2018-04-01
To obtain consensus opinion for the development of a standards framework for the development and implementation of virtual clinics for glaucoma monitoring in the UK using a modified Delphi methodology. A modified Delphi technique was used that involved sampling members of the UK Glaucoma and Eire Society (UKEGS). The first round scored the strength of agreement to a series of standards statements using a 9-point Likert scale. The revised standards were subjected to a second round of scoring and free-text comment. The final standards were discussed and agreed by an expert panel consisting of seven glaucoma subspecialists from across the UK. A version of the standards was submitted to external stakeholders for a 3-month consultation. There was a 44% response rate of UKEGS members to rounds 1 and 2, consisting largely of consultant ophthalmologists with a specialist interest in glaucoma. The final version of the standards document was validated by stakeholder consultation and contains four sections pertaining to the patient groups, testing methods, staffing requirements and governance structure of NHS secondary care glaucoma virtual clinic models. Use of a modified Delphi approach has provided consensus agreement for the standards required for the development of virtual clinics to monitor glaucoma in the UK. It is anticipated that this document will be useful as a guide for those implementing this model of service delivery. © Article author(s) (or their employer(s) unless otherwise stated in the text of the article) 2018. All rights reserved. No commercial use is permitted unless otherwise expressly granted.
ERIC Educational Resources Information Center
Cambridge Conference on School Mathematics, Newton, MA.
This is part of a student text which was written with the aim of reflecting the thinking of The Cambridge Conference on School Mathematics (CCSM) regarding the goals and objectives for mathematics. The instructional materials were developed for teaching geometry in the secondary schools. This document is chapter six and titled Motions and…
Reframing Equity under Common Core: A Commentary on the Text Exemplar List for Grades 9-12
ERIC Educational Resources Information Center
Schieble, Melissa
2013-01-01
This article provides a commentary on the text exemplar list for Grade bands 9-12 included in the Common Core documents in the United States. It is argued that a critical literacy perspective supports ELA teachers to assert a professional voice when making complex text selections based on diverse students' needs and interests. Implications…
ERIC Educational Resources Information Center
Maun, Ian
2006-01-01
This paper examines visual and affective factors involved in the reading of foreign language texts. It draws on the results of a pilot study among students of post-compulsory school stage studying French in England. Through a detailed analysis of students' reactions to texts, it demonstrates that the use of "authentic" documents under…
Improving sexually transmitted infection results notification via mobile phone technology.
Reed, Jennifer L; Huppert, Jill S; Taylor, Regina G; Gillespie, Gordon L; Byczkowski, Terri L; Kahn, Jessica A; Alessandrini, Evaline A
2014-11-01
To improve adolescent notification of positive sexually transmitted infection (STI) tests using mobile phone technology and STI information cards. A randomized intervention among 14- to 21-year olds in a pediatric emergency department (PED). A 2 × 3 factorial design with replication was used to evaluate the effectiveness of six combinations of two factors on the proportion of STI-positive adolescents notified within 7 days of testing. Independent factors included method of notification (call, text message, or call + text message) and provision of an STI information card with or without a phone number to obtain results. Covariates for logistic regression included age, empiric STI treatment, days until first attempted notification, and documentation of confidential phone number. Approximately half of the 383 females and 201 males enrolled were ≥18 years of age. Texting only or type of card was not significantly associated with patient notification rates, and there was no significant interaction between card and notification method. For females, successful notification was significantly greater for call + text message (odds ratio, 3.2; 95% confidence interval, 1.4-6.9), and documenting a confidential phone number was independently associated with successful notification (odds ratio, 3.6; 95% confidence interval, 1.7-7.5). We found no significant predictors of successful notification for males. Of patients with a documented confidential phone number who received a call + text message, 94% of females and 83% of males were successfully notified. Obtaining a confidential phone number and using call + text message improved STI notification rates among female but not male adolescents in a pediatric emergency department. Copyright © 2014 Society for Adolescent Health and Medicine. All rights reserved.
BioTextQuest(+): a knowledge integration platform for literature mining and concept discovery.
Papanikolaou, Nikolas; Pavlopoulos, Georgios A; Pafilis, Evangelos; Theodosiou, Theodosios; Schneider, Reinhard; Satagopam, Venkata P; Ouzounis, Christos A; Eliopoulos, Aristides G; Promponas, Vasilis J; Iliopoulos, Ioannis
2014-11-15
The iterative process of finding relevant information in biomedical literature and performing bioinformatics analyses might result in an endless loop for an inexperienced user, considering the exponential growth of scientific corpora and the plethora of tools designed to mine PubMed(®) and related biological databases. Herein, we describe BioTextQuest(+), a web-based interactive knowledge exploration platform with significant advances to its predecessor (BioTextQuest), aiming to bridge processes such as bioentity recognition, functional annotation, document clustering and data integration towards literature mining and concept discovery. BioTextQuest(+) enables PubMed and OMIM querying, retrieval of abstracts related to a targeted request and optimal detection of genes, proteins, molecular functions, pathways and biological processes within the retrieved documents. The front-end interface facilitates the browsing of document clustering per subject, the analysis of term co-occurrence, the generation of tag clouds containing highly represented terms per cluster and at-a-glance popup windows with information about relevant genes and proteins. Moreover, to support experimental research, BioTextQuest(+) addresses integration of its primary functionality with biological repositories and software tools able to deliver further bioinformatics services. The Google-like interface extends beyond simple use by offering a range of advanced parameterization for expert users. We demonstrate the functionality of BioTextQuest(+) through several exemplary research scenarios including author disambiguation, functional term enrichment, knowledge acquisition and concept discovery linking major human diseases, such as obesity and ageing. The service is accessible at http://bioinformatics.med.uoc.gr/biotextquest. g.pavlopoulos@gmail.com or georgios.pavlopoulos@esat.kuleuven.be Supplementary data are available at Bioinformatics online. © The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
Task-Oriented Reading of Multiple Documents: Online Comprehension Processes and Offline Products
ERIC Educational Resources Information Center
Anmarkrud, Øistein; McCrudden, Matthew T.; Bråten, Ivar; Strømsø, Helge I.
2013-01-01
We explored readers' judgments of text relevance and strategy use while they read about a controversial scientific issue in multiple conflicting documents using a think-aloud methodology and had them write a short essay after reading. Participants were university-level students. There were three main findings. First, readers discriminated…
77 FR 34943 - Privacy Act of 1974; System of Records
Federal Register 2010, 2011, 2012, 2013, 2014
2012-06-12
... document is the document published in the Federal Register. Free Internet access to the official edition of... text telephone (TTY), call the Federal Relay Service (FRS), toll free, at 1- 800-877-8339. Individuals... use PDF you must have Adobe Acrobat Reader, which is available free at the site. You may also access...
32 CFR 1900.44 - Action by appeals authority.
Code of Federal Regulations, 2010 CFR
2010-07-01
... complete record of the request consisting of the request, the document(s) (sanitized and full text) at... Panel. (c) Decision by the Historical Records Policy Board. In any cases of divided vote by the ARP, any member of that body is authorized to refer the request to the CIA Historical Records Policy Board which...
32 CFR 1908.35 - Action by appeals authority.
Code of Federal Regulations, 2010 CFR
2010-07-01
... consisting of the request, the document(s) (sanitized and full text) at issue, and the findings of the... of the Panel. (b) Action by Historical Records Policy Board. In any cases of divided vote by the ARP, any member of that body is authorized to refer the request to the CIA Historical Records Policy Board...
Multi-Filter String Matching and Human-Centric Entity Matching for Information Extraction
ERIC Educational Resources Information Center
Sun, Chong
2012-01-01
More and more information is being generated in text documents, such as Web pages, emails and blogs. To effectively manage this unstructured information, one broadly used approach includes locating relevant content in documents, extracting structured information and integrating the extracted information for querying, mining or further analysis. In…