Development of a full-text information retrieval system
DOE Office of Scientific and Technical Information (OSTI.GOV)
Keizo Oyama; AKira Miyazawa, Atsuhiro Takasu; Kouji Shibano
The authors have executed a project to realize a full-text information retrieval system. The system is designed to deal with a document database comprising full text of a large number of documents such as academic papers. The document structures are utilized in searching and extracting appropriate information. The concept of structure handling and the configuration of the system are described in this paper.
DOE Research and Development Accomplishments Help
be used to search, locate, access, and electronically download full-text research and development (R Browse Downloading, Viewing, and/or Searching Full-text Documents/Pages Searching the Database Search Features Search allows you to search the OCRed full-text document and bibliographic information, the
75 FR 55267 - Airspace Designations; Incorporation By Reference
Federal Register 2010, 2011, 2012, 2013, 2014
2010-09-10
... airspace listings in FAA Order 7400.9T in full text as proposed rule documents in the Federal Register. Likewise, all amendments of these listings were published in full text as final rules in the Federal... Order 7400.9U in full text as proposed rule documents in the Federal Register. Likewise, all amendments...
76 FR 53328 - Airspace Designations; Incorporation by Reference
Federal Register 2010, 2011, 2012, 2013, 2014
2011-08-26
... proposed changes of the airspace listings in FAA Order 7400.9U in full text as proposed rule documents in the Federal Register. Likewise, all amendments of these listings were published in full text as final... the airspace listings in FAA Order 7400.9V in full text as proposed rule documents in the Federal...
Academic Journal Embargoes and Full Text Databases.
ERIC Educational Resources Information Center
Brooks, Sam
2003-01-01
Documents the reasons for embargoes of academic journals in full text databases (i.e., publisher-imposed delays on the availability of full text content) and provides insight regarding common misconceptions. Tables present data on selected journals covering a cross-section of subjects and publishers and comparing two full text business databases.…
On the Creation of Hypertext Links in Full-Text Documents: Measurement of Inter-Linker Consistency.
ERIC Educational Resources Information Center
Ellis, David; And Others
1994-01-01
Describes a study in which several different sets of hypertext links are inserted by different people in full-text documents. The degree of similarity between the sets is measured using coefficients and topological indices. As in comparable studies of inter-indexer consistency, the sets of links used by different people showed little similarity.…
Garcia Castro, Leyla Jael; Berlanga, Rafael; Garcia, Alexander
2015-10-01
Although full-text articles are provided by the publishers in electronic formats, it remains a challenge to find related work beyond the title and abstract context. Identifying related articles based on their abstract is indeed a good starting point; this process is straightforward and does not consume as many resources as full-text based similarity would require. However, further analyses may require in-depth understanding of the full content. Two articles with highly related abstracts can be substantially different regarding the full content. How similarity differs when considering title-and-abstract versus full-text and which semantic similarity metric provides better results when dealing with full-text articles are the main issues addressed in this manuscript. We have benchmarked three similarity metrics - BM25, PMRA, and Cosine, in order to determine which one performs best when using concept-based annotations on full-text documents. We also evaluated variations in similarity values based on title-and-abstract against those relying on full-text. Our test dataset comprises the Genomics track article collection from the 2005 Text Retrieval Conference. Initially, we used an entity recognition software to semantically annotate titles and abstracts as well as full-text with concepts defined in the Unified Medical Language System (UMLS®). For each article, we created a document profile, i.e., a set of identified concepts, term frequency, and inverse document frequency; we then applied various similarity metrics to those document profiles. We considered correlation, precision, recall, and F1 in order to determine which similarity metric performs best with concept-based annotations. For those full-text articles available in PubMed Central Open Access (PMC-OA), we also performed dispersion analyses in order to understand how similarity varies when considering full-text articles. We have found that the PubMed Related Articles similarity metric is the most suitable for full-text articles annotated with UMLS concepts. For similarity values above 0.8, all metrics exhibited an F1 around 0.2 and a recall around 0.1; BM25 showed the highest precision close to 1; in all cases the concept-based metrics performed better than the word-stem-based one. Our experiments show that similarity values vary when considering only title-and-abstract versus full-text similarity. Therefore, analyses based on full-text become useful when a given research requires going beyond title and abstract, particularly regarding connectivity across articles. Visualization available at ljgarcia.github.io/semsim.benchmark/, data available at http://dx.doi.org/10.5281/zenodo.13323. Copyright © 2015 Elsevier Inc. All rights reserved.
76 FR 17353 - Aviation Communications
Federal Register 2010, 2011, 2012, 2013, 2014
2011-03-29
... publication). The full text of this document is available for public inspection and copying during regular... Printing, Inc., 445 12th Street, SW., Room CY-B402, Washington, DC 20554. The full text may also be...
77 FR 50907 - Airspace Designations; Incorporation by Reference
Federal Register 2010, 2011, 2012, 2013, 2014
2012-08-23
... FAA processed all proposed changes of the airspace listings in FAA Order 7400.9V in full text as... in full text as final rules in the Federal Register. This rule reflects the periodic integration of... changes of the airspace listings in FAA Order 7400.9W in full text as proposed rule documents in the...
78 FR 52847 - Airspace Designations; Incorporation by Reference
Federal Register 2010, 2011, 2012, 2013, 2014
2013-08-27
... FAA processed all proposed changes of the airspace listings in FAA Order 7400.9W in full text as... in full text as final rules in the Federal Register. This rule reflects the periodic integration of... changes of the airspace listings in FAA Order 7400.9X in full text as proposed rule documents in the...
NASA Technical Reports Server (NTRS)
Ambur, Manjula Y.; Adams, David L.; Trinidad, P. Paul
1997-01-01
NASA Langley Technical Library has been involved in developing systems for full-text information delivery of NACA/NASA technical reports since 1991. This paper will describe the two prototypes it has developed and the present production system configuration. The prototype systems are a NACA CD-ROM of thirty-three classic paper NACA reports and a network-based Full-text Electronic Reports Documents System (FEDS) constructed from both paper and electronic formats of NACA and NASA reports. The production system is the DigiDoc System (DIGItal Documents) presently being developed based on the experiences gained from the two prototypes. DigiDoc configuration integrates the on-line catalog database World Wide Web interface and PDF technology to provide a powerful and flexible search and retrieval system. It describes in detail significant achievements and lessons learned in terms of data conversion, storage technologies, full-text searching and retrieval, and image databases. The conclusions from the experiences of digitization and full- text access and future plans for DigiDoc system implementation are discussed.
Document Delivery from Full-Text Online Files: A Pilot Project.
ERIC Educational Resources Information Center
Gillikin, David P.
1990-01-01
Describes the Electronic Journal Retrieval Project (EJRP) developed at the University of Tennessee, Knoxville Libraries, to provide full-text journal articles from online systems. Highlights include costs of various search strategies; implications for library services; collection development and interlibrary loan considerations; and suggestions…
Subject Retrieval from Full-Text Databases in the Humanities
ERIC Educational Resources Information Center
East, John W.
2007-01-01
This paper examines the problems involved in subject retrieval from full-text databases of secondary materials in the humanities. Ten such databases were studied and their search functionality evaluated, focusing on factors such as Boolean operators, document surrogates, limiting by subject area, proximity operators, phrase searching, wildcards,…
32 CFR 1801.44 - Action by appeals authority.
Code of Federal Regulations, 2010 CFR
2010-07-01
... CENTER PUBLIC RIGHTS UNDER THE PRIVACY ACT OF 1974 Action On Privacy Act Administrative Appeals § 1801.44... request, the document(s) (sanitized and full text) at issue, and the findings of any concerned office...
Research notes : information at your fingertips!
DOT National Transportation Integrated Search
2000-03-01
TRIS Online includes full-text reports or links to publishers or suppliers of the original documents. You will find titles, publication dates, authors, abstracts, and document sources. : Each year over 20,000 new records are added to TRIS. The databa...
Federal Register 2010, 2011, 2012, 2013, 2014
2011-10-12
... 11-134] Facilitating the Deployment of Text-to-911 and Other Next Generation 911 Applications... 911 Public Safety Answering Points (PSAPs) via text, photos, videos, and data and enhance the... 22, 2011. The full text of this document is available for public inspection during regular business...
78 FR 46310 - Petition for Reconsideration of Action in Rulemaking Proceeding
Federal Register 2010, 2011, 2012, 2013, 2014
2013-07-31
... document, Report No. 2985, released June11, 2013. The full text of Report No. 2985 is available for viewing... the Deployment of Text-to-911 and Other Next Generation 911 Applications; Framework for Next...
Müller, H-M; Van Auken, K M; Li, Y; Sternberg, P W
2018-03-09
The biomedical literature continues to grow at a rapid pace, making the challenge of knowledge retrieval and extraction ever greater. Tools that provide a means to search and mine the full text of literature thus represent an important way by which the efficiency of these processes can be improved. We describe the next generation of the Textpresso information retrieval system, Textpresso Central (TPC). TPC builds on the strengths of the original system by expanding the full text corpus to include the PubMed Central Open Access Subset (PMC OA), as well as the WormBase C. elegans bibliography. In addition, TPC allows users to create a customized corpus by uploading and processing documents of their choosing. TPC is UIMA compliant, to facilitate compatibility with external processing modules, and takes advantage of Lucene indexing and search technology for efficient handling of millions of full text documents. Like Textpresso, TPC searches can be performed using keywords and/or categories (semantically related groups of terms), but to provide better context for interpreting and validating queries, search results may now be viewed as highlighted passages in the context of full text. To facilitate biocuration efforts, TPC also allows users to select text spans from the full text and annotate them, create customized curation forms for any data type, and send resulting annotations to external curation databases. As an example of such a curation form, we describe integration of TPC with the Noctua curation tool developed by the Gene Ontology (GO) Consortium. Textpresso Central is an online literature search and curation platform that enables biocurators and biomedical researchers to search and mine the full text of literature by integrating keyword and category searches with viewing search results in the context of the full text. It also allows users to create customized curation interfaces, use those interfaces to make annotations linked to supporting evidence statements, and then send those annotations to any database in the world. Textpresso Central URL: http://www.textpresso.org/tpc.
System of HPC content archiving
NASA Astrophysics Data System (ADS)
Bogdanov, A.; Ivashchenko, A.
2017-12-01
This work is aimed to develop a system, that will effectively solve the problem of storing and analyzing files containing text data, by using modern software development tools, techniques and approaches. The main challenge of storing a large number of text documents defined at the problem formulation stage, have to be resolved with such functionality as full text search and document clustering depends on their contents. Main system features could be described with notions of distributed multilevel architecture, flexibility and interchangeability of components, achieved through the standard functionality incapsulation in independent executable modules.
A Survey in Indexing and Searching XML Documents.
ERIC Educational Resources Information Center
Luk, Robert W. P.; Leong, H. V.; Dillon, Tharam S.; Chan, Alvin T. S.; Croft, W. Bruce; Allan, James
2002-01-01
Discussion of XML focuses on indexing techniques for XML documents, grouping them into flat-file, semistructured, and structured indexing paradigms. Highlights include searching techniques, including full text search and multistage search; search result presentations; database and information retrieval system integration; XML query languages; and…
1 CFR 18.16 - Reinstatement of expired regulations.
Code of Federal Regulations, 2011 CFR
2011-01-01
... Section 18.16 General Provisions ADMINISTRATIVE COMMITTEE OF THE FEDERAL REGISTER PREPARATION, TRANSMITTAL, AND PROCESSING OF DOCUMENTS PREPARATION AND TRANSMITTAL OF DOCUMENTS GENERALLY § 18.16 Reinstatement... Regulations data base which have expired by their own terms only by republishing the regulations in full text...
1 CFR 18.16 - Reinstatement of expired regulations.
Code of Federal Regulations, 2010 CFR
2010-01-01
... Section 18.16 General Provisions ADMINISTRATIVE COMMITTEE OF THE FEDERAL REGISTER PREPARATION, TRANSMITTAL, AND PROCESSING OF DOCUMENTS PREPARATION AND TRANSMITTAL OF DOCUMENTS GENERALLY § 18.16 Reinstatement... Regulations data base which have expired by their own terms only by republishing the regulations in full text...
Detection of figure and caption pairs based on disorder measurements
NASA Astrophysics Data System (ADS)
Faure, Claudie; Vincent, Nicole
2010-01-01
Figures inserted in documents mediate a kind of information for which the visual modality is more appropriate than the text. A complete understanding of a figure often necessitates the reading of its caption or to establish a relationship with the main text using a numbered figure identifier which is replicated in the caption and in the main text. A figure and its caption are closely related; they constitute single multimodal components (FC-pair) that Document Image Analysis cannot extract with text and graphics segmentation. We propose a method to go further than the graphics and text segmentation in order to extract FC-pairs without performing a full labelling of the page components. Horizontal and vertical text lines are detected in the pages. The graphics are associated with selected text lines to initiate the detector of FC-pairs. Spatial and visual disorders are introduced to define a layout model in terms of properties. It enables to cope with most of the numerous spatial arrangements of graphics and text lines. The detector of FC-pairs performs operations in order to eliminate the layout disorder and assigns a quality value to each FC-pair. The processed documents were collected in medic@, the digital historical collection of the BIUM (Bibliothèque InterUniversitaire Médicale). A first set of 98 pages constitutes the design set. Then 298 pages were collected to evaluate the system. The performances are the result of a full process, from the binarisation of the digital images to the detection of FC-pairs.
ICCE/ICCAI 2000 Full & Short Papers (Methodologies).
ERIC Educational Resources Information Center
2000
This document contains the full text of the following full and short papers on methodologies from ICCE/ICCAI 2000 (International Conference on Computers in Education/International Conference on Computer-Assisted Instruction): (1) "A Methodology for Learning Pattern Analysis from Web Logs by Interpreting Web Page Contents" (Chih-Kai Chang and…
Storing and Viewing Electronic Documents.
ERIC Educational Resources Information Center
Falk, Howard
1999-01-01
Discusses the conversion of fragile library materials to computer storage and retrieval to extend the life of the items and to improve accessibility through the World Wide Web. Highlights include entering the images, including scanning; optical character recognition; full text and manual indexing; and available document- and image-management…
A Digital Library in the Mid-Nineties, Ahead or On Schedule?
ERIC Educational Resources Information Center
Dijkstra, Joost
1994-01-01
Discussion of the future possibilities of digital library systems highlights digital projects developed at Tilburg University (Netherlands). Topics addressed include online access to databases; electronic document delivery; agreements between libraries and Elsevier Science publishers to provide journal articles; full text document delivery; and…
Federal Register 2010, 2011, 2012, 2013, 2014
2011-12-01
...: This is a summary of the Commission's document, Report No. 2937, released November 15, 2011. The full text of this document is available for viewing and copying in Room CY-B402, 445 12th Street SW...
Automatic indexing of scanned documents: a layout-based approach
NASA Astrophysics Data System (ADS)
Esser, Daniel; Schuster, Daniel; Muthmann, Klemens; Berger, Michael; Schill, Alexander
2012-01-01
Archiving official written documents such as invoices, reminders and account statements in business and private area gets more and more important. Creating appropriate index entries for document archives like sender's name, creation date or document number is a tedious manual work. We present a novel approach to handle automatic indexing of documents based on generic positional extraction of index terms. For this purpose we apply the knowledge of document templates stored in a common full text search index to find index positions that were successfully extracted in the past.
32 CFR 1901.44 - Action by appeals authority.
Code of Federal Regulations, 2010 CFR
2010-07-01
... request, the document(s) (sanitized and full text) at issue, and the findings of the concerned Deputy... appearances shall be permitted without the express permission of the Panel. (c) Decision by the Historical... refer the request to the CIA Historical Records Policy Board which acts as the senior corporate board...
Using ontology network structure in text mining.
Berndt, Donald J; McCart, James A; Luther, Stephen L
2010-11-13
Statistical text mining treats documents as bags of words, with a focus on term frequencies within documents and across document collections. Unlike natural language processing (NLP) techniques that rely on an engineered vocabulary or a full-featured ontology, statistical approaches do not make use of domain-specific knowledge. The freedom from biases can be an advantage, but at the cost of ignoring potentially valuable knowledge. The approach proposed here investigates a hybrid strategy based on computing graph measures of term importance over an entire ontology and injecting the measures into the statistical text mining process. As a starting point, we adapt existing search engine algorithms such as PageRank and HITS to determine term importance within an ontology graph. The graph-theoretic approach is evaluated using a smoking data set from the i2b2 National Center for Biomedical Computing, cast as a simple binary classification task for categorizing smoking-related documents, demonstrating consistent improvements in accuracy.
A Full-Text-Based Search Engine for Finding Highly Matched Documents Across Multiple Categories
NASA Technical Reports Server (NTRS)
Nguyen, Hung D.; Steele, Gynelle C.
2016-01-01
This report demonstrates the full-text-based search engine that works on any Web-based mobile application. The engine has the capability to search databases across multiple categories based on a user's queries and identify the most relevant or similar. The search results presented here were found using an Android (Google Co.) mobile device; however, it is also compatible with other mobile phones.
ERIC Educational Resources Information Center
Jul, Erik
1992-01-01
Describes the use of file transfer protocol (FTP) on the INTERNET computer network and considers its use as an electronic publishing system. The differing electronic formats of text files are discussed; the preparation and access of documents are described; and problems are addressed, including a lack of consistency. (LRW)
Finding Relevant Data in a Sea of Languages
2016-04-26
full machine-translated text , unbiased word clouds , query-biased word clouds , and query-biased sentence...and information retrieval to automate language processing tasks so that the limited number of linguists available for analyzing text and spoken...the crime (stock market). The Cross-LAnguage Search Engine (CLASE) has already preprocessed the documents, extracting text to identify the language
32 CFR 1900.44 - Action by appeals authority.
Code of Federal Regulations, 2010 CFR
2010-07-01
... complete record of the request consisting of the request, the document(s) (sanitized and full text) at... Panel. (c) Decision by the Historical Records Policy Board. In any cases of divided vote by the ARP, any member of that body is authorized to refer the request to the CIA Historical Records Policy Board which...
32 CFR 1908.35 - Action by appeals authority.
Code of Federal Regulations, 2010 CFR
2010-07-01
... consisting of the request, the document(s) (sanitized and full text) at issue, and the findings of the... of the Panel. (b) Action by Historical Records Policy Board. In any cases of divided vote by the ARP, any member of that body is authorized to refer the request to the CIA Historical Records Policy Board...
An Optical Disk-Based Information Retrieval System.
ERIC Educational Resources Information Center
Bender, Avi
1988-01-01
Discusses a pilot project by the Nuclear Regulatory Commission to apply optical disk technology to the storage and retrieval of documents related to its high level waste management program. Components and features of the microcomputer-based system which provides full-text and image access to documents are described. A sample search is included.…
Mining protein function from text using term-based support vector machines
Rice, Simon B; Nenadic, Goran; Stapley, Benjamin J
2005-01-01
Background Text mining has spurred huge interest in the domain of biology. The goal of the BioCreAtIvE exercise was to evaluate the performance of current text mining systems. We participated in Task 2, which addressed assigning Gene Ontology terms to human proteins and selecting relevant evidence from full-text documents. We approached it as a modified form of the document classification task. We used a supervised machine-learning approach (based on support vector machines) to assign protein function and select passages that support the assignments. As classification features, we used a protein's co-occurring terms that were automatically extracted from documents. Results The results evaluated by curators were modest, and quite variable for different problems: in many cases we have relatively good assignment of GO terms to proteins, but the selected supporting text was typically non-relevant (precision spanning from 3% to 50%). The method appears to work best when a substantial set of relevant documents is obtained, while it works poorly on single documents and/or short passages. The initial results suggest that our approach can also mine annotations from text even when an explicit statement relating a protein to a GO term is absent. Conclusion A machine learning approach to mining protein function predictions from text can yield good performance only if sufficient training data is available, and significant amount of supporting data is used for prediction. The most promising results are for combined document retrieval and GO term assignment, which calls for the integration of methods developed in BioCreAtIvE Task 1 and Task 2. PMID:15960835
ICCE/ICCAI 2000 Full & Short Papers (Virtual Reality in Education).
ERIC Educational Resources Information Center
2000
This document contains the full text of the following full and short papers on virtual reality in education from ICCE/ICCAI 2000 (International Conference on Computers in Education/International Conference on Computer-Assisted Instruction): (1) "A CAL System for Appreciation of 3D Shapes by Surface Development (C3D-SD)" (Stephen C. F. Chan, Andy…
How We Think and Learn. Lecture Series.
ERIC Educational Resources Information Center
National Learning Center, Washington, DC.
A lecture series was conducted in 1989 to present information on learning theories by learning theorists. This document contains short texts of the lectures; full texts are available on request. In lecture 1, Robert Chase discusses educational reform and Bonnie Guiton examines educational goals from the perspective of White House policy. In…
ERIC Educational Resources Information Center
McClean, Clare M.
1998-01-01
Reviews strengths and weaknesses of five optical character recognition (OCR) software packages used to digitize paper documents before publishing on the Internet. Outlines options available and stages of the conversion process. Describes the learning experience of Eurotext, a United Kingdom-based electronic libraries project (eLib). (PEN)
Desiderata for ontologies to be used in semantic annotation of biomedical documents.
Bada, Michael; Hunter, Lawrence
2011-02-01
A wealth of knowledge valuable to the translational research scientist is contained within the vast biomedical literature, but this knowledge is typically in the form of natural language. Sophisticated natural-language-processing systems are needed to translate text into unambiguous formal representations grounded in high-quality consensus ontologies, and these systems in turn rely on gold-standard corpora of annotated documents for training and testing. To this end, we are constructing the Colorado Richly Annotated Full-Text (CRAFT) Corpus, a collection of 97 full-text biomedical journal articles that are being manually annotated with the entire sets of terms from select vocabularies, predominantly from the Open Biomedical Ontologies (OBO) library. Our efforts in building this corpus has illuminated infelicities of these ontologies with respect to the semantic annotation of biomedical documents, and we propose desiderata whose implementation could substantially improve their utility in this task; these include the integration of overlapping terms across OBOs, the resolution of OBO-specific ambiguities, the integration of the BFO with the OBOs and the use of mid-level ontologies, the inclusion of noncanonical instances, and the expansion of relations and realizable entities. Copyright © 2010 Elsevier Inc. All rights reserved.
SureChEMBL: a large-scale, chemically annotated patent document database.
Papadatos, George; Davies, Mark; Dedman, Nathan; Chambers, Jon; Gaulton, Anna; Siddle, James; Koks, Richard; Irvine, Sean A; Pettersson, Joe; Goncharoff, Nicko; Hersey, Anne; Overington, John P
2016-01-04
SureChEMBL is a publicly available large-scale resource containing compounds extracted from the full text, images and attachments of patent documents. The data are extracted from the patent literature according to an automated text and image-mining pipeline on a daily basis. SureChEMBL provides access to a previously unavailable, open and timely set of annotated compound-patent associations, complemented with sophisticated combined structure and keyword-based search capabilities against the compound repository and patent document corpus; given the wealth of knowledge hidden in patent documents, analysis of SureChEMBL data has immediate applications in drug discovery, medicinal chemistry and other commercial areas of chemical science. Currently, the database contains 17 million compounds extracted from 14 million patent documents. Access is available through a dedicated web-based interface and data downloads at: https://www.surechembl.org/. © The Author(s) 2015. Published by Oxford University Press on behalf of Nucleic Acids Research.
SureChEMBL: a large-scale, chemically annotated patent document database
Papadatos, George; Davies, Mark; Dedman, Nathan; Chambers, Jon; Gaulton, Anna; Siddle, James; Koks, Richard; Irvine, Sean A.; Pettersson, Joe; Goncharoff, Nicko; Hersey, Anne; Overington, John P.
2016-01-01
SureChEMBL is a publicly available large-scale resource containing compounds extracted from the full text, images and attachments of patent documents. The data are extracted from the patent literature according to an automated text and image-mining pipeline on a daily basis. SureChEMBL provides access to a previously unavailable, open and timely set of annotated compound-patent associations, complemented with sophisticated combined structure and keyword-based search capabilities against the compound repository and patent document corpus; given the wealth of knowledge hidden in patent documents, analysis of SureChEMBL data has immediate applications in drug discovery, medicinal chemistry and other commercial areas of chemical science. Currently, the database contains 17 million compounds extracted from 14 million patent documents. Access is available through a dedicated web-based interface and data downloads at: https://www.surechembl.org/. PMID:26582922
ICCE/ICCAI 2000 Full & Short Papers (Educational Agent).
ERIC Educational Resources Information Center
2000
This document contains the full text of the following papers on educational agent from ICCE/ICCAI 2000 (International Conference on Computers in Education/International Conference on Computer-Assisted Instruction): (1) "An Agent-Based Intelligent Tutoring System" (C.M. Bruff and M.A. Williams); (2) "Design of Systematic Concept…
Text mining for the biocuration workflow
Hirschman, Lynette; Burns, Gully A. P. C; Krallinger, Martin; Arighi, Cecilia; Cohen, K. Bretonnel; Valencia, Alfonso; Wu, Cathy H.; Chatr-Aryamontri, Andrew; Dowell, Karen G.; Huala, Eva; Lourenço, Anália; Nash, Robert; Veuthey, Anne-Lise; Wiegers, Thomas; Winter, Andrew G.
2012-01-01
Molecular biology has become heavily dependent on biological knowledge encoded in expert curated biological databases. As the volume of biological literature increases, biocurators need help in keeping up with the literature; (semi-) automated aids for biocuration would seem to be an ideal application for natural language processing and text mining. However, to date, there have been few documented successes for improving biocuration throughput using text mining. Our initial investigations took place for the workshop on ‘Text Mining for the BioCuration Workflow’ at the third International Biocuration Conference (Berlin, 2009). We interviewed biocurators to obtain workflows from eight biological databases. This initial study revealed high-level commonalities, including (i) selection of documents for curation; (ii) indexing of documents with biologically relevant entities (e.g. genes); and (iii) detailed curation of specific relations (e.g. interactions); however, the detailed workflows also showed many variabilities. Following the workshop, we conducted a survey of biocurators. The survey identified biocurator priorities, including the handling of full text indexed with biological entities and support for the identification and prioritization of documents for curation. It also indicated that two-thirds of the biocuration teams had experimented with text mining and almost half were using text mining at that time. Analysis of our interviews and survey provide a set of requirements for the integration of text mining into the biocuration workflow. These can guide the identification of common needs across curated databases and encourage joint experimentation involving biocurators, text mining developers and the larger biomedical research community. PMID:22513129
Text mining for the biocuration workflow.
Hirschman, Lynette; Burns, Gully A P C; Krallinger, Martin; Arighi, Cecilia; Cohen, K Bretonnel; Valencia, Alfonso; Wu, Cathy H; Chatr-Aryamontri, Andrew; Dowell, Karen G; Huala, Eva; Lourenço, Anália; Nash, Robert; Veuthey, Anne-Lise; Wiegers, Thomas; Winter, Andrew G
2012-01-01
Molecular biology has become heavily dependent on biological knowledge encoded in expert curated biological databases. As the volume of biological literature increases, biocurators need help in keeping up with the literature; (semi-) automated aids for biocuration would seem to be an ideal application for natural language processing and text mining. However, to date, there have been few documented successes for improving biocuration throughput using text mining. Our initial investigations took place for the workshop on 'Text Mining for the BioCuration Workflow' at the third International Biocuration Conference (Berlin, 2009). We interviewed biocurators to obtain workflows from eight biological databases. This initial study revealed high-level commonalities, including (i) selection of documents for curation; (ii) indexing of documents with biologically relevant entities (e.g. genes); and (iii) detailed curation of specific relations (e.g. interactions); however, the detailed workflows also showed many variabilities. Following the workshop, we conducted a survey of biocurators. The survey identified biocurator priorities, including the handling of full text indexed with biological entities and support for the identification and prioritization of documents for curation. It also indicated that two-thirds of the biocuration teams had experimented with text mining and almost half were using text mining at that time. Analysis of our interviews and survey provide a set of requirements for the integration of text mining into the biocuration workflow. These can guide the identification of common needs across curated databases and encourage joint experimentation involving biocurators, text mining developers and the larger biomedical research community.
ERIC Educational Resources Information Center
Fitzgerald, Sallyanne H.
1982-01-01
THE FOLLOWING IS THE FULL TEXT OF THIS DOCUMENT: LEVEL: College. AUTHOR'S COMMENT: When I first began as a college composition instructor, I gave a standard explanation that definition was necessary if students wished to argue logically or to explain an unfamiliar subject. I showed examples of definitions, discussed ones in the text, and then sent…
ERIC Educational Resources Information Center
International Association of Technological Univ. Libraries, Gothenburg (Sweden).
This proceedings of the 1998 conference of the International Association of Technological University Libraries (IATUL) contains the full text of the following papers: "A Library Ready for 21st Century Services: The Case of the University of Science and Technology (UST) Library, Kumasi, Ghana" (Helena Rebecca Asamoah-Hassan);…
Typograph: Multiscale Spatial Exploration of Text Documents
DOE Office of Scientific and Technical Information (OSTI.GOV)
Endert, Alexander; Burtner, Edwin R.; Cramer, Nicholas O.
2013-12-01
Visualizing large document collections using a spatial layout of terms can enable quick overviews of information. However, these metaphors (e.g., word clouds, tag clouds, etc.) often lack interactivity to explore the information and the location and rendering of the terms are often not based on mathematical models that maintain relative distances from other information based on similarity metrics. Further, transitioning between levels of detail (i.e., from terms to full documents) can be challanging. In this paper, we present Typograph, a multi-scale spatial exploration visualization for large document collections. Based on the term-based visualization methods, Typograh enables multipel levels of detailmore » (terms, phrases, snippets, and full documents) within the single spatialization. Further, the information is placed based on their relative similarity to other information to create the “near = similar” geography metaphor. This paper discusses the design principles and functionality of Typograph and presents a use case analyzing Wikipedia to demonstrate usage.« less
Federal Register 2010, 2011, 2012, 2013, 2014
2013-10-03
.... 98-170; DA 13-1807] Empowering Consumers To Prevent and Detect Billing for Unauthorized Charges... Notice DA 13-1807, released August 27, 2013 in CG Docket Nos. 11-116 and 09-158, and CC Docket No. 98-170. The full text of document DA 13- 1807 and copies of any subsequently filed documents in this matter...
Federal Register 2010, 2011, 2012, 2013, 2014
2010-06-30
... FEDERAL COMMUNICATIONS COMMISSION [CG Docket No. 02-278; DA 10-997] Consumer & Governmental... Notice DA 10-997, which seeks comment on Global Tel's petition. Pursuant to 47 CFR 1.415 and 1.419 of the... 1.1206(b). The full text of document DA 10-997 and any subsequently filed documents in this matter...
The Full Monty: Locating Resources, Creating, and Presenting a Web Enhanced History Course.
ERIC Educational Resources Information Center
Bazillion, Richard J.; Braun, Connie L.
2001-01-01
Discusses how to develop a history course using the World Wide Web; course development software; full text digitized articles, electronic books, primary documents, images, and audio files; and computer equipment such as LCD projectors and interactive whiteboards. Addresses the importance of support for faculty using technology in teaching. (PAL)
Criminal Justice Research in Libraries and on the Internet.
ERIC Educational Resources Information Center
Nelson, Bonnie R.
In addition to covering the enduring elements of traditional research on criminal justice, this new edition provides full coverage on research using the World Wide Web, hypertext documents, computer indexes, and other online resources. It gives an in-depth explanation of such concepts as databases, networks, and full text, and covers the Internet…
Challenges for automatically extracting molecular interactions from full-text articles.
McIntosh, Tara; Curran, James R
2009-09-24
The increasing availability of full-text biomedical articles will allow more biomedical knowledge to be extracted automatically with greater reliability. However, most Information Retrieval (IR) and Extraction (IE) tools currently process only abstracts. The lack of corpora has limited the development of tools that are capable of exploiting the knowledge in full-text articles. As a result, there has been little investigation into the advantages of full-text document structure, and the challenges developers will face in processing full-text articles. We manually annotated passages from full-text articles that describe interactions summarised in a Molecular Interaction Map (MIM). Our corpus tracks the process of identifying facts to form the MIM summaries and captures any factual dependencies that must be resolved to extract the fact completely. For example, a fact in the results section may require a synonym defined in the introduction. The passages are also annotated with negated and coreference expressions that must be resolved.We describe the guidelines for identifying relevant passages and possible dependencies. The corpus includes 2162 sentences from 78 full-text articles. Our corpus analysis demonstrates the necessity of full-text processing; identifies the article sections where interactions are most commonly stated; and quantifies the proportion of interaction statements requiring coherent dependencies. Further, it allows us to report on the relative importance of identifying synonyms and resolving negated expressions. We also experiment with an oracle sentence retrieval system using the corpus as a gold-standard evaluation set. We introduce the MIM corpus, a unique resource that maps interaction facts in a MIM to annotated passages within full-text articles. It is an invaluable case study providing guidance to developers of biomedical IR and IE systems, and can be used as a gold-standard evaluation set for full-text IR tasks.
ERIC Educational Resources Information Center
Texas Higher Education Coordinating Board, Austin.
This document presents two sets of data for Texas public institutions of higher learning: (1) the number of women faculty and (2) enrollment of racial and ethnic minority students. Text summaries and data tables for women include: full-time faculty, including tenured and tenure-track; full-time faculty new hires; full-time faculty promotions;…
Video to Text (V2T) in Wide Area Motion Imagery
2015-09-01
microtext) or a document (e.g., using Sphinx or Apache NLP ) as an automated approach [102]. Previous work in natural language full-text searching...language processing ( NLP ) based module. The heart of the structured text processing module includes the following seven key word banks...Features Tracker MHT Multiple Hypothesis Tracking MIL Multiple Instance Learning NLP Natural Language Processing OAB Online AdaBoost OF Optic Flow
Helios: Understanding Solar Evolution Through Text Analytics
DOE Office of Scientific and Technical Information (OSTI.GOV)
Randazzese, Lucien
This proof-of-concept project focused on developing, testing, and validating a range of bibliometric, text analytic, and machine-learning based methods to explore the evolution of three photovoltaic (PV) technologies: Cadmium Telluride (CdTe), Dye-Sensitized solar cells (DSSC), and Multi-junction solar cells. The analytical approach to the work was inspired by previous work by the same team to measure and predict the scientific prominence of terms and entities within specific research domains. The goal was to create tools that could assist domain-knowledgeable analysts in investigating the history and path of technological developments in general, with a focus on analyzing step-function changes in performance,more » or “breakthroughs,” in particular. The text-analytics platform developed during this project was dubbed Helios. The project relied on computational methods for analyzing large corpora of technical documents. For this project we ingested technical documents from the following sources into Helios: Thomson Scientific Web of Science (papers), the U.S. Patent & Trademark Office (patents), the U.S. Department of Energy (technical documents), the U.S. National Science Foundation (project funding summaries), and a hand curated set of full-text documents from Thomson Scientific and other sources.« less
An Introduction to Your College Library: Making It Work for You.
ERIC Educational Resources Information Center
Davis, H. Scott
This document presents the full text of a self-paced library skills workbook which was piloted in fall 1984 in all sections of freshmen English composition courses at Georgia College. The workbook text is divided into four units of instruction: (1) An Introduction to Georgia College's Russell Library; (2) The Divided Card Catalog Revisited and an…
Biblio-MetReS: A bibliometric network reconstruction application and server
2011-01-01
Background Reconstruction of genes and/or protein networks from automated analysis of the literature is one of the current targets of text mining in biomedical research. Some user-friendly tools already perform this analysis on precompiled databases of abstracts of scientific papers. Other tools allow expert users to elaborate and analyze the full content of a corpus of scientific documents. However, to our knowledge, no user friendly tool that simultaneously analyzes the latest set of scientific documents available on line and reconstructs the set of genes referenced in those documents is available. Results This article presents such a tool, Biblio-MetReS, and compares its functioning and results to those of other user-friendly applications (iHOP, STRING) that are widely used. Under similar conditions, Biblio-MetReS creates networks that are comparable to those of other user friendly tools. Furthermore, analysis of full text documents provides more complete reconstructions than those that result from using only the abstract of the document. Conclusions Literature-based automated network reconstruction is still far from providing complete reconstructions of molecular networks. However, its value as an auxiliary tool is high and it will increase as standards for reporting biological entities and relationships become more widely accepted and enforced. Biblio-MetReS is an application that can be downloaded from http://metres.udl.cat/. It provides an easy to use environment for researchers to reconstruct their networks of interest from an always up to date set of scientific documents. PMID:21975133
Richard P. Feynman and the Feynman Diagrams
available in full-text and on the Web. Documents: A Theorem and Its Application to Finite Tampers, DOE Fermi-Thomas Theory; DOE Technical Report, April 28, 1947 Mathematical Formulation of the Quantum Theory
Transport telematics, tolling and info systems
DOT National Transportation Integrated Search
1995-10-03
This online document is the full text of the speech delivered by John Dawson, Group Public Affairs Director, The Automobile Association (AA), at the Waldorf Hotel, London, on October 3, 1995. It focuses on intelligent transportation systems, ITS, for...
Building the Digital Library Infrastructure: A Primer.
ERIC Educational Resources Information Center
Tebbetts, Diane R.
1999-01-01
Provides a framework for examining the complex infrastructure needed to successfully implement a digital library. Highlights include database development, online public-access catalogs, interactive technical services, full-text documents, hardware and wiring, licensing, access, and security issues. (Author/LRW)
Literature evidence in open targets - a target validation platform.
Kafkas, Şenay; Dunham, Ian; McEntyre, Johanna
2017-06-06
We present the Europe PMC literature component of Open Targets - a target validation platform that integrates various evidence to aid drug target identification and validation. The component identifies target-disease associations in documents and ranks the documents based on their confidence from the Europe PMC literature database, by using rules utilising expert-provided heuristic information. The confidence score of a given document represents how valuable the document is in the scope of target validation for a given target-disease association by taking into account the credibility of the association based on the properties of the text. The component serves the platform regularly with the up-to-date data since December, 2015. Currently, there are a total number of 1168365 distinct target-disease associations text mined from >26 million PubMed abstracts and >1.2 million Open Access full text articles. Our comparative analyses on the current available evidence data in the platform revealed that 850179 of these associations are exclusively identified by literature mining. This component helps the platform's users by providing the most relevant literature hits for a given target and disease. The text mining evidence along with the other types of evidence can be explored visually through https://www.targetvalidation.org and all the evidence data is available for download in json format from https://www.targetvalidation.org/downloads/data .
Extracting and connecting chemical structures from text sources using chemicalize.org.
Southan, Christopher; Stracz, Andras
2013-04-23
Exploring bioactive chemistry requires navigating between structures and data from a variety of text-based sources. While PubChem currently includes approximately 16 million document-extracted structures (15 million from patents) the extent of public inter-document and document-to-database links is still well below any estimated total, especially for journal articles. A major expansion in access to text-entombed chemistry is enabled by chemicalize.org. This on-line resource can process IUPAC names, SMILES, InChI strings, CAS numbers and drug names from pasted text, PDFs or URLs to generate structures, calculate properties and launch searches. Here, we explore its utility for answering questions related to chemical structures in documents and where these overlap with database records. These aspects are illustrated using a common theme of Dipeptidyl Peptidase 4 (DPPIV) inhibitors. Full-text open URL sources facilitated the download of over 1400 structures from a DPPIV patent and the alignment of specific examples with IC50 data. Uploading the SMILES to PubChem revealed extensive linking to patents and papers, including prior submissions from chemicalize.org as submitting source. A DPPIV medicinal chemistry paper was completely extracted and structures were aligned to the activity results table, as well as linked to other documents via PubChem. In both cases, key structures with data were partitioned from common chemistry by dividing them into individual new PDFs for conversion. Over 500 structures were also extracted from a batch of PubMed abstracts related to DPPIV inhibition. The drug structures could be stepped through each text occurrence and included some converted MeSH-only IUPAC names not linked in PubChem. Performing set intersections proved effective for detecting compounds-in-common between documents and merged extractions. This work demonstrates the utility of chemicalize.org for the exploration of chemical structure connectivity between documents and databases, including structure searches in PubChem, InChIKey searches in Google and the chemicalize.org archive. It has the flexibility to extract text from any internal, external or Web source. It synergizes with other open tools and the application is undergoing continued development. It should thus facilitate progress in medicinal chemistry, chemical biology and other bioactive chemistry domains.
77 FR 32075 - Petitions for Reconsideration of Action of Rulemaking Proceeding
Federal Register 2010, 2011, 2012, 2013, 2014
2012-05-31
.... 2950, released May 24, 2012. The full text of this document is available for viewing and copying in Room CY-B402, 445 12th Street SW., Washington, DC or may be purchased from the Commission's copy...
76 FR 11737 - Petition for Reconsideration of Action of Rulemaking Proceeding
Federal Register 2010, 2011, 2012, 2013, 2014
2011-03-03
.... 2925, released February 7, 2011. The full text of this document is available for viewing and copying in Room CY-B402, 445 12th Street, SW., Washington, DC or may be purchased from the Commission's copy...
Automatic document classification of biological literature
Chen, David; Müller, Hans-Michael; Sternberg, Paul W
2006-01-01
Background Document classification is a wide-spread problem with many applications, from organizing search engine snippets to spam filtering. We previously described Textpresso, a text-mining system for biological literature, which marks up full text according to a shallow ontology that includes terms of biological interest. This project investigates document classification in the context of biological literature, making use of the Textpresso markup of a corpus of Caenorhabditis elegans literature. Results We present a two-step text categorization algorithm to classify a corpus of C. elegans papers. Our classification method first uses a support vector machine-trained classifier, followed by a novel, phrase-based clustering algorithm. This clustering step autonomously creates cluster labels that are descriptive and understandable by humans. This clustering engine performed better on a standard test-set (Reuters 21578) compared to previously published results (F-value of 0.55 vs. 0.49), while producing cluster descriptions that appear more useful. A web interface allows researchers to quickly navigate through the hierarchy and look for documents that belong to a specific concept. Conclusion We have demonstrated a simple method to classify biological documents that embodies an improvement over current methods. While the classification results are currently optimized for Caenorhabditis elegans papers by human-created rules, the classification engine can be adapted to different types of documents. We have demonstrated this by presenting a web interface that allows researchers to quickly navigate through the hierarchy and look for documents that belong to a specific concept. PMID:16893465
Proceedings of the conference on diameter-limit cutting in northeastern forests
Laura S. Kenefic; Ralph D., eds. Nyland; Ralph D. Nyland
2006-01-01
Contains nine papers presented at the conference on diameter-limit cutting in northeastern forests on May 23-24, 2005, at the University of Massachusetts at Amherst. NOTE, this is the full-text document. Individual papers are available via Treesearch.
Fei, Lin; Zhao, Jing; Leng, Jiahao; Zhang, Shujian
2017-10-12
The ALIPORC full-text database is targeted at a specific full-text database of acupuncture literature in the Republic of China. Starting in 2015, till now, the database has been getting completed, focusing on books relevant with acupuncture, articles and advertising documents, accomplished or published in the Republic of China. The construction of this database aims to achieve the source sharing of acupuncture medical literature in the Republic of China through the retrieval approaches to diversity and accurate content presentation, contributes to the exchange of scholars, reduces the paper damage caused by paging and simplify the retrieval of the rare literature. The writers have made the explanation of the database in light of sources, characteristics and current situation of construction; and have discussed on improving the efficiency and integrity of the database and deepening the development of acupuncture literature in the Republic of China.
Organizational Influences on the University Electronic Library.
ERIC Educational Resources Information Center
Davies, Clare
1997-01-01
Reviews the literature on the development of full-text electronic libraries in the academic setting. Organizational factors can have impact on electronic library development and ultimate usability. Topics include strategic management, planning and implementation; system specification and design; document provision; user support and training; and…
The Korean War: An ERIC/ChESS Sample.
ERIC Educational Resources Information Center
Pinhey, Laura A.
2000-01-01
Provides a list of teaching materials and general background documents about the Korean War from the ERIC database. Offers directions for obtaining the full text of materials about the division of South and North Korea, the geography of Korea, and South Korea's economic development. (CMK)
Astronomical Software Directory Service
NASA Technical Reports Server (NTRS)
Hanisch, R. J.; Payne, H.; Hayes, J.
1998-01-01
This is the final report on the development of the Astronomical Software Directory Service (ASDS), a distributable, searchable, WWW-based database of software packages and their related documentation. ASDS provides integrated access to 56 astronomical software packages, with more than 16,000 URL's indexed for full-text searching.
García-Sánchez, E; Rubio-Arias, J A; Ávila-Gandía, V; Ramos-Campo, D J; López-Román, J
2016-06-01
To analyse the content of various published studies related to physical exercise and its effects on urinary incontinence and to determine the effectiveness of pelvic floor training programmes. We conducted a search in the databases of PubMed, CINAHL, the Cochrane Plus Library, The Cochrane Library, WOS and SPORTDiscus and a manual search in the Google Scholar metasearcher using the search descriptors for documents published in the last 10 years in Spanish or English. The documents needed to have an abstract or complete text on the treatment of urinary incontinence in female athletes and in women in general. We selected 3 full-text articles on treating urinary incontinence in female athletes and 6 full-text articles and 1 abstract on treating urinary incontinence in women in general. The 9 studies included in the review achieved positive results, i.e., there was improvement in the disease in all of the studies. Physical exercise, specifically pelvic floor muscle training programmes, has positive effects on urinary incontinence. This type of training has been shown to be an effective programme for treating urinary incontinence, especially stress urinary incontinence. Copyright © 2015 AEU. Publicado por Elsevier España, S.L.U. All rights reserved.
Dynamic "inline" images: context-sensitive retrieval and integration of images into Web documents.
Kahn, Charles E
2008-09-01
Integrating relevant images into web-based information resources adds value for research and education. This work sought to evaluate the feasibility of using "Web 2.0" technologies to dynamically retrieve and integrate pertinent images into a radiology web site. An online radiology reference of 1,178 textual web documents was selected as the set of target documents. The ARRS GoldMiner image search engine, which incorporated 176,386 images from 228 peer-reviewed journals, retrieved images on demand and integrated them into the documents. At least one image was retrieved in real-time for display as an "inline" image gallery for 87% of the web documents. Each thumbnail image was linked to the full-size image at its original web site. Review of 20 randomly selected Collaborative Hypertext of Radiology documents found that 69 of 72 displayed images (96%) were relevant to the target document. Users could click on the "More" link to search the image collection more comprehensively and, from there, link to the full text of the article. A gallery of relevant radiology images can be inserted easily into web pages on any web server. Indexing by concepts and keywords allows context-aware image retrieval, and searching by document title and subject metadata yields excellent results. These techniques allow web developers to incorporate easily a context-sensitive image gallery into their documents.
78 FR 32991 - Connect America Fund
Federal Register 2010, 2011, 2012, 2013, 2014
2013-06-03
..., 2013. The full text of this document is available for public inspection during regular business hours.... Introduction 1. In the USF/ICC Transformation Order, 76 FR 73830, November 29, 2011, the Commission... the USF/ICC Transformation Order, an unsubsidized competitor in areas where the price cap carrier will...
ERIC Educational Resources Information Center
Sequoia Union High School District, Redwood City, CA.
THE FOLLOWING IS THE FULL TEXT OF THIS DOCUMENT: The Peninsula Academies program helps educationally disadvantaged youth overcome the handicaps of low academic achievement, lack of skills, and chronic unemployment. This is accomplished by providing a high school curriculum that is clearly related to work, training in specific job skills, emphasis…
Nouns/Pronouns. MicroSIFT Courseware Evaluation.
ERIC Educational Resources Information Center
Northwest Regional Educational Lab., Portland, OR.
THE FOLLOWING IS THE FULL TEXT OF THIS DOCUMENT (Except for the Evaluation Summary Table): VERSION: 3.3. PRODUCER: Hartley Courseware, Inc., Box 431, Dimondale, Michigan 48821. EVALUATION COMPLETED: January 1983 at the North Clackamas School District, Milwaukie, Oregon, and at Northwest Regional Educational Laboratory, Portland, Oregon. COST:…
Letter Recognition. MicroSIFT Courseware Evaluation.
ERIC Educational Resources Information Center
Northwest Regional Educational Lab., Portland, OR.
THE FOLLOWING IS THE FULL TEXT OF THIS DOCUMENT (Except for the Evaluation Summary Table): VERSION: 3.3. PRODUCER: Hartley Courseware, Inc., Box 431, Dimondale, Michigan 48821. EVALUATION COMPLETED: January 1983 at the North Clackamas School District, Milwaukie, Oregon, and at Northwest Regional Educational Laboratory, Portland, Oregon. COST:…
ERIC Educational Resources Information Center
Moreland Elementary School District, San Jose, CA.
THE FOLLOWING IS THE FULL TEXT OF THIS DOCUMENT: Recognition for special effort and achievement has been noted as a component of effective schools. Schools in the Moreland School District have effectively improved standards of discipline and achievement by providing forty-six different ways for children to receive positive recognition. Good…
The Fifth Annual Pennsylvania Conference on Postsecondary Occupational Education.
ERIC Educational Resources Information Center
Gillie, Angelo C., Ed.
The document contains the full text of the following conference papers: Introduction: Cooperative Ventures in Vocational Education: Pennsylvania Style, by Angelo C. Gillie, Sr.; Cooperation and Coordination Among Secondary and Postsecondary Vocational Education: The Massachusetts Story, Charles H. Buzzell and Vincent P. Lamo; Cooperation and…
ERIC Educational Resources Information Center
Dubois, Barbara R.
1983-01-01
THE FOLLOWING IS THE FULL TEXT OF THIS DOCUMENT: LEVEL: High school and college. AUTHOR'S COMMENT: Many would like to abandon the distinction between "lay" and "lie," but I still receive enough questions about it to continue teaching it. Finding that students did not believe me when I taught them to substitute…
Improvement of School Climate.
ERIC Educational Resources Information Center
Sierra Sands Unified School District, Ridgecrest, CA.
THE FOLLOWING IS THE FULL TEXT OF THIS DOCUMENT: As a part of its School Improvement Program, James Monroe Junior High School planned to improve its school climate. Since the physical school environment was devoid of landscaping and did not provide places for student socialization, all interested groups (PTSA, student council, students, staff, and…
Are CD-ROM LANs a Thing of the Past?
ERIC Educational Resources Information Center
Mehta, Apurva
1996-01-01
Remote access to full-text and CD-ROM databases using the Internet has advantages over a CD-ROM local area network. Topics include speed, document delivery, multiple platforms, technical support, licensing, copyright, and access to graphics. Considerations of duplication of information, platform compatibility, print versus digital media, back…
Sentences. MicroSIFT Courseware Evaluation.
ERIC Educational Resources Information Center
Northwest Regional Educational Lab., Portland, OR.
THE FOLLOWING IS THE FULL TEXT OF THIS DOCUMENT (Except for the Evaluation Summary Table): VERSION: Apple II. PRODUCER: Micro Power & Light Company, 12820 Hillcrest Rd., Suite 224, Dallas, Texas 75230. EVALUATION COMPLETED: June 1982 by the staff and constituents of the Portland Public Schools, Multnomah ESD, Portland, Oregon. COST: $24.95.…
Magic Spells. MicroSIFT Courseware Evaluation.
ERIC Educational Resources Information Center
Northwest Regional Educational Lab., Portland, OR.
THE FOLLOWING IS THE FULL TEXT OF THIS DOCUMENT (Except for the Evaluation Summary Table): VERSION: Copyright 1981. PRODUCER: Advanced Learning Technology, Inc., 4370 Alpine Road, Portola Valley, CA 94025. EVALUATION COMPLETED: January, 1983 by the Oakland ISD, Pontiac, Michigan. COST: $45.00. ABILITY LEVEL: Grades 1 to 8. SUBJECT: Language Arts.…
Word Families. MicroSIFT Courseware Evaluation.
ERIC Educational Resources Information Center
Northwest Regional Educational Lab., Portland, OR.
THE FOLLOWING IS THE FULL TEXT OF THIS DOCUMENT (Except for the Evaluation Summary Table): VERSION: 1981. PRODUCER: Hartley Courseware, Inc., Box 431, Dimondale, Michigan 48821. EVALUATION COMPLETED: January, 1983 at the Clackamas County ESD in Milwaukie, Oregon. COST: $29.95. ABILITY LEVEL: Pre-school through grade 2. SUBJECT: Language Arts.…
Create Spell-It. MicroSIFT Courseware Evaluation.
ERIC Educational Resources Information Center
Northwest Regional Educational Lab., Portland, OR.
THE FOLLOWING IS THE FULL TEXT OF THIS DOCUMENT (Except for the Evaluation Summary Table): Version: 1981. PRODUCER: Hartley Courseware, Inc., Box 431, Dimondale, Michigan 48821. EVALUATION COMPLETED: January, 1983 at the Clackamas County ESD, Milwaukie, Oregon. COST: $26.95. ABILITY LEVEL: Pre-school through grade 10. SUBJECT: Language Arts.…
Word Search. MicroSIFT Courseware Evaluation.
ERIC Educational Resources Information Center
Northwest Regional Educational Lab., Portland, OR.
THE FOLLOWING IS THE FULL TEXT OF THIS DOCUMENT (Except for the Evaluation Summary Table): VERSION: 1981. PRODUCER: Hartley Courseware, Inc., Box 431, Dimondale, Michigan 48821. EVALUATION COMPLETED: January, 1983 at the Clackamas County ESD in Milwaukie, Oregon. COST: $26.95. ABILITY LEVEL: Grades 2-6. SUBJECT: Language Arts. MEDIUM OF TRANSFER:…
Alphabet Keyboard. MicroSIFT Courseware Evaluation.
ERIC Educational Resources Information Center
Northwest Regional Educational Lab., Portland, OR.
THE FOLLOWING IS THE FULL TEXT OF THIS DOCUMENT (Except for the Evaluation Summary Table): PRODUCER: Random House, Inc., School Division, 1970 Brandywine Rd., Atlanta, Georgia 30341. EVALUATION COMPLETED: June 1982 by staff of the Portland Public Schools, Oregon. COST: Cassette: $24 Disk: $34.50. ABILITY LEVEL: K-1. SUBJECT: Reading: location of…
Titration. MicroSIFT Courseware Evaluation.
ERIC Educational Resources Information Center
Northwest Regional Educational Lab., Portland, OR.
THE FOLLOWING IS THE FULL TEXT OF THIS DOCUMENT (Except for the Evaluation Summary Table): VERSION: 1980. PRODUCER: Mentor Software, Inc., Box 8082, St. Paul, Minn. 55113. EVALUATION COMPLETED: April 1982, by staff and constituents of the Texas Region X Educational Service Center. COST: $19.95. ABILITY LEVEL: Grades 10-14. SUBJECT: Chemistry:…
Millikan. MicroSIFT Courseware Evaluation.
ERIC Educational Resources Information Center
Northwest Regional Educational Lab., Portland, OR.
THE FOLLOWING IS THE FULL TEXT OF THIS DOCUMENT (Except for the Evaluation Summary Table): VERSION: Copyright 1979. PRODUCER: Mentor Software, Inc., Box 8082, St. Paul, Minnesota 55113. EVALUATION COMPLETED: March 14, 1982 by the staff and constituents of Texas Region X Educational Service Center. COST: $19.95. ABILITY LEVEL: Grade 11+. SUBJECT:…
Redefining Information Access to Serials Information.
ERIC Educational Resources Information Center
Chen, Ching-chih
1992-01-01
Describes full-text document delivery services that have been introduced in conjunction with available databases in response to economic and technological changes affecting libraries: (1) CARL System's UnCover database and UnCover2 service; (2) Research Libraries Group's CitaDel delivery service; and (3) Faxon Research Service's Faxon Finder and…
ERIC Educational Resources Information Center
International Federation of Library Associations and Institutions, The Hague (Netherlands).
Papers on acquisitions and exchange presented at the 1986 International Federation of Library Associations (IFLA) conference include: (1) a condensed English version and the full German text of the presentation, "Document Exchange and the Deutsche Forschungsgemeinschaft (German Research Council)--The Acquisition of Grey and Special Literature…
Spelling Strategy. MicroSIFT Courseware Evaluation.
ERIC Educational Resources Information Center
Northwest Regional Educational Lab., Portland, OR.
THE FOLLOWING IS THE FULL TEXT OF THIS DOCUMENT (Except for the Evaluation Summary Table): VERSION: Copyright 1981. PRODUCER: Behavioral Engineering, 230 Mt. Hermon Road, Suite 207, Scotts Valley, CA 95066. EVALUATION COMPLETED: January, 1983 at the Oakland ISD in Pontiac, Michigan. COST: $45.00. ABILITY LEVEL: Grades 2 to 8. SUBJECT: Language…
Newton. MicroSIFT Courseware Evaluation.
ERIC Educational Resources Information Center
Northwest Regional Educational Lab., Portland, OR.
THE FOLLOWING IS THE FULL TEXT OF THIS DOCUMENT (Except for the Evaluation Summary Table): VERSION: Member's Apple Demonstration Kit. PRODUCER: Conduit, P.O. Box 388, Iowa City, Iowa 52244. EVALUATION COMPLETED: June 1982 by the staff and constituents of the Portland Public Schools, Multnomah ESD, Portland, Oregon. COST: $35.00. ABILITY LEVEL:…
ERIC Educational Resources Information Center
Placer Hills Union Elementary School District, Meadow Vista, CA.
THE FOLLOWING IS THE FULL TEXT OF THIS DOCUMENT: The "Good Citizen" Program was developed for many reasons: to keep the campus clean, to reward students for improvement, to reward students for good deeds, to improve the total school climate, to reward students for excellence, and to offer staff members a method of reward for positive…
ERIC Educational Resources Information Center
Solana Beach Elementary School District, CA.
THE FOLLOWING IS THE FULL TEXT OF THIS DOCUMENT: MATH AND BEYOND is a schoolwide math incentive program designed to encourage students--and their parents--to investigate and explore the world of mathematics beyond those experiences provided during the school day. The program focuses on experiences and activities in seven different areas of math:…
Evolut. MicroSIFT Courseware Evaluation.
ERIC Educational Resources Information Center
Northwest Regional Educational Lab., Portland, OR.
THE FOLLOWING IS THE FULL TEXT OF THIS DOCUMENT (Except for the Evaluation Summary Table): VERSION: Conduit Demonstration Package. PRODUCER: Conduit, P. O. Box 388, Iowa City, IA 52244. EVALUATION COMPLETED: June, 1982, by staff and constituents of the Portland Public Schools, Multnomah ESD, Oregon. COST: $30.00. ABILITY LEVEL: Post-secondary,…
Digital Libraries: Situating Use in Changing Information Infrastructure.
ERIC Educational Resources Information Center
Bishop, Ann Peterson; Neumann, Laura J.; Star, Susan Leigh; Merkel, Cecelia; Ignacio, Emily; Sandusky, Robert J.
2000-01-01
Reviews empirical studies about how digital libraries evolve for use in scientific and technical work based on the Digital Libraries Initiative (DLI) at the University of Illinois. Discusses how users meet infrastructure and document disaggregation; describes use of the DLI testbed of full text journal articles; and explains research methodology.…
Speed Reader. MicroSIFT Courseware Evaluation.
ERIC Educational Resources Information Center
Northwest Regional Educational Lab., Portland, OR.
THE FOLLOWING IS THE FULL TEXT OF THIS DOCUMENT (Except for the Evaluation Summary Table): VERSION: Copyright 1981. PRODUCER: Davidson and Associates, 6069 Groveoak Place #12, Rancho Palos Verdes, CA 90274. EVALUATION COMPLETED: January, 1983 by the Oakland ISD of Pontiac, Michigan. COST: $70.00. ABILITY LEVEL: Secondary. SUBJECT: Reading. MEDIUM…
ERIC Educational Resources Information Center
Rosemead Elementary School District, CA.
THE FOLLOWING IS THE FULL TEXT OF THIS DOCUMENT: How does a school provide the computer learning experiences for students given the paucity of available funding for hardware, software, and staffing? Here is what one school, Emma W. Shuey in Rosemead, did after exploratory research on computers by a committee of teachers and administrators. The…
75 FR 77872 - Petition for Reconsideration of Action in Rulemaking Proceeding
Federal Register 2010, 2011, 2012, 2013, 2014
2010-12-14
... FEDERAL COMMUNICATIONS COMMISSION [Report No. 2923] Petition for Reconsideration of Action in Rulemaking Proceeding December 3, 2010. A Petition for Reconsideration has been filed in the Commission's Rulemaking proceeding listed in this document and published pursuant to 47 CFR 1.429(e). The full text of...
A strategy for electronic dissemination of NASA Langley technical publications
NASA Technical Reports Server (NTRS)
Roper, Donna G.; Mccaskill, Mary K.; Holland, Scott D.; Walsh, Joanne L.; Nelson, Michael L.; Adkins, Susan L.; Ambur, Manjula Y.; Campbell, Bryan A.
1994-01-01
To demonstrate NASA Langley Research Center's relevance and to transfer technology to external customers in a timely and efficient manner, Langley has formed a working group to study and recommend a course of action for the electronic dissemination of technical reports (EDTR). The working group identified electronic report requirements (e.g., accessibility, file format, search requirements) of customers in U.S. industry through numerous site visits and personal contacts. Internal surveys were also used to determine commonalities in document preparation methods. From these surveys, a set of requirements for an electronic dissemination system was developed. Two candidate systems were identified and evaluated against the set of requirements: the Full-Text Electronic Documents System (FEDS), which is a full-text retrieval system based on the commercial document management package Interleaf, and the Langley Technical Report Server (LTRS), which is a Langley-developed system based on the publicly available World Wide Web (WWW) software system. Factors that led to the selection of LTRS as the vehicle for electronic dissemination included searching and viewing capability, current system operability, and client software availability for multiple platforms at no cost to industry. This report includes the survey results, evaluations, a description of the LTRS architecture, recommended policy statement, and suggestions for future implementations.
Semi automatic indexing of PostScript files using Medical Text Indexer in medical education.
Mollah, Shamim Ara; Cimino, Christopher
2007-10-11
At Albert Einstein College of Medicine a large part of online lecture materials contain PostScript files. As the collection grows it becomes essential to create a digital library to have easy access to relevant sections of the lecture material that is full-text indexed; to create this index it is necessary to extract all the text from the document files that constitute the originals of the lectures. In this study we present a semi automatic indexing method using robust technique for extracting text from PostScript files and National Library of Medicine's Medical Text Indexer (MTI) program for indexing the text. This model can be applied to other medical schools for indexing purposes.
CancerNet redistribution via WWW.
Quade, G; Püschel, N; Far, F
1996-01-01
CancerNet from the National Cancer Institute contains nearly 500 ASCII-files, updated monthly, with up-to-date information about cancer and the "Golden Standard" in tumor therapy. Perl scripts are used to convert these files to HTML-documents. A complex algorithm, using regular expression matching and extensive exception handling, detects headlines, listings and other constructs of the original ASCII-text and converts them into their HTML-counterparts. A table of contents is also created during the process. The resulting files are indexed for full-text search via WAIS. Building the complete CancerNet WWW redistribution takes less than two hours with a minimum of manual work. For 26,000 requests of information from our service per month the average costs for the worldwide delivery of one document is about 19 cents.
Volcanoes. MicroSIFT Courseware Evaluation.
ERIC Educational Resources Information Center
Northwest Regional Educational Lab., Portland, OR.
THE FOLLOWING IS THE FULL TEXT OF THIS DOCUMENT (Except for the Evaluation Summary Table): VERSION: 1.4E. PRODUCER: Earthware Computer Services, P.O. Box 30039, Eugene, OR 07403. EVALUATION COMPLETED: June 1982 by staff of NWREL and constituents of the Alaska Department of Education. COST: $49.50. ABILITY LEVEL: Secondary and College. SUBJECT:…
Highlights from Evaluation of EBCE.
ERIC Educational Resources Information Center
Bucknam, Ronald B.; Brand, Sheara G.
1983-01-01
THE FOLLOWING IS THE FULL TEXT OF THIS DOCUMENT A meta-analysis of 80 third-party evaluations (all that were conducted) of Experience-Based Career Education programs shows that in the large majority of programs: (1) EBCE students made large gains not only in career skills and life attitudes but also in academic skills; (2) EBCE students gained…
Circulation (Organs). MicroSIFT Courseware Evaluation.
ERIC Educational Resources Information Center
Northwest Regional Educational Lab., Portland, OR.
THE FOLLOWING IS THE FULL TEXT OF THIS DOCUMENT (Except for the Evaluation Summary Table): PRODUCER: Micro Power and Light Company, Keystone Park, Suite 1108, 13773 N. Central Expressway, Dallas, TX 75243. LOCAL DISTRIBUTORS: Contact producer for list. EVALUATION COMPLETED: Fall 1981. VERSION: Apple II. COST: $29.95. ABILITY LEVEL: Grades 5-12.…
ERIC Educational Resources Information Center
Maye, Jessica, Ed.; Miyashita, Mizuki, Ed.
This document contains the full texts of six papers that were presented at the Southwest workshop on optimality theory. Papers include the following: "Shuswap Diminutive Reduplication" (Sean Hendricks); "On Multiple Sympathy Candidates in Optimality Theory" (Hidehito Hoshi); "A Perceptually Grounded OT Analysis of…
The LENR-CANR.ORG Website, its Past and Future
NASA Astrophysics Data System (ADS)
Rothwell, J.; Storms, E.
2005-12-01
The LENR-CANR.org web site has proven to be a popular source of information about cold fusion. This site has distributed more full text papers about LENR than any other source. In addition, it contains many features that allow easy search and insertion of the discovered references into a document.
Creating a New Definition of Library Cooperation: Past, Present, and Future Models.
ERIC Educational Resources Information Center
Lenzini, Rebecca T.; Shaw, Ward
1991-01-01
Describes the creation and purpose of the Colorado Alliance of Research Libraries (CARL), the subsequent development of CARL Systems, and its current research projects. Topics discussed include online catalogs; UnCover, a journal article database; full text data; document delivery; visual images in computer systems; networks; and implications for…
ERIC Educational Resources Information Center
Atkinson, Roderick D.; Stackpole, Laurie E.
1995-01-01
The Naval Research Laboratory (NRL) Library and the American Physical Society (APS) are experimenting with electronically disseminating journals and reports in a project called TORPEDO (The Optical Retrieval Project: Electronic Documents Online). Scanned journals and reports are converted to ASCII, then attached to bibliographic information, and…
Homonyms in Context. MicroSIFT Courseware Evaluation.
ERIC Educational Resources Information Center
Northwest Regional Educational Lab., Portland, OR.
THE FOLLOWING IS THE FULL TEXT OF THIS DOCUMENT (Except for the Evaluation Summary Table): VERSION: Apple II. PRODUCER: Random House, School Division, 2970 Brandywine Road, Atlanta, Georgia 30341. EVALUATION COMPLETED: June 1982 by the staff and constituents of the Portland Public Schools, Portland, Oregon. COST: Apple II and Radio Shack TRS-80…
Odell Lake. MicroSIFT Courseware Evaluation.
ERIC Educational Resources Information Center
Northwest Regional Educational Lab., Portland, OR.
THE FOLLOWING IS THE FULL TEXT OF THIS DOCUMENT (Except for the Evaluation Summary Table): PRODUCER: MECC Publications, 2520 Broadway Drive, St. Paul, MN 55113. LOCAL DISTRIBUTORS: Contact producer for list. EVALUATION COMPLETED: Fall 1981, revised February 1, 1982. VERSION: 4.3. COST: Varied; sold in package of several programs on a disk at $30…
ERIC Educational Resources Information Center
San Marcos Unified School District, CA.
THE FOLLOWING IS THE FULL TEXT OF THIS DOCUMENT: After viewing many computer-literacy programs, we believe San Marcos Junior High School has developed a unique program which will truly develop computer literacy. Our hope is to give all students a comprehensive look at computers as they go through their two years here. They will not only learn the…
My First Alphabet. MicroSIFT Courseware Evaluation.
ERIC Educational Resources Information Center
Northwest Regional Educational Lab., Portland, OR.
THE FOLLOWING IS THE FULL TEXT OF THIS DOCUMENT (Except for the Evaluation Summary Table): VERSION: Atari APX-20083. PRODUCER: Atari, Inc., 60 E. Plumeria, P.O. Box 50047, San Jose, California 95050. EVALUATION COMPLETED: September 1982 by the staff and constituents of the Capital Children's Museum. Their evaluation is partly based on observation…
Comparing the Document Representations of Two IR-Systems: CLARIT and TOPIC.
ERIC Educational Resources Information Center
Paijmans, Hans
1993-01-01
Compares two information retrieval systems, CLARIT and TOPIC, in terms of assigned versus derived and precoordinate versus postcoordinate indexing. Models of information retrieval systems are discussed, and a test of the systems using a demonstration database of full-text articles from the "Wall Street Journal" is described. (Contains 21…
American History. Computer Programs.
ERIC Educational Resources Information Center
Lengel, James G.
1983-01-01
THE FOLLOWING IS THE FULL TEXT OF THIS DOCUMENT: Seven interactive computer programs are available to help with the study of American History. They cover the period of the 17th century up through the present day, and involve a variety of approaches to instruction. These programs were conceived and programmed by Jim Lengel, a former state social…
Benefits of Coaching on Test Scores Seen as Negligible.
ERIC Educational Resources Information Center
Report on Education Research, 1983
1983-01-01
THE FOLLOWING IS THE FULL TEXT OF THIS DOCUMENT: A new study by a pair of Harvard University researchers discounts earlier findings that coaching can substantially improve student performance on the Scholastic Aptitude Test (SAT). "There is simply insufficient evidence that large score increases are a result of a coaching program," write…
The new on-line Czech Food Composition Database.
Machackova, Marie; Holasova, Marie; Maskova, Eva
2013-10-01
The new on-line Czech Food Composition Database (FCDB) was launched on http://www.czfcdb.cz in December 2010 as a main freely available channel for dissemination of Czech food composition data. The application is based on a complied FCDB documented according to the EuroFIR standardised procedure for full value documentation and indexing of foods by the LanguaL™ Thesaurus. A content management system was implemented for administration of the website and performing data export (comma-separated values or EuroFIR XML transport package formats) by a compiler. Reference/s are provided for each published value with linking to available freely accessible on-line sources of data (e.g. full texts, EuroFIR Document Repository, on-line national FCDBs). LanguaL™ codes are displayed within each food record as searchable keywords of the database. A photo (or a photo gallery) is used as a visual descriptor of a food item. The application is searchable on foods, components, food groups, alphabet and a multi-field advanced search. Copyright © 2013 Elsevier Ltd. All rights reserved.
Machine Aided Indexing and the NASA Thesaurus
NASA Technical Reports Server (NTRS)
vonOfenheim, Bill
2007-01-01
Machine Aided Indexing (MAI) is a Web-based application program for aiding the indexing of literature in the NASA Scientific and Technical Information (STI) Database. MAI was designed to be a convenient, fully interactive tool for determining the subject matter of documents and identifying keywords. The heart of MAI is a natural-language processor that accepts, as input, any user-supplied text, including abstracts, full documents, and Web pages. Within seconds, the text is analyzed and a ranked list of terms is generated. The 17,800 terms of the NASA Thesaurus serve as the foundation of the knowledge base used by MAI. The NASA Thesaurus defines a standard vocabulary, the use of which enables MAI to assist in ensuring that STI documents are uniformly and consistently accessible. Of particular interest to traditional users of the NASA Thesaurus, MAI incorporates a fully searchable thesaurus display module that affords word-search and hierarchy- navigation capabilities that make it much easier and less time-consuming to look up terms and browse, relative to lookup and browsing in older print and Portable Document Format (PDF) digital versions of the Thesaurus. In addition, because MAI is centrally hosted, the Thesaurus data are always current.
Improving text recall with multiple summaries.
van der Meij, Hans; van der Meij, Jan
2012-06-01
QuikScan (QS) is an innovative design that aims to improve accessibility, comprehensibility, and subsequent recall of expository text by means of frequent within-document summaries that are formatted as numbered list items. The numbers in the QS summaries correspond to numbers placed in the body of the document where the summarized ideas are discussed in full. To examine the influence of QS summaries on participants' perceptions of text quality (i.e., comprehensibility, structure, and interest) and recall, an experimental - control group design compared the effects of a QS text with a structured abstract (SA) text. Forty psychology students participated voluntarily or received course credits. Students first read a control (SA) or experimental (QS) text on flashbulb memory (FBM). Next, their perceptions of text quality were measured through a questionnaire. Recall was assessed with an open answer test with items for facts, comprehension and higher order information. Perceptions of text quality did not vary across conditions. But QS did lead to significantly and substantially (d= 1.57) higher overall recall scores. Participants with the QS text performed significantly better on all item types than participants with the SA text. Studying a QS text led to a substantial improvement in recall compared to an SA text. Further research is needed to examine how readers study QS texts and whether a text model hypothesis or a repetition effect hypothesis accounts for the effectiveness. The first hypothesis posits that the QS summaries support the reader in constructing a text schema. The second attributes the effects of these summaries to their repetition of text topics. ©2011 The British Psychological Society.
Creating "Informed Interest" in Education. The Editor's Page.
ERIC Educational Resources Information Center
Cole, Robert W.
1983-01-01
THE FOLLOWING IS THE FULL TEXT OF THIS DOCUMENT: Something good is happening in Indiana that may be a model for the nation. The Indiana Congress on Education, which convened for the first time last June, could be an unconventional but effective way to change public policy. Throughout the fall, we've been treated to demonstrations of the…
FACES (Friday Afternoon Choices for Enrichment for Our Students).
ERIC Educational Resources Information Center
Myers, Donna
THE FOLLOWING IS THE FULL TEXT OF THIS DOCUMENT: It has been the goal of the staff and parents at Ramona Elementary School to provide more enriching opportunities for our students. We want to stimulate learning and expand our horizons in every area of the curriculum. Parents, community members, and the school staff work together to provide these…
Highlights of Research on Right and Left Hemispheres of the Brain.
ERIC Educational Resources Information Center
Levy, Jerre
1983-01-01
THE FOLLOWING IS THE FULL TEXT OF THIS DOCUMENT: Simplified interpretations of brain function portraying rationality solely in the left hemisphere and creativity solely in the right are incorrect, but the two sides of the brain do differ in important ways. Researchers have discovered that: In the vast majority of right handers, speech is almost…
Grammar Package 1. MicroSIFT Courseware Evaluation.
ERIC Educational Resources Information Center
Northwest Regional Educational Lab., Portland, OR.
THE FOLLOWING IS THE FULL TEXT OF THIS DOCUMENT (Except for the Evaluation Summary Table): VERSION: TRS-80. PRODUCER: Micro Learning Ware, P.O. Box 2134, N. Mankato, MN 56001. EVALUATION COMPLETED: June 22, 1982 by the staff and constituents of the Portland Public Schools, Portland, Oregon. COST: $24.95. ABILITY LEVEL: 4-5. SUBJECT: Language arts.…
ERIC Educational Resources Information Center
Lincoln Unified School District, Stockton, CA.
THE FOLLOWING IS THE FULL TEXT OF THIS DOCUMENT: A group of seven people, four parents, two teachers and a school principal, launched a program to provide a computer in every classroom. After considerable reading and discussion, the group which had grown to include the P.T.A. Executive Board, two-thirds of the staff of this K-6 elementary school…
Writing Across the Curriculum: Writing Assignments. TWI Resource File.
ERIC Educational Resources Information Center
Rish, Shirley; Lapidus-Saltz, Wendy
1982-01-01
THE FOLLOWING IS THE FULL TEXT OF THIS DOCUMENT: If you teach a composition class which is affiliated with a subject-matter course, one of the following assignments may be appropriate for your students. If, on the other hand, you teach a traditionallly constituted composition class, you might give your students the entire list with instructions to…
Awareness of Audiences' Needs: A Charade.
ERIC Educational Resources Information Center
Spector, Ann D.
1982-01-01
THE FOLLOWING IS THE FULL TEXT OF THIS DOCUMENT: LEVEL: High school and college. AUTHOR'S COMMENT: I used this mini-unit to initiate the class in working effectively as a peer group. Moreover, the task I assigned demands that students develop an awareness of their audience's needs by providing an immediate and concrete response. THE APPROACH: (1)…
Library Skills: What's There and How to Find It. MicroSIFT Courseware Evaluation.
ERIC Educational Resources Information Center
Northwest Regional Educational Lab., Portland, OR.
THE FOLLOWING IS THE FULL TEXT OF THIS DOCUMENT (Except for the Evaluation Summary Table): PRODUCER: Micro Power and Light Company, Keystone Park, Suite 1108, 13773 N. Central Expressway, Dallas, TX 75243. LOCAL DISTRIBUTORS: Contact producer for list. EVALUATION COMPLETED: Fall 1981. VERSION: Apple II. COST: $24.95. ABILITY LEVEL: Grades 4+.…
Grammar Problems for Practice: Homonyms. MicroSIFT Courseware Evaluation.
ERIC Educational Resources Information Center
Northwest Regional Educational Lab., Portland, OR.
THE FOLLOWING IS THE FULL TEXT OF THIS DOCUMENT (Except for the Evaluation Summary Table): PRODUCER: Milliken Publishing Company, 1100 Research Blvd., St. Louis MO 63132. EVALUATION COMPLETED: June 1982, by staff of the Portland Public Schools, Multnomah ESD, Portland, Oregon. COST: $80 per module; $375 for series of 5 modules. ABILITY LEVEL: 3-9.…
Christmas Program for Elementary School Children.
ERIC Educational Resources Information Center
Taggart, Doris
THE FOLLOWING IS THE FULL TEXT OF THIS DOCUMENT: In 1974 Doris Taggart, Public Relations Vice President of Zions First National Bank in Salt Lake City, was serving on the Free Enterprise Committee of the Salt Lake Chamber of Commerce. She developed a plan to involve elementary school children with a large bank by asking the children to make…
Federal Register 2010, 2011, 2012, 2013, 2014
2012-09-17
... FEDERAL COMMUNICATIONS COMMISSION 47 CFR Part 1 [GC Docket No. 10-44; DA 12-1401] Notice of... . SUPPLEMENTARY INFORMATION: This is a synopsis of the Commission's Public Notice, document DA 12-1401, released... Procedure of Serving Parties in an Electronic Format. The full text of DA [[Page 57036
10 CFR 2.1013 - Use of the electronic docket during the proceeding.
Code of Federal Regulations, 2012 CFR
2012-01-01
... searchable full text, by header and image, as appropriate. (b) Absent good cause, all exhibits tendered... circumstances where submitters may need to use an image scanned before January 1, 2004, in a document created after January 1, 2004, or the scanning process for a large, one-page image may not successfully complete...
Methods for semi-automated indexing for high precision information retrieval.
Berrios, Daniel C; Cucina, Russell J; Fagan, Lawrence M
2002-01-01
To evaluate a new system, ISAID (Internet-based Semi-automated Indexing of Documents), and to generate textbook indexes that are more detailed and more useful to readers. Pilot evaluation: simple, nonrandomized trial comparing ISAID with manual indexing methods. Methods evaluation: randomized, cross-over trial comparing three versions of ISAID and usability survey. Pilot evaluation: two physicians. Methods evaluation: twelve physicians, each of whom used three different versions of the system for a total of 36 indexing sessions. Total index term tuples generated per document per minute (TPM), with and without adjustment for concordance with other subjects; inter-indexer consistency; ratings of the usability of the ISAID indexing system. Compared with manual methods, ISAID decreased indexing times greatly. Using three versions of ISAID, inter-indexer consistency ranged from 15% to 65% with a mean of 41%, 31%, and 40% for each of three documents. Subjects using the full version of ISAID were faster (average TPM: 5.6) and had higher rates of concordant index generation. There were substantial learning effects, despite our use of a training/run-in phase. Subjects using the full version of ISAID were much faster by the third indexing session (average TPM: 9.1). There was a statistically significant increase in three-subject concordant indexing rate using the full version of ISAID during the second indexing session (p < 0.05). Users of the ISAID indexing system create complex, precise, and accurate indexing for full-text documents much faster than users of manual methods. Furthermore, the natural language processing methods that ISAID uses to suggest indexes contributes substantially to increased indexing speed and accuracy.
Aids to Develop Throwing and Catching Skills.
ERIC Educational Resources Information Center
Schilling, Mary Lou, Ed.
1982-01-01
THE FOLLOWING IS THE FULL TEXT OF THIS DOCUMENT: STICKER MITT: (Creative Concepts Unlimited, P.O. Box 176, Elmhurst, IL 60126) A plastic mitt with small suction cups on the palm of the glove so that a plastic ball will easily adhere to it. This ensures a successful experience for children who have never caught a ball!! Approximate cost: $8.95.…
What? A Field Trip on the Playground?
ERIC Educational Resources Information Center
Garbutt, Barb
1983-01-01
THE FOLLOWING IS THE FULL TEXT OF THIS DOCUMENT: In this day and age of budget problems, school districts are cutting back on many programs, one of which is field trips. Why worry? There must be dozens of trips that can be made on the playground of your school. Let's look into activities that can be accomplished there. SOIL STUDIES: Have you ever…
"Hi. Your Kid Cut Class Today. At the Tone,..."
ERIC Educational Resources Information Center
Executive Educator, 1983
1983-01-01
THE FOLLOWING IS THE FULL TEXT OF THIS DOCUMENT: And you thought you'd tried every trick in the book to cut student absenteeism. You haven't. Now that computers have become an accepted feature in many schools' administrative offices, you might want to check out a new, computerized telephone system that six Chicago schools are using. Each of the…
Are On-Line Data Bases in Your Library's Future?
ERIC Educational Resources Information Center
Deacon, Jim
1983-01-01
THE FOLLOWING IS THE FULL TEXT OF THIS DOCUMENT: Today there are over 900 on-line data banks available for public access. Most microcomputers can use them through the aid of a modem and communication program. Major public information utilities that offer access to these on-line data bases are growing and expanding. The Source, a data base utility…
School Fits Three R's into Four Days.
ERIC Educational Resources Information Center
Sun-News (Las Cruces, New Mexico), 1983
1983-01-01
THE FOLLOWING IS THE FULL TEXT OF THIS DOCUMENT: The last bell rings at 4 o'clock and kids come tumbling out of classrooms, eager to be free for the weekend. As lockers bang shut and chatter fades out the front door, one teacher sighs, "Thank God it's Thursday." Thursday? For the 250 students and 16 teachers in this southwestern Oregon…
Helping the Visually Impaired Student with Electronic Video Visual Aids.
ERIC Educational Resources Information Center
Visualtek, Inc., Santa Monica, CA.
THE FOLLOWING IS THE FULL TEXT OF THIS DOCUMENT: Video visual aids are Closed Circuit TV systems (CCTV's) which magnify print and enlarge it electronically upon a screen so partially sighted persons with some residual vision can read and write normal size print. These devices are in use around the world in homes, schools, industries and libraries,…
This Contest Can Give Recognition to Record-Breaking Kids. Front Lines.
ERIC Educational Resources Information Center
Executive Educator, 1983
1983-01-01
THE FOLLOWING IS THE FULL TEXT OF THIS DOCUMENT: Your local high school students might never get to see their names up in lights. But with talent, luck, and determination, they might get to see their names in print--as winners in the World Almanac's high school records contest. As a way to recognize and reward teenage achievements (and undoubtedly…
ERIC Educational Resources Information Center
McLean, Ross
THE FOLLOWING IS THE FULL TEXT OF THIS DOCUMENT: Last year, the Bruce Trail Association held its first annual Go-To-Blazes Day in which a record number of volunteers gave the 700 kilometres of Trail from Queenston to Tobermory a spring-cleaning. One key section of Trail near Dyer's Bay had been closed for over a year. On this day, over four miles…
NASA Aeroelasticity Handbook Volume 2: Design Guides Part 2
NASA Technical Reports Server (NTRS)
Ramsey, John K. (Editor)
2006-01-01
The NASA Aeroelasticity Handbook comprises a database (in three formats) of NACA and NASA aeroelasticity flutter data through 1998 and a collection of aeroelasticity design guides. The Microsoft Access format provides the capability to search for specific data, retrieve it, and present it in a tabular or graphical form unique to the application. The full-text NACA and NASA documents from which the data originated are provided in portable document format (PDF), and these are hyperlinked to their respective data records. This provides full access to all available information from the data source. Two other electronic formats, one delimited by commas and the other by spaces, are provided for use with other software capable of reading text files. To the best of the author s knowledge, this database represents the most extensive collection of NACA and NASA flutter data in electronic form compiled to date by NASA. Volume 2 of the handbook contains a convenient collection of aeroelastic design guides covering fixed wings, turbomachinery, propellers and rotors, panels, and model scaling. This handbook provides an interactive database and design guides for use in the preliminary aeroelastic design of aerospace systems and can also be used in validating or calibrating flutter-prediction software.
Multimedia proceedings of the 10th Office Information Technology Conference
DOE Office of Scientific and Technical Information (OSTI.GOV)
Hudson, B.
1993-09-10
The CD contains the handouts for all the speakers, demo software from Apple, Adobe, Microsoft, and Zylabs, and video movies of the keynote speakers. Adobe Acrobat is used to provide full-fidelity retrieval of the speakers` slides and Apple`s Quicktime for Macintosh and Windows is used for video playback. ZyIndex is included for Windows users to provide a full-text search engine for selected documents. There are separately labelled installation and operating instructions for Macintosh and Windows users and some general materials common to both sets of users.
Rules regarding the health insurance premium tax credit. Final and temporary regulations.
2014-07-28
This document contains final and temporary regulations relating to the health insurance premium tax credit enacted by the Patient Protection and Affordable Care Act and the Health Care and Education Reconciliation Act of 2010, as amended by the Medicare and Medicaid Extenders Act of 2010, the Comprehensive 1099 Taxpayer Protection and Repayment of Exchange Subsidy Overpayments Act of 2011, and the Department of Defense and Full-Year Continuing Appropriations Act of 2011 and the 3% Withholding Repeal and Job Creation Act. These regulations affect individuals who enroll in qualified health plans through Affordable Insurance Exchanges (Exchanges) and claim the premium tax credit, and Exchanges that make qualified health plans available to individuals. The text of the temporary regulations in this document also serves as the text of proposed regulations set forth in a notice of proposed rulemaking (REG-104579-13) on this subject in the Proposed Rules section in this issue of the Federal Register.
The Implementation of Cosine Similarity to Calculate Text Relevance between Two Documents
NASA Astrophysics Data System (ADS)
Gunawan, D.; Sembiring, C. A.; Budiman, M. A.
2018-03-01
Rapidly increasing number of web pages or documents leads to topic specific filtering in order to find web pages or documents efficiently. This is a preliminary research that uses cosine similarity to implement text relevance in order to find topic specific document. This research is divided into three parts. The first part is text-preprocessing. In this part, the punctuation in a document will be removed, then convert the document to lower case, implement stop word removal and then extracting the root word by using Porter Stemming algorithm. The second part is keywords weighting. Keyword weighting will be used by the next part, the text relevance calculation. Text relevance calculation will result the value between 0 and 1. The closer value to 1, then both documents are more related, vice versa.
"The Play's the Thing"--In Which One Finds Himself and Others.
ERIC Educational Resources Information Center
Corono-Norco Unified School District, Corono, CA.
THE FOLLOWING IS THE FULL TEXT OF THIS DOCUMENT: Two-semester program: The working center of the program is the production of two original plays, written by the drama teacher of Raney Junior High and based on ideas or interests researched by teacher and drama students. Four goals direct the writing of each play: (1) to have the original play do…
ERIC Educational Resources Information Center
Smith, Peter, Ed.; Smith, Carol L., Ed.
2005-01-01
This 2005 Association of Small Computer Users in Education (ASCUE) conference proceedings presented the theme "Campus Technology: Anticipating the Future." The conference introduced its ASCUE Officers and Directors, and provides abstracts of the pre-conference workshops. The full-text conference papers in this document include: (1) Developing…
ERIC Educational Resources Information Center
Huttemann, Lutz, Ed.; Inganji, Francis K., Ed.
This workshop attended by 21 participants from 13 countries was designed to promote the use of computerized information and documentation services in the eastern and southern African subregion, and increase the exchange of experiences of the personnel involved in the field. The full text is provided for the following papers presented at the…
Going, going, still there: using the WebCite service to permanently archive cited Web pages.
Eysenbach, Gunther
2006-01-01
Scholars are increasingly citing electronic "web references" which are not preserved in libraries or full text archives. WebCite is a new standard for citing web references. To "webcite" a document involves archiving the cited Web page through www.webcitation.org and citing the WebCite permalink instead of (or in addition to) the unstable live Web page.
Coordinating Council. Tenth Meeting: Information retrieval: The role of controlled vocabularies
NASA Technical Reports Server (NTRS)
1993-01-01
The theme of this NASA Scientific and Technical Information Program Coordinating Council meeting was the role of controlled vocabularies (thesauri) in information retrieval. Included are summaries of the presentations and the accompanying visuals. Dr. Raya Fidel addressed 'Retrieval: Free Text, Full Text, and Controlled Vocabularies.' Dr. Bella Hass Weinberg spoke on 'Controlled Vocabularies and Thesaurus Standards.' The presentations were followed by a panel discussion with participation from NASA, the National Library of Medicine, the Defense Technical Information Center, and the Department of Energy; this discussion, however, is not summarized in any detail in this document.
Text extraction method for historical Tibetan document images based on block projections
NASA Astrophysics Data System (ADS)
Duan, Li-juan; Zhang, Xi-qun; Ma, Long-long; Wu, Jian
2017-11-01
Text extraction is an important initial step in digitizing the historical documents. In this paper, we present a text extraction method for historical Tibetan document images based on block projections. The task of text extraction is considered as text area detection and location problem. The images are divided equally into blocks and the blocks are filtered by the information of the categories of connected components and corner point density. By analyzing the filtered blocks' projections, the approximate text areas can be located, and the text regions are extracted. Experiments on the dataset of historical Tibetan documents demonstrate the effectiveness of the proposed method.
Methods for semi-automated indexing for high precision information retrieval
NASA Technical Reports Server (NTRS)
Berrios, Daniel C.; Cucina, Russell J.; Fagan, Lawrence M.
2002-01-01
OBJECTIVE: To evaluate a new system, ISAID (Internet-based Semi-automated Indexing of Documents), and to generate textbook indexes that are more detailed and more useful to readers. DESIGN: Pilot evaluation: simple, nonrandomized trial comparing ISAID with manual indexing methods. Methods evaluation: randomized, cross-over trial comparing three versions of ISAID and usability survey. PARTICIPANTS: Pilot evaluation: two physicians. Methods evaluation: twelve physicians, each of whom used three different versions of the system for a total of 36 indexing sessions. MEASUREMENTS: Total index term tuples generated per document per minute (TPM), with and without adjustment for concordance with other subjects; inter-indexer consistency; ratings of the usability of the ISAID indexing system. RESULTS: Compared with manual methods, ISAID decreased indexing times greatly. Using three versions of ISAID, inter-indexer consistency ranged from 15% to 65% with a mean of 41%, 31%, and 40% for each of three documents. Subjects using the full version of ISAID were faster (average TPM: 5.6) and had higher rates of concordant index generation. There were substantial learning effects, despite our use of a training/run-in phase. Subjects using the full version of ISAID were much faster by the third indexing session (average TPM: 9.1). There was a statistically significant increase in three-subject concordant indexing rate using the full version of ISAID during the second indexing session (p < 0.05). SUMMARY: Users of the ISAID indexing system create complex, precise, and accurate indexing for full-text documents much faster than users of manual methods. Furthermore, the natural language processing methods that ISAID uses to suggest indexes contributes substantially to increased indexing speed and accuracy.
Methods for Semi-automated Indexing for High Precision Information Retrieval
Berrios, Daniel C.; Cucina, Russell J.; Fagan, Lawrence M.
2002-01-01
Objective. To evaluate a new system, ISAID (Internet-based Semi-automated Indexing of Documents), and to generate textbook indexes that are more detailed and more useful to readers. Design. Pilot evaluation: simple, nonrandomized trial comparing ISAID with manual indexing methods. Methods evaluation: randomized, cross-over trial comparing three versions of ISAID and usability survey. Participants. Pilot evaluation: two physicians. Methods evaluation: twelve physicians, each of whom used three different versions of the system for a total of 36 indexing sessions. Measurements. Total index term tuples generated per document per minute (TPM), with and without adjustment for concordance with other subjects; inter-indexer consistency; ratings of the usability of the ISAID indexing system. Results. Compared with manual methods, ISAID decreased indexing times greatly. Using three versions of ISAID, inter-indexer consistency ranged from 15% to 65% with a mean of 41%, 31%, and 40% for each of three documents. Subjects using the full version of ISAID were faster (average TPM: 5.6) and had higher rates of concordant index generation. There were substantial learning effects, despite our use of a training/run-in phase. Subjects using the full version of ISAID were much faster by the third indexing session (average TPM: 9.1). There was a statistically significant increase in three-subject concordant indexing rate using the full version of ISAID during the second indexing session (p < 0.05). Summary. Users of the ISAID indexing system create complex, precise, and accurate indexing for full-text documents much faster than users of manual methods. Furthermore, the natural language processing methods that ISAID uses to suggest indexes contributes substantially to increased indexing speed and accuracy. PMID:12386114
NASA Astrophysics Data System (ADS)
Hadyan, Fadhlil; Shaufiah; Arif Bijaksana, Moch.
2017-01-01
Automatic summarization is a system that can help someone to take the core information of a long text instantly. The system can help by summarizing text automatically. there’s Already many summarization systems that have been developed at this time but there are still many problems in those system. In this final task proposed summarization method using document index graph. This method utilizes the PageRank and HITS formula used to assess the web page, adapted to make an assessment of words in the sentences in a text document. The expected outcome of this final task is a system that can do summarization of a single document, by utilizing document index graph with TextRank and HITS to improve the quality of the summary results automatically.
This School Drug Search Made a Point: We Care Enough To Get Tough with Kids. The Endpaper.
ERIC Educational Resources Information Center
Ryder, Bernard F.
1982-01-01
THE FOLLOWING IS THE FULL TEXT OF THIS DOCUMENT: A parent who notices a gun in his child's room would not hesitate to ask questions and demand answers about its presence. As a school administrator, I believe it is my responsibility to ask questions and take action when I find an equally destructive weapon--drugs--in my schools. The zealous…
ERIC Educational Resources Information Center
Krishnamurthy, Ramesh S.; Mead, Clifford S.
1995-01-01
Presents plan of Oregon State University Libraries to convert all paper documents from the Ava Helen and Linus Pauling archives to digital format. The scope, goals, tasks and objectives set by the project coordinators are outlined, and issues such as protection of equipment, access, copyright and management are discussed. (JKP)
Information services for comparative analysis of biorhythm research
NASA Technical Reports Server (NTRS)
1972-01-01
References and full text documents are presented in support of continuing research and research planning for the NASA behavioral physiology program. Areas covered include: (1) desynchronosis and performance; (2) effects of alcohol, common colds, drugs, and toxic hazards on performance; (3) effects of stress on rhythm of plasma steroids; (4) data processing of biological rhythms; (5) pharmacology and biological rhythms; (6) mechanisms of biological rhythms; and (7) development of biological rhythms.
ERIC Educational Resources Information Center
Horn, Jerry G., Ed.; Havlicek, Barbara, Ed.
This document contains the full text of the conference keynote address and abstracts of conference papers. The keynote address--"Excellence in Rural Education: Our Heritage and Our Future," by Duane M. Nielsen--outlines: (1) demographic changes in rural America in the 1970s and 1980s; (2) the challenges facing rural education due to increasing…
The Angel in the Academy: The Creative Writer as Helpmeet on the Distaff Side of English Studies.
ERIC Educational Resources Information Center
Elliott, Gayle
Women who wish to assume full voice in their writing have no choice but to raise questions regarding their status and the status of creative writing within the academy. Tillie Olsen and Elaine Showalter have documented the bias in texts taught at the university in which women have little place, if at all. The effects are devastating: if the voices…
ERIC Educational Resources Information Center
International Association of Technological Univ. Libraries, Gothenburg (Sweden).
This proceedings of the International Association of Technological University Libraries (IATUL) contains the opening address by IATUL president Nancy Fjallbrant and the full text of the following papers: "Building Info-Skills by Degrees: Embedding Information Literacy in University Study" (Wendy Abbott and Deborah Peach); "UQ…
Utility of social media and crowd-intelligence data for pharmacovigilance: a scoping review.
Tricco, Andrea C; Zarin, Wasifa; Lillie, Erin; Jeblee, Serena; Warren, Rachel; Khan, Paul A; Robson, Reid; Pham, Ba'; Hirst, Graeme; Straus, Sharon E
2018-06-14
A scoping review to characterize the literature on the use of conversations in social media as a potential source of data for detecting adverse events (AEs) related to health products. Our specific research questions were (1) What social media listening platforms exist to detect adverse events related to health products, and what are their capabilities and characteristics? (2) What is the validity and reliability of data from social media for detecting these adverse events? MEDLINE, EMBASE, Cochrane Library, and relevant websites were searched from inception to May 2016. Any type of document (e.g., manuscripts, reports) that described the use of social media data for detecting health product AEs was included. Two reviewers independently screened citations and full-texts, and one reviewer and one verifier performed data abstraction. Descriptive synthesis was conducted. After screening 3631 citations and 321 full-texts, 70 unique documents with 7 companion reports available from 2001 to 2016 were included. Forty-six documents (66%) described an automated or semi-automated information extraction system to detect health product AEs from social media conversations (in the developmental phase). Seven pre-existing information extraction systems to mine social media data were identified in eight documents. Nineteen documents compared AEs reported in social media data with validated data and found consistent AE discovery in all except two documents. None of the documents reported the validity and reliability of the overall system, but some reported on the performance of individual steps in processing the data. The validity and reliability results were found for the following steps in the data processing pipeline: data de-identification (n = 1), concept identification (n = 3), concept normalization (n = 2), and relation extraction (n = 8). The methods varied widely, and some approaches yielded better results than others. Our results suggest that the use of social media conversations for pharmacovigilance is in its infancy. Although social media data has the potential to supplement data from regulatory agency databases; is able to capture less frequently reported AEs; and can identify AEs earlier than official alerts or regulatory changes, the utility and validity of the data source remains under-studied. Open Science Framework ( https://osf.io/kv9hu/ ).
Building a comprehensive syntactic and semantic corpus of Chinese clinical texts.
He, Bin; Dong, Bin; Guan, Yi; Yang, Jinfeng; Jiang, Zhipeng; Yu, Qiubin; Cheng, Jianyi; Qu, Chunyan
2017-05-01
To build a comprehensive corpus covering syntactic and semantic annotations of Chinese clinical texts with corresponding annotation guidelines and methods as well as to develop tools trained on the annotated corpus, which supplies baselines for research on Chinese texts in the clinical domain. An iterative annotation method was proposed to train annotators and to develop annotation guidelines. Then, by using annotation quality assurance measures, a comprehensive corpus was built, containing annotations of part-of-speech (POS) tags, syntactic tags, entities, assertions, and relations. Inter-annotator agreement (IAA) was calculated to evaluate the annotation quality and a Chinese clinical text processing and information extraction system (CCTPIES) was developed based on our annotated corpus. The syntactic corpus consists of 138 Chinese clinical documents with 47,426 tokens and 2612 full parsing trees, while the semantic corpus includes 992 documents that annotated 39,511 entities with their assertions and 7693 relations. IAA evaluation shows that this comprehensive corpus is of good quality, and the system modules are effective. The annotated corpus makes a considerable contribution to natural language processing (NLP) research into Chinese texts in the clinical domain. However, this corpus has a number of limitations. Some additional types of clinical text should be introduced to improve corpus coverage and active learning methods should be utilized to promote annotation efficiency. In this study, several annotation guidelines and an annotation method for Chinese clinical texts were proposed, and a comprehensive corpus with its NLP modules were constructed, providing a foundation for further study of applying NLP techniques to Chinese texts in the clinical domain. Copyright © 2017. Published by Elsevier Inc.
GeoDeepDive: Towards a Machine Reading-Ready Digital Library and Information Integration Resource
NASA Astrophysics Data System (ADS)
Husson, J. M.; Peters, S. E.; Livny, M.; Ross, I.
2015-12-01
Recent developments in machine reading and learning approaches to text and data mining hold considerable promise for accelerating the pace and quality of literature-based data synthesis, but these advances have outpaced even basic levels of access to the published literature. For many geoscience domains, particularly those based on physical samples and field-based descriptions, this limitation is significant. Here we describe a general infrastructure to support published literature-based machine reading and learning approaches to information integration and knowledge base creation. This infrastructure supports rate-controlled automated fetching of original documents, along with full bibliographic citation metadata, from remote servers, the secure storage of original documents, and the utilization of considerable high-throughput computing resources for the pre-processing of these documents by optical character recognition, natural language parsing, and other document annotation and parsing software tools. New tools and versions of existing tools can be automatically deployed against original documents when they are made available. The products of these tools (text/XML files) are managed by MongoDB and are available for use in data extraction applications. Basic search and discovery functionality is provided by ElasticSearch, which is used to identify documents of potential relevance to a given data extraction task. Relevant files derived from the original documents are then combined into basic starting points for application building; these starting points are kept up-to-date as new relevant documents are incorporated into the digital library. Currently, our digital library stores contains more than 360K documents supplied by Elsevier and the USGS and we are actively seeking additional content providers. By focusing on building a dependable infrastructure to support the retrieval, storage, and pre-processing of published content, we are establishing a foundation for complex, and continually improving, information integration and data extraction applications. We have developed one such application, which we present as an example, and invite new collaborations to develop other such applications.
Spatial Paradigm for Information Retrieval and Exploration
DOE Office of Scientific and Technical Information (OSTI.GOV)
The SPIRE system consists of software for visual analysis of primarily text based information sources. This technology enables the content analysis of text documents without reading all the documents. It employs several algorithms for text and word proximity analysis. It identifies the key themes within the text documents. From this analysis, it projects the results onto a visual spatial proximity display (Galaxies or Themescape) where items (documents and/or themes) visually close to each other are known to have content which is close to each other. Innovative interaction techniques then allow for dynamic visual analysis of large text based information spaces.
SPIRE1.03. Spatial Paradigm for Information Retrieval and Exploration
DOE Office of Scientific and Technical Information (OSTI.GOV)
Adams, K.J.; Bohn, S.; Crow, V.
The SPIRE system consists of software for visual analysis of primarily text based information sources. This technology enables the content analysis of text documents without reading all the documents. It employs several algorithms for text and word proximity analysis. It identifies the key themes within the text documents. From this analysis, it projects the results onto a visual spatial proximity display (Galaxies or Themescape) where items (documents and/or themes) visually close to each other are known to have content which is close to each other. Innovative interaction techniques then allow for dynamic visual analysis of large text based information spaces.
Text Mining in Biomedical Domain with Emphasis on Document Clustering.
Renganathan, Vinaitheerthan
2017-07-01
With the exponential increase in the number of articles published every year in the biomedical domain, there is a need to build automated systems to extract unknown information from the articles published. Text mining techniques enable the extraction of unknown knowledge from unstructured documents. This paper reviews text mining processes in detail and the software tools available to carry out text mining. It also reviews the roles and applications of text mining in the biomedical domain. Text mining processes, such as search and retrieval of documents, pre-processing of documents, natural language processing, methods for text clustering, and methods for text classification are described in detail. Text mining techniques can facilitate the mining of vast amounts of knowledge on a given topic from published biomedical research articles and draw meaningful conclusions that are not possible otherwise.
Feature extraction for document text using Latent Dirichlet Allocation
NASA Astrophysics Data System (ADS)
Prihatini, P. M.; Suryawan, I. K.; Mandia, IN
2018-01-01
Feature extraction is one of stages in the information retrieval system that used to extract the unique feature values of a text document. The process of feature extraction can be done by several methods, one of which is Latent Dirichlet Allocation. However, researches related to text feature extraction using Latent Dirichlet Allocation method are rarely found for Indonesian text. Therefore, through this research, a text feature extraction will be implemented for Indonesian text. The research method consists of data acquisition, text pre-processing, initialization, topic sampling and evaluation. The evaluation is done by comparing Precision, Recall and F-Measure value between Latent Dirichlet Allocation and Term Frequency Inverse Document Frequency KMeans which commonly used for feature extraction. The evaluation results show that Precision, Recall and F-Measure value of Latent Dirichlet Allocation method is higher than Term Frequency Inverse Document Frequency KMeans method. This shows that Latent Dirichlet Allocation method is able to extract features and cluster Indonesian text better than Term Frequency Inverse Document Frequency KMeans method.
Script-independent text line segmentation in freestyle handwritten documents.
Li, Yi; Zheng, Yefeng; Doermann, David; Jaeger, Stefan; Li, Yi
2008-08-01
Text line segmentation in freestyle handwritten documents remains an open document analysis problem. Curvilinear text lines and small gaps between neighboring text lines present a challenge to algorithms developed for machine printed or hand-printed documents. In this paper, we propose a novel approach based on density estimation and a state-of-the-art image segmentation technique, the level set method. From an input document image, we estimate a probability map, where each element represents the probability that the underlying pixel belongs to a text line. The level set method is then exploited to determine the boundary of neighboring text lines by evolving an initial estimate. Unlike connected component based methods ( [1], [2] for example), the proposed algorithm does not use any script-specific knowledge. Extensive quantitative experiments on freestyle handwritten documents with diverse scripts, such as Arabic, Chinese, Korean, and Hindi, demonstrate that our algorithm consistently outperforms previous methods [1]-[3]. Further experiments show the proposed algorithm is robust to scale change, rotation, and noise.
Document segmentation for high-quality printing
NASA Astrophysics Data System (ADS)
Ancin, Hakan
1997-04-01
A technique to segment dark texts on light background of mixed mode color documents is presented. This process does not perceptually change graphics and photo regions. Color documents are scanned and printed from various media which usually do not have clean background. This is especially the case for the printouts generated from thin magazine samples, these printouts usually include text and figures form the back of the page, which is called bleeding. Removal of bleeding artifacts improves the perceptual quality of the printed document and reduces the color ink usage. By detecting the light background of the document, these artifacts are removed from background regions. Also detection of dark text regions enables the halftoning algorithms to use true black ink for the black text pixels instead of composite black. The processed document contains sharp black text on white background, resulting improved perceptual quality and better ink utilization. The described method is memory efficient and requires a small number of scan lines of high resolution color documents during processing.
Challenges and methodology for indexing the computerized patient record.
Ehrler, Frédéric; Ruch, Patrick; Geissbuhler, Antoine; Lovis, Christian
2007-01-01
Patient records contain most crucial documents for managing the treatments and healthcare of patients in the hospital. Retrieving information from these records in an easy, quick and safe way helps care providers to save time and find important facts about their patient's health. This paper presents the scalability issues induced by the indexing and the retrieval of the information contained in the patient records. For this study, EasyIR, an information retrieval tool performing full text queries and retrieving the related documents has been used. An evaluation of the performance reveals that the indexing process suffers from overhead consequence of the particular structure of the patient records. Most IR tools are designed to manage very large numbers of documents in a single index whereas in our hypothesis, one index per record, which usually implies few documents, has been imposed. As the number of modifications and creations of patient records are significant in a day, using a specialized and efficient indexation tool is required.
Typograph: Multiscale Spatial Exploration of Text Documents
DOE Office of Scientific and Technical Information (OSTI.GOV)
Endert, Alexander; Burtner, Edwin R.; Cramer, Nicholas O.
2013-10-06
Visualizing large document collections using a spatial layout of terms can enable quick overviews of information. These visual metaphors (e.g., word clouds, tag clouds, etc.) traditionally show a series of terms organized by space-filling algorithms. However, often lacking in these views is the ability to interactively explore the information to gain more detail, and the location and rendering of the terms are often not based on mathematical models that maintain relative distances from other information based on similarity metrics. In this paper, we present Typograph, a multi-scale spatial exploration visualization for large document collections. Based on the term-based visualization methods,more » Typograh enables multiple levels of detail (terms, phrases, snippets, and full documents) within the single spatialization. Further, the information is placed based on their relative similarity to other information to create the “near = similar” geographic metaphor. This paper discusses the design principles and functionality of Typograph and presents a use case analyzing Wikipedia to demonstrate usage.« less
Computation of term dominance in text documents
Bauer, Travis L [Albuquerque, NM; Benz, Zachary O [Albuquerque, NM; Verzi, Stephen J [Albuquerque, NM
2012-04-24
An improved entropy-based term dominance metric useful for characterizing a corpus of text documents, and is useful for comparing the term dominance metrics of a first corpus of documents to a second corpus having a different number of documents.
Document of standardization of enteral nutrition access in adults.
Arribas, Lorena; Frías, Laura; Creus, Gloria; Parejo, Juana; Urzola, Carmen; Ashbaugh, Rosana; Pérez-Portabella, Cleofé; Cuerda, Cristina
2014-07-01
The group of standardization and protocols of the Spanish Society of Parenteral and Enteral Nutrition (SENPE) published in 2011 a consensus document SENPE/SEGHNP/ANECIPN/SECP on enteral access for paediatric nutritional support. Along the lines of this document, we have developed another document on adult patients to homogenize the clinical practice and improve the quality of care in enteral access in this age group. The working group included health professionals (nurses, dietitians and doctor) with extensive experience in enteral nutrition and access. We tried to find scientific evidence through a literature review and we used the criteria of the Agency for Health-care Research and Quality (AHRQ) to classify the evidence (Grade of Recommendation A, B or C). Later the document was reviewed by external experts to the group and requested the endorsement of the Scientific and Educational Committee (CCE) and the group of home artificial nutrition (NADYA) of the SENPE. The full text will be published as a monograph number in this journal. Copyright AULA MEDICA EDICIONES 2014. Published by AULA MEDICA. All rights reserved.
Duplicate document detection in DocBrowse
NASA Astrophysics Data System (ADS)
Chalana, Vikram; Bruce, Andrew G.; Nguyen, Thien
1998-04-01
Duplicate documents are frequently found in large databases of digital documents, such as those found in digital libraries or in the government declassification effort. Efficient duplicate document detection is important not only to allow querying for similar documents, but also to filter out redundant information in large document databases. We have designed three different algorithm to identify duplicate documents. The first algorithm is based on features extracted from the textual content of a document, the second algorithm is based on wavelet features extracted from the document image itself, and the third algorithm is a combination of the first two. These algorithms are integrated within the DocBrowse system for information retrieval from document images which is currently under development at MathSoft. DocBrowse supports duplicate document detection by allowing (1) automatic filtering to hide duplicate documents, and (2) ad hoc querying for similar or duplicate documents. We have tested the duplicate document detection algorithms on 171 documents and found that text-based method has an average 11-point precision of 97.7 percent while the image-based method has an average 11- point precision of 98.9 percent. However, in general, the text-based method performs better when the document contains enough high-quality machine printed text while the image- based method performs better when the document contains little or no quality machine readable text.
Text Mining in Biomedical Domain with Emphasis on Document Clustering
2017-01-01
Objectives With the exponential increase in the number of articles published every year in the biomedical domain, there is a need to build automated systems to extract unknown information from the articles published. Text mining techniques enable the extraction of unknown knowledge from unstructured documents. Methods This paper reviews text mining processes in detail and the software tools available to carry out text mining. It also reviews the roles and applications of text mining in the biomedical domain. Results Text mining processes, such as search and retrieval of documents, pre-processing of documents, natural language processing, methods for text clustering, and methods for text classification are described in detail. Conclusions Text mining techniques can facilitate the mining of vast amounts of knowledge on a given topic from published biomedical research articles and draw meaningful conclusions that are not possible otherwise. PMID:28875048
BioRAT: extracting biological information from full-length papers.
Corney, David P A; Buxton, Bernard F; Langdon, William B; Jones, David T
2004-11-22
Converting the vast quantity of free-format text found in journals into a concise, structured format makes the researcher's quest for information easier. Recently, several information extraction systems have been developed that attempt to simplify the retrieval and analysis of biological and medical data. Most of this work has used the abstract alone, owing to the convenience of access and the quality of data. Abstracts are generally available through central collections with easy direct access (e.g. PubMed). The full-text papers contain more information, but are distributed across many locations (e.g. publishers' web sites, journal web sites and local repositories), making access more difficult. In this paper, we present BioRAT, a new information extraction (IE) tool, specifically designed to perform biomedical IE, and which is able to locate and analyse both abstracts and full-length papers. BioRAT is a Biological Research Assistant for Text mining, and incorporates a document search ability with domain-specific IE. We show first, that BioRAT performs as well as existing systems, when applied to abstracts; and second, that significantly more information is available to BioRAT through the full-length papers than via the abstracts alone. Typically, less than half of the available information is extracted from the abstract, with the majority coming from the body of each paper. Overall, BioRAT recalled 20.31% of the target facts from the abstracts with 55.07% precision, and achieved 43.6% recall with 51.25% precision on full-length papers.
Investigation into Text Classification With Kernel Based Schemes
2010-03-01
Document Matrix TDMs Term-Document Matrices TMG Text to Matrix Generator TN True Negative TP True Positive VSM Vector Space Model xxii THIS PAGE...are represented as a term-document matrix, common evaluation metrics, and the software package Text to Matrix Generator ( TMG ). The classifier...AND METRICS This chapter introduces the indexing capabilities of the Text to Matrix Generator ( TMG ) Toolbox. Specific attention is placed on the
GeneView: a comprehensive semantic search engine for PubMed.
Thomas, Philippe; Starlinger, Johannes; Vowinkel, Alexander; Arzt, Sebastian; Leser, Ulf
2012-07-01
Research results are primarily published in scientific literature and curation efforts cannot keep up with the rapid growth of published literature. The plethora of knowledge remains hidden in large text repositories like MEDLINE. Consequently, life scientists have to spend a great amount of time searching for specific information. The enormous ambiguity among most names of biomedical objects such as genes, chemicals and diseases often produces too large and unspecific search results. We present GeneView, a semantic search engine for biomedical knowledge. GeneView is built upon a comprehensively annotated version of PubMed abstracts and openly available PubMed Central full texts. This semi-structured representation of biomedical texts enables a number of features extending classical search engines. For instance, users may search for entities using unique database identifiers or they may rank documents by the number of specific mentions they contain. Annotation is performed by a multitude of state-of-the-art text-mining tools for recognizing mentions from 10 entity classes and for identifying protein-protein interactions. GeneView currently contains annotations for >194 million entities from 10 classes for ∼21 million citations with 271,000 full text bodies. GeneView can be searched at http://bc3.informatik.hu-berlin.de/.
Desktop document delivery using portable document format (PDF) files and the Web.
Shipman, J P; Gembala, W L; Reeder, J M; Zick, B A; Rainwater, M J
1998-01-01
Desktop access to electronic full-text literature was rated one of the most desirable services in a client survey conducted by the University of Washington Libraries. The University of Washington Health Sciences Libraries (UW HSL) conducted a ten-month pilot test from August 1996 to May 1997 to determine the feasibility of delivering electronic journal articles via the Internet to remote faculty. Articles were scanned into Adobe Acrobat Portable Document Format (PDF) files and delivered to individuals using Multipurpose Internet Mail Extensions (MIME) standard e-mail attachments and the Web. Participants retrieved scanned articles and used the Adobe Acrobat Reader software to view and print files. The pilot test required a special programming effort to automate the client notification and file deletion processes. Test participants were satisfied with the pilot test despite some technical difficulties. Desktop delivery is now offered as a routine delivery method from the UW HSL. PMID:9681165
Desktop document delivery using portable document format (PDF) files and the Web.
Shipman, J P; Gembala, W L; Reeder, J M; Zick, B A; Rainwater, M J
1998-07-01
Desktop access to electronic full-text literature was rated one of the most desirable services in a client survey conducted by the University of Washington Libraries. The University of Washington Health Sciences Libraries (UW HSL) conducted a ten-month pilot test from August 1996 to May 1997 to determine the feasibility of delivering electronic journal articles via the Internet to remote faculty. Articles were scanned into Adobe Acrobat Portable Document Format (PDF) files and delivered to individuals using Multipurpose Internet Mail Extensions (MIME) standard e-mail attachments and the Web. Participants retrieved scanned articles and used the Adobe Acrobat Reader software to view and print files. The pilot test required a special programming effort to automate the client notification and file deletion processes. Test participants were satisfied with the pilot test despite some technical difficulties. Desktop delivery is now offered as a routine delivery method from the UW HSL.
Astronomical Software Directory Service
NASA Astrophysics Data System (ADS)
Hanisch, Robert J.; Payne, Harry; Hayes, Jeffrey
1997-01-01
With the support of NASA's Astrophysics Data Program (NRA 92-OSSA-15), we have developed the Astronomical Software Directory Service (ASDS): a distributed, searchable, WWW-based database of software packages and their related documentation. ASDS provides integrated access to 56 astronomical software packages, with more than 16,000 URLs indexed for full-text searching. Users are performing about 400 searches per month. A new aspect of our service is the inclusion of telescope and instrumentation manuals, which prompted us to change the name to the Astronomical Software and Documentation Service. ASDS was originally conceived to serve two purposes: to provide a useful Internet service in an area of expertise of the investigators (astronomical software), and as a research project to investigate various architectures for searching through a set of documents distributed across the Internet. Two of the co-investigators were then installing and maintaining astronomical software as their primary job responsibility. We felt that a service which incorporated our experience in this area would be more useful than a straightforward listing of software packages. The original concept was for a service based on the client/server model, which would function as a directory/referral service rather than as an archive. For performing the searches, we began our investigation with a decision to evaluate the Isite software from the Center for Networked Information Discovery and Retrieval (CNIDR). This software was intended as a replacement for Wide-Area Information Service (WAIS), a client/server technology for performing full-text searches through a set of documents. Isite had some additional features that we considered attractive, and we enjoyed the cooperation of the Isite developers, who were happy to have ASDS as a demonstration project. We ended up staying with the software throughout the project, making modifications to take advantage of new features as they came along, as well as influencing the software development. The Web interface to the search engine is provided by a gateway program written in C++ by a consultant to the project (A. Warnock).
Representation-based user interfaces for the audiovisual library of the year 2000
NASA Astrophysics Data System (ADS)
Aigrain, Philippe; Joly, Philippe; Lepain, Philippe; Longueville, Veronique
1995-03-01
The audiovisual library of the future will be based on computerized access to digitized documents. In this communication, we address the user interface issues which will arise from this new situation. One cannot simply transfer a user interface designed for the piece by piece production of some audiovisual presentation and make it a tool for accessing full-length movies in an electronic library. One cannot take a digital sound editing tool and propose it as a means to listen to a musical recording. In our opinion, when computers are used as mediations to existing contents, document representation-based user interfaces are needed. With such user interfaces, a structured visual representation of the document contents is presented to the user, who can then manipulate it to control perception and analysis of these contents. In order to build such manipulable visual representations of audiovisual documents, one needs to automatically extract structural information from the documents contents. In this communication, we describe possible visual interfaces for various temporal media, and we propose methods for the economically feasible large scale processing of documents. The work presented is sponsored by the Bibliotheque Nationale de France: it is part of the program aiming at developing for image and sound documents an experimental counterpart to the digitized text reading workstation of this library.
Development of an information retrieval tool for biomedical patents.
Alves, Tiago; Rodrigues, Rúben; Costa, Hugo; Rocha, Miguel
2018-06-01
The volume of biomedical literature has been increasing in the last years. Patent documents have also followed this trend, being important sources of biomedical knowledge, technical details and curated data, which are put together along the granting process. The field of Biomedical text mining (BioTM) has been creating solutions for the problems posed by the unstructured nature of natural language, which makes the search of information a challenging task. Several BioTM techniques can be applied to patents. From those, Information Retrieval (IR) includes processes where relevant data are obtained from collections of documents. In this work, the main goal was to build a patent pipeline addressing IR tasks over patent repositories to make these documents amenable to BioTM tasks. The pipeline was developed within @Note2, an open-source computational framework for BioTM, adding a number of modules to the core libraries, including patent metadata and full text retrieval, PDF to text conversion and optical character recognition. Also, user interfaces were developed for the main operations materialized in a new @Note2 plug-in. The integration of these tools in @Note2 opens opportunities to run BioTM tools over patent texts, including tasks from Information Extraction, such as Named Entity Recognition or Relation Extraction. We demonstrated the pipeline's main functions with a case study, using an available benchmark dataset from BioCreative challenges. Also, we show the use of the plug-in with a user query related to the production of vanillin. This work makes available all the relevant content from patents to the scientific community, decreasing drastically the time required for this task, and provides graphical interfaces to ease the use of these tools. Copyright © 2018 Elsevier B.V. All rights reserved.
Techniques of Document Management: A Review of Text Retrieval and Related Technologies.
ERIC Educational Resources Information Center
Veal, D. C.
2001-01-01
Reviews present and possible future developments in the techniques of electronic document management, the major ones being text retrieval and scanning and OCR (optical character recognition). Also addresses document acquisition, indexing and thesauri, publishing and dissemination standards, impact of the Internet, and the document management…
Benchmarking of neutron production of heavy-ion transport codes
DOE Office of Scientific and Technical Information (OSTI.GOV)
Remec, I.; Ronningen, R. M.; Heilbronn, L.
Document available in abstract form only, full text of document follows: Accurate prediction of radiation fields generated by heavy ion interactions is important in medical applications, space missions, and in design and operation of rare isotope research facilities. In recent years, several well-established computer codes in widespread use for particle and radiation transport calculations have been equipped with the capability to simulate heavy ion transport and interactions. To assess and validate these capabilities, we performed simulations of a series of benchmark-quality heavy ion experiments with the computer codes FLUKA, MARS15, MCNPX, and PHITS. We focus on the comparisons of secondarymore » neutron production. Results are encouraging; however, further improvements in models and codes and additional benchmarking are required. (authors)« less
Translations from KOMMUNIST No. 12, August 1978
1978-10-23
of respect, full of ideal thrusts; on the other, the scientist whose "utopia" of a classless society has been "buried in the catacombs of history...wishes to be democratic" (M. Duverger, "Lettre ouverte aux socialistes" [Open Letter to the Socialists], Paris , 1976, p 54). Naturally, the bour...83-87 [Article by V. Sedykh, Paris -Moscow, August 1978] [Text] Documents which inform us of new details of Lev Nikolayevich Tolstoy’s life and
Spacecraft Fire Safety 1956 to 1999: An Annotated Bibliography
NASA Technical Reports Server (NTRS)
Friedman, Robert; Ruff, Gary A.
2013-01-01
Knowledge of fire safety in spacecraft has resulted from over 50 years of investigation and experience in space flight. Current practices and procedures for the operation of the Space Transportation System (STS) shuttle and the International Space Station (ISS) have been developed from this expertise, much of which has been documented in various reports. Extending manned space exploration from low Earth orbit to lunar or Martian habitats and beyond will require continued research in microgravity combustion and fire protection in low gravity. This descriptive bibliography has been produced to document and summarize significant work in the area of spacecraft fire safety that was published between 1956 and July 1999. Although some important work published in the late 1990s may be missing, these citations as well as work since 2000 can generally be found in Web-based resources that are easily accessed and searched. In addition to the citation, each reference includes a short description of the contents and conclusions of the article. The bibliography contains over 800 citations that are cross-referenced both by topic and the authors and editors. There is a DVD that accompanies this bibliography (available by request from the Center for Aerospace Information) containing the full-text articles of selected citations as well as an electronic version of this report that has these citations as active links to their corresponding full-text article.
Ambiguity and variability of database and software names in bioinformatics.
Duck, Geraint; Kovacevic, Aleksandar; Robertson, David L; Stevens, Robert; Nenadic, Goran
2015-01-01
There are numerous options available to achieve various tasks in bioinformatics, but until recently, there were no tools that could systematically identify mentions of databases and tools within the literature. In this paper we explore the variability and ambiguity of database and software name mentions and compare dictionary and machine learning approaches to their identification. Through the development and analysis of a corpus of 60 full-text documents manually annotated at the mention level, we report high variability and ambiguity in database and software mentions. On a test set of 25 full-text documents, a baseline dictionary look-up achieved an F-score of 46 %, highlighting not only variability and ambiguity but also the extensive number of new resources introduced. A machine learning approach achieved an F-score of 63 % (with precision of 74 %) and 70 % (with precision of 83 %) for strict and lenient matching respectively. We characterise the issues with various mention types and propose potential ways of capturing additional database and software mentions in the literature. Our analyses show that identification of mentions of databases and tools is a challenging task that cannot be achieved by relying on current manually-curated resource repositories. Although machine learning shows improvement and promise (primarily in precision), more contextual information needs to be taken into account to achieve a good degree of accuracy.
Detection of text strings from mixed text/graphics images
NASA Astrophysics Data System (ADS)
Tsai, Chien-Hua; Papachristou, Christos A.
2000-12-01
A robust system for text strings separation from mixed text/graphics images is presented. Based on a union-find (region growing) strategy the algorithm is thus able to classify the text from graphics and adapts to changes in document type, language category (e.g., English, Chinese and Japanese), text font style and size, and text string orientation within digital images. In addition, it allows for a document skew that usually occurs in documents, without skew correction prior to discrimination while these proposed methods such a projection profile or run length coding are not always suitable for the condition. The method has been tested with a variety of printed documents from different origins with one common set of parameters, and the experimental results of the performance of the algorithm in terms of computational efficiency are demonstrated by using several tested images from the evaluation.
Layout-aware text extraction from full-text PDF of scientific articles.
Ramakrishnan, Cartic; Patnia, Abhishek; Hovy, Eduard; Burns, Gully Apc
2012-05-28
The Portable Document Format (PDF) is the most commonly used file format for online scientific publications. The absence of effective means to extract text from these PDF files in a layout-aware manner presents a significant challenge for developers of biomedical text mining or biocuration informatics systems that use published literature as an information source. In this paper we introduce the 'Layout-Aware PDF Text Extraction' (LA-PDFText) system to facilitate accurate extraction of text from PDF files of research articles for use in text mining applications. Our paper describes the construction and performance of an open source system that extracts text blocks from PDF-formatted full-text research articles and classifies them into logical units based on rules that characterize specific sections. The LA-PDFText system focuses only on the textual content of the research articles and is meant as a baseline for further experiments into more advanced extraction methods that handle multi-modal content, such as images and graphs. The system works in a three-stage process: (1) Detecting contiguous text blocks using spatial layout processing to locate and identify blocks of contiguous text, (2) Classifying text blocks into rhetorical categories using a rule-based method and (3) Stitching classified text blocks together in the correct order resulting in the extraction of text from section-wise grouped blocks. We show that our system can identify text blocks and classify them into rhetorical categories with Precision1 = 0.96% Recall = 0.89% and F1 = 0.91%. We also present an evaluation of the accuracy of the block detection algorithm used in step 2. Additionally, we have compared the accuracy of the text extracted by LA-PDFText to the text from the Open Access subset of PubMed Central. We then compared this accuracy with that of the text extracted by the PDF2Text system, 2commonly used to extract text from PDF. Finally, we discuss preliminary error analysis for our system and identify further areas of improvement. LA-PDFText is an open-source tool for accurately extracting text from full-text scientific articles. The release of the system is available at http://code.google.com/p/lapdftext/.
Layout-aware text extraction from full-text PDF of scientific articles
2012-01-01
Background The Portable Document Format (PDF) is the most commonly used file format for online scientific publications. The absence of effective means to extract text from these PDF files in a layout-aware manner presents a significant challenge for developers of biomedical text mining or biocuration informatics systems that use published literature as an information source. In this paper we introduce the ‘Layout-Aware PDF Text Extraction’ (LA-PDFText) system to facilitate accurate extraction of text from PDF files of research articles for use in text mining applications. Results Our paper describes the construction and performance of an open source system that extracts text blocks from PDF-formatted full-text research articles and classifies them into logical units based on rules that characterize specific sections. The LA-PDFText system focuses only on the textual content of the research articles and is meant as a baseline for further experiments into more advanced extraction methods that handle multi-modal content, such as images and graphs. The system works in a three-stage process: (1) Detecting contiguous text blocks using spatial layout processing to locate and identify blocks of contiguous text, (2) Classifying text blocks into rhetorical categories using a rule-based method and (3) Stitching classified text blocks together in the correct order resulting in the extraction of text from section-wise grouped blocks. We show that our system can identify text blocks and classify them into rhetorical categories with Precision1 = 0.96% Recall = 0.89% and F1 = 0.91%. We also present an evaluation of the accuracy of the block detection algorithm used in step 2. Additionally, we have compared the accuracy of the text extracted by LA-PDFText to the text from the Open Access subset of PubMed Central. We then compared this accuracy with that of the text extracted by the PDF2Text system, 2commonly used to extract text from PDF. Finally, we discuss preliminary error analysis for our system and identify further areas of improvement. Conclusions LA-PDFText is an open-source tool for accurately extracting text from full-text scientific articles. The release of the system is available at http://code.google.com/p/lapdftext/. PMID:22640904
Graph-based layout analysis for PDF documents
NASA Astrophysics Data System (ADS)
Xu, Canhui; Tang, Zhi; Tao, Xin; Li, Yun; Shi, Cao
2013-03-01
To increase the flexibility and enrich the reading experience of e-book on small portable screens, a graph based method is proposed to perform layout analysis on Portable Document Format (PDF) documents. Digital born document has its inherent advantages like representing texts and fractional images in explicit form, which can be straightforwardly exploited. To integrate traditional image-based document analysis and the inherent meta-data provided by PDF parser, the page primitives including text, image and path elements are processed to produce text and non text layer for respective analysis. Graph-based method is developed in superpixel representation level, and page text elements corresponding to vertices are used to construct an undirected graph. Euclidean distance between adjacent vertices is applied in a top-down manner to cut the graph tree formed by Kruskal's algorithm. And edge orientation is then used in a bottom-up manner to extract text lines from each sub tree. On the other hand, non-textual objects are segmented by connected component analysis. For each segmented text and non-text composite, a 13-dimensional feature vector is extracted for labelling purpose. The experimental results on selected pages from PDF books are presented.
Telemetry Standards, RCC Standard 106-17, Annex A.1, Pulse Amplitude Modulation Standards
2017-07-01
conform to either Figure Error! No text of specified style in document.-1 or Figure Error! No text of specified style in document.-2. Figure Error...No text of specified style in document.-1. 50 percent duty cycle PAM with amplitude synchronization A 20-25 percent deviation reserved for pulse...synchronization is recommended. Telemetry Standards, RCC Standard 106-17 Annex A.1, July 2017 A.1.2 Figure Error! No text of specified style
Full Text and Figure Display Improves Bioscience Literature Search
Divoli, Anna; Wooldridge, Michael A.; Hearst, Marti A.
2010-01-01
When reading bioscience journal articles, many researchers focus attention on the figures and their captions. This observation led to the development of the BioText literature search engine [1], a freely available Web-based application that allows biologists to search over the contents of Open Access Journals, and see figures from the articles displayed directly in the search results. This article presents a qualitative assessment of this system in the form of a usability study with 20 biologist participants using and commenting on the system. 19 out of 20 participants expressed a desire to use a bioscience literature search engine that displays articles' figures alongside the full text search results. 15 out of 20 participants said they would use a caption search and figure display interface either frequently or sometimes, while 4 said rarely and 1 said undecided. 10 out of 20 participants said they would use a tool for searching the text of tables and their captions either frequently or sometimes, while 7 said they would use it rarely if at all, 2 said they would never use it, and 1 was undecided. This study found evidence, supporting results of an earlier study, that bioscience literature search systems such as PubMed should show figures from articles alongside search results. It also found evidence that full text and captions should be searched along with the article title, metadata, and abstract. Finally, for a subset of users and information needs, allowing for explicit search within captions for figures and tables is a useful function, but it is not entirely clear how to cleanly integrate this within a more general literature search interface. Such a facility supports Open Access publishing efforts, as it requires access to full text of documents and the lifting of restrictions in order to show figures in the search interface. PMID:20418942
Reading and Writing in the 21st Century.
ERIC Educational Resources Information Center
Soloway, Elliot; And Others
1993-01-01
Describes MediaText, a multimedia document processor developed at the University of Michigan that allows the incorporation of video, music, sound, animations, still images, and text into one document. Interactive documents are discussed, and the need for users to be able to write documents as well as read them is emphasized. (four references) (LRW)
Electronic Documentation Support Tools and Text Duplication in the Electronic Medical Record
ERIC Educational Resources Information Center
Wrenn, Jesse
2010-01-01
In order to ease the burden of electronic note entry on physicians, electronic documentation support tools have been developed to assist in note authoring. There is little evidence of the effects of these tools on attributes of clinical documentation, including document quality. Furthermore, the resultant abundance of duplicated text and…
Semi-Automated Methods for Refining a Domain-Specific Terminology Base
2011-02-01
only as a resource for written and oral translation, but also for Natural Language Processing ( NLP ) applications, text retrieval, document indexing...Natural Language Processing ( NLP ) applications, text retrieval, document indexing, and other knowledge management tasks. The objective of this...also for Natural Language Processing ( NLP ) applications, text retrieval (1), document indexing, and other knowledge management tasks. The National
Thematic clustering of text documents using an EM-based approach
2012-01-01
Clustering textual contents is an important step in mining useful information on the web or other text-based resources. The common task in text clustering is to handle text in a multi-dimensional space, and to partition documents into groups, where each group contains documents that are similar to each other. However, this strategy lacks a comprehensive view for humans in general since it cannot explain the main subject of each cluster. Utilizing semantic information can solve this problem, but it needs a well-defined ontology or pre-labeled gold standard set. In this paper, we present a thematic clustering algorithm for text documents. Given text, subject terms are extracted and used for clustering documents in a probabilistic framework. An EM approach is used to ensure documents are assigned to correct subjects, hence it converges to a locally optimal solution. The proposed method is distinctive because its results are sufficiently explanatory for human understanding as well as efficient for clustering performance. The experimental results show that the proposed method provides a competitive performance compared to other state-of-the-art approaches. We also show that the extracted themes from the MEDLINE® dataset represent the subjects of clusters reasonably well. PMID:23046528
Fast words boundaries localization in text fields for low quality document images
NASA Astrophysics Data System (ADS)
Ilin, Dmitry; Novikov, Dmitriy; Polevoy, Dmitry; Nikolaev, Dmitry
2018-04-01
The paper examines the problem of word boundaries precise localization in document text zones. Document processing on a mobile device consists of document localization, perspective correction, localization of individual fields, finding words in separate zones, segmentation and recognition. While capturing an image with a mobile digital camera under uncontrolled capturing conditions, digital noise, perspective distortions or glares may occur. Further document processing gets complicated because of its specifics: layout elements, complex background, static text, document security elements, variety of text fonts. However, the problem of word boundaries localization has to be solved at runtime on mobile CPU with limited computing capabilities under specified restrictions. At the moment, there are several groups of methods optimized for different conditions. Methods for the scanned printed text are quick but limited only for images of high quality. Methods for text in the wild have an excessively high computational complexity, thus, are hardly suitable for running on mobile devices as part of the mobile document recognition system. The method presented in this paper solves a more specialized problem than the task of finding text on natural images. It uses local features, a sliding window and a lightweight neural network in order to achieve an optimal algorithm speed-precision ratio. The duration of the algorithm is 12 ms per field running on an ARM processor of a mobile device. The error rate for boundaries localization on a test sample of 8000 fields is 0.3
Documents Similarity Measurement Using Field Association Terms.
ERIC Educational Resources Information Center
Atlam, El-Sayed; Fuketa, M.; Morita, K.; Aoe, Jun-ichi
2003-01-01
Discussion of text analysis and information retrieval and measurement of document similarity focuses on a new text manipulation system called FA (field association)-Sim that is useful for retrieving information in large heterogeneous texts and for recognizing content similarity in text excerpts. Discusses recall and precision, automatic indexing…
Means of storage and automated monitoring of versions of text technical documentation
NASA Astrophysics Data System (ADS)
Leonovets, S. A.; Shukalov, A. V.; Zharinov, I. O.
2018-03-01
The paper presents automation of the process of preparation, storage and monitoring of version control of a text designer, and program documentation by means of the specialized software is considered. Automation of preparation of documentation is based on processing of the engineering data which are contained in the specifications and technical documentation or in the specification. Data handling assumes existence of strictly structured electronic documents prepared in widespread formats according to templates on the basis of industry standards and generation by an automated method of the program or designer text document. Further life cycle of the document and engineering data entering it are controlled. At each stage of life cycle, archive data storage is carried out. Studies of high-speed performance of use of different widespread document formats in case of automated monitoring and storage are given. The new developed software and the work benches available to the developer of the instrumental equipment are described.
Essential medicines availability is still suboptimal in many countries: a scoping review.
Mahmić-Kaknjo, Mersiha; Jeličić-Kadić, Antonia; Utrobičić, Ana; Chan, Kit; Bero, Lisa; Marušić, Ana
2018-06-01
To identify uses of WHO Model list of essential medicines (EMs) and summarize studies examining EM and national EM lists (NEMLs). In this scoping review, we searched PubMed, Scopus, WHO website and WHO Regional Databases for studies on NEMLs, reimbursement medicines lists, and WHO EML, with no date or language restrictions. Three thousand one hundred forty-four retrieved documents were independently screened by two reviewers; 100 full-text documents were analyzed; 37 contained data suitable for quantitative and qualitative analysis on EMs availability (11 documents), medicines for specific diseases (13 documents), and comparison of WHO EML and NEMLs (13 documents). From the latter, two documents analyzed the relevance of evidence from Cochrane systematic reviews for medicines that were on NEMLs but not on the WHO EML. EMs availability is still suboptimal in low-income countries. Availability of children formulations and EMs for specific diseases such as chronic, cancer, pain, and reproductive health is suboptimal even in middle-income countries. WHO EML can be used as a basic set of medicines for different settings. More evidence is needed into how NEMLs can contribute to better availability of children formulations, pain, and cancer medicines in developing countries. Copyright © 2018 Elsevier Inc. All rights reserved.
Biotea: RDFizing PubMed Central in support for the paper as an interface to the Web of Data
2013-01-01
Background The World Wide Web has become a dissemination platform for scientific and non-scientific publications. However, most of the information remains locked up in discrete documents that are not always interconnected or machine-readable. The connectivity tissue provided by RDF technology has not yet been widely used to support the generation of self-describing, machine-readable documents. Results In this paper, we present our approach to the generation of self-describing machine-readable scholarly documents. We understand the scientific document as an entry point and interface to the Web of Data. We have semantically processed the full-text, open-access subset of PubMed Central. Our RDF model and resulting dataset make extensive use of existing ontologies and semantic enrichment services. We expose our model, services, prototype, and datasets at http://biotea.idiginfo.org/ Conclusions The semantic processing of biomedical literature presented in this paper embeds documents within the Web of Data and facilitates the execution of concept-based queries against the entire digital library. Our approach delivers a flexible and adaptable set of tools for metadata enrichment and semantic processing of biomedical documents. Our model delivers a semantically rich and highly interconnected dataset with self-describing content so that software can make effective use of it. PMID:23734622
ERIC Educational Resources Information Center
Giordano, Richard
1994-01-01
Describes the Text Encoding Initiative (TEI) project and the TEI header, which documents electronic text in a standard interchange format understandable to both librarian catalogers and nonlibrarian text encoders. The form and function of the TEI header is introduced, and its relationship to the MARC record is explained. (10 references) (KRN)
Handwritten text line segmentation by spectral clustering
NASA Astrophysics Data System (ADS)
Han, Xuecheng; Yao, Hui; Zhong, Guoqiang
2017-02-01
Since handwritten text lines are generally skewed and not obviously separated, text line segmentation of handwritten document images is still a challenging problem. In this paper, we propose a novel text line segmentation algorithm based on the spectral clustering. Given a handwritten document image, we convert it to a binary image first, and then compute the adjacent matrix of the pixel points. We apply spectral clustering on this similarity metric and use the orthogonal kmeans clustering algorithm to group the text lines. Experiments on Chinese handwritten documents database (HIT-MW) demonstrate the effectiveness of the proposed method.
Machine printed text and handwriting identification in noisy document images.
Zheng, Yefeng; Li, Huiping; Doermann, David
2004-03-01
In this paper, we address the problem of the identification of text in noisy document images. We are especially focused on segmenting and identifying between handwriting and machine printed text because: 1) Handwriting in a document often indicates corrections, additions, or other supplemental information that should be treated differently from the main content and 2) the segmentation and recognition techniques requested for machine printed and handwritten text are significantly different. A novel aspect of our approach is that we treat noise as a separate class and model noise based on selected features. Trained Fisher classifiers are used to identify machine printed text and handwriting from noise and we further exploit context to refine the classification. A Markov Random Field-based (MRF) approach is used to model the geometrical structure of the printed text, handwriting, and noise to rectify misclassifications. Experimental results show that our approach is robust and can significantly improve page segmentation in noisy document collections.
The Text Encoding Initiative: Flexible and Extensible Document Encoding.
ERIC Educational Resources Information Center
Barnard, David T.; Ide, Nancy M.
1997-01-01
The Text Encoding Initiative (TEI), an international collaboration aimed at producing a common encoding scheme for complex texts, examines the requirement for generality versus the requirement to handle specialized text types. Discusses how documents and users tax the limits of fixed schemes requiring flexible extensible encoding to support…
BioTextQuest: a web-based biomedical text mining suite for concept discovery.
Papanikolaou, Nikolas; Pafilis, Evangelos; Nikolaou, Stavros; Ouzounis, Christos A; Iliopoulos, Ioannis; Promponas, Vasilis J
2011-12-01
BioTextQuest combines automated discovery of significant terms in article clusters with structured knowledge annotation, via Named Entity Recognition services, offering interactive user-friendly visualization. A tag-cloud-based illustration of terms labeling each document cluster are semantically annotated according to the biological entity, and a list of document titles enable users to simultaneously compare terms and documents of each cluster, facilitating concept association and hypothesis generation. BioTextQuest allows customization of analysis parameters, e.g. clustering/stemming algorithms, exclusion of documents/significant terms, to better match the biological question addressed. http://biotextquest.biol.ucy.ac.cy vprobon@ucy.ac.cy; iliopj@med.uoc.gr Supplementary data are available at Bioinformatics online.
Jiang, Xiangying; Ringwald, Martin; Blake, Judith; Shatkay, Hagit
2017-01-01
The Gene Expression Database (GXD) is a comprehensive online database within the Mouse Genome Informatics resource, aiming to provide available information about endogenous gene expression during mouse development. The information stems primarily from many thousands of biomedical publications that database curators must go through and read. Given the very large number of biomedical papers published each year, automatic document classification plays an important role in biomedical research. Specifically, an effective and efficient document classifier is needed for supporting the GXD annotation workflow. We present here an effective yet relatively simple classification scheme, which uses readily available tools while employing feature selection, aiming to assist curators in identifying publications relevant to GXD. We examine the performance of our method over a large manually curated dataset, consisting of more than 25 000 PubMed abstracts, of which about half are curated as relevant to GXD while the other half as irrelevant to GXD. In addition to text from title-and-abstract, we also consider image captions, an important information source that we integrate into our method. We apply a captions-based classifier to a subset of about 3300 documents, for which the full text of the curated articles is available. The results demonstrate that our proposed approach is robust and effectively addresses the GXD document classification. Moreover, using information obtained from image captions clearly improves performance, compared to title and abstract alone, affirming the utility of image captions as a substantial evidence source for automatically determining the relevance of biomedical publications to a specific subject area. www.informatics.jax.org. © The Author(s) 2017. Published by Oxford University Press.
Yuan, Soe-Tsyr; Sun, Jerry
2005-10-01
Development of algorithms for automated text categorization in massive text document sets is an important research area of data mining and knowledge discovery. Most of the text-clustering methods were grounded in the term-based measurement of distance or similarity, ignoring the structure of the documents. In this paper, we present a novel method named structured cosine similarity (SCS) that furnishes document clustering with a new way of modeling on document summarization, considering the structure of the documents so as to improve the performance of document clustering in terms of quality, stability, and efficiency. This study was motivated by the problem of clustering speech documents (of no rich document features) attained from the wireless experience oral sharing conducted by mobile workforce of enterprises, fulfilling audio-based knowledge management. In other words, this problem aims to facilitate knowledge acquisition and sharing by speech. The evaluations also show fairly promising results on our method of structured cosine similarity.
Page layout analysis and classification for complex scanned documents
NASA Astrophysics Data System (ADS)
Erkilinc, M. Sezer; Jaber, Mustafa; Saber, Eli; Bauer, Peter; Depalov, Dejan
2011-09-01
A framework for region/zone classification in color and gray-scale scanned documents is proposed in this paper. The algorithm includes modules for extracting text, photo, and strong edge/line regions. Firstly, a text detection module which is based on wavelet analysis and Run Length Encoding (RLE) technique is employed. Local and global energy maps in high frequency bands of the wavelet domain are generated and used as initial text maps. Further analysis using RLE yields a final text map. The second module is developed to detect image/photo and pictorial regions in the input document. A block-based classifier using basis vector projections is employed to identify photo candidate regions. Then, a final photo map is obtained by applying probabilistic model based on Markov random field (MRF) based maximum a posteriori (MAP) optimization with iterated conditional mode (ICM). The final module detects lines and strong edges using Hough transform and edge-linkages analysis, respectively. The text, photo, and strong edge/line maps are combined to generate a page layout classification of the scanned target document. Experimental results and objective evaluation show that the proposed technique has a very effective performance on variety of simple and complex scanned document types obtained from MediaTeam Oulu document database. The proposed page layout classifier can be used in systems for efficient document storage, content based document retrieval, optical character recognition, mobile phone imagery, and augmented reality.
The Electronic Documentation Project in the NASA mission control center environment
NASA Technical Reports Server (NTRS)
Wang, Lui; Leigh, Albert
1994-01-01
NASA's space programs like many other technical programs of its magnitude is supported by a large volume of technical documents. These documents are not only diverse but also abundant. Management, maintenance, and retrieval of these documents is a challenging problem by itself; but, relating and cross-referencing this wealth of information when it is all on a medium of paper is an even greater challenge. The Electronic Documentation Project (EDP) is to provide an electronic system capable of developing, distributing and controlling changes for crew/ground controller procedures and related documents. There are two primary motives for the solution. The first motive is to reduce the cost of maintaining the current paper based method of operations by replacing paper documents with electronic information storage and retrieval. And, the other is to improve the efficiency and provide enhanced flexibility in document usage. Initially, the current paper based system will be faithfully reproduced in an electronic format to be used in the document viewing system. In addition, this metaphor will have hypertext extensions. Hypertext features support basic functions such as full text searches, key word searches, data retrieval, and traversal between nodes of information as well as speeding up the data access rate. They enable related but separate documents to have relationships, and allow the user to explore information naturally through non-linear link traversals. The basic operational requirements of the document viewing system are to: provide an electronic corollary to the current method of paper based document usage; supplement and ultimately replace paper-based documents; maintain focused toward control center operations such as Flight Data File, Flight Rules and Console Handbook viewing; and be available NASA wide.
NASA Astrophysics Data System (ADS)
Manzella, Giuseppe M. R.; Bartolini, Andrea; Bustaffa, Franco; D'Angelo, Paolo; De Mattei, Maurizio; Frontini, Francesca; Maltese, Maurizio; Medone, Daniele; Monachini, Monica; Novellino, Antonio; Spada, Andrea
2016-04-01
The MAPS (Marine Planning and Service Platform) project is aiming at building a computer platform supporting a Marine Information and Knowledge System. One of the main objective of the project is to develop a repository that should gather, classify and structure marine scientific literature and data thus guaranteeing their accessibility to researchers and institutions by means of standard protocols. In oceanography the cost related to data collection is very high and the new paradigm is based on the concept to collect once and re-use many times (for re-analysis, marine environment assessment, studies on trends, etc). This concept requires the access to quality controlled data and to information that is provided in reports (grey literature) and/or in relevant scientific literature. Hence, creation of new technology is needed by integrating several disciplines such as data management, information systems, knowledge management. In one of the most important EC projects on data management, namely SeaDataNet (www.seadatanet.org), an initial example of knowledge management is provided through the Common Data Index, that is providing links to data and (eventually) to papers. There are efforts to develop search engines to find author's contributions to scientific literature or publications. This implies the use of persistent identifiers (such as DOI), as is done in ORCID. However very few efforts are dedicated to link publications to the data cited or used or that can be of importance for the published studies. This is the objective of MAPS. Full-text technologies are often unsuccessful since they assume the presence of specific keywords in the text; in order to fix this problem, the MAPS project suggests to use different semantic technologies for retrieving the text and data and thus getting much more complying results. The main parts of our design of the search engine are: • Syntactic parser - This module is responsible for the extraction of "rich words" from the text: the whole document gets parsed to extract the words which are more meaningful for the main argument of the document, and applies the extraction in the form of N-grams (mono-grams, bi-grams, tri-grams). • MAPS database - This module is a simple database which contains all the N-grams used by MAPS (physical parameters from SeaDataNet vocabularies) to define our marine "ontology". • Relation identifier - This module performs the most important task of identifying relationships between the N-gram extracted from the text by the parser and the provided oceanographic terminology. It checks N-grams supplied by the Syntactic parser and then matches them with the terms stored in the MAPS database. Found matches are returned back to the parser with flexed form appearing in the source text. • A "relaxed" extractor - This option can be activated when the search engine is launched. It was introduced to give the user a chance to create new N-grams combining existing mono-grams and bi-grams in the database with rich-words found within the source text. The innovation of a semantic engine lies in the fact that the process is not just about the retrieval of already known documents by means of a simple term query but rather the retrieval of a population of documents whose existence was unknown. The system answers by showing a screenshot of results ordered according to the following criteria: • Relevance - of the document with respect to the concept that is searched • Date - of publication of the paper • Source - data provider as defined in the SeaDataNet Common Data Index • Matrix - environmental matrices as defined in the oceanographic field • Geographic area - area specified in the text • Clustering - the process of organizing objects into groups whose members are similar The clustering returns as the output the related documents. For each document the MAPS visualization provides: • Title, author, source/provider of data, web address • Tagging of key terms or concepts • Summary of the document • Visualization of the whole document The possibility of inserting the number of citations for each document among the criteria of the advanced search is currently undergoing; in this case the engine should be able to connect to any of the existing bibliographic citation systems (such as Google Scholar, Scopus, etc.).
Three-Dimensional Dispaly Of Document Set
Lantrip, David B.; Pennock, Kelly A.; Pottier, Marc C.; Schur, Anne; Thomas, James J.; Wise, James A.
2003-06-24
A method for spatializing text content for enhanced visual browsing and analysis. The invention is applied to large text document corpora such as digital libraries, regulations and procedures, archived reports, and the like. The text content from these sources may be transformed to a spatial representation that preserves informational characteristics from the documents. The three-dimensional representation may then be visually browsed and analyzed in ways that avoid language processing and that reduce the analysts' effort.
Three-dimensional display of document set
Lantrip, David B [Oxnard, CA; Pennock, Kelly A [Richland, WA; Pottier, Marc C [Richland, WA; Schur, Anne [Richland, WA; Thomas, James J [Richland, WA; Wise, James A [Richland, WA
2006-09-26
A method for spatializing text content for enhanced visual browsing and analysis. The invention is applied to large text document corpora such as digital libraries, regulations and procedures, archived reports, and the like. The text content from these sources may e transformed to a spatial representation that preserves informational characteristics from the documents. The three-dimensional representation may then be visually browsed and analyzed in ways that avoid language processing and that reduce the analysts' effort.
Three-dimensional display of document set
Lantrip, David B [Oxnard, CA; Pennock, Kelly A [Richland, WA; Pottier, Marc C [Richland, WA; Schur, Anne [Richland, WA; Thomas, James J [Richland, WA; Wise, James A [Richland, WA
2001-10-02
A method for spatializing text content for enhanced visual browsing and analysis. The invention is applied to large text document corpora such as digital libraries, regulations and procedures, archived reports, and the like. The text content from these sources may be transformed to a spatial representation that preserves informational characteristics from the documents. The three-dimensional representation may then be visually browsed and analyzed in ways that avoid language processing and that reduce the analysts' effort.
Three-dimensional display of document set
Lantrip, David B [Oxnard, CA; Pennock, Kelly A [Richland, WA; Pottier, Marc C [Richland, WA; Schur, Anne [Richland, WA; Thomas, James J [Richland, WA; Wise, James A [Richland, WA; York, Jeremy [Bothell, WA
2009-06-30
A method for spatializing text content for enhanced visual browsing and analysis. The invention is applied to large text document corpora such as digital libraries, regulations and procedures, archived reports, and the like. The text content from these sources may be transformed to a spatial representation that preserves informational characteristics from the documents. The three-dimensional representation may then be visually browsed and analyzed in ways that avoid language processing and that reduce the analysts' effort.
Eyrolle, Hélène; Virbel, Jacques; Lemarié, Julie
2008-03-01
Based on previous research in the field of cognitive psychology, highlighting the facilitatory effects of titles on several text-related activities, this paper looks at the extent to which titles reflect text content. An exploratory study of real-life technical documents investigated the content of their Subject lines, which linguistic analyses had led us to regard as titles. The study showed that most of the titles supplied by the writers failed to represent the documents' contents and that most users failed to detect this lack of validity.
Semantic retrieval and navigation in clinical document collections.
Kreuzthaler, Markus; Daumke, Philipp; Schulz, Stefan
2015-01-01
Patients with chronic diseases undergo numerous in- and outpatient treatment periods, and therefore many documents accumulate in their electronic records. We report on an on-going project focussing on the semantic enrichment of medical texts, in order to support recall-oriented navigation across a patient's complete documentation. A document pool of 1,696 de-identified discharge summaries was used for prototyping. A natural language processing toolset for document annotation (based on the text-mining framework UIMA) and indexing (Solr) was used to support a browser-based platform for document import, search and navigation. The integrated search engine combines free text and concept-based querying, supported by dynamically generated facets (diagnoses, procedures, medications, lab values, and body parts). The prototype demonstrates the feasibility of semantic document enrichment within document collections of a single patient. Originally conceived as an add-on for the clinical workplace, this technology could also be adapted to support personalised health record platforms, as well as cross-patient search for cohort building and other secondary use scenarios.
Text-line extraction in handwritten Chinese documents based on an energy minimization framework.
Koo, Hyung Il; Cho, Nam Ik
2012-03-01
Text-line extraction in unconstrained handwritten documents remains a challenging problem due to nonuniform character scale, spatially varying text orientation, and the interference between text lines. In order to address these problems, we propose a new cost function that considers the interactions between text lines and the curvilinearity of each text line. Precisely, we achieve this goal by introducing normalized measures for them, which are based on an estimated line spacing. We also present an optimization method that exploits the properties of our cost function. Experimental results on a database consisting of 853 handwritten Chinese document images have shown that our method achieves a detection rate of 99.52% and an error rate of 0.32%, which outperforms conventional methods.
Mapping annotations with textual evidence using an scLDA model.
Jin, Bo; Chen, Vicky; Chen, Lujia; Lu, Xinghua
2011-01-01
Most of the knowledge regarding genes and proteins is stored in biomedical literature as free text. Extracting information from complex biomedical texts demands techniques capable of inferring biological concepts from local text regions and mapping them to controlled vocabularies. To this end, we present a sentence-based correspondence latent Dirichlet allocation (scLDA) model which, when trained with a corpus of PubMed documents with known GO annotations, performs the following tasks: 1) learning major biological concepts from the corpus, 2) inferring the biological concepts existing within text regions (sentences), and 3) identifying the text regions in a document that provides evidence for the observed annotations. When applied to new gene-related documents, a trained scLDA model is capable of predicting GO annotations and identifying text regions as textual evidence supporting the predicted annotations. This study uses GO annotation data as a testbed; the approach can be generalized to other annotated data, such as MeSH and MEDLINE documents.
Ontology-based content analysis of US patent applications from 2001-2010.
Weber, Lutz; Böhme, Timo; Irmer, Matthias
2013-01-01
Ontology-based semantic text analysis methods allow to automatically extract knowledge relationships and data from text documents. In this review, we have applied these technologies for the systematic analysis of pharmaceutical patents. Hierarchical concepts from the knowledge domains of chemical compounds, diseases and proteins were used to annotate full-text US patent applications that deal with pharmacological activities of chemical compounds and filed in the years 2001-2010. Compounds claimed in these applications have been classified into their respective compound classes to review the distribution of scaffold types or general compound classes such as natural products in a time-dependent manner. Similarly, the target proteins and claimed utility of the compounds have been classified and the most relevant were extracted. The method presented allows the discovery of the main areas of innovation as well as emerging fields of patenting activities - providing a broad statistical basis for competitor analysis and decision-making efforts.
How to use the WWW to distribute STI
NASA Technical Reports Server (NTRS)
Roper, Donna G.
1994-01-01
This presentation explains how to use the World Wide Web (WWW) to distribute scientific and technical information as hypermedia. WWW clients and servers use the HyperText Transfer Protocol (HTTP) to transfer documents containing links to other text, graphics, video, and sound. The standard language for these documents is the HyperText MarkUp Language (HTML). These are simply text files with formatting codes that contain layout information and hyperlinks. HTML documents can be created with any text editor or with one of the publicly available HTML editors or convertors. HTML can also include links to available image formats. This presentation is available online. The URL is http://sti.larc.nasa. (followed by) gov/demos/workshop/introtext.html.
Comparing Latent Dirichlet Allocation and Latent Semantic Analysis as Classifiers
ERIC Educational Resources Information Center
Anaya, Leticia H.
2011-01-01
In the Information Age, a proliferation of unstructured text electronic documents exists. Processing these documents by humans is a daunting task as humans have limited cognitive abilities for processing large volumes of documents that can often be extremely lengthy. To address this problem, text data computer algorithms are being developed.…
NASA Astrophysics Data System (ADS)
David, Peter; Hansen, Nichole; Nolan, James J.; Alcocer, Pedro
2015-05-01
The growth in text data available online is accompanied by a growth in the diversity of available documents. Corpora with extreme heterogeneity in terms of file formats, document organization, page layout, text style, and content are common. The absence of meaningful metadata describing the structure of online and open-source data leads to text extraction results that contain no information about document structure and are cluttered with page headers and footers, web navigation controls, advertisements, and other items that are typically considered noise. We describe an approach to document structure and metadata recovery that uses visual analysis of documents to infer the communicative intent of the author. Our algorithm identifies the components of documents such as titles, headings, and body content, based on their appearance. Because it operates on an image of a document, our technique can be applied to any type of document, including scanned images. Our approach to document structure recovery considers a finer-grained set of component types than prior approaches. In this initial work, we show that a machine learning approach to document structure recovery using a feature set based on the geometry and appearance of images of documents achieves a 60% greater F1- score than a baseline random classifier.
Text line extraction in free style document
NASA Astrophysics Data System (ADS)
Shen, Xiaolu; Liu, Changsong; Ding, Xiaoqing; Zou, Yanming
2009-01-01
This paper addresses to text line extraction in free style document, such as business card, envelope, poster, etc. In free style document, global property such as character size, line direction can hardly be concluded, which reveals a grave limitation in traditional layout analysis. 'Line' is the most prominent and the highest structure in our bottom-up method. First, we apply a novel intensity function found on gradient information to locate text areas where gradient within a window have large magnitude and various directions, and split such areas into text pieces. We build a probability model of lines consist of text pieces via statistics on training data. For an input image, we group text pieces to lines using a simulated annealing algorithm with cost function based on the probability model.
Brooks, Ingrid A; Sayre, Michael R; Spencer, Caroline; Archer, Frank L
2016-02-01
The Emergency Medical Services (EMS) approach to emergency prehospital care in the United States (US) has global influence. As the 50-year anniversary of modern US EMS approaches, there is value in examining US EMS education development over this period. This report describes US EMS education milestones and identifies themes that provide context to readers outside the US. As US EMS education is described mainly in publications of federal US EMS agencies and associations, a Google search and hand searching of documents identified publications in the public domain. MEDLINE and CINAHL Plus were searched for peer reviewed publications. Documents were reviewed using both a chronological and thematic approach. Seventy-eight documents and 685 articles were screened, the full texts of 175 were reviewed, and 41 were selected for full review. Four historical periods in US EMS education became apparent: EMS education development (1966-1980); EMS education consolidation and review (1981-1989); EMS education reflection and change (1990-1999); and EMS education for the future (2000-2014). Four major themes emerged: legislative authority, physician direction, quality, and development of the profession. Documents produced through broad interprofessional consultations, with support from federal and US EMS authorities, reflect the catalysts for US EMS education development. The current model of US EMS education provides a structure to enhance educational quality into the future. Implementation evaluation of this model would be a valuable addition to the US EMS literature. The themes emerging from this review assist the understanding of the characteristics of US EMS education.
Semantic Theme Analysis of Pilot Incident Reports
NASA Technical Reports Server (NTRS)
Thirumalainambi, Rajkumar
2009-01-01
Pilots report accidents or incidents during take-off, on flight and landing to airline authorities and Federal aviation authority as well. The description of pilot reports for an incident contains technical terms related to Flight instruments and operations. Normal text mining approaches collect keywords from text documents and relate them among documents that are stored in database. Present approach will extract specific theme analysis of incident reports and semantically relate hierarchy of terms assigning weights of themes. Once the theme extraction has been performed for a given document, a unique key can be assigned to that document to cross linking the documents. Semantic linking will be used to categorize the documents based on specific rules that can help an end-user to analyze certain types of accidents. This presentation outlines the architecture of text mining for pilot incident reports for autonomous categorization of pilot incident reports using semantic theme analysis.
An automated system for generating program documentation
NASA Technical Reports Server (NTRS)
Hanney, R. J.
1970-01-01
A documentation program was developed in which the emphasis is placed on text content rather than flowcharting. It is keyword oriented, with 26 keywords that control the program. Seventeen of those keywords are recognized by the flowchart generator, three are related to text generation, and three have to do with control card and deck displays. The strongest advantage offered by the documentation program is that it produces the entire document. The document is prepared on 35mm microfilm, which is easy to store, and letter-size reproductions can be made inexpensively on bond paper.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Fischer, G.A.
2011-07-01
Document available in abstract form only, full text of document follows: The dosimetry from the H. B. Robinson Unit 2 Pressure Vessel Benchmark is analyzed with a suite of Westinghouse-developed codes and data libraries. The radiation transport from the reactor core to the surveillance capsule and ex-vessel locations is performed by RAPTOR-M3G, a parallel deterministic radiation transport code that calculates high-resolution neutron flux information in three dimensions. The cross-section library used in this analysis is the ALPAN library, an Evaluated Nuclear Data File (ENDF)/B-VII.0-based library designed for reactor dosimetry and fluence analysis applications. Dosimetry is evaluated with the industry-standard SNLRMLmore » reactor dosimetry cross-section data library. (authors)« less
Large Scale Document Inversion using a Multi-threaded Computing System
Jung, Sungbo; Chang, Dar-Jen; Park, Juw Won
2018-01-01
Current microprocessor architecture is moving towards multi-core/multi-threaded systems. This trend has led to a surge of interest in using multi-threaded computing devices, such as the Graphics Processing Unit (GPU), for general purpose computing. We can utilize the GPU in computation as a massive parallel coprocessor because the GPU consists of multiple cores. The GPU is also an affordable, attractive, and user-programmable commodity. Nowadays a lot of information has been flooded into the digital domain around the world. Huge volume of data, such as digital libraries, social networking services, e-commerce product data, and reviews, etc., is produced or collected every moment with dramatic growth in size. Although the inverted index is a useful data structure that can be used for full text searches or document retrieval, a large number of documents will require a tremendous amount of time to create the index. The performance of document inversion can be improved by multi-thread or multi-core GPU. Our approach is to implement a linear-time, hash-based, single program multiple data (SPMD), document inversion algorithm on the NVIDIA GPU/CUDA programming platform utilizing the huge computational power of the GPU, to develop high performance solutions for document indexing. Our proposed parallel document inversion system shows 2-3 times faster performance than a sequential system on two different test datasets from PubMed abstract and e-commerce product reviews. CCS Concepts •Information systems➝Information retrieval • Computing methodologies➝Massively parallel and high-performance simulations. PMID:29861701
Large Scale Document Inversion using a Multi-threaded Computing System.
Jung, Sungbo; Chang, Dar-Jen; Park, Juw Won
2017-06-01
Current microprocessor architecture is moving towards multi-core/multi-threaded systems. This trend has led to a surge of interest in using multi-threaded computing devices, such as the Graphics Processing Unit (GPU), for general purpose computing. We can utilize the GPU in computation as a massive parallel coprocessor because the GPU consists of multiple cores. The GPU is also an affordable, attractive, and user-programmable commodity. Nowadays a lot of information has been flooded into the digital domain around the world. Huge volume of data, such as digital libraries, social networking services, e-commerce product data, and reviews, etc., is produced or collected every moment with dramatic growth in size. Although the inverted index is a useful data structure that can be used for full text searches or document retrieval, a large number of documents will require a tremendous amount of time to create the index. The performance of document inversion can be improved by multi-thread or multi-core GPU. Our approach is to implement a linear-time, hash-based, single program multiple data (SPMD), document inversion algorithm on the NVIDIA GPU/CUDA programming platform utilizing the huge computational power of the GPU, to develop high performance solutions for document indexing. Our proposed parallel document inversion system shows 2-3 times faster performance than a sequential system on two different test datasets from PubMed abstract and e-commerce product reviews. •Information systems➝Information retrieval • Computing methodologies➝Massively parallel and high-performance simulations.
Migration Amendment Act 1987 (No. 133 of 1987), 16 December 1987.
1988-01-01
This Act does the following, among other things: 1) requires that certain requests relating to entry permits, visas, and return endorsements are not considered to have been made unless in writing in accordance with the relevant approved forms and unless any fee payable has been made; 2) makes "operators," as well as charterers, liable for the carriage of persons to Australia without documentation; and 3) provides that international air operators shall pay immigration clearance fees payable by passengers whether or not the operator has collected the fees from the passengers. full text
Generating Researcher Networks with Identified Persons on a Semantic Service Platform
NASA Astrophysics Data System (ADS)
Jung, Hanmin; Lee, Mikyoung; Kim, Pyung; Lee, Seungwoo
This paper describes a Semantic Web-based method to acquire researcher networks by means of identification scheme, ontology, and reasoning. Three steps are required to realize it; resolving co-references, finding experts, and generating researcher networks. We adopt OntoFrame as an underlying semantic service platform and apply reasoning to make direct relations between far-off classes in ontology schema. 453,124 Elsevier journal articles with metadata and full-text documents in information technology and biomedical domains have been loaded and served on the platform as a test set.
NASA Astrophysics Data System (ADS)
Zakaria, Chahnez; Curé, Olivier; Salzano, Gabriella; Smaïli, Kamel
In Computer Supported Cooperative Work (CSCW), it is crucial for project leaders to detect conflicting situations as early as possible. Generally, this task is performed manually by studying a set of documents exchanged between team members. In this paper, we propose a full-fledged automatic solution that identifies documents, subjects and actors involved in relational conflicts. Our approach detects conflicts in emails, probably the most popular type of documents in CSCW, but the methods used can handle other text-based documents. These methods rely on the combination of statistical and ontological operations. The proposed solution is decomposed in several steps: (i) we enrich a simple negative emotion ontology with terms occuring in the corpus of emails, (ii) we categorize each conflicting email according to the concepts of this ontology and (iii) we identify emails, subjects and team members involved in conflicting emails using possibilistic description logic and a set of proposed measures. Each of these steps are evaluated and validated on concrete examples. Moreover, this approach's framework is generic and can be easily adapted to domains other than conflicts, e.g. security issues, and extended with operations making use of our proposed set of measures.
On the map: Nature and Science editorials.
Waaijer, Cathelijn J F; van Bochove, Cornelis A; van Eck, Nees Jan
2011-01-01
Bibliometric mapping of scientific articles based on keywords and technical terms in abstracts is now frequently used to chart scientific fields. In contrast, no significant mapping has been applied to the full texts of non-specialist documents. Editorials in Nature and Science are such non-specialist documents, reflecting the views of the two most read scientific journals on science, technology and policy issues. We use the VOSviewer mapping software to chart the topics of these editorials. A term map and a document map are constructed and clusters are distinguished in both of them. The validity of the document clustering is verified by a manual analysis of a sample of the editorials. This analysis confirms the homogeneity of the clusters obtained by mapping and augments the latter with further detail. As a result, the analysis provides reliable information on the distribution of the editorials over topics, and on differences between the journals. The most striking difference is that Nature devotes more attention to internal science policy issues and Science more to the political influence of scientists. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1007/s11192-010-0205-9) contains supplementary material, which is available to authorized users.
A Feature Mining Based Approach for the Classification of Text Documents into Disjoint Classes.
ERIC Educational Resources Information Center
Nieto Sanchez, Salvador; Triantaphyllou, Evangelos; Kraft, Donald
2002-01-01
Proposes a new approach for classifying text documents into two disjoint classes. Highlights include a brief overview of document clustering; a data mining approach called the One Clause at a Time (OCAT) algorithm which is based on mathematical logic; vector space model (VSM); and comparing the OCAT to the VSM. (Author/LRW)
Text Categorization for Multi-Page Documents: A Hybrid Naive Bayes HMM Approach.
ERIC Educational Resources Information Center
Frasconi, Paolo; Soda, Giovanni; Vullo, Alessandro
Text categorization is typically formulated as a concept learning problem where each instance is a single isolated document. This paper is interested in a more general formulation where documents are organized as page sequences, as naturally occurring in digital libraries of scanned books and magazines. The paper describes a method for classifying…
DOE Office of Scientific and Technical Information (OSTI.GOV)
Burchard, Ross L.; Pierson, Kathleen P.; Trumbo, Derek
Tarjetas is used to generate requirements from source documents. These source documents are in a hierarchical XML format that have been produced from PDF documents processed through the “Reframe” software package. The software includes the ability to create Topics and associate text Snippets with those topics. Requirements are then generated and text Snippets with their associated Topics are referenced to the requirement. The software maintains traceability from the requirement ultimately to the source document that produced the snippet
Using phrases and document metadata to improve topic modeling of clinical reports.
Speier, William; Ong, Michael K; Arnold, Corey W
2016-06-01
Probabilistic topic models provide an unsupervised method for analyzing unstructured text, which have the potential to be integrated into clinical automatic summarization systems. Clinical documents are accompanied by metadata in a patient's medical history and frequently contains multiword concepts that can be valuable for accurately interpreting the included text. While existing methods have attempted to address these problems individually, we present a unified model for free-text clinical documents that integrates contextual patient- and document-level data, and discovers multi-word concepts. In the proposed model, phrases are represented by chained n-grams and a Dirichlet hyper-parameter is weighted by both document-level and patient-level context. This method and three other Latent Dirichlet allocation models were fit to a large collection of clinical reports. Examples of resulting topics demonstrate the results of the new model and the quality of the representations are evaluated using empirical log likelihood. The proposed model was able to create informative prior probabilities based on patient and document information, and captured phrases that represented various clinical concepts. The representation using the proposed model had a significantly higher empirical log likelihood than the compared methods. Integrating document metadata and capturing phrases in clinical text greatly improves the topic representation of clinical documents. The resulting clinically informative topics may effectively serve as the basis for an automatic summarization system for clinical reports. Copyright © 2016 Elsevier Inc. All rights reserved.
ERIC Educational Resources Information Center
Beghtol, Clare
1986-01-01
Explicates a definition and theory of "aboutness" and aboutness analysis developed by text linguist van Dijk; explores implications of text linguistics for bibliographic classification theory; suggests the elements that a theory of the cognitive process of classifying documents needs to encompass; and delineates how people identify…
An Efficiency Comparison of Document Preparation Systems Used in Academic Research and Development
Knauff, Markus; Nejasmic, Jelica
2014-01-01
The choice of an efficient document preparation system is an important decision for any academic researcher. To assist the research community, we report a software usability study in which 40 researchers across different disciplines prepared scholarly texts with either Microsoft Word or LaTeX. The probe texts included simple continuous text, text with tables and subheadings, and complex text with several mathematical equations. We show that LaTeX users were slower than Word users, wrote less text in the same amount of time, and produced more typesetting, orthographical, grammatical, and formatting errors. On most measures, expert LaTeX users performed even worse than novice Word users. LaTeX users, however, more often report enjoying using their respective software. We conclude that even experienced LaTeX users may suffer a loss in productivity when LaTeX is used, relative to other document preparation systems. Individuals, institutions, and journals should carefully consider the ramifications of this finding when choosing document preparation strategies, or requiring them of authors. PMID:25526083
Multioriented and curved text lines extraction from Indian documents.
Pal, U; Roy, Partha Pratim
2004-08-01
There are printed artistic documents where text lines of a single page may not be parallel to each other. These text lines may have different orientations or the text lines may be curved shapes. For the optical character recognition (OCR) of these documents, we need to extract such lines properly. In this paper, we propose a novel scheme, mainly based on the concept of water reservoir analogy, to extract individual text lines from printed Indian documents containing multioriented and/or curve text lines. A reservoir is a metaphor to illustrate the cavity region of a character where water can be stored. In the proposed scheme, at first, connected components are labeled and identified either as isolated or touching. Next, each touching component is classified either straight type (S-type) or curve type (C-type), depending on the reservoir base-area and envelope points of the component. Based on the type (S-type or C-type) of a component two candidate points are computed from each touching component. Finally, candidate regions (neighborhoods of the candidate points) of the candidate points of each component are detected and after analyzing these candidate regions, components are grouped to get individual text lines.
Sumathipala, Athula; Siribaddana, Sisira; Hewege, Suwin; Lekamwattage, Manura; Athukorale, Manjula; Siriwardhana, Chesmal; Murray, Joanna; Prince, Martin
2008-01-01
Background International guidelines on research have focused on protecting research participants. Ethical Research Committee (ERC) approval and informed consent are the cornerstones. Externally sponsored research requires approval through ethical review in both the host and the sponsoring country. This study aimed to determine to what extent ERC approval and informed consent procedures are documented in locally and internationally published human subject research carried out in Sri Lanka. Methods We obtained ERC approval in Sri Lanka and the United Kingdom. Theses from 1985 to 2005 available at the Postgraduate Institute of Medicine (PGIM) library affiliated to the University of Colombo were scrutinised using checklists agreed in consultation with senior research collaborators. A Medline search was carried out with MeSH major and minor heading 'Sri Lanka' as the search term for international publications originating in Sri Lanka during 1999 to 2004. All research publications from CMJ during 1999 to 2005 were also scrutinized. Results Of 291 theses, 34% documented ERC approvals and 61% documented obtaining consent. From the international journal survey, 250 publications originated from Sri Lanka of which only 79 full text original research publications could be accessed electronically. Of these 38% documented ERC approval and 39% documented obtaining consent. In the Ceylon Medical Journal 36% documented ERC approval and 37% documented obtaining consent. Conclusion Only one third of the publications scrutinized recorded ERC approval and procurement of informed consent. However, there is a positive trend in documenting these ethical requirements in local postgraduate research and in the local medical journal. PMID:18267015
Automatic Generation of Conditional Diagnostic Guidelines.
Baldwin, Tyler; Guo, Yufan; Syeda-Mahmood, Tanveer
2016-01-01
The diagnostic workup for many diseases can be extraordinarily nuanced, and as such reference material text often contains extensive information regarding when it is appropriate to have a patient undergo a given procedure. In this work we employ a three task pipeline for the extraction of statements indicating the conditions under which a procedure should be performed, given a suspected diagnosis. First, we identify each instance in the text where a procedure is being recommended. Next we examine the context around these recommendations to extract conditional statements that dictate the conditions under which the recommendation holds. Finally, corefering recommendations across the document are linked to produce a full recommendation summary. Results indicate that each underlying task can be performed with above baseline performance, and the output can be used to produce concise recommendation summaries.
Chen, Changyou; Buntine, Wray; Ding, Nan; Xie, Lexing; Du, Lan
2015-02-01
In applications we may want to compare different document collections: they could have shared content but also different and unique aspects in particular collections. This task has been called comparative text mining or cross-collection modeling. We present a differential topic model for this application that models both topic differences and similarities. For this we use hierarchical Bayesian nonparametric models. Moreover, we found it was important to properly model power-law phenomena in topic-word distributions and thus we used the full Pitman-Yor process rather than just a Dirichlet process. Furthermore, we propose the transformed Pitman-Yor process (TPYP) to incorporate prior knowledge such as vocabulary variations in different collections into the model. To deal with the non-conjugate issue between model prior and likelihood in the TPYP, we thus propose an efficient sampling algorithm using a data augmentation technique based on the multinomial theorem. Experimental results show the model discovers interesting aspects of different collections. We also show the proposed MCMC based algorithm achieves a dramatically reduced test perplexity compared to some existing topic models. Finally, we show our model outperforms the state-of-the-art for document classification/ideology prediction on a number of text collections.
Retrieving clinical evidence: a comparison of PubMed and Google Scholar for quick clinical searches.
Shariff, Salimah Z; Bejaimal, Shayna Ad; Sontrop, Jessica M; Iansavichus, Arthur V; Haynes, R Brian; Weir, Matthew A; Garg, Amit X
2013-08-15
Physicians frequently search PubMed for information to guide patient care. More recently, Google Scholar has gained popularity as another freely accessible bibliographic database. To compare the performance of searches in PubMed and Google Scholar. We surveyed nephrologists (kidney specialists) and provided each with a unique clinical question derived from 100 renal therapy systematic reviews. Each physician provided the search terms they would type into a bibliographic database to locate evidence to answer the clinical question. We executed each of these searches in PubMed and Google Scholar and compared results for the first 40 records retrieved (equivalent to 2 default search pages in PubMed). We evaluated the recall (proportion of relevant articles found) and precision (ratio of relevant to nonrelevant articles) of the searches performed in PubMed and Google Scholar. Primary studies included in the systematic reviews served as the reference standard for relevant articles. We further documented whether relevant articles were available as free full-texts. Compared with PubMed, the average search in Google Scholar retrieved twice as many relevant articles (PubMed: 11%; Google Scholar: 22%; P<.001). Precision was similar in both databases (PubMed: 6%; Google Scholar: 8%; P=.07). Google Scholar provided significantly greater access to free full-text publications (PubMed: 5%; Google Scholar: 14%; P<.001). For quick clinical searches, Google Scholar returns twice as many relevant articles as PubMed and provides greater access to free full-text articles.
Benge, James; Beach, Thomas; Gladding, Connie; Maestas, Gail
2008-01-01
The Military Health System (MHS) deployed its electronic health record (EHR), AHLTA to Military Treatment Facilities (MTFs) around the world. This paper focuses on the approach and barriers to using structured text in AHLTA to document care encounters and illustrates the direct correlation between the use of structured text and achievement of expected benefits. AHLTA uses commercially available products, a health data dictionary and standardized medical terminology, enabling the capture of structured computable data. With structured text stored in the AHLTA Clinical Data Repository (CDR), the MHS has seen a return on its EHR investment with improvements in the accuracy and completeness of coding and the documentation of care provided. Determining the aspects of documentation where structured text is most beneficial, as well as the degree of structured text needed has been a significant challenge. This paper describes how the economic value framework aligns the enterprise strategic objectives with the EHR investment features, performance metrics and expected benefits. The framework analyses focus on return on investment calculations, baseline assessment and post-implementation benefits validation. Cost avoidance, revenue enhancements and operational improvements, such as evidence-based medicine and medical surveillance can be directly attributed to use structured text.
Redd, Andrew M; Gundlapalli, Adi V; Divita, Guy; Carter, Marjorie E; Tran, Le-Thuy; Samore, Matthew H
2017-07-01
Templates in text notes pose challenges for automated information extraction algorithms. We propose a method that identifies novel templates in plain text medical notes. The identification can then be used to either include or exclude templates when processing notes for information extraction. The two-module method is based on the framework of information foraging and addresses the hypothesis that documents containing templates and the templates within those documents can be identified by common features. The first module takes documents from the corpus and groups those with common templates. This is accomplished through a binned word count hierarchical clustering algorithm. The second module extracts the templates. It uses the groupings and performs a longest common subsequence (LCS) algorithm to obtain the constituent parts of the templates. The method was developed and tested on a random document corpus of 750 notes derived from a large database of US Department of Veterans Affairs (VA) electronic medical notes. The grouping module, using hierarchical clustering, identified 23 groups with 3 documents or more, consisting of 120 documents from the 750 documents in our test corpus. Of these, 18 groups had at least one common template that was present in all documents in the group for a positive predictive value of 78%. The LCS extraction module performed with 100% positive predictive value, 94% sensitivity, and 83% negative predictive value. The human review determined that in 4 groups the template covered the entire document, with the remaining 14 groups containing a common section template. Among documents with templates, the number of templates per document ranged from 1 to 14. The mean and median number of templates per group was 5.9 and 5, respectively. The grouping method was successful in finding like documents containing templates. Of the groups of documents containing templates, the LCS module was successful in deciphering text belonging to the template and text that was extraneous. Major obstacles to improved performance included documents composed of multiple templates, templates that included other templates embedded within them, and variants of templates. We demonstrate proof of concept of the grouping and extraction method of identifying templates in electronic medical records in this pilot study and propose methods to improve performance and scaling up. Published by Elsevier Inc.
Scalable ranked retrieval using document images
NASA Astrophysics Data System (ADS)
Jain, Rajiv; Oard, Douglas W.; Doermann, David
2013-12-01
Despite the explosion of text on the Internet, hard copy documents that have been scanned as images still play a significant role for some tasks. The best method to perform ranked retrieval on a large corpus of document images, however, remains an open research question. The most common approach has been to perform text retrieval using terms generated by optical character recognition. This paper, by contrast, examines whether a scalable segmentation-free image retrieval algorithm, which matches sub-images containing text or graphical objects, can provide additional benefit in satisfying a user's information needs on a large, real world dataset. Results on 7 million scanned pages from the CDIP v1.0 test collection show that content based image retrieval finds a substantial number of documents that text retrieval misses, and that when used as a basis for relevance feedback can yield improvements in retrieval effectiveness.
Global and Local Features Based Classification for Bleed-Through Removal
NASA Astrophysics Data System (ADS)
Hu, Xiangyu; Lin, Hui; Li, Shutao; Sun, Bin
2016-12-01
The text on one side of historical documents often seeps through and appears on the other side, so the bleed-through is a common problem in historical document images. It makes the document images hard to read and the text difficult to recognize. To improve the image quality and readability, the bleed-through has to be removed. This paper proposes a global and local features extraction based bleed-through removal method. The Gaussian mixture model is used to get the global features of the images. Local features are extracted by the patch around each pixel. Then, the extreme learning machine classifier is utilized to classify the scanned images into the foreground text and the bleed-through component. Experimental results on real document image datasets show that the proposed method outperforms the state-of-the-art bleed-through removal methods and preserves the text strokes well.
Fuzzy Document Clustering Approach using WordNet Lexical Categories
NASA Astrophysics Data System (ADS)
Gharib, Tarek F.; Fouad, Mohammed M.; Aref, Mostafa M.
Text mining refers generally to the process of extracting interesting information and knowledge from unstructured text. This area is growing rapidly mainly because of the strong need for analysing the huge and large amount of textual data that reside on internal file systems and the Web. Text document clustering provides an effective navigation mechanism to organize this large amount of data by grouping their documents into a small number of meaningful classes. In this paper we proposed a fuzzy text document clustering approach using WordNet lexical categories and Fuzzy c-Means algorithm. Some experiments are performed to compare efficiency of the proposed approach with the recently reported approaches. Experimental results show that Fuzzy clustering leads to great performance results. Fuzzy c-means algorithm overcomes other classical clustering algorithms like k-means and bisecting k-means in both clustering quality and running time efficiency.
Yoo, Illhoi; Hu, Xiaohua; Song, Il-Yeol
2007-11-27
A huge amount of biomedical textual information has been produced and collected in MEDLINE for decades. In order to easily utilize biomedical information in the free text, document clustering and text summarization together are used as a solution for text information overload problem. In this paper, we introduce a coherent graph-based semantic clustering and summarization approach for biomedical literature. Our extensive experimental results show the approach shows 45% cluster quality improvement and 72% clustering reliability improvement, in terms of misclassification index, over Bisecting K-means as a leading document clustering approach. In addition, our approach provides concise but rich text summary in key concepts and sentences. Our coherent biomedical literature clustering and summarization approach that takes advantage of ontology-enriched graphical representations significantly improves the quality of document clusters and understandability of documents through summaries.
Yoo, Illhoi; Hu, Xiaohua; Song, Il-Yeol
2007-01-01
Background A huge amount of biomedical textual information has been produced and collected in MEDLINE for decades. In order to easily utilize biomedical information in the free text, document clustering and text summarization together are used as a solution for text information overload problem. In this paper, we introduce a coherent graph-based semantic clustering and summarization approach for biomedical literature. Results Our extensive experimental results show the approach shows 45% cluster quality improvement and 72% clustering reliability improvement, in terms of misclassification index, over Bisecting K-means as a leading document clustering approach. In addition, our approach provides concise but rich text summary in key concepts and sentences. Conclusion Our coherent biomedical literature clustering and summarization approach that takes advantage of ontology-enriched graphical representations significantly improves the quality of document clusters and understandability of documents through summaries. PMID:18047705
Text Generation: The State of the Art and the Literature.
ERIC Educational Resources Information Center
Mann, William C.; And Others
This report comprises two documents which describe the state of the art of computer generation of natural language text. Both were prepared by a panel of individuals who are active in research on text generation. The first document assesses the techniques now available for use in systems design, covering all of the technical methods by which…
DOCU-TEXT: A tool before the data dictionary
NASA Technical Reports Server (NTRS)
Carter, B.
1983-01-01
DOCU-TEXT, a proprietary software package that aids in the production of documentation for a data processing organization and can be installed and operated only on IBM computers is discussed. In organizing information that ultimately will reside in a data dictionary, DOCU-TEXT proved to be a useful documentation tool in extracting information from existing production jobs, procedure libraries, system catalogs, control data sets and related files. DOCU-TEXT reads these files to derive data that is useful at the system level. The output of DOCU-TEXT is a series of user selectable reports. These reports can reflect the interactions within a single job stream, a complete system, or all the systems in an installation. Any single report, or group of reports, can be generated in an independent documentation pass.
Data mining of text as a tool in authorship attribution
NASA Astrophysics Data System (ADS)
Visa, Ari J. E.; Toivonen, Jarmo; Autio, Sami; Maekinen, Jarno; Back, Barbro; Vanharanta, Hannu
2001-03-01
It is common that text documents are characterized and classified by keywords that the authors use to give them. Visa et al. have developed a new methodology based on prototype matching. The prototype is an interesting document or a part of an extracted, interesting text. This prototype is matched with the document database of the monitored document flow. The new methodology is capable of extracting the meaning of the document in a certain degree. Our claim is that the new methodology is also capable of authenticating the authorship. To verify this claim two tests were designed. The test hypothesis was that the words and the word order in the sentences could authenticate the author. In the first test three authors were selected. The selected authors were William Shakespeare, Edgar Allan Poe, and George Bernard Shaw. Three texts from each author were examined. Every text was one by one used as a prototype. The two nearest matches with the prototype were noted. The second test uses the Reuters-21578 financial news database. A group of 25 short financial news reports from five different authors are examined. Our new methodology and the interesting results from the two tests are reported in this paper. In the first test, for Shakespeare and for Poe all cases were successful. For Shaw one text was confused with Poe. In the second test the Reuters-21578 financial news were identified by the author relatively well. The resolution is that our text mining methodology seems to be capable of authorship attribution.
A suffix arrays based approach to semantic search in P2P systems
NASA Astrophysics Data System (ADS)
Shi, Qingwei; Zhao, Zheng; Bao, Hu
2007-09-01
Building a semantic search system on top of peer-to-peer (P2P) networks is becoming an attractive and promising alternative scheme for the reason of scalability, Data freshness and search cost. In this paper, we present a Suffix Arrays based algorithm for Semantic Search (SASS) in P2P systems, which generates a distributed Semantic Overlay Network (SONs) construction for full-text search in P2P networks. For each node through the P2P network, SASS distributes document indices based on a set of suffix arrays, by which clusters are created depending on words or phrases shared between documents, therefore, the search cost for a given query is decreased by only scanning semantically related documents. In contrast to recently announced SONs scheme designed by using metadata or predefined-class, SASS is an unsupervised approach for decentralized generation of SONs. SASS is also an incremental, linear time algorithm, which efficiently handle the problem of nodes update in P2P networks. Our simulation results demonstrate that SASS yields high search efficiency in dynamic environments.
Document Exploration and Automatic Knowledge Extraction for Unstructured Biomedical Text
NASA Astrophysics Data System (ADS)
Chu, S.; Totaro, G.; Doshi, N.; Thapar, S.; Mattmann, C. A.; Ramirez, P.
2015-12-01
We describe our work on building a web-browser based document reader with built-in exploration tool and automatic concept extraction of medical entities for biomedical text. Vast amounts of biomedical information are offered in unstructured text form through scientific publications and R&D reports. Utilizing text mining can help us to mine information and extract relevant knowledge from a plethora of biomedical text. The ability to employ such technologies to aid researchers in coping with information overload is greatly desirable. In recent years, there has been an increased interest in automatic biomedical concept extraction [1, 2] and intelligent PDF reader tools with the ability to search on content and find related articles [3]. Such reader tools are typically desktop applications and are limited to specific platforms. Our goal is to provide researchers with a simple tool to aid them in finding, reading, and exploring documents. Thus, we propose a web-based document explorer, which we called Shangri-Docs, which combines a document reader with automatic concept extraction and highlighting of relevant terms. Shangri-Docsalso provides the ability to evaluate a wide variety of document formats (e.g. PDF, Words, PPT, text, etc.) and to exploit the linked nature of the Web and personal content by performing searches on content from public sites (e.g. Wikipedia, PubMed) and private cataloged databases simultaneously. Shangri-Docsutilizes Apache cTAKES (clinical Text Analysis and Knowledge Extraction System) [4] and Unified Medical Language System (UMLS) to automatically identify and highlight terms and concepts, such as specific symptoms, diseases, drugs, and anatomical sites, mentioned in the text. cTAKES was originally designed specially to extract information from clinical medical records. Our investigation leads us to extend the automatic knowledge extraction process of cTAKES for biomedical research domain by improving the ontology guided information extraction process. We will describe our experience and implementation of our system and share lessons learned from our development. We will also discuss ways in which this could be adapted to other science fields. [1] Funk et al., 2014. [2] Kang et al., 2014. [3] Utopia Documents, http://utopiadocs.com [4] Apache cTAKES, http://ctakes.apache.org
PDF text classification to leverage information extraction from publication reports.
Bui, Duy Duc An; Del Fiol, Guilherme; Jonnalagadda, Siddhartha
2016-06-01
Data extraction from original study reports is a time-consuming, error-prone process in systematic review development. Information extraction (IE) systems have the potential to assist humans in the extraction task, however majority of IE systems were not designed to work on Portable Document Format (PDF) document, an important and common extraction source for systematic review. In a PDF document, narrative content is often mixed with publication metadata or semi-structured text, which add challenges to the underlining natural language processing algorithm. Our goal is to categorize PDF texts for strategic use by IE systems. We used an open-source tool to extract raw texts from a PDF document and developed a text classification algorithm that follows a multi-pass sieve framework to automatically classify PDF text snippets (for brevity, texts) into TITLE, ABSTRACT, BODYTEXT, SEMISTRUCTURE, and METADATA categories. To validate the algorithm, we developed a gold standard of PDF reports that were included in the development of previous systematic reviews by the Cochrane Collaboration. In a two-step procedure, we evaluated (1) classification performance, and compared it with machine learning classifier, and (2) the effects of the algorithm on an IE system that extracts clinical outcome mentions. The multi-pass sieve algorithm achieved an accuracy of 92.6%, which was 9.7% (p<0.001) higher than the best performing machine learning classifier that used a logistic regression algorithm. F-measure improvements were observed in the classification of TITLE (+15.6%), ABSTRACT (+54.2%), BODYTEXT (+3.7%), SEMISTRUCTURE (+34%), and MEDADATA (+14.2%). In addition, use of the algorithm to filter semi-structured texts and publication metadata improved performance of the outcome extraction system (F-measure +4.1%, p=0.002). It also reduced of number of sentences to be processed by 44.9% (p<0.001), which corresponds to a processing time reduction of 50% (p=0.005). The rule-based multi-pass sieve framework can be used effectively in categorizing texts extracted from PDF documents. Text classification is an important prerequisite step to leverage information extraction from PDF documents. Copyright © 2016 Elsevier Inc. All rights reserved.
Görg, Carsten; Liu, Zhicheng; Kihm, Jaeyeon; Choo, Jaegul; Park, Haesun; Stasko, John
2013-10-01
Investigators across many disciplines and organizations must sift through large collections of text documents to understand and piece together information. Whether they are fighting crime, curing diseases, deciding what car to buy, or researching a new field, inevitably investigators will encounter text documents. Taking a visual analytics approach, we integrate multiple text analysis algorithms with a suite of interactive visualizations to provide a flexible and powerful environment that allows analysts to explore collections of documents while sensemaking. Our particular focus is on the process of integrating automated analyses with interactive visualizations in a smooth and fluid manner. We illustrate this integration through two example scenarios: an academic researcher examining InfoVis and VAST conference papers and a consumer exploring car reviews while pondering a purchase decision. Finally, we provide lessons learned toward the design and implementation of visual analytics systems for document exploration and understanding.
Tank waste remediation system functions and requirements document
DOE Office of Scientific and Technical Information (OSTI.GOV)
Carpenter, K.E
1996-10-03
This is the Tank Waste Remediation System (TWRS) Functions and Requirements Document derived from the TWRS Technical Baseline. The document consists of several text sections that provide the purpose, scope, background information, and an explanation of how this document assists the application of Systems Engineering to the TWRS. The primary functions identified in the TWRS Functions and Requirements Document are identified in Figure 4.1 (Section 4.0) Currently, this document is part of the overall effort to develop the TWRS Functional Requirements Baseline, and contains the functions and requirements needed to properly define the top three TWRS function levels. TWRS Technicalmore » Baseline information (RDD-100 database) included in the appendices of the attached document contain the TWRS functions, requirements, and architecture necessary to define the TWRS Functional Requirements Baseline. Document organization and user directions are provided in the introductory text. This document will continue to be modified during the TWRS life-cycle.« less
The Galileo PPS expert monitoring and diagnostic prototype
NASA Technical Reports Server (NTRS)
Bahrami, Khosrow
1989-01-01
The Galileo PPS Expert Monitoring Module (EMM) is a prototype system implemented on the SUN workstation that will demonstrate a knowledge-based approach to monitoring and diagnosis for the Galileo spacecraft Power/Pyro subsystems. The prototype will simulate an analysis module functioning within the SFOC Engineering Analysis Subsystem Environment (EASE). This document describes the implementation of a prototype EMM for the Galileo spacecraft Power Pyro Subsystem. Section 2 of this document provides an overview of the issues in monitoring and diagnosis and comparison between traditional and knowledge-based solutions to this problem. Section 3 describes various tradeoffs which must be considered when designing a knowledge-based approach to monitoring and diagnosis, and section 4 discusses how these issues were resolved in constructing the prototype. Section 5 presents conclusions and recommendations for constructing a full-scale demonstration of the EMM. A Glossary provides definitions of terms used in this text.
User Evaluation of the NASA Technical Report Server Recommendation Service
NASA Technical Reports Server (NTRS)
Nelson, Michael L.; Bollen, Johan; Calhoun, JoAnne R.; Mackey, Calvin E.
2004-01-01
We present the user evaluation of two recommendation server methodologies implemented for the NASA Technical Report Server (NTRS). One methodology for generating recommendations uses log analysis to identify co-retrieval events on full-text documents. For comparison, we used the Vector Space Model (VSM) as the second methodology. We calculated cosine similarities and used the top 10 most similar documents (based on metadata) as recommendations . We then ran an experiment with NASA Langley Research Center (LaRC) staff members to gather their feedback on which method produced the most quality recommendations. We found that in most cases VSM outperformed log analysis of co-retrievals. However, analyzing the data revealed the evaluations may have been structurally biased in favor of the VSM generated recommendations. We explore some possible methods for combining log analysis and VSM generated recommendations and suggest areas of future work.
User Evaluation of the NASA Technical Report Server Recommendation Service
NASA Technical Reports Server (NTRS)
Nelson, Michael L.; Bollen, Johan; Calhoun, JoAnne R.; Mackey, Calvin E.
2004-01-01
We present the user evaluation of two recommendation server methodologies implemented for the NASA Technical Report Server (NTRS). One methodology for generating recommendations uses log analysis to identify co-retrieval events on full-text documents. For comparison, we used the Vector Space Model (VSM) as the second methodology. We calculated cosine similarities and used the top 10 most similar documents (based on metadata) as 'recommendations'. We then ran an experiment with NASA Langley Research Center (LaRC) staff members to gather their feedback on which method produced the most 'quality' recommendations. We found that in most cases VSM outperformed log analysis of co-retrievals. However, analyzing the data revealed the evaluations may have been structurally biased in favor of the VSM generated recommendations. We explore some possible methods for combining log analysis and VSM generated recommendations and suggest areas of future work.
Improving PHENIX search with Solr, Nutch and Drupal.
NASA Astrophysics Data System (ADS)
Morrison, Dave; Sourikova, Irina
2012-12-01
During its 20 years of R&D, construction and operation the PHENIX experiment at the Relativistic Heavy Ion Collider (RHIC) has accumulated large amounts of proprietary collaboration data that is hosted on many servers around the world and is not open for commercial search engines for indexing and searching. The legacy search infrastructure did not scale well with the fast growing PHENIX document base and produced results inadequate in both precision and recall. After considering the possible alternatives that would provide an aggregated, fast, full text search of a variety of data sources and file formats we decided to use Nutch [1] as a web crawler and Solr [2] as a search engine. To present XML-based Solr search results in a user-friendly format we use Drupal [3] as a web interface to Solr. We describe the experience of building a federated search for a heterogeneous collection of 10 million PHENIX documents with Nutch, Solr and Drupal.
NASA Astrophysics Data System (ADS)
Fume, Kosei; Ishitani, Yasuto
2008-01-01
We propose a document categorization method based on a document model that can be defined externally for each task and that categorizes Web content or business documents into a target category in accordance with the similarity of the model. The main feature of the proposed method consists of two aspects of semantics extraction from an input document. The semantics of terms are extracted by the semantic pattern analysis and implicit meanings of document substructure are specified by a bottom-up text clustering technique focusing on the similarity of text line attributes. We have constructed a system based on the proposed method for trial purposes. The experimental results show that the system achieves more than 80% classification accuracy in categorizing Web content and business documents into 15 or 70 categories.
39 CFR 3001.10 - Form and number of copies of documents.
Code of Federal Regulations, 2010 CFR
2010-07-01
... service must be printed from a text-based pdf version of the document, where possible. Otherwise, they may... generated in either Acrobat (pdf), Word, or WordPerfect, or Rich Text Format (rtf). [67 FR 67559, Nov. 6...
Use of speech-to-text technology for documentation by healthcare providers.
Ajami, Sima
2016-01-01
Medical records are a critical component of a patient's treatment. However, documentation of patient-related information is considered a secondary activity in the provision of healthcare services, often leading to incomplete medical records and patient data of low quality. Advances in information technology (IT) in the health system and registration of information in electronic health records (EHR) using speechto- text conversion software have facilitated service delivery. This narrative review is a literature search with the help of libraries, books, conference proceedings, databases of Science Direct, PubMed, Proquest, Springer, SID (Scientific Information Database), and search engines such as Yahoo, and Google. I used the following keywords and their combinations: speech recognition, automatic report documentation, voice to text software, healthcare, information, and voice recognition. Due to lack of knowledge of other languages, I searched all texts in English or Persian with no time limits. Of a total of 70, only 42 articles were selected. Speech-to-text conversion technology offers opportunities to improve the documentation process of medical records, reduce cost and time of recording information, enhance the quality of documentation, improve the quality of services provided to patients, and support healthcare providers in legal matters. Healthcare providers should recognize the impact of this technology on service delivery.
Global synthesis of the documented and projected effects of climate change on inland fishes
Myers, Bonnie; Lynch, Abigail; Bunnell, David; Chu, Cindy; Falke, Jeffrey A.; Kovach, Ryan; Krabbenhoft, Trevor J.; Kwak, Thomas J.; Paukert, Craig P.
2017-01-01
Although climate change is an important factor affecting inland fishes globally, a comprehensive review of how climate change has impacted and will continue to impact inland fishes worldwide does not currently exist. We conducted an extensive, systematic primary literature review to identify English-language, peer-reviewed journal publications with projected and documented examples of climate change impacts on inland fishes globally. Since the mid-1980s, scientists have projected the effects of climate change on inland fishes, and more recently, documentation of climate change impacts on inland fishes has increased. Of the thousands of title and abstracts reviewed, we selected 624 publications for a full text review: 63 of these publications documented an effect of climate change on inland fishes, while 116 publications projected inland fishes’ response to future climate change. Documented and projected impacts of climate change varied, but several trends emerged including differences between documented and projected impacts of climate change on salmonid abundance (P = 0.0002). Salmonid abundance decreased in 89.5% of documented effects compared to 35.7% of projected effects, where variable effects were more commonly reported (64.3%). Studies focused on responses of salmonids (61% of total) to climate change in North America and Europe, highlighting major gaps in the literature for taxonomic groups and geographic focus. Elucidating global patterns and identifying knowledge gaps of climate change effects on inland fishes will help managers better anticipate local changes in fish populations and assemblages, resulting in better development of management plans, particularly in systems with little information on climate change effects on fish.
Annotating Socio-Cultural Structures in Text
2012-10-31
parts of speech (POS) within text, using the Stanford Part of Speech Tagger (Stanford Log-Linear, 2011). The ERDC-CERL taxonomy is then used to...annotated NP/VP Pane: Shows the sentence parsed using the Parts of Speech tagger Document View Pane: Specifies the document (being annotated) in three...first parsed using the Stanford Parts of Speech tagger and converted to an XML document both components which are done through the Import function
Neural networks for data mining electronic text collections
NASA Astrophysics Data System (ADS)
Walker, Nicholas; Truman, Gregory
1997-04-01
The use of neural networks in information retrieval and text analysis has primarily suffered from the issues of adequate document representation, the ability to scale to very large collections, dynamism in the face of new information and the practical difficulties of basing the design on the use of supervised training sets. Perhaps the most important approach to begin solving these problems is the use of `intermediate entities' which reduce the dimensionality of document representations and the size of documents collections to manageable levels coupled with the use of unsupervised neural network paradigms. This paper describes the issues, a fully configured neural network-based text analysis system--dataHARVEST--aimed at data mining text collections which begins this process, along with the remaining difficulties and potential ways forward.
Identifying and Overcoming Obstacles to Point-of-Care Data Collection for Eye Care Professionals
Lobach, David F.; Silvey, Garry M.; Macri, Jennifer M.; Hunt, Megan; Kacmaz, Roje O.; Lee, Paul P.
2005-01-01
Supporting data entry by clinicians is considered one of the greatest challenges in implementing electronic health records. In this paper we describe a formative evaluation study using three different methodologies through which we identified obstacles to point-of-care data entry for eye care and then used the formative process to develop and test solutions to overcome these obstacles. The greatest obstacles were supporting free text annotation of clinical observations and accommodating the creation of detailed diagrams in multiple colors. To support free text entry, we arrived at an approach that captures an image of a free text note and associates this image with related data elements in an encounter note. The detailed diagrams included a color pallet that allowed changing pen color with a single stroke and also captured the diagrams as an image associated with related data elements. During observed sessions with simulated patients, these approaches satisfied the clinicians’ documentation needs by capturing the full range of clinical complexity that arises in practice. PMID:16779083
Raising the IQ in full-text searching via intelligent querying
DOE Office of Scientific and Technical Information (OSTI.GOV)
Kero, R.; Russell, L.; Swietlik, C.
1994-11-01
Current Information Retrieval (IR) technologies allow for efficient access to relevant information, provided that user selected query terms coincide with the specific linguistical choices made by the authors whose works constitute the text-base. Therefore, the challenge is to enhance the limited searching capability of state-of-the-practice IR. This can be done either with augmented clients that overcome current server searching deficiencies, or with added capabilities that can augment searching algorithms on the servers. The technology being investigated is that of deductive databases, with a set of new techniques called cooperative answering. This technology utilizes semantic networks to allow for navigation betweenmore » possible query search term alternatives. The augmented search terms are passed to an IR engine and the results can be compared. The project utilizes the OSTI Environment, Safety and Health Thesaurus to populate the domain specific semantic network and the text base of ES&H related documents from the Facility Profile Information Management System as the domain specific search space.« less
Using Bitmap Indexing Technology for Combined Numerical and TextQueries
DOE Office of Scientific and Technical Information (OSTI.GOV)
Stockinger, Kurt; Cieslewicz, John; Wu, Kesheng
2006-10-16
In this paper, we describe a strategy of using compressedbitmap indices to speed up queries on both numerical data and textdocuments. By using an efficient compression algorithm, these compressedbitmap indices are compact even for indices with millions of distinctterms. Moreover, bitmap indices can be used very efficiently to answerBoolean queries over text documents involving multiple query terms.Existing inverted indices for text searches are usually inefficient forcorpora with a very large number of terms as well as for queriesinvolving a large number of hits. We demonstrate that our compressedbitmap index technology overcomes both of those short-comings. In aperformance comparison against amore » commonly used database system, ourindices answer queries 30 times faster on average. To provide full SQLsupport, we integrated our indexing software, called FastBit, withMonetDB. The integrated system MonetDB/FastBit provides not onlyefficient searches on a single table as FastBit does, but also answersjoin queries efficiently. Furthermore, MonetDB/FastBit also provides avery efficient retrieval mechanism of result records.« less
The digital darkroom, part 3: digital presentation in plastic surgery.
Galdino, G M; Chiaramonte, M; Klatsky, S A
2001-01-01
We summarize here the third and final part of our series on the Digital Darkroom. In this part, we review the use of digital technology for medical and other presentations, including the kinds of equipment available, the advantages and disadvantages of digital projection, and the most common pitfalls encountered in preparing and presenting material in digital presentations. The full text of the complete series, including expanded illustrative material and complete bibliographic documentation, is now available at our journal web site at . Please see page 39 for instructions on how to access Aesthetic Surgery Journal Online and view the entire series.
Effective Web and Desktop Retrieval with Enhanced Semantic Spaces
NASA Astrophysics Data System (ADS)
Daoud, Amjad M.
We describe the design and implementation of the NETBOOK prototype system for collecting, structuring and efficiently creating semantic vectors for concepts, noun phrases, and documents from a corpus of free full text ebooks available on the World Wide Web. Automatic generation of concept maps from correlated index terms and extracted noun phrases are used to build a powerful conceptual index of individual pages. To ensure scalabilty of our system, dimension reduction is performed using Random Projection [13]. Furthermore, we present a complete evaluation of the relative effectiveness of the NETBOOK system versus the Google Desktop [8].
Replacement Attack: A New Zero Text Watermarking Attack
NASA Astrophysics Data System (ADS)
Bashardoost, Morteza; Mohd Rahim, Mohd Shafry; Saba, Tanzila; Rehman, Amjad
2017-03-01
The main objective of zero watermarking methods that are suggested for the authentication of textual properties is to increase the fragility of produced watermarks against tampering attacks. On the other hand, zero watermarking attacks intend to alter the contents of document without changing the watermark. In this paper, the Replacement attack is proposed, which focuses on maintaining the location of the words in the document. The proposed text watermarking attack is specifically effective on watermarking approaches that exploit words' transition in the document. The evaluation outcomes prove that tested word-based method are unable to detect the existence of replacement attack in the document. Moreover, the comparison results show that the size of Replacement attack is estimated less accurate than other common types of zero text watermarking attacks.
C4ISR Architecture Working Group (AWG), Architecture Framework Version 2.0.
1997-12-18
Vision Name Name/identifier of document that contains doctrine, goals, or vision Type Doctrine, goals, or vision Description Text summary description...e.g., organization, directive, order) Description Text summary of tasking •Rules, Criteria, or Conventions Name Name/identifier of document that...contains rules, criteria, or conventions Type One of: rules, criteria, or conventions Description Text summary description of contents or
García-Remesal, Miguel; Maojo, Victor; Crespo, José
2010-01-01
In this paper we present a knowledge engineering approach to automatically recognize and extract genetic sequences from scientific articles. To carry out this task, we use a preliminary recognizer based on a finite state machine to extract all candidate DNA/RNA sequences. The latter are then fed into a knowledge-based system that automatically discards false positives and refines noisy and incorrectly merged sequences. We created the knowledge base by manually analyzing different manuscripts containing genetic sequences. Our approach was evaluated using a test set of 211 full-text articles in PDF format containing 3134 genetic sequences. For such set, we achieved 87.76% precision and 97.70% recall respectively. This method can facilitate different research tasks. These include text mining, information extraction, and information retrieval research dealing with large collections of documents containing genetic sequences.
Getting more out of biomedical documents with GATE's full lifecycle open source text analytics.
Cunningham, Hamish; Tablan, Valentin; Roberts, Angus; Bontcheva, Kalina
2013-01-01
This software article describes the GATE family of open source text analysis tools and processes. GATE is one of the most widely used systems of its type with yearly download rates of tens of thousands and many active users in both academic and industrial contexts. In this paper we report three examples of GATE-based systems operating in the life sciences and in medicine. First, in genome-wide association studies which have contributed to discovery of a head and neck cancer mutation association. Second, medical records analysis which has significantly increased the statistical power of treatment/outcome models in the UK's largest psychiatric patient cohort. Third, richer constructs in drug-related searching. We also explore the ways in which the GATE family supports the various stages of the lifecycle present in our examples. We conclude that the deployment of text mining for document abstraction or rich search and navigation is best thought of as a process, and that with the right computational tools and data collection strategies this process can be made defined and repeatable. The GATE research programme is now 20 years old and has grown from its roots as a specialist development tool for text processing to become a rather comprehensive ecosystem, bringing together software developers, language engineers and research staff from diverse fields. GATE now has a strong claim to cover a uniquely wide range of the lifecycle of text analysis systems. It forms a focal point for the integration and reuse of advances that have been made by many people (the majority outside of the authors' own group) who work in text processing for biomedicine and other areas. GATE is available online <1> under GNU open source licences and runs on all major operating systems. Support is available from an active user and developer community and also on a commercial basis.
Getting More Out of Biomedical Documents with GATE's Full Lifecycle Open Source Text Analytics
Cunningham, Hamish; Tablan, Valentin; Roberts, Angus; Bontcheva, Kalina
2013-01-01
This software article describes the GATE family of open source text analysis tools and processes. GATE is one of the most widely used systems of its type with yearly download rates of tens of thousands and many active users in both academic and industrial contexts. In this paper we report three examples of GATE-based systems operating in the life sciences and in medicine. First, in genome-wide association studies which have contributed to discovery of a head and neck cancer mutation association. Second, medical records analysis which has significantly increased the statistical power of treatment/outcome models in the UK's largest psychiatric patient cohort. Third, richer constructs in drug-related searching. We also explore the ways in which the GATE family supports the various stages of the lifecycle present in our examples. We conclude that the deployment of text mining for document abstraction or rich search and navigation is best thought of as a process, and that with the right computational tools and data collection strategies this process can be made defined and repeatable. The GATE research programme is now 20 years old and has grown from its roots as a specialist development tool for text processing to become a rather comprehensive ecosystem, bringing together software developers, language engineers and research staff from diverse fields. GATE now has a strong claim to cover a uniquely wide range of the lifecycle of text analysis systems. It forms a focal point for the integration and reuse of advances that have been made by many people (the majority outside of the authors' own group) who work in text processing for biomedicine and other areas. GATE is available online <1> under GNU open source licences and runs on all major operating systems. Support is available from an active user and developer community and also on a commercial basis. PMID:23408875
Contemporary issues in HIM. The application layer--III.
Wear, L L; Pinkert, J R
1993-07-01
We have seen document preparation systems evolve from basic line editors through powerful, sophisticated desktop publishing programs. This component of the application layer is probably one of the most used, and most readily identifiable. Ask grade school children nowadays, and many will tell you that they have written a paper on a computer. Next month will be a "fun" tour through a number of other application programs we find useful. They will range from a simple notebook reminder to a sophisticated photograph processor. Application layer: Software targeted for the end user, focusing on a specific application area, and typically residing in the computer system as distinct components on top of the OS. Desktop publishing: A document preparation program that begins with the text features of a word processor, then adds the ability for a user to incorporate outputs from a variety of graphic programs, spreadsheets, and other applications. Line editor: A document preparation program that manipulates text in a file on the basis of numbered lines. Word processor: A document preparation program that can, among other things, reformat sections of documents, move and replace blocks of text, use multiple character fonts, automatically create a table of contents and index, create complex tables, and combine text and graphics.
Retrieving Clinical Evidence: A Comparison of PubMed and Google Scholar for Quick Clinical Searches
Bejaimal, Shayna AD; Sontrop, Jessica M; Iansavichus, Arthur V; Haynes, R Brian; Weir, Matthew A; Garg, Amit X
2013-01-01
Background Physicians frequently search PubMed for information to guide patient care. More recently, Google Scholar has gained popularity as another freely accessible bibliographic database. Objective To compare the performance of searches in PubMed and Google Scholar. Methods We surveyed nephrologists (kidney specialists) and provided each with a unique clinical question derived from 100 renal therapy systematic reviews. Each physician provided the search terms they would type into a bibliographic database to locate evidence to answer the clinical question. We executed each of these searches in PubMed and Google Scholar and compared results for the first 40 records retrieved (equivalent to 2 default search pages in PubMed). We evaluated the recall (proportion of relevant articles found) and precision (ratio of relevant to nonrelevant articles) of the searches performed in PubMed and Google Scholar. Primary studies included in the systematic reviews served as the reference standard for relevant articles. We further documented whether relevant articles were available as free full-texts. Results Compared with PubMed, the average search in Google Scholar retrieved twice as many relevant articles (PubMed: 11%; Google Scholar: 22%; P<.001). Precision was similar in both databases (PubMed: 6%; Google Scholar: 8%; P=.07). Google Scholar provided significantly greater access to free full-text publications (PubMed: 5%; Google Scholar: 14%; P<.001). Conclusions For quick clinical searches, Google Scholar returns twice as many relevant articles as PubMed and provides greater access to free full-text articles. PMID:23948488
Segmentation-driven compound document coding based on H.264/AVC-INTRA.
Zaghetto, Alexandre; de Queiroz, Ricardo L
2007-07-01
In this paper, we explore H.264/AVC operating in intraframe mode to compress a mixed image, i.e., composed of text, graphics, and pictures. Even though mixed contents (compound) documents usually require the use of multiple compressors, we apply a single compressor for both text and pictures. For that, distortion is taken into account differently between text and picture regions. Our approach is to use a segmentation-driven adaptation strategy to change the H.264/AVC quantization parameter on a macroblock by macroblock basis, i.e., we deviate bits from pictorial regions to text in order to keep text edges sharp. We show results of a segmentation driven quantizer adaptation method applied to compress documents. Our reconstructed images have better text sharpness compared to straight unadapted coding, at negligible visual losses on pictorial regions. Our results also highlight the fact that H.264/AVC-INTRA outperforms coders such as JPEG-2000 as a single coder for compound images.
Mining the pharmacogenomics literature—a survey of the state of the art
Cohen, K. Bretonnel; Garten, Yael; Shah, Nigam H.
2012-01-01
This article surveys efforts on text mining of the pharmacogenomics literature, mainly from the period 2008 to 2011. Pharmacogenomics (or pharmacogenetics) is the field that studies how human genetic variation impacts drug response. Therefore, publications span the intersection of research in genotypes, phenotypes and pharmacology, a topic that has increasingly become a focus of active research in recent years. This survey covers efforts dealing with the automatic recognition of relevant named entities (e.g. genes, gene variants and proteins, diseases and other pathological phenomena, drugs and other chemicals relevant for medical treatment), as well as various forms of relations between them. A wide range of text genres is considered, such as scientific publications (abstracts, as well as full texts), patent texts and clinical narratives. We also discuss infrastructure and resources needed for advanced text analytics, e.g. document corpora annotated with corresponding semantic metadata (gold standards and training data), biomedical terminologies and ontologies providing domain-specific background knowledge at different levels of formality and specificity, software architectures for building complex and scalable text analytics pipelines and Web services grounded to them, as well as comprehensive ways to disseminate and interact with the typically huge amounts of semiformal knowledge structures extracted by text mining tools. Finally, we consider some of the novel applications that have already been developed in the field of pharmacogenomic text mining and point out perspectives for future research. PMID:22833496
Mining the pharmacogenomics literature--a survey of the state of the art.
Hahn, Udo; Cohen, K Bretonnel; Garten, Yael; Shah, Nigam H
2012-07-01
This article surveys efforts on text mining of the pharmacogenomics literature, mainly from the period 2008 to 2011. Pharmacogenomics (or pharmacogenetics) is the field that studies how human genetic variation impacts drug response. Therefore, publications span the intersection of research in genotypes, phenotypes and pharmacology, a topic that has increasingly become a focus of active research in recent years. This survey covers efforts dealing with the automatic recognition of relevant named entities (e.g. genes, gene variants and proteins, diseases and other pathological phenomena, drugs and other chemicals relevant for medical treatment), as well as various forms of relations between them. A wide range of text genres is considered, such as scientific publications (abstracts, as well as full texts), patent texts and clinical narratives. We also discuss infrastructure and resources needed for advanced text analytics, e.g. document corpora annotated with corresponding semantic metadata (gold standards and training data), biomedical terminologies and ontologies providing domain-specific background knowledge at different levels of formality and specificity, software architectures for building complex and scalable text analytics pipelines and Web services grounded to them, as well as comprehensive ways to disseminate and interact with the typically huge amounts of semiformal knowledge structures extracted by text mining tools. Finally, we consider some of the novel applications that have already been developed in the field of pharmacogenomic text mining and point out perspectives for future research.
ERIC Educational Resources Information Center
Thomas, Georgelle; Fishburne, Robert P.
Part of the Anthropology Curriculum Project, the document contains a programmed text on evolution and a vocabulary pronunciation guide. The unit is intended for use by students in social studies and science courses in the 5th, 6th, and 7th grades. The bulk of the document, the programmed text, is organized in a question answer format. Students are…
NASA Astrophysics Data System (ADS)
Suzuki, Izumi; Mikami, Yoshiki; Ohsato, Ario
A technique that acquires documents in the same category with a given short text is introduced. Regarding the given text as a training document, the system marks up the most similar document, or sufficiently similar documents, from among the document domain (or entire Web). The system then adds the marked documents to the training set to learn the set, and this process is repeated until no more documents are marked. Setting a monotone increasing property to the similarity as it learns enables the system to 1) detect the correct timing so that no more documents remain to be marked and to 2) decide the threshold value that the classifier uses. In addition, under the condition that the normalization process is limited to what term weights are divided by a p-norm of the weights, the linear classifier in which training documents are indexed in a binary manner is the only instance that satisfies the monotone increasing property. The feasibility of the proposed technique was confirmed through an examination of binary similarity and using English and German documents randomly selected from the Web.
Mujtaba, Ghulam; Shuib, Liyana; Raj, Ram Gopal; Rajandram, Retnagowri; Shaikh, Khairunisa; Al-Garadi, Mohammed Ali
2018-06-01
Text categorization has been used extensively in recent years to classify plain-text clinical reports. This study employs text categorization techniques for the classification of open narrative forensic autopsy reports. One of the key steps in text classification is document representation. In document representation, a clinical report is transformed into a format that is suitable for classification. The traditional document representation technique for text categorization is the bag-of-words (BoW) technique. In this study, the traditional BoW technique is ineffective in classifying forensic autopsy reports because it merely extracts frequent but discriminative features from clinical reports. Moreover, this technique fails to capture word inversion, as well as word-level synonymy and polysemy, when classifying autopsy reports. Hence, the BoW technique suffers from low accuracy and low robustness unless it is improved with contextual and application-specific information. To overcome the aforementioned limitations of the BoW technique, this research aims to develop an effective conceptual graph-based document representation (CGDR) technique to classify 1500 forensic autopsy reports from four (4) manners of death (MoD) and sixteen (16) causes of death (CoD). Term-based and Systematized Nomenclature of Medicine-Clinical Terms (SNOMED CT) based conceptual features were extracted and represented through graphs. These features were then used to train a two-level text classifier. The first level classifier was responsible for predicting MoD. In addition, the second level classifier was responsible for predicting CoD using the proposed conceptual graph-based document representation technique. To demonstrate the significance of the proposed technique, its results were compared with those of six (6) state-of-the-art document representation techniques. Lastly, this study compared the effects of one-level classification and two-level classification on the experimental results. The experimental results indicated that the CGDR technique achieved 12% to 15% improvement in accuracy compared with fully automated document representation baseline techniques. Moreover, two-level classification obtained better results compared with one-level classification. The promising results of the proposed conceptual graph-based document representation technique suggest that pathologists can adopt the proposed system as their basis for second opinion, thereby supporting them in effectively determining CoD. Copyright © 2018 Elsevier Inc. All rights reserved.
Unapparent Information Revelation: Text Mining for Counterterrorism
NASA Astrophysics Data System (ADS)
Srihari, Rohini K.
Unapparent information revelation (UIR) is a special case of text mining that focuses on detecting possible links between concepts across multiple text documents by generating an evidence trail explaining the connection. A traditional search involving, for example, two or more person names will attempt to find documents mentioning both these individuals. This research focuses on a different interpretation of such a query: what is the best evidence trail across documents that explains a connection between these individuals? For example, all may be good golfers. A generalization of this task involves query terms representing general concepts (e.g. indictment, foreign policy). Previous approaches to this problem have focused on graph mining involving hyperlinked documents, and link analysis exploiting named entities. A new robust framework is presented, based on (i) generating concept chain graphs, a hybrid content representation, (ii) performing graph matching to select candidate subgraphs, and (iii) subsequently using graphical models to validate hypotheses using ranked evidence trails. We adapt the DUC data set for cross-document summarization to evaluate evidence trails generated by this approach
Real-time text extraction based on the page layout analysis system
NASA Astrophysics Data System (ADS)
Soua, M.; Benchekroun, A.; Kachouri, R.; Akil, M.
2017-05-01
Several approaches were proposed in order to extract text from scanned documents. However, text extraction in heterogeneous documents stills a real challenge. Indeed, text extraction in this context is a difficult task because of the variation of the text due to the differences of sizes, styles and orientations, as well as to the complexity of the document region background. Recently, we have proposed the improved hybrid binarization based on Kmeans method (I-HBK)5 to extract suitably the text from heterogeneous documents. In this method, the Page Layout Analysis (PLA), part of the Tesseract OCR engine, is used to identify text and image regions. Afterwards our hybrid binarization is applied separately on each kind of regions. In one side, gamma correction is employed before to process image regions. In the other side, binarization is performed directly on text regions. Then, a foreground and background color study is performed to correct inverted region colors. Finally, characters are located from the binarized regions based on the PLA algorithm. In this work, we extend the integration of the PLA algorithm within the I-HBK method. In addition, to speed up the separation of text and image step, we employ an efficient GPU acceleration. Through the performed experiments, we demonstrate the high F-measure accuracy of the PLA algorithm reaching 95% on the LRDE dataset. In addition, we illustrate the sequential and the parallel compared PLA versions. The obtained results give a speedup of 3.7x when comparing the parallel PLA implementation on GPU GTX 660 to the CPU version.
Improving Text Recall with Multiple Summaries
ERIC Educational Resources Information Center
van der Meij, Hans; van der Meij, Jan
2012-01-01
Background. QuikScan (QS) is an innovative design that aims to improve accessibility, comprehensibility, and subsequent recall of expository text by means of frequent within-document summaries that are formatted as numbered list items. The numbers in the QS summaries correspond to numbers placed in the body of the document where the summarized…
FEQinput—An editor for the full equations (FEQ) hydraulic modeling system
Ancalle, David S.; Ancalle, Pablo J.; Domanski, Marian M.
2017-10-30
IntroductionThe Full Equations Model (FEQ) is a computer program that solves the full, dynamic equations of motion for one-dimensional unsteady hydraulic flow in open channels and through control structures. As a result, hydrologists have used FEQ to design and operate flood-control structures, delineate inundation maps, and analyze peak-flow impacts. To aid in fighting floods, hydrologists are using the software to develop a system that uses flood-plain models to simulate real-time streamflow.Input files for FEQ are composed of text files that contain large amounts of parameters, data, and instructions that are written in a format exclusive to FEQ. Although documentation exists that can aid in the creation and editing of these input files, new users face a steep learning curve in order to understand the specific format and language of the files.FEQinput provides a set of tools to help a new user overcome the steep learning curve associated with creating and modifying input files for the FEQ hydraulic model and the related utility tool, Full Equations Utilities (FEQUTL).
Analysis of line structure in handwritten documents using the Hough transform
NASA Astrophysics Data System (ADS)
Ball, Gregory R.; Kasiviswanathan, Harish; Srihari, Sargur N.; Narayanan, Aswin
2010-01-01
In the analysis of handwriting in documents a central task is that of determining line structure of the text, e.g., number of text lines, location of their starting and end-points, line-width, etc. While simple methods can handle ideal images, real world documents have complexities such as overlapping line structure, variable line spacing, line skew, document skew, noisy or degraded images etc. This paper explores the application of the Hough transform method to handwritten documents with the goal of automatically determining global document line structure in a top-down manner which can then be used in conjunction with a bottom-up method such as connected component analysis. The performance is significantly better than other top-down methods, such as the projection profile method. In addition, we evaluate the performance of skew analysis by the Hough transform on handwritten documents.
Document image cleanup and binarization
NASA Astrophysics Data System (ADS)
Wu, Victor; Manmatha, Raghaven
1998-04-01
Image binarization is a difficult task for documents with text over textured or shaded backgrounds, poor contrast, and/or considerable noise. Current optical character recognition (OCR) and document analysis technology do not handle such documents well. We have developed a simple yet effective algorithm for document image clean-up and binarization. The algorithm consists of two basic steps. In the first step, the input image is smoothed using a low-pass filter. The smoothing operation enhances the text relative to any background texture. This is because background texture normally has higher frequency than text does. The smoothing operation also removes speckle noise. In the second step, the intensity histogram of the smoothed image is computed and a threshold automatically selected as follows. For black text, the first peak of the histogram corresponds to text. Thresholding the image at the value of the valley between the first and second peaks of the histogram binarizes the image well. In order to reliably identify the valley, the histogram is smoothed by a low-pass filter before the threshold is computed. The algorithm has been applied to some 50 images from a wide variety of source: digitized video frames, photos, newspapers, advertisements in magazines or sales flyers, personal checks, etc. There are 21820 characters and 4406 words in these images. 91 percent of the characters and 86 percent of the words are successfully cleaned up and binarized. A commercial OCR was applied to the binarized text when it consisted of fonts which were OCR recognizable. The recognition rate was 84 percent for the characters and 77 percent for the words.
Leveraging Text Content for Management of Construction Project Documents
ERIC Educational Resources Information Center
Alqady, Mohammed
2012-01-01
The construction industry is a knowledge intensive industry. Thousands of documents are generated by construction projects. Documents, as information carriers, must be managed effectively to ensure successful project management. The fact that a single project can produce thousands of documents and that a lot of the documents are generated in a…
The Islamic State Battle Plan: Press Release Natural Language Processing
2016-06-01
Processing, text mining , corpus, generalized linear model, cascade, R Shiny, leaflet, data visualization 15. NUMBER OF PAGES 83 16. PRICE CODE...Terrorism and Responses to Terrorism TDM Term Document Matrix TF Term Frequency TF-IDF Term Frequency-Inverse Document Frequency tm text mining (R...package=leaflet. Feinerer I, Hornik K (2015) Text Mining Package “tm,” Version 0.6-2. (Jul 3) https://cran.r-project.org/web/packages/tm/tm.pdf
A Semi-supervised Heat Kernel Pagerank MBO Algorithm for Data Classification
2016-07-01
financial predictions, etc. and is finding growing use in text mining studies. In this paper, we present an efficient algorithm for classification of high...video data, set of images, hyperspectral data, medical data, text data, etc. Moreover, the framework provides a way to analyze data whose different...also be incorporated. For text classification, one can use tfidf (term frequency inverse document frequency) to form feature vectors for each document
ERIC Educational Resources Information Center
Stahl, Steven A.; And Others
To examine the effects of students reading multiple documents on their perceptions of a historical event, in this case the "discovery" of America by Christopher Columbus, 85 high school freshmen read 3 of 4 different texts (or sets of texts) dealing with Columbus. One text was an encyclopedia article, one a set of articles from…
"Understanding" medical school curriculum content using KnowledgeMap.
Denny, Joshua C; Smithers, Jeffrey D; Miller, Randolph A; Spickard, Anderson
2003-01-01
To describe the development and evaluation of computational tools to identify concepts within medical curricular documents, using information derived from the National Library of Medicine's Unified Medical Language System (UMLS). The long-term goal of the KnowledgeMap (KM) project is to provide faculty and students with an improved ability to develop, review, and integrate components of the medical school curriculum. The KM concept identifier uses lexical resources partially derived from the UMLS (SPECIALIST lexicon and Metathesaurus), heuristic language processing techniques, and an empirical scoring algorithm. KM differentiates among potentially matching Metathesaurus concepts within a source document. The authors manually identified important "gold standard" biomedical concepts within selected medical school full-content lecture documents and used these documents to compare KM concept recognition with that of a known state-of-the-art "standard"-the National Library of Medicine's MetaMap program. The number of "gold standard" concepts in each lecture document identified by either KM or MetaMap, and the cause of each failure or relative success in a random subset of documents. For 4,281 "gold standard" concepts, MetaMap matched 78% and KM 82%. Precision for "gold standard" concepts was 85% for MetaMap and 89% for KM. The heuristics of KM accurately matched acronyms, concepts underspecified in the document, and ambiguous matches. The most frequent cause of matching failures was absence of target concepts from the UMLS Metathesaurus. The prototypic KM system provided an encouraging rate of concept extraction for representative medical curricular texts. Future versions of KM should be evaluated for their ability to allow administrators, lecturers, and students to navigate through the medical curriculum to locate redundancies, find interrelated information, and identify omissions. In addition, the ability of KM to meet specific, personal information needs should be assessed.
M68000 RNF text formatter user's manual
NASA Technical Reports Server (NTRS)
Will, R. W.; Grantham, C.
1985-01-01
A powerful, flexible text formatting program, RNF, is described. It is designed to automate many of the tedious elements of typing, including breaking a document into pages with titles and page numbers, formatting chapter and section headings, keeping track of page numbers for use in a table of contents, justifying lines by inserting blanks to give an even right margin, and inserting figures and footnotes at appropriate places on the page. The RNF program greatly facilitates both preparing and modifying a document because it allows you to concentrate your efforts on the content of the document instead of its appearance and because it removes the necessity of retyping text that has not changed.
Inferring Group Processes from Computer-Mediated Affective Text Analysis
DOE Office of Scientific and Technical Information (OSTI.GOV)
Schryver, Jack C; Begoli, Edmon; Jose, Ajith
2011-02-01
Political communications in the form of unstructured text convey rich connotative meaning that can reveal underlying group social processes. Previous research has focused on sentiment analysis at the document level, but we extend this analysis to sub-document levels through a detailed analysis of affective relationships between entities extracted from a document. Instead of pure sentiment analysis, which is just positive or negative, we explore nuances of affective meaning in 22 affect categories. Our affect propagation algorithm automatically calculates and displays extracted affective relationships among entities in graphical form in our prototype (TEAMSTER), starting with seed lists of affect terms. Severalmore » useful metrics are defined to infer underlying group processes by aggregating affective relationships discovered in a text. Our approach has been validated with annotated documents from the MPQA corpus, achieving a performance gain of 74% over comparable random guessers.« less
Evaluation of PHI Hunter in Natural Language Processing Research.
Redd, Andrew; Pickard, Steve; Meystre, Stephane; Scehnet, Jeffrey; Bolton, Dan; Heavirland, Julia; Weaver, Allison Lynn; Hope, Carol; Garvin, Jennifer Hornung
2015-01-01
We introduce and evaluate a new, easily accessible tool using a common statistical analysis and business analytics software suite, SAS, which can be programmed to remove specific protected health information (PHI) from a text document. Removal of PHI is important because the quantity of text documents used for research with natural language processing (NLP) is increasing. When using existing data for research, an investigator must remove all PHI not needed for the research to comply with human subjects' right to privacy. This process is similar, but not identical, to de-identification of a given set of documents. PHI Hunter removes PHI from free-form text. It is a set of rules to identify and remove patterns in text. PHI Hunter was applied to 473 Department of Veterans Affairs (VA) text documents randomly drawn from a research corpus stored as unstructured text in VA files. PHI Hunter performed well with PHI in the form of identification numbers such as Social Security numbers, phone numbers, and medical record numbers. The most commonly missed PHI items were names and locations. Incorrect removal of information occurred with text that looked like identification numbers. PHI Hunter fills a niche role that is related to but not equal to the role of de-identification tools. It gives research staff a tool to reasonably increase patient privacy. It performs well for highly sensitive PHI categories that are rarely used in research, but still shows possible areas for improvement. More development for patterns of text and linked demographic tables from electronic health records (EHRs) would improve the program so that more precise identifiable information can be removed. PHI Hunter is an accessible tool that can flexibly remove PHI not needed for research. If it can be tailored to the specific data set via linked demographic tables, its performance will improve in each new document set.
Evaluation of PHI Hunter in Natural Language Processing Research
Redd, Andrew; Pickard, Steve; Meystre, Stephane; Scehnet, Jeffrey; Bolton, Dan; Heavirland, Julia; Weaver, Allison Lynn; Hope, Carol; Garvin, Jennifer Hornung
2015-01-01
Objectives We introduce and evaluate a new, easily accessible tool using a common statistical analysis and business analytics software suite, SAS, which can be programmed to remove specific protected health information (PHI) from a text document. Removal of PHI is important because the quantity of text documents used for research with natural language processing (NLP) is increasing. When using existing data for research, an investigator must remove all PHI not needed for the research to comply with human subjects’ right to privacy. This process is similar, but not identical, to de-identification of a given set of documents. Materials and methods PHI Hunter removes PHI from free-form text. It is a set of rules to identify and remove patterns in text. PHI Hunter was applied to 473 Department of Veterans Affairs (VA) text documents randomly drawn from a research corpus stored as unstructured text in VA files. Results PHI Hunter performed well with PHI in the form of identification numbers such as Social Security numbers, phone numbers, and medical record numbers. The most commonly missed PHI items were names and locations. Incorrect removal of information occurred with text that looked like identification numbers. Discussion PHI Hunter fills a niche role that is related to but not equal to the role of de-identification tools. It gives research staff a tool to reasonably increase patient privacy. It performs well for highly sensitive PHI categories that are rarely used in research, but still shows possible areas for improvement. More development for patterns of text and linked demographic tables from electronic health records (EHRs) would improve the program so that more precise identifiable information can be removed. Conclusions PHI Hunter is an accessible tool that can flexibly remove PHI not needed for research. If it can be tailored to the specific data set via linked demographic tables, its performance will improve in each new document set. PMID:26807078
An Introduction to the Extensible Markup Language (XML).
ERIC Educational Resources Information Center
Bryan, Martin
1998-01-01
Describes Extensible Markup Language (XML), a subset of the Standard Generalized Markup Language (SGML) that is designed to make it easy to interchange structured documents over the Internet. Topics include Document Type Definition (DTD), components of XML, the use of XML, text and non-text elements, and uses for XML-coded files. (LRW)
Extracting biomedical events from pairs of text entities
2015-01-01
Background Huge amounts of electronic biomedical documents, such as molecular biology reports or genomic papers are generated daily. Nowadays, these documents are mainly available in the form of unstructured free texts, which require heavy processing for their registration into organized databases. This organization is instrumental for information retrieval, enabling to answer the advanced queries of researchers and practitioners in biology, medicine, and related fields. Hence, the massive data flow calls for efficient automatic methods of text-mining that extract high-level information, such as biomedical events, from biomedical text. The usual computational tools of Natural Language Processing cannot be readily applied to extract these biomedical events, due to the peculiarities of the domain. Indeed, biomedical documents contain highly domain-specific jargon and syntax. These documents also describe distinctive dependencies, making text-mining in molecular biology a specific discipline. Results We address biomedical event extraction as the classification of pairs of text entities into the classes corresponding to event types. The candidate pairs of text entities are recursively provided to a multiclass classifier relying on Support Vector Machines. This recursive process extracts events involving other events as arguments. Compared to joint models based on Markov Random Fields, our model simplifies inference and hence requires shorter training and prediction times along with lower memory capacity. Compared to usual pipeline approaches, our model passes over a complex intermediate problem, while making a more extensive usage of sophisticated joint features between text entities. Our method focuses on the core event extraction of the Genia task of BioNLP challenges yielding the best result reported so far on the 2013 edition. PMID:26201478
What Can Pictures Tell Us About Web Pages? Improving Document Search Using Images.
Rodriguez-Vaamonde, Sergio; Torresani, Lorenzo; Fitzgibbon, Andrew W
2015-06-01
Traditional Web search engines do not use the images in the HTML pages to find relevant documents for a given query. Instead, they typically operate by computing a measure of agreement between the keywords provided by the user and only the text portion of each page. In this paper we study whether the content of the pictures appearing in a Web page can be used to enrich the semantic description of an HTML document and consequently boost the performance of a keyword-based search engine. We present a Web-scalable system that exploits a pure text-based search engine to find an initial set of candidate documents for a given query. Then, the candidate set is reranked using visual information extracted from the images contained in the pages. The resulting system retains the computational efficiency of traditional text-based search engines with only a small additional storage cost needed to encode the visual information. We test our approach on one of the TREC Million Query Track benchmarks where we show that the exploitation of visual content yields improvement in accuracies for two distinct text-based search engines, including the system with the best reported performance on this benchmark. We further validate our approach by collecting document relevance judgements on our search results using Amazon Mechanical Turk. The results of this experiment confirm the improvement in accuracy produced by our image-based reranker over a pure text-based system.
Web Prep: How to Prepare NAS Reports For Publication on the Web
NASA Technical Reports Server (NTRS)
Walatka, Pamela; Balakrishnan, Prithika; Clucas, Jean; McCabe, R. Kevin; Felchle, Gail; Brickell, Cristy
1996-01-01
This document contains specific advice and requirements for NASA Ames Code IN authors of NAS reports. Much of the information may be of interest to other authors writing for the Web. WebPrep has a graphic Table of Contents in the form of a WebToon, which simulates a discussion between a scientist and a Web publishing consultant. In the WebToon, Frequently Asked Questions about preparing reports for the Web are linked to relevant text in the body of this document. We also provide a text-only Table of Contents. The text for this document is divided into chapters: each chapter corresponds to one frame of the WebToons. The chapter topics are: converting text to HTML, converting 2D graphic images to gif, creating imagemaps and tables, converting movie and audio files to Web formats, supplying 3D interactive data, and (briefly) JAVA capabilities. The last chapter is specifically for NAS staff authors. The Glossary-Index lists web related words and links to topics covered in the main text.
Text-interpreter language for flexible generation of patient notes and instructions.
Forker, T S
1992-01-01
An interpreted computer language has been developed along with a windowed user interface and multi-printer-support formatter to allow preparation of documentation of patient visits, including progress notes, prescriptions, excuses for work/school, outpatient laboratory requisitions, and patient instructions. Input is by trackball or mouse with little or no keyboard skill required. For clinical problems with specific protocols, the clinician can be prompted with problem-specific items of history, exam, and lab data to be gathered and documented. The language implements a number of text-related commands as well as branching logic and arithmetic commands. In addition to generating text, it is simple to implement arithmetic calculations such as weight-specific drug dosages; multiple branching decision-support protocols for paramedical personnel (or physicians); and calculation of clinical scores (e.g., coma or trauma scores) while simultaneously documenting the status of each component of the score. ASCII text files produced by the interpreter are available for computerized quality audit. Interpreter instructions are contained in text files users can customize with any text editor.
Matrimonial Causes Rules, 1986, 30 January 1987.
1987-01-01
These Rules are made under Section 4 of the Matrimonial Causes Law, 1976 and contain provisions on applications for leave to present a petition for divorce, documents to accompany the petition, information to be contained in the petition, service of the petition, pleadings subsequent to the petition, directions for trial, security for costs, decrees, and enforcement of orders, among other things. The Rules also stipulate that when "it appears that there is a child of the marriage under the age of sixteen, the record shall show specifically that the question of provision for such child has been considered and dealt with by the Court." full text
ERIC Educational Resources Information Center
1976
The full texts of all prepared statements and supplemental materials presented during five days of oversight hearings held on rehabilitation of the handicapped programs and implementation of these programs by agencies under the Rehabilitation Act of 1973 are contained in this document. Statements are made by (1) State and local directors and other…
Full-scale system impact analysis: Digital document storage project
NASA Technical Reports Server (NTRS)
1989-01-01
The Digital Document Storage Full Scale System can provide cost effective electronic document storage, retrieval, hard copy reproduction, and remote access for users of NASA Technical Reports. The desired functionality of the DDS system is highly dependent on the assumed requirements for remote access used in this Impact Analysis. It is highly recommended that NASA proceed with a phased, communications requirement analysis to ensure that adequate communications service can be supplied at a reasonable cost in order to validate recent working assumptions upon which the success of the DDS Full Scale System is dependent.
Full-Text Databases in Medicine.
ERIC Educational Resources Information Center
Sievert, MaryEllen C.; And Others
1995-01-01
Describes types of full-text databases in medicine; discusses features for searching full-text journal databases available through online vendors; reviews research on full-text databases in medicine; and describes the MEDLINE/Full-Text Research Project at the University of Missouri (Columbia) which investigated precision, recall, and relevancy.…
10 CFR 2.1013 - Use of the electronic docket during the proceeding.
Code of Federal Regulations, 2010 CFR
2010-01-01
... bi-tonal documents. (v) Electronic submissions must be generated in the appropriate PDF output format by using: (A) PDF—Formatted Text and Graphics for textual documents converted from native applications; (B) PDF—Searchable Image (Exact) for textual documents converted from scanned documents; and (C...
VisualUrText: A Text Analytics Tool for Unstructured Textual Data
NASA Astrophysics Data System (ADS)
Zainol, Zuraini; Jaymes, Mohd T. H.; Nohuddin, Puteri N. E.
2018-05-01
The growing amount of unstructured text over Internet is tremendous. Text repositories come from Web 2.0, business intelligence and social networking applications. It is also believed that 80-90% of future growth data is available in the form of unstructured text databases that may potentially contain interesting patterns and trends. Text Mining is well known technique for discovering interesting patterns and trends which are non-trivial knowledge from massive unstructured text data. Text Mining covers multidisciplinary fields involving information retrieval (IR), text analysis, natural language processing (NLP), data mining, machine learning statistics and computational linguistics. This paper discusses the development of text analytics tool that is proficient in extracting, processing, analyzing the unstructured text data and visualizing cleaned text data into multiple forms such as Document Term Matrix (DTM), Frequency Graph, Network Analysis Graph, Word Cloud and Dendogram. This tool, VisualUrText, is developed to assist students and researchers for extracting interesting patterns and trends in document analyses.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Dimmick, Ross
This document contains updates to the Supplemental Information Sandia National Laboratories/New Mexico Site-Wide Environmental Impact Statement Source Documents that were developed in 2010. In general, this addendum provides calendar year 2010 data, along with changes or additions to text in the original documents.
Simpao, Allan F; Tan, Jonathan M; Lingappan, Arul M; Gálvez, Jorge A; Morgan, Sherry E; Krall, Michael A
2017-10-01
Anesthesia information management systems (AIMS) are sophisticated hardware and software technology solutions that can provide electronic feedback to anesthesia providers. This feedback can be tailored to provide clinical decision support (CDS) to aid clinicians with patient care processes, documentation compliance, and resource utilization. We conducted a systematic review of peer-reviewed articles on near real-time and point-of-care CDS within AIMS using the Preferred Reporting Items for Systematic Review and Meta-Analysis Protocols. Studies were identified by searches of the electronic databases Medline and EMBASE. Two reviewers screened studies based on title, abstract, and full text. Studies that were similar in intervention and desired outcome were grouped into CDS categories. Three reviewers graded the evidence within each category. The final analysis included 25 articles on CDS as implemented within AIMS. CDS categories included perioperative antibiotic prophylaxis, post-operative nausea and vomiting prophylaxis, vital sign monitors and alarms, glucose management, blood pressure management, ventilator management, clinical documentation, and resource utilization. Of these categories, the reviewers graded perioperative antibiotic prophylaxis and clinical documentation as having strong evidence per the peer reviewed literature. There is strong evidence for the inclusion of near real-time and point-of-care CDS in AIMS to enhance compliance with perioperative antibiotic prophylaxis and clinical documentation. Additional research is needed in many other areas of AIMS-based CDS.
Wang, Xinglong; Rak, Rafal; Restificar, Angelo; Nobata, Chikashi; Rupp, C J; Batista-Navarro, Riza Theresa B; Nawaz, Raheel; Ananiadou, Sophia
2011-10-03
The selection of relevant articles for curation, and linking those articles to experimental techniques confirming the findings became one of the primary subjects of the recent BioCreative III contest. The contest's Protein-Protein Interaction (PPI) task consisted of two sub-tasks: Article Classification Task (ACT) and Interaction Method Task (IMT). ACT aimed to automatically select relevant documents for PPI curation, whereas the goal of IMT was to recognise the methods used in experiments for identifying the interactions in full-text articles. We proposed and compared several classification-based methods for both tasks, employing rich contextual features as well as features extracted from external knowledge sources. For IMT, a new method that classifies pair-wise relations between every text phrase and candidate interaction method obtained promising results with an F1 score of 64.49%, as tested on the task's development dataset. We also explored ways to combine this new approach and more conventional, multi-label document classification methods. For ACT, our classifiers exploited automatically detected named entities and other linguistic information. The evaluation results on the BioCreative III PPI test datasets showed that our systems were very competitive: one of our IMT methods yielded the best performance among all participants, as measured by F1 score, Matthew's Correlation Coefficient and AUC iP/R; whereas for ACT, our best classifier was ranked second as measured by AUC iP/R, and also competitive according to other metrics. Our novel approach that converts the multi-class, multi-label classification problem to a binary classification problem showed much promise in IMT. Nevertheless, on the test dataset the best performance was achieved by taking the union of the output of this method and that of a multi-class, multi-label document classifier, which indicates that the two types of systems complement each other in terms of recall. For ACT, our system exploited a rich set of features and also obtained encouraging results. We examined the features with respect to their contributions to the classification results, and concluded that contextual words surrounding named entities, as well as the MeSH headings associated with the documents were among the main contributors to the performance.
NASA Technical Reports Server (NTRS)
Panait, Claudia M.
2004-01-01
The NASA Glenn Library is a science and engineering research library providing the most current books, journals, CD-ROM's and documents to support the study of aeronautics, space propulsion and power, communications technology, materials and structures and microgravity science. The GRC technical library also supports the research and development efforts of all scientists and engineers on site via full text electronic files, literature searching, technical reports, etc. As an intern in the NASA Glenn Library, I attempt to support these objectives through efficiently and effectively fulfilling the assignment that was given to me. The assignment that was relegated to me was to catalog National Advisory Committee for Aeronautics, NASA Technical Documents into NASA Galaxie. This process consists of holdings being added to existing Galaxie records, upgrades and editing done to the bibliographic records when needed, adding URL's into Galaxie when they were missing from the record. NASA ASAP and Digidoc was used to locate URL's of PDF's that were not in Galaxie. A spreadsheet of documents with no URL's were maintained. Also, a subject channel of web, fill-text, paid and free, journal and other subject specific pages were developed and expanded fiom current content of intranet pages. To expand upon the second half of my assignment, I was given the project of taking inventory of the library s book collection. I kept record of the books that were not accounted for on a master list I was given to work fiom and submitted them for correction and addition. I also made sure the books were placed in the appropriate order and made corrections to any discrepancies that existed between the master list and what was on the shelf. Upon completion of this assignment, I will have verified that 21,113 books were in the correct location, order and have the correct corresponding serial number and barcode. In conclusion, as of this date I have input around 750 documents into NASA Galaxie, inputting about half of the NASA Technical Documents into the system. The rest of my tenure in this program will consist of finishing the other half of the reports. In regard to the second assignment, I still have about three-quarters of the collection to record and correct.
ERIC Educational Resources Information Center
Göçer, Ali
2014-01-01
In this study, Turkish text-based written examination questions posed to students in secondary schools were examined. In this research, document analysis method within the framework of the qualitative research approach was used. The data obtained from the documents consisting of written examination papers were analyzed with content analysis…
ERIC Educational Resources Information Center
Stromso, Helge I.; Braten, Ivar; Britt, M. Anne
2010-01-01
In many situations, readers are asked to learn from multiple documents. Many studies have found that evaluating the trustworthiness and usefulness of document sources is an important skill in such learning situations. There has been, however, no direct evidence that attending to source information helps readers learn from and interpret a…
Automatic Identification of Topic Tags from Texts Based on Expansion-Extraction Approach
ERIC Educational Resources Information Center
Yang, Seungwon
2013-01-01
Identifying topics of a textual document is useful for many purposes. We can organize the documents by topics in digital libraries. Then, we could browse and search for the documents with specific topics. By examining the topics of a document, we can quickly understand what the document is about. To augment the traditional manual way of topic…
Document reconstruction by layout analysis of snippets
NASA Astrophysics Data System (ADS)
Kleber, Florian; Diem, Markus; Sablatnig, Robert
2010-02-01
Document analysis is done to analyze entire forms (e.g. intelligent form analysis, table detection) or to describe the layout/structure of a document. Also skew detection of scanned documents is performed to support OCR algorithms that are sensitive to skew. In this paper document analysis is applied to snippets of torn documents to calculate features for the reconstruction. Documents can either be destroyed by the intention to make the printed content unavailable (e.g. tax fraud investigation, business crime) or due to time induced degeneration of ancient documents (e.g. bad storage conditions). Current reconstruction methods for manually torn documents deal with the shape, inpainting and texture synthesis techniques. In this paper the possibility of document analysis techniques of snippets to support the matching algorithm by considering additional features are shown. This implies a rotational analysis, a color analysis and a line detection. As a future work it is planned to extend the feature set with the paper type (blank, checked, lined), the type of the writing (handwritten vs. machine printed) and the text layout of a snippet (text size, line spacing). Preliminary results show that these pre-processing steps can be performed reliably on a real dataset consisting of 690 snippets.
Probing the Topological Properties of Complex Networks Modeling Short Written Texts
Amancio, Diego R.
2015-01-01
In recent years, graph theory has been widely employed to probe several language properties. More specifically, the so-called word adjacency model has been proven useful for tackling several practical problems, especially those relying on textual stylistic analysis. The most common approach to treat texts as networks has simply considered either large pieces of texts or entire books. This approach has certainly worked well—many informative discoveries have been made this way—but it raises an uncomfortable question: could there be important topological patterns in small pieces of texts? To address this problem, the topological properties of subtexts sampled from entire books was probed. Statistical analyses performed on a dataset comprising 50 novels revealed that most of the traditional topological measurements are stable for short subtexts. When the performance of the authorship recognition task was analyzed, it was found that a proper sampling yields a discriminability similar to the one found with full texts. Surprisingly, the support vector machine classification based on the characterization of short texts outperformed the one performed with entire books. These findings suggest that a local topological analysis of large documents might improve its global characterization. Most importantly, it was verified, as a proof of principle, that short texts can be analyzed with the methods and concepts of complex networks. As a consequence, the techniques described here can be extended in a straightforward fashion to analyze texts as time-varying complex networks. PMID:25719799
Miwa, Makoto; Ohta, Tomoko; Rak, Rafal; Rowley, Andrew; Kell, Douglas B.; Pyysalo, Sampo; Ananiadou, Sophia
2013-01-01
Motivation: To create, verify and maintain pathway models, curators must discover and assess knowledge distributed over the vast body of biological literature. Methods supporting these tasks must understand both the pathway model representations and the natural language in the literature. These methods should identify and order documents by relevance to any given pathway reaction. No existing system has addressed all aspects of this challenge. Method: We present novel methods for associating pathway model reactions with relevant publications. Our approach extracts the reactions directly from the models and then turns them into queries for three text mining-based MEDLINE literature search systems. These queries are executed, and the resulting documents are combined and ranked according to their relevance to the reactions of interest. We manually annotate document-reaction pairs with the relevance of the document to the reaction and use this annotation to study several ranking methods, using various heuristic and machine-learning approaches. Results: Our evaluation shows that the annotated document-reaction pairs can be used to create a rule-based document ranking system, and that machine learning can be used to rank documents by their relevance to pathway reactions. We find that a Support Vector Machine-based system outperforms several baselines and matches the performance of the rule-based system. The success of the query extraction and ranking methods are used to update our existing pathway search system, PathText. Availability: An online demonstration of PathText 2 and the annotated corpus are available for research purposes at http://www.nactem.ac.uk/pathtext2/. Contact: makoto.miwa@manchester.ac.uk Supplementary information: Supplementary data are available at Bioinformatics online. PMID:23813008
Rectification of curved document images based on single view three-dimensional reconstruction.
Kang, Lai; Wei, Yingmei; Jiang, Jie; Bai, Liang; Lao, Songyang
2016-10-01
Since distortions in camera-captured document images significantly affect the accuracy of optical character recognition (OCR), distortion removal plays a critical role for document digitalization systems using a camera for image capturing. This paper proposes a novel framework that performs three-dimensional (3D) reconstruction and rectification of camera-captured document images. While most existing methods rely on additional calibrated hardware or multiple images to recover the 3D shape of a document page, or make a simple but not always valid assumption on the corresponding 3D shape, our framework is more flexible and practical since it only requires a single input image and is able to handle a general locally smooth document surface. The main contributions of this paper include a new iterative refinement scheme for baseline fitting from connected components of text line, an efficient discrete vertical text direction estimation algorithm based on convex hull projection profile analysis, and a 2D distortion grid construction method based on text direction function estimation using 3D regularization. In order to examine the performance of our proposed method, both qualitative and quantitative evaluation and comparison with several recent methods are conducted in our experiments. The experimental results demonstrate that the proposed method outperforms relevant approaches for camera-captured document image rectification, in terms of improvements on both visual distortion removal and OCR accuracy.
Reminder Cards Improve Physician Documentation of Obesity But Not Obesity Counseling.
Shungu, Nicholas; Miller, Marshal N; Mills, Geoffrey; Patel, Neesha; de la Paz, Amanda; Rose, Victoria; Kropa, Jill; Edi, Rina; Levy, Emily; Crenshaw, Margaret; Hwang, Chris
2015-01-01
Physicians frequently fail to document obesity and obesity-related counseling. We sought to determine whether attaching a physical reminder card to patient encounter forms would increase electronic medical record (EMR) assessment of and documentation of obesity and dietary counseling. Reminder cards for obesity documentation were attached to encounter forms for patient encounters over a 2-week intervention period. For visits in the intervention period, the EMR was retrospectively reviewed for BMI, assessment of "obesity" or "morbid obesity" as an active problem, free-text dietary counseling within physician notes, and assessment of "dietary counseling" as an active problem. These data were compared to those collected through a retrospective chart review during a 2-week pre-intervention period. We also compared physician self-report of documentation via reminder cards with EMR documentation. We found significant improvement in the primary endpoint of assessment of "obesity" or "morbid obesity" as an active problem (42.5% versus 28%) compared to the pre-intervention period. There was no significant difference in the primary endpoints of free-text dietary counseling or assessment of "dietary counseling" as an active problem between the groups. Physician self-reporting of assessment of "obesity" or "morbid obesity" as an active problem (77.7% versus 42.5%), free-text dietary counseling on obesity (69.1% versus 35.4%) and assessment of "dietary counseling" as an active problem (54.3% versus 25.2%) were all significantly higher than those reflected in EMR documentation. This study demonstrates that physical reminder cards are a successful means of increasing obesity documentation rates among providers but do not necessarily increase rates of obesity-related counseling or documentation of counseling. Our study suggests that even with such interventions, physicians are likely under-documenting obesity and counseling compared to self-reported rates.
SGML and HTML: The Merging of Document Management and Electronic Document Publishing.
ERIC Educational Resources Information Center
Dixon, Ross
1996-01-01
Document control is an issue for organizations that use SGML/HTML. The prevalent approach is to apply the same techniques to document elements that are applied to full documents, a practice that has led to an overlap of electronic publishing and document management. Lists requirements for the management of SGML/HTML documents. (PEN)
Let Documents Talk to Each Other: A Computer Model for Connection of Short Documents.
ERIC Educational Resources Information Center
Chen, Z.
1993-01-01
Discusses the integration of scientific texts through the connection of documents and describes a computer model that can connect short documents. Information retrieval and artificial intelligence are discussed; a prototype system of the model is explained; and the model is compared to other computer models. (17 references) (LRW)
30 CFR 285.115 - Documents incorporated by reference.
Code of Federal Regulations, 2011 CFR
2011-07-01
... incorporating by reference the documents listed in the table in paragraph (e) of this section. The Director of...: ER29AP09.104 (e) This paragraph lists documents incorporated by reference. To easily reference text of the... 30 Mineral Resources 2 2011-07-01 2011-07-01 false Documents incorporated by reference. 285.115...
75 FR 28594 - Ready-to-Learn Television Program
Federal Register 2010, 2011, 2012, 2013, 2014
2010-05-21
... Federal Register. Free Internet access to the official edition of the Federal Register and the Code of... Access to This Document: You can view this document, as well as all other documents of this Department published in the Federal Register, in text or Adobe Portable Document Format (PDF) on the Internet at the...
BoB, a best-of-breed automated text de-identification system for VHA clinical documents.
Ferrández, Oscar; South, Brett R; Shen, Shuying; Friedlin, F Jeffrey; Samore, Matthew H; Meystre, Stéphane M
2013-01-01
De-identification allows faster and more collaborative clinical research while protecting patient confidentiality. Clinical narrative de-identification is a tedious process that can be alleviated by automated natural language processing methods. The goal of this research is the development of an automated text de-identification system for Veterans Health Administration (VHA) clinical documents. We devised a novel stepwise hybrid approach designed to improve the current strategies used for text de-identification. The proposed system is based on a previous study on the best de-identification methods for VHA documents. This best-of-breed automated clinical text de-identification system (aka BoB) tackles the problem as two separate tasks: (1) maximize patient confidentiality by redacting as much protected health information (PHI) as possible; and (2) leave de-identified documents in a usable state preserving as much clinical information as possible. We evaluated BoB with a manually annotated corpus of a variety of VHA clinical notes, as well as with the 2006 i2b2 de-identification challenge corpus. We present evaluations at the instance- and token-level, with detailed results for BoB's main components. Moreover, an existing text de-identification system was also included in our evaluation. BoB's design efficiently takes advantage of the methods implemented in its pipeline, resulting in high sensitivity values (especially for sensitive PHI categories) and a limited number of false positives. Our system successfully addressed VHA clinical document de-identification, and its hybrid stepwise design demonstrates robustness and efficiency, prioritizing patient confidentiality while leaving most clinical information intact.
A systematic review of nursing research priorities on health system and services in the Americas.
Garcia, Alessandra Bassalobre; Cassiani, Silvia Helena De Bortoli; Reveiz, Ludovic
2015-03-01
To systematically review literature on priorities in nursing research on health systems and services in the Region of the Americas as a step toward developing a nursing research agenda that will advance the Regional Strategy for Universal Access to Health and Universal Health Coverage. This was a systematic review of the literature available from the following databases: Web of Science, PubMed, LILACS, and Google. Documents considered were published in 2008-2014; in English, Spanish, or Portuguese; and addressed the topic in the Region of the Americas. The documents selected had their priority-setting process evaluated according to the "nine common themes for good practice in health research priorities." A content analysis collected all study questions and topics, and sorted them by category and subcategory. Of 185 full-text articles/documents that were assessed for eligibility, 23 were selected: 12 were from peer-reviewed journals; 6 from nursing publications; 4 from Ministries of Health; and 1 from an international organization. Journal publications had stronger methodological rigor; the majority did not present a clear implementation or evaluation plan. After compiling the 444 documents' study questions and topics, the content analysis resulted in a document with 5 categories and 16 subcategories regarding nursing research priorities on health systems and services. Research priority-setting is a highly important process for health services improvement and resources optimization, but implementation and evaluation plans are rarely included. The resulting document will serve as basis for the development of a new nursing research agenda focused on health systems and services, and shaped to advance universal health coverage and universal access to health.
OrganismTagger: detection, normalization and grounding of organism entities in biomedical documents.
Naderi, Nona; Kappler, Thomas; Baker, Christopher J O; Witte, René
2011-10-01
Semantic tagging of organism mentions in full-text articles is an important part of literature mining and semantic enrichment solutions. Tagged organism mentions also play a pivotal role in disambiguating other entities in a text, such as proteins. A high-precision organism tagging system must be able to detect the numerous forms of organism mentions, including common names as well as the traditional taxonomic groups: genus, species and strains. In addition, such a system must resolve abbreviations and acronyms, assign the scientific name and if possible link the detected mention to the NCBI Taxonomy database for further semantic queries and literature navigation. We present the OrganismTagger, a hybrid rule-based/machine learning system to extract organism mentions from the literature. It includes tools for automatically generating lexical and ontological resources from a copy of the NCBI Taxonomy database, thereby facilitating system updates by end users. Its novel ontology-based resources can also be reused in other semantic mining and linked data tasks. Each detected organism mention is normalized to a canonical name through the resolution of acronyms and abbreviations and subsequently grounded with an NCBI Taxonomy database ID. In particular, our system combines a novel machine-learning approach with rule-based and lexical methods for detecting strain mentions in documents. On our manually annotated OT corpus, the OrganismTagger achieves a precision of 95%, a recall of 94% and a grounding accuracy of 97.5%. On the manually annotated corpus of Linnaeus-100, the results show a precision of 99%, recall of 97% and grounding accuracy of 97.4%. The OrganismTagger, including supporting tools, resources, training data and manual annotations, as well as end user and developer documentation, is freely available under an open-source license at http://www.semanticsoftware.info/organism-tagger. witte@semanticsoftware.info.
Discovering functional modules by topic modeling RNA-Seq based toxicogenomic data.
Yu, Ke; Gong, Binsheng; Lee, Mikyung; Liu, Zhichao; Xu, Joshua; Perkins, Roger; Tong, Weida
2014-09-15
Toxicogenomics (TGx) endeavors to elucidate the underlying molecular mechanisms through exploring gene expression profiles in response to toxic substances. Recently, RNA-Seq is increasingly regarded as a more powerful alternative to microarrays in TGx studies. However, realizing RNA-Seq's full potential requires novel approaches to extracting information from the complex TGx data. Considering read counts as the number of times a word occurs in a document, gene expression profiles from RNA-Seq are analogous to a word by document matrix used in text mining. Topic modeling aiming at to discover the latent structures in text corpora would be helpful to explore RNA-Seq based TGx data. In this study, topic modeling was applied on a typical RNA-Seq based TGx data set to discover hidden functional modules. The RNA-Seq based gene expression profiles were transformed into "documents", on which latent Dirichlet allocation (LDA) was used to build a topic model. We found samples treated by the compounds with the same modes of actions (MoAs) could be clustered based on topic similarities. The topic most relevant to each cluster was identified as a "marker" topic, which was interpreted by gene enrichment analysis with MoAs then confirmed by compound and pathways associations mined from literature. To further validate the "marker" topics, we tested topic transferability from RNA-Seq to microarrays. The RNA-Seq based gene expression profile of a topic specifically associated with peroxisome proliferator-activated receptors (PPAR) signaling pathway was used to query samples with similar expression profiles in two different microarray data sets, yielding accuracy of about 85%. This proof-of-concept study demonstrates the applicability of topic modeling to discover functional modules in RNA-Seq data and suggests a valuable computational tool for leveraging information within TGx data in RNA-Seq era.
In Situ Soil Venting - Full Scale Test Hill AFB, Guidance Document, Literature Review. Volume 1
1991-08-01
AD-A254 924 1’) VOL I IN SITU SOIL VENTING - FULL SCALE TEST HILL AFB, GUIDANCE DOCUMENT, LITERATURE REVIEW D. W. DEPAO, S. E. HERBES, J. H . WILSON...D. K. SOLOMON, AND H . L. JENNINGS MARTIN-MARIETTA ENERGY SYSTEMS OAK RIDGE NATIONAL LABORATORY P. O. BOX 2008 OAK RIDGE TN 37831 OTI AUGUST 1991 S...sificat,cn) (U) In Situ Soil Ver.ting - Full Scale Test Hill AFB, Guidance Document, Literature Review 2 PERSO’.AL AUTH-O’.S, a W ApP li- S_ T’.- erber:. H
Textpresso: An Ontology-Based Information Retrieval and Extraction System for Biological Literature
Müller, Hans-Michael; Kenny, Eimear E
2004-01-01
We have developed Textpresso, a new text-mining system for scientific literature whose capabilities go far beyond those of a simple keyword search engine. Textpresso's two major elements are a collection of the full text of scientific articles split into individual sentences, and the implementation of categories of terms for which a database of articles and individual sentences can be searched. The categories are classes of biological concepts (e.g., gene, allele, cell or cell group, phenotype, etc.) and classes that relate two objects (e.g., association, regulation, etc.) or describe one (e.g., biological process, etc.). Together they form a catalog of types of objects and concepts called an ontology. After this ontology is populated with terms, the whole corpus of articles and abstracts is marked up to identify terms of these categories. The current ontology comprises 33 categories of terms. A search engine enables the user to search for one or a combination of these tags and/or keywords within a sentence or document, and as the ontology allows word meaning to be queried, it is possible to formulate semantic queries. Full text access increases recall of biological data types from 45% to 95%. Extraction of particular biological facts, such as gene-gene interactions, can be accelerated significantly by ontologies, with Textpresso automatically performing nearly as well as expert curators to identify sentences; in searches for two uniquely named genes and an interaction term, the ontology confers a 3-fold increase of search efficiency. Textpresso currently focuses on Caenorhabditis elegans literature, with 3,800 full text articles and 16,000 abstracts. The lexicon of the ontology contains 14,500 entries, each of which includes all versions of a specific word or phrase, and it includes all categories of the Gene Ontology database. Textpresso is a useful curation tool, as well as search engine for researchers, and can readily be extended to other organism-specific corpora of text. Textpresso can be accessed at http://www.textpresso.org or via WormBase at http://www.wormbase.org. PMID:15383839
"What is relevant in a text document?": An interpretable machine learning approach
Arras, Leila; Horn, Franziska; Montavon, Grégoire; Müller, Klaus-Robert
2017-01-01
Text documents can be described by a number of abstract concepts such as semantic category, writing style, or sentiment. Machine learning (ML) models have been trained to automatically map documents to these abstract concepts, allowing to annotate very large text collections, more than could be processed by a human in a lifetime. Besides predicting the text’s category very accurately, it is also highly desirable to understand how and why the categorization process takes place. In this paper, we demonstrate that such understanding can be achieved by tracing the classification decision back to individual words using layer-wise relevance propagation (LRP), a recently developed technique for explaining predictions of complex non-linear classifiers. We train two word-based ML models, a convolutional neural network (CNN) and a bag-of-words SVM classifier, on a topic categorization task and adapt the LRP method to decompose the predictions of these models onto words. Resulting scores indicate how much individual words contribute to the overall classification decision. This enables one to distill relevant information from text documents without an explicit semantic information extraction step. We further use the word-wise relevance scores for generating novel vector-based document representations which capture semantic information. Based on these document vectors, we introduce a measure of model explanatory power and show that, although the SVM and CNN models perform similarly in terms of classification accuracy, the latter exhibits a higher level of explainability which makes it more comprehensible for humans and potentially more useful for other applications. PMID:28800619
Robust keyword retrieval method for OCRed text
NASA Astrophysics Data System (ADS)
Fujii, Yusaku; Takebe, Hiroaki; Tanaka, Hiroshi; Hotta, Yoshinobu
2011-01-01
Document management systems have become important because of the growing popularity of electronic filing of documents and scanning of books, magazines, manuals, etc., through a scanner or a digital camera, for storage or reading on a PC or an electronic book. Text information acquired by optical character recognition (OCR) is usually added to the electronic documents for document retrieval. Since texts generated by OCR generally include character recognition errors, robust retrieval methods have been introduced to overcome this problem. In this paper, we propose a retrieval method that is robust against both character segmentation and recognition errors. In the proposed method, the insertion of noise characters and dropping of characters in the keyword retrieval enables robustness against character segmentation errors, and character substitution in the keyword of the recognition candidate for each character in OCR or any other character enables robustness against character recognition errors. The recall rate of the proposed method was 15% higher than that of the conventional method. However, the precision rate was 64% lower.
Text-image alignment for historical handwritten documents
NASA Astrophysics Data System (ADS)
Zinger, S.; Nerbonne, J.; Schomaker, L.
2009-01-01
We describe our work on text-image alignment in context of building a historical document retrieval system. We aim at aligning images of words in handwritten lines with their text transcriptions. The images of handwritten lines are automatically segmented from the scanned pages of historical documents and then manually transcribed. To train automatic routines to detect words in an image of handwritten text, we need a training set - images of words with their transcriptions. We present our results on aligning words from the images of handwritten lines and their corresponding text transcriptions. Alignment based on the longest spaces between portions of handwriting is a baseline. We then show that relative lengths, i.e. proportions of words in their lines, can be used to improve the alignment results considerably. To take into account the relative word length, we define the expressions for the cost function that has to be minimized for aligning text words with their images. We apply right to left alignment as well as alignment based on exhaustive search. The quality assessment of these alignments shows correct results for 69% of words from 100 lines, or 90% of partially correct and correct alignments combined.
Vogel, Markus; Kaisers, Wolfgang; Wassmuth, Ralf; Mayatepek, Ertan
2015-11-03
Clinical documentation has undergone a change due to the usage of electronic health records. The core element is to capture clinical findings and document therapy electronically. Health care personnel spend a significant portion of their time on the computer. Alternatives to self-typing, such as speech recognition, are currently believed to increase documentation efficiency and quality, as well as satisfaction of health professionals while accomplishing clinical documentation, but few studies in this area have been published to date. This study describes the effects of using a Web-based medical speech recognition system for clinical documentation in a university hospital on (1) documentation speed, (2) document length, and (3) physician satisfaction. Reports of 28 physicians were randomized to be created with (intervention) or without (control) the assistance of a Web-based system of medical automatic speech recognition (ASR) in the German language. The documentation was entered into a browser's text area and the time to complete the documentation including all necessary corrections, correction effort, number of characters, and mood of participant were stored in a database. The underlying time comprised text entering, text correction, and finalization of the documentation event. Participants self-assessed their moods on a scale of 1-3 (1=good, 2=moderate, 3=bad). Statistical analysis was done using permutation tests. The number of clinical reports eligible for further analysis stood at 1455. Out of 1455 reports, 718 (49.35%) were assisted by ASR and 737 (50.65%) were not assisted by ASR. Average documentation speed without ASR was 173 (SD 101) characters per minute, while it was 217 (SD 120) characters per minute using ASR. The overall increase in documentation speed through Web-based ASR assistance was 26% (P=.04). Participants documented an average of 356 (SD 388) characters per report when not assisted by ASR and 649 (SD 561) characters per report when assisted by ASR. Participants' average mood rating was 1.3 (SD 0.6) using ASR assistance compared to 1.6 (SD 0.7) without ASR assistance (P<.001). We conclude that medical documentation with the assistance of Web-based speech recognition leads to an increase in documentation speed, document length, and participant mood when compared to self-typing. Speech recognition is a meaningful and effective tool for the clinical documentation process.
Mining the Text: 34 Text Features that Can Ease or Obstruct Text Comprehension and Use
ERIC Educational Resources Information Center
White, Sheida
2012-01-01
This article presents 34 characteristics of texts and tasks ("text features") that can make continuous (prose), noncontinuous (document), and quantitative texts easier or more difficult for adolescents and adults to comprehend and use. The text features were identified by examining the assessment tasks and associated texts in the national…
Assessing semantic similarity of texts - Methods and algorithms
NASA Astrophysics Data System (ADS)
Rozeva, Anna; Zerkova, Silvia
2017-12-01
Assessing the semantic similarity of texts is an important part of different text-related applications like educational systems, information retrieval, text summarization, etc. This task is performed by sophisticated analysis, which implements text-mining techniques. Text mining involves several pre-processing steps, which provide for obtaining structured representative model of the documents in a corpus by means of extracting and selecting the features, characterizing their content. Generally the model is vector-based and enables further analysis with knowledge discovery approaches. Algorithms and measures are used for assessing texts at syntactical and semantic level. An important text-mining method and similarity measure is latent semantic analysis (LSA). It provides for reducing the dimensionality of the document vector space and better capturing the text semantics. The mathematical background of LSA for deriving the meaning of the words in a given text by exploring their co-occurrence is examined. The algorithm for obtaining the vector representation of words and their corresponding latent concepts in a reduced multidimensional space as well as similarity calculation are presented.
Relevance popularity: A term event model based feature selection scheme for text classification.
Feng, Guozhong; An, Baiguo; Yang, Fengqin; Wang, Han; Zhang, Libiao
2017-01-01
Feature selection is a practical approach for improving the performance of text classification methods by optimizing the feature subsets input to classifiers. In traditional feature selection methods such as information gain and chi-square, the number of documents that contain a particular term (i.e. the document frequency) is often used. However, the frequency of a given term appearing in each document has not been fully investigated, even though it is a promising feature to produce accurate classifications. In this paper, we propose a new feature selection scheme based on a term event Multinomial naive Bayes probabilistic model. According to the model assumptions, the matching score function, which is based on the prediction probability ratio, can be factorized. Finally, we derive a feature selection measurement for each term after replacing inner parameters by their estimators. On a benchmark English text datasets (20 Newsgroups) and a Chinese text dataset (MPH-20), our numerical experiment results obtained from using two widely used text classifiers (naive Bayes and support vector machine) demonstrate that our method outperformed the representative feature selection methods.
Federal Register 2010, 2011, 2012, 2013, 2014
2010-07-02
... published in the Federal Register. Free Internet access to the official edition of the Federal Register and.... Electronic Access to This Document: You can view this document, as well as all other documents of this Department published in the Federal Register, in text or Adobe Portable Document Format (PDF) on the Internet...
Can physicians recognize their own patients in de-identified notes?
Meystre, Stéphane; Shen, Shuying; Hofmann, Deborah; Gundlapalli, Adi
2014-01-01
The adoption of Electronic Health Records is growing at a fast pace, and this growth results in very large quantities of patient clinical information becoming available in electronic format, with tremendous potentials, but also equally growing concern for patient confidentiality breaches. De-identification of patient information has been proposed as a solution to both facilitate secondary uses of clinical information, and protect patient information confidentiality. Automated approaches based on Natural Language Processing have been implemented and evaluated, allowing for much faster text de-identification than manual approaches. A U.S. Veterans Affairs clinical text de-identification project focused on investigating the current state of the art of automatic clinical text de-identification, on developing a best-of-breed de-identification application for clinical documents, and on evaluating its impact on subsequent text uses and the risk for re-identification. To evaluate this risk, we de-identified discharge summaries from 86 patients using our 'best-of-breed' text de-identification application with resynthesis of the identifiers detected. We then asked physicians working in the ward the patients were hospitalized in if they could recognize these patients when reading the de-identified documents. Each document was examined by at least one resident and one attending physician, and with 4.65% of the documents, physicians thought they recognized the patient because of specific clinical information, but after verification, none was correctly re-identified.
Is searching full text more effective than searching abstracts?
Lin, Jimmy
2009-01-01
Background With the growing availability of full-text articles online, scientists and other consumers of the life sciences literature now have the ability to go beyond searching bibliographic records (title, abstract, metadata) to directly access full-text content. Motivated by this emerging trend, I posed the following question: is searching full text more effective than searching abstracts? This question is answered by comparing text retrieval algorithms on MEDLINE® abstracts, full-text articles, and spans (paragraphs) within full-text articles using data from the TREC 2007 genomics track evaluation. Two retrieval models are examined: bm25 and the ranking algorithm implemented in the open-source Lucene search engine. Results Experiments show that treating an entire article as an indexing unit does not consistently yield higher effectiveness compared to abstract-only search. However, retrieval based on spans, or paragraphs-sized segments of full-text articles, consistently outperforms abstract-only search. Results suggest that highest overall effectiveness may be achieved by combining evidence from spans and full articles. Conclusion Users searching full text are more likely to find relevant articles than searching only abstracts. This finding affirms the value of full text collections for text retrieval and provides a starting point for future work in exploring algorithms that take advantage of rapidly-growing digital archives. Experimental results also highlight the need to develop distributed text retrieval algorithms, since full-text articles are significantly longer than abstracts and may require the computational resources of multiple machines in a cluster. The MapReduce programming model provides a convenient framework for organizing such computations. PMID:19192280
ERIC Educational Resources Information Center
Stadtler, Marc; Scharrer, Lisa; Brummernhenrich, Benjamin; Bromme, Rainer
2013-01-01
Past research has shown that readers often fail to notice conflicts in text. In our present study we investigated whether accessing information from multiple documents instead of a single document might alleviate this problem by motivating readers to integrate information. We further tested whether this effect would be moderated by source…
Transcript mapping for handwritten English documents
NASA Astrophysics Data System (ADS)
Jose, Damien; Bharadwaj, Anurag; Govindaraju, Venu
2008-01-01
Transcript mapping or text alignment with handwritten documents is the automatic alignment of words in a text file with word images in a handwritten document. Such a mapping has several applications in fields ranging from machine learning where large quantities of truth data are required for evaluating handwriting recognition algorithms, to data mining where word image indexes are used in ranked retrieval of scanned documents in a digital library. The alignment also aids "writer identity" verification algorithms. Interfaces which display scanned handwritten documents may use this alignment to highlight manuscript tokens when a person examines the corresponding transcript word. We propose an adaptation of the True DTW dynamic programming algorithm for English handwritten documents. The integration of the dissimilarity scores from a word-model word recognizer and Levenshtein distance between the recognized word and lexicon word, as a cost metric in the DTW algorithm leading to a fast and accurate alignment, is our primary contribution. Results provided, confirm the effectiveness of our approach.
Text grouping in patent analysis using adaptive K-means clustering algorithm
NASA Astrophysics Data System (ADS)
Shanie, Tiara; Suprijadi, Jadi; Zulhanif
2017-03-01
Patents are one of the Intellectual Property. Analyzing patent is one requirement in knowing well the development of technology in each country and in the world now. This study uses the patent document coming from the Espacenet server about Green Tea. Patent documents related to the technology in the field of tea is still widespread, so it will be difficult for users to information retrieval (IR). Therefore, it is necessary efforts to categorize documents in a specific group of related terms contained therein. This study uses titles patent text data with the proposed Green Tea in Statistical Text Mining methods consists of two phases: data preparation and data analysis stage. The data preparation phase uses Text Mining methods and data analysis stage is done by statistics. Statistical analysis in this study using a cluster analysis algorithm, the Adaptive K-Means Clustering Algorithm. Results from this study showed that based on the maximum value Silhouette, generate 87 clusters associated fifteen terms therein that can be utilized in the process of information retrieval needs.
Building Background Knowledge through Reading: Rethinking Text Sets
ERIC Educational Resources Information Center
Lupo, Sarah M.; Strong, John Z.; Lewis, William; Walpole, Sharon; McKenna, Michael C.
2018-01-01
To increase reading volume and help students access challenging texts, the authors propose a four-dimensional framework for text sets. The quad text set framework is designed around a target text: a challenging content area text, such as a canonical literary work, research article, or historical primary source document. The three remaining…
Southern Salish Sea Habitat Map Series: Admiralty Inlet
Cochrane, Guy R.; Dethier, Megan N.; Hodson, Timothy O.; Kull, Kristine K.; Golden, Nadine E.; Ritchie, Andrew C.; Moegling, Crescent; Pacunski, Robert E.; Cochrane, Guy R.
2015-01-01
This publication includes four map sheets, explanatory text, and a descriptive pamphlet. Each map sheet is published as a portable document format (PDF) file. ESRI ArcGIS compatible geotiffs (for example, bathymetry) and shapefiles (for example video observation points) will be available for download in the data catalog associated with this publication (Cochrane, 2015). An ArcGIS Project File with the symbology used to generate the map sheets is also provided. For those who do not own the full suite of ESRI GIS and mapping software, the data can be read using ESRI ArcReader, a free viewer that is available at http://www.esri.com/software/arcgis/arcreader/index.html.
Microfilm and Computer Full Text of Archival Documents
1988-10-13
0ulNr N TED . t ao 0 a-E Wk an a-.f- 0 ma n 4 NNl V -U% C-4 M IC 041 aN, 4- N ACC 40 Wa f 44 NM C, W0 ’a-V CO a. "... 0~ ( 0 a IA v =0 a *0 M4 . aa co a 0...a40 40 0 I.- I. S 0 $- I - I W4.- -t . - 0S -(. a . (5c-c - o Q 0 L 0t 0 C - I .1 Wun O .4 tO4 0 0 w M r u c * I- 44 m 0 .0 ce *n in 00 0CF O(! C M P
Assessing usage patterns of electronic clinical documentation templates.
Vawdrey, David K
2008-11-06
Many vendors of electronic medical records support structured and free-text entry of clinical documents using configurable templates. At a healthcare institution comprising two large academic medical centers, a documentation management data mart and a custom, Web-accessible business intelligence application were developed to track the availability and usage of electronic documentation templates. For each medical center, template availability and usage trends were measured from November 2007 through February 2008. By February 2008, approximately 65,000 electronic notes were authored per week on the two campuses. One site had 934 available templates, with 313 being used to author at least one note. The other site had 765 templates, of which 480 were used. The most commonly used template at both campuses was a free text note called "Miscellaneous Nursing Note," which accounted for 33.3% of total documents generated at one campus and 15.2% at the other.
Davis, Philip M
2013-07-01
Does PubMed Central--a government-run digital archive of biomedical articles--compete with scientific society journals? A longitudinal, retrospective cohort analysis of 13,223 articles (5999 treatment, 7224 control) published in 14 society-run biomedical research journals in nutrition, experimental biology, physiology, and radiology between February 2008 and January 2011 reveals a 21.4% reduction in full-text hypertext markup language (HTML) article downloads and a 13.8% reduction in portable document format (PDF) article downloads from the journals' websites when U.S. National Institutes of Health-sponsored articles (treatment) become freely available from the PubMed Central repository. In addition, the effect of PubMed Central on reducing PDF article downloads is increasing over time, growing at a rate of 1.6% per year. There was no longitudinal effect for full-text HTML downloads. While PubMed Central may be providing complementary access to readers traditionally underserved by scientific journals, the loss of article readership from the journal website may weaken the ability of the journal to build communities of interest around research papers, impede the communication of news and events to scientific society members and journal readers, and reduce the perceived value of the journal to institutional subscribers.
Mucke, Hermann A M
2011-01-01
This investigation identifies patent applications published under the international Patent Convention Treaty between July 2010 and January 2011 in three significant fields of vascular risk management (arterial hypertension, atherosclerosis, and aneurysms) and investigates whether the inventors have also published peer reviewed papers directly describing their claimed invention. Out of only 48 patent documents that specifically addressed at least one of the above-mentioned fields, 15 had immediate companion papers of which 13 were published earlier than the corresponding patent applications; the majority of these papers were published by noncorporate patentees. Although the majority of patent applications (30 documents) had at least one corporate assignee, 18 came from academic environments. As expected, medical devices dominated in the aneurysm segment while pharmacology dominated hypertension and atherosclerosis. Although information related to hypertension, atherosclerosis, or aneurysms that was claimed in international patent applications reached the public quicker through the corresponding peer review document if one was published, more than two-thirds of the patent applications had no such companion paper in a scientific journal. The patent literature, which is freely available online as full text, offers information to scientists and developers in the fields of vascular risk management that is not available from the peer reviewed literature.
NCBI Bookshelf: books and documents in life sciences and health care
Hoeppner, Marilu A.
2013-01-01
Bookshelf (http://www.ncbi.nlm.nih.gov/books/) is a full-text electronic literature resource of books and documents in life sciences and health care at the National Center for Biotechnology Information (NCBI). Created in 1999 with a single book as an encyclopedic reference for resources such as PubMed and GenBank, it has grown to its current size of >1300 titles. Unlike other NCBI databases, such as GenBank and Gene, which have a strict data structure, books come in all forms; they are diverse in publication types, formats, sizes and authoring models. The Bookshelf data format is XML tagged in the NCBI Book DTD (Document Type Definition), modeled after the National Library of Medicine journal article DTDs. The book DTD has been used for systematically tagging the diverse data formats of books, a move that has set the foundation for the growth of this resource. Books at NCBI followed the route of journal articles in the PubMed Central project, using the PubMed Central architectural framework, workflows and processes. Through integration with other NCBI molecular databases, books at NCBI can be used to provide reference information for biological data and facilitate its discovery. This article describes Bookshelf at NCBI: its growth, data handling and retrieval and integration with molecular databases. PMID:23203889
NCBI Bookshelf: books and documents in life sciences and health care.
Hoeppner, Marilu A
2013-01-01
Bookshelf (http://www.ncbi.nlm.nih.gov/books/) is a full-text electronic literature resource of books and documents in life sciences and health care at the National Center for Biotechnology Information (NCBI). Created in 1999 with a single book as an encyclopedic reference for resources such as PubMed and GenBank, it has grown to its current size of >1300 titles. Unlike other NCBI databases, such as GenBank and Gene, which have a strict data structure, books come in all forms; they are diverse in publication types, formats, sizes and authoring models. The Bookshelf data format is XML tagged in the NCBI Book DTD (Document Type Definition), modeled after the National Library of Medicine journal article DTDs. The book DTD has been used for systematically tagging the diverse data formats of books, a move that has set the foundation for the growth of this resource. Books at NCBI followed the route of journal articles in the PubMed Central project, using the PubMed Central architectural framework, workflows and processes. Through integration with other NCBI molecular databases, books at NCBI can be used to provide reference information for biological data and facilitate its discovery. This article describes Bookshelf at NCBI: its growth, data handling and retrieval and integration with molecular databases.
Adaptive removal of background and white space from document images using seam categorization
NASA Astrophysics Data System (ADS)
Fillion, Claude; Fan, Zhigang; Monga, Vishal
2011-03-01
Document images are obtained regularly by rasterization of document content and as scans of printed documents. Resizing via background and white space removal is often desired for better consumption of these images, whether on displays or in print. While white space and background are easy to identify in images, existing methods such as naïve removal and content aware resizing (seam carving) each have limitations that can lead to undesirable artifacts, such as uneven spacing between lines of text or poor arrangement of content. An adaptive method based on image content is hence needed. In this paper we propose an adaptive method to intelligently remove white space and background content from document images. Document images are different from pictorial images in structure. They typically contain objects (text letters, pictures and graphics) separated by uniform background, which include both white paper space and other uniform color background. Pixels in uniform background regions are excellent candidates for deletion if resizing is required, as they introduce less change in document content and style, compared with deletion of object pixels. We propose a background deletion method that exploits both local and global context. The method aims to retain the document structural information and image quality.
49 CFR 1104.2 - Document specifications.
Code of Federal Regulations, 2014 CFR
2014-10-01
... to facilitate automated processing in document sheet feeders, original documents of more than one... textual submissions. Use of color in filings is limited to images such as graphs, maps and photographs. To facilitate automated processing of color pages, color pages may not be inserted among pages containing text...
49 CFR 1104.2 - Document specifications.
Code of Federal Regulations, 2010 CFR
2010-10-01
... to facilitate automated processing in document sheet feeders, original documents of more than one... textual submissions. Use of color in filings is limited to images such as graphs, maps and photographs. To facilitate automated processing of color pages, color pages may not be inserted among pages containing text...
49 CFR 1104.2 - Document specifications.
Code of Federal Regulations, 2012 CFR
2012-10-01
... to facilitate automated processing in document sheet feeders, original documents of more than one... textual submissions. Use of color in filings is limited to images such as graphs, maps and photographs. To facilitate automated processing of color pages, color pages may not be inserted among pages containing text...
49 CFR 1104.2 - Document specifications.
Code of Federal Regulations, 2011 CFR
2011-10-01
... to facilitate automated processing in document sheet feeders, original documents of more than one... textual submissions. Use of color in filings is limited to images such as graphs, maps and photographs. To facilitate automated processing of color pages, color pages may not be inserted among pages containing text...
49 CFR 1104.2 - Document specifications.
Code of Federal Regulations, 2013 CFR
2013-10-01
... to facilitate automated processing in document sheet feeders, original documents of more than one... textual submissions. Use of color in filings is limited to images such as graphs, maps and photographs. To facilitate automated processing of color pages, color pages may not be inserted among pages containing text...
Enhancement of Text Representations Using Related Document Titles.
ERIC Educational Resources Information Center
Salton, G.; Zhang, Y.
1986-01-01
Briefly reviews various methodologies for constructing enhanced document representations, discusses their general lack of usefulness, and describes a method of document indexing which uses title words taken from bibliographically related items. Evaluation of this process indicates that it is not sufficiently reliable to warrant incorporation into…
Simple-random-sampling-based multiclass text classification algorithm.
Liu, Wuying; Wang, Lin; Yi, Mianzhu
2014-01-01
Multiclass text classification (MTC) is a challenging issue and the corresponding MTC algorithms can be used in many applications. The space-time overhead of the algorithms must be concerned about the era of big data. Through the investigation of the token frequency distribution in a Chinese web document collection, this paper reexamines the power law and proposes a simple-random-sampling-based MTC (SRSMTC) algorithm. Supported by a token level memory to store labeled documents, the SRSMTC algorithm uses a text retrieval approach to solve text classification problems. The experimental results on the TanCorp data set show that SRSMTC algorithm can achieve the state-of-the-art performance at greatly reduced space-time requirements.
Basic test framework for the evaluation of text line segmentation and text parameter extraction.
Brodić, Darko; Milivojević, Dragan R; Milivojević, Zoran
2010-01-01
Text line segmentation is an essential stage in off-line optical character recognition (OCR) systems. It is a key because inaccurately segmented text lines will lead to OCR failure. Text line segmentation of handwritten documents is a complex and diverse problem, complicated by the nature of handwriting. Hence, text line segmentation is a leading challenge in handwritten document image processing. Due to inconsistencies in measurement and evaluation of text segmentation algorithm quality, some basic set of measurement methods is required. Currently, there is no commonly accepted one and all algorithm evaluation is custom oriented. In this paper, a basic test framework for the evaluation of text feature extraction algorithms is proposed. This test framework consists of a few experiments primarily linked to text line segmentation, skew rate and reference text line evaluation. Although they are mutually independent, the results obtained are strongly cross linked. In the end, its suitability for different types of letters and languages as well as its adaptability are its main advantages. Thus, the paper presents an efficient evaluation method for text analysis algorithms.
Basic Test Framework for the Evaluation of Text Line Segmentation and Text Parameter Extraction
Brodić, Darko; Milivojević, Dragan R.; Milivojević, Zoran
2010-01-01
Text line segmentation is an essential stage in off-line optical character recognition (OCR) systems. It is a key because inaccurately segmented text lines will lead to OCR failure. Text line segmentation of handwritten documents is a complex and diverse problem, complicated by the nature of handwriting. Hence, text line segmentation is a leading challenge in handwritten document image processing. Due to inconsistencies in measurement and evaluation of text segmentation algorithm quality, some basic set of measurement methods is required. Currently, there is no commonly accepted one and all algorithm evaluation is custom oriented. In this paper, a basic test framework for the evaluation of text feature extraction algorithms is proposed. This test framework consists of a few experiments primarily linked to text line segmentation, skew rate and reference text line evaluation. Although they are mutually independent, the results obtained are strongly cross linked. In the end, its suitability for different types of letters and languages as well as its adaptability are its main advantages. Thus, the paper presents an efficient evaluation method for text analysis algorithms. PMID:22399932
49 CFR 1016.203 - Documentation of fees and expenses.
Code of Federal Regulations, 2010 CFR
2010-10-01
... 49 Transportation 8 2010-10-01 2010-10-01 false Documentation of fees and expenses. 1016.203 Section 1016.203 Transportation Other Regulations Relating to Transportation (Continued) SURFACE... § 1016.203 Documentation of fees and expenses. The application shall be accompanied by full documentation...
49 CFR 1016.203 - Documentation of fees and expenses.
Code of Federal Regulations, 2011 CFR
2011-10-01
... 49 Transportation 8 2011-10-01 2011-10-01 false Documentation of fees and expenses. 1016.203 Section 1016.203 Transportation Other Regulations Relating to Transportation (Continued) SURFACE... § 1016.203 Documentation of fees and expenses. The application shall be accompanied by full documentation...
40 CFR Appendix A to Part 66 - Technical Support Document
Code of Federal Regulations, 2010 CFR
2010-07-01
... 40 Protection of Environment 15 2010-07-01 2010-07-01 false Technical Support Document A Appendix A to Part 66 Protection of Environment ENVIRONMENTAL PROTECTION AGENCY (CONTINUED) AIR PROGRAMS...—Technical Support Document Note: For text of appendix A see appendix A to part 67. ...
Implementation of the common phrase index method on the phrase query for information retrieval
NASA Astrophysics Data System (ADS)
Fatmawati, Triyah; Zaman, Badrus; Werdiningsih, Indah
2017-08-01
As the development of technology, the process of finding information on the news text is easy, because the text of the news is not only distributed in print media, such as newspapers, but also in electronic media that can be accessed using the search engine. In the process of finding relevant documents on the search engine, a phrase often used as a query. The number of words that make up the phrase query and their position obviously affect the relevance of the document produced. As a result, the accuracy of the information obtained will be affected. Based on the outlined problem, the purpose of this research was to analyze the implementation of the common phrase index method on information retrieval. This research will be conducted in English news text and implemented on a prototype to determine the relevance level of the documents produced. The system is built with the stages of pre-processing, indexing, term weighting calculation, and cosine similarity calculation. Then the system will display the document search results in a sequence, based on the cosine similarity. Furthermore, system testing will be conducted using 100 documents and 20 queries. That result is then used for the evaluation stage. First, determine the relevant documents using kappa statistic calculation. Second, determine the system success rate using precision, recall, and F-measure calculation. In this research, the result of kappa statistic calculation was 0.71, so that the relevant documents are eligible for the system evaluation. Then the calculation of precision, recall, and F-measure produces precision of 0.37, recall of 0.50, and F-measure of 0.43. From this result can be said that the success rate of the system to produce relevant documents is low.
tmBioC: improving interoperability of text-mining tools with BioC.
Khare, Ritu; Wei, Chih-Hsuan; Mao, Yuqing; Leaman, Robert; Lu, Zhiyong
2014-01-01
The lack of interoperability among biomedical text-mining tools is a major bottleneck in creating more complex applications. Despite the availability of numerous methods and techniques for various text-mining tasks, combining different tools requires substantial efforts and time owing to heterogeneity and variety in data formats. In response, BioC is a recent proposal that offers a minimalistic approach to tool interoperability by stipulating minimal changes to existing tools and applications. BioC is a family of XML formats that define how to present text documents and annotations, and also provides easy-to-use functions to read/write documents in the BioC format. In this study, we introduce our text-mining toolkit, which is designed to perform several challenging and significant tasks in the biomedical domain, and repackage the toolkit into BioC to enhance its interoperability. Our toolkit consists of six state-of-the-art tools for named-entity recognition, normalization and annotation (PubTator) of genes (GenNorm), diseases (DNorm), mutations (tmVar), species (SR4GN) and chemicals (tmChem). Although developed within the same group, each tool is designed to process input articles and output annotations in a different format. We modify these tools and enable them to read/write data in the proposed BioC format. We find that, using the BioC family of formats and functions, only minimal changes were required to build the newer versions of the tools. The resulting BioC wrapped toolkit, which we have named tmBioC, consists of our tools in BioC, an annotated full-text corpus in BioC, and a format detection and conversion tool. Furthermore, through participation in the 2013 BioCreative IV Interoperability Track, we empirically demonstrate that the tools in tmBioC can be more efficiently integrated with each other as well as with external tools: Our experimental results show that using BioC reduces >60% in lines of code for text-mining tool integration. The tmBioC toolkit is publicly available at http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/tmTools/. Database URL: http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/tmTools/. Published by Oxford University Press 2014. This work is written by US Government employees and is in the public domain in the US.
Design and realization of the compound text-based test questions library management system
NASA Astrophysics Data System (ADS)
Shi, Lei; Feng, Lin; Zhao, Xin
2011-12-01
The test questions library management system is the essential part of the on-line examination system. The basic demand for which is to deal with compound text including information like images, formulae and create the corresponding Word documents. Having compared with the two current solutions of creating documents, this paper presents a design proposal of Word Automation mechanism based on OLE/COM technology, and discusses the way of Word Automation application in detail and at last provides the operating results of the system which have high reference value in improving the generated efficiency of project documents and report forms.
Introducing Text Analytics as a Graduate Business School Course
ERIC Educational Resources Information Center
Edgington, Theresa M.
2011-01-01
Text analytics refers to the process of analyzing unstructured data from documented sources, including open-ended surveys, blogs, and other types of web dialog. Text analytics has enveloped the concept of text mining, an analysis approach influenced heavily from data mining. While text mining has been covered extensively in various computer…
Validating a strategy for psychosocial phenotyping using a large corpus of clinical text.
Gundlapalli, Adi V; Redd, Andrew; Carter, Marjorie; Divita, Guy; Shen, Shuying; Palmer, Miland; Samore, Matthew H
2013-12-01
To develop algorithms to improve efficiency of patient phenotyping using natural language processing (NLP) on text data. Of a large number of note titles available in our database, we sought to determine those with highest yield and precision for psychosocial concepts. From a database of over 1 billion documents from US Department of Veterans Affairs medical facilities, a random sample of 1500 documents from each of 218 enterprise note titles were chosen. Psychosocial concepts were extracted using a UIMA-AS-based NLP pipeline (v3NLP), using a lexicon of relevant concepts with negation and template format annotators. Human reviewers evaluated a subset of documents for false positives and sensitivity. High-yield documents were identified by hit rate and precision. Reasons for false positivity were characterized. A total of 58 707 psychosocial concepts were identified from 316 355 documents for an overall hit rate of 0.2 concepts per document (median 0.1, range 1.6-0). Of 6031 concepts reviewed from a high-yield set of note titles, the overall precision for all concept categories was 80%, with variability among note titles and concept categories. Reasons for false positivity included templating, negation, context, and alternate meaning of words. The sensitivity of the NLP system was noted to be 49% (95% CI 43% to 55%). Phenotyping using NLP need not involve the entire document corpus. Our methods offer a generalizable strategy for scaling NLP pipelines to large free text corpora with complex linguistic annotations in attempts to identify patients of a certain phenotype.
Validating a strategy for psychosocial phenotyping using a large corpus of clinical text
Gundlapalli, Adi V; Redd, Andrew; Carter, Marjorie; Divita, Guy; Shen, Shuying; Palmer, Miland; Samore, Matthew H
2013-01-01
Objective To develop algorithms to improve efficiency of patient phenotyping using natural language processing (NLP) on text data. Of a large number of note titles available in our database, we sought to determine those with highest yield and precision for psychosocial concepts. Materials and methods From a database of over 1 billion documents from US Department of Veterans Affairs medical facilities, a random sample of 1500 documents from each of 218 enterprise note titles were chosen. Psychosocial concepts were extracted using a UIMA-AS-based NLP pipeline (v3NLP), using a lexicon of relevant concepts with negation and template format annotators. Human reviewers evaluated a subset of documents for false positives and sensitivity. High-yield documents were identified by hit rate and precision. Reasons for false positivity were characterized. Results A total of 58 707 psychosocial concepts were identified from 316 355 documents for an overall hit rate of 0.2 concepts per document (median 0.1, range 1.6–0). Of 6031 concepts reviewed from a high-yield set of note titles, the overall precision for all concept categories was 80%, with variability among note titles and concept categories. Reasons for false positivity included templating, negation, context, and alternate meaning of words. The sensitivity of the NLP system was noted to be 49% (95% CI 43% to 55%). Conclusions Phenotyping using NLP need not involve the entire document corpus. Our methods offer a generalizable strategy for scaling NLP pipelines to large free text corpora with complex linguistic annotations in attempts to identify patients of a certain phenotype. PMID:24169276
47 CFR 1.1513 - Documentation of fees and expenses.
Code of Federal Regulations, 2010 CFR
2010-10-01
... 47 Telecommunication 1 2010-10-01 2010-10-01 false Documentation of fees and expenses. 1.1513... Applicants § 1.1513 Documentation of fees and expenses. The application shall be accompanied by full documentation of the fees and expenses, including the cost of any study, analysis, engineering report, test...
The Flip Sides of Full-Text: Superindex and the Harvard Business Review/Online.
ERIC Educational Resources Information Center
Dadlez, Eva M.
1984-01-01
This article illustrates similarities between two different types of full-text databases--Superindex, Harvard Business Review/Online--and uses them as arena to demonstrate search and display applications of full-text. The selection of logical operators, full-text search strategies, and keywords and Bibliographic Retrieval Service's Occurrence…
New Framework for Cross-Domain Document Classification
2011-03-01
classification. The following paragraphs will introduce these related works in more detail. Wang et al . attempted to improve the accuracy of text document...of using Wikipedia to develop a thesaurus [20]. Gabrilovich et al . had an approach that is more elaborate in its use of Wikipedia text [21]. The...did show a modest improvement when it is performed using the Wikipedia information. Wang et al . improved on the results of co-clustering algorithm [24
Enhancement of the Shared Graphics Workspace.
1987-12-31
participants to share videodisc images and computer graphics displayed in color and text and facsimile information displayed in black on amber. They...could annotate the information in up to five * colors and print the annotated version at both sites, using a standard fax machine. The SGWS also used a fax...system to display a document, whether text or photo, the camera scans the document, digitizes the data, and sends it via direct memory access (DMA) to
Essie: A Concept-based Search Engine for Structured Biomedical Text
Ide, Nicholas C.; Loane, Russell F.; Demner-Fushman, Dina
2007-01-01
This article describes the algorithms implemented in the Essie search engine that is currently serving several Web sites at the National Library of Medicine. Essie is a phrase-based search engine with term and concept query expansion and probabilistic relevancy ranking. Essie’s design is motivated by an observation that query terms are often conceptually related to terms in a document, without actually occurring in the document text. Essie’s performance was evaluated using data and standard evaluation methods from the 2003 and 2006 Text REtrieval Conference (TREC) Genomics track. Essie was the best-performing search engine in the 2003 TREC Genomics track and achieved results comparable to those of the highest-ranking systems on the 2006 TREC Genomics track task. Essie shows that a judicious combination of exploiting document structure, phrase searching, and concept based query expansion is a useful approach for information retrieval in the biomedical domain. PMID:17329729
An IR-Based Approach Utilizing Query Expansion for Plagiarism Detection in MEDLINE.
Nawab, Rao Muhammad Adeel; Stevenson, Mark; Clough, Paul
2017-01-01
The identification of duplicated and plagiarized passages of text has become an increasingly active area of research. In this paper, we investigate methods for plagiarism detection that aim to identify potential sources of plagiarism from MEDLINE, particularly when the original text has been modified through the replacement of words or phrases. A scalable approach based on Information Retrieval is used to perform candidate document selection-the identification of a subset of potential source documents given a suspicious text-from MEDLINE. Query expansion is performed using the ULMS Metathesaurus to deal with situations in which original documents are obfuscated. Various approaches to Word Sense Disambiguation are investigated to deal with cases where there are multiple Concept Unique Identifiers (CUIs) for a given term. Results using the proposed IR-based approach outperform a state-of-the-art baseline based on Kullback-Leibler Distance.
Automatic target validation based on neuroscientific literature mining for tractography
Vasques, Xavier; Richardet, Renaud; Hill, Sean L.; Slater, David; Chappelier, Jean-Cedric; Pralong, Etienne; Bloch, Jocelyne; Draganski, Bogdan; Cif, Laura
2015-01-01
Target identification for tractography studies requires solid anatomical knowledge validated by an extensive literature review across species for each seed structure to be studied. Manual literature review to identify targets for a given seed region is tedious and potentially subjective. Therefore, complementary approaches would be useful. We propose to use text-mining models to automatically suggest potential targets from the neuroscientific literature, full-text articles and abstracts, so that they can be used for anatomical connection studies and more specifically for tractography. We applied text-mining models to three structures: two well-studied structures, since validated deep brain stimulation targets, the internal globus pallidus and the subthalamic nucleus and, the nucleus accumbens, an exploratory target for treating psychiatric disorders. We performed a systematic review of the literature to document the projections of the three selected structures and compared it with the targets proposed by text-mining models, both in rat and primate (including human). We ran probabilistic tractography on the nucleus accumbens and compared the output with the results of the text-mining models and literature review. Overall, text-mining the literature could find three times as many targets as two man-weeks of curation could. The overall efficiency of the text-mining against literature review in our study was 98% recall (at 36% precision), meaning that over all the targets for the three selected seeds, only one target has been missed by text-mining. We demonstrate that connectivity for a structure of interest can be extracted from a very large amount of publications and abstracts. We believe this tool will be useful in helping the neuroscience community to facilitate connectivity studies of particular brain regions. The text mining tools used for the study are part of the HBP Neuroinformatics Platform, publicly available at http://connectivity-brainer.rhcloud.com/. PMID:26074781
ERIC Educational Resources Information Center
Sheehan, Kathleen M.
2015-01-01
The "TextEvaluator"® text analysis tool is a fully automated text complexity evaluation tool designed to help teachers, curriculum specialists, textbook publishers, and test developers select texts that are consistent with the text complexity guidelines specified in the Common Core State Standards.This paper documents the procedure used…
Angle comparison using an autocollimator
NASA Astrophysics Data System (ADS)
Geckeler, Ralf D.; Just, Andreas; Vasilev, Valentin; Prieto, Emilio; Dvorácek, František; Zelenika, Slobodan; Przybylska, Joanna; Duta, Alexandru; Victorov, Ilya; Pisani, Marco; Saraiva, Fernanda; Salgado, Jose-Antonio; Gao, Sitian; Anusorn, Tonmueanwai; Leng Tan, Siew; Cox, Peter; Watanabe, Tsukasa; Lewis, Andrew; Chaudhary, K. P.; Thalmann, Ruedi; Banreti, Edit; Nurul, Alfiyati; Fira, Roman; Yandayan, Tanfer; Chekirda, Konstantin; Bergmans, Rob; Lassila, Antti
2018-01-01
Autocollimators are versatile optical devices for the contactless measurement of the tilt angles of reflecting surfaces. An international key comparison (KC) on autocollimator calibration, EURAMET.L-K3.2009, was initiated by the European Association of National Metrology Institutes (EURAMET) to provide information on the capabilities in this field. The Physikalisch-Technische Bundesanstalt (PTB) acted as the pilot laboratory, with a total of 25 international participants from EURAMET and from the Asia Pacific Metrology Programme (APMP) providing measurements. This KC was the first one to utilise a high-resolution electronic autocollimator as a standard. In contrast to KCs in angle metrology which usually involve the full plane angle, it focused on relatively small angular ranges (+/-10 arcsec and +/-1000 arcsec) and step sizes (10 arcsec and 0.1 arcsec, respectively). This document represents the approved final report on the results of the KC. Main text To reach the main text of this paper, click on Final Report. Note that this text is that which appears in Appendix B of the BIPM key comparison database kcdb.bipm.org/. The final report has been peer-reviewed and approved for publication by the CCL, according to the provisions of the CIPM Mutual Recognition Arrangement (CIPM MRA).
Calculation algorithms for breath-by-breath alveolar gas exchange: the unknowns!
Golja, Petra; Cettolo, Valentina; Francescato, Maria Pia
2018-06-25
Several papers (algorithm papers) describe computational algorithms that assess alveolar breath-by-breath gas exchange by accounting for changes in lung gas stores. It is unclear, however, if the effects of the latter are actually considered in literature. We evaluated dissemination of algorithm papers and the relevant provided information. The list of documents investigating exercise transients (in 1998-2017) was extracted from Scopus database. Documents citing the algorithm papers in the same period were analyzed in full text to check consistency of the relevant information provided. Less than 8% (121/1522) of documents dealing with exercise transients cited at least one algorithm paper; the paper of Beaver et al. (J Appl Physiol 51:1662-1675, 1981) was cited most often, with others being cited tenfold less. Among the documents citing the algorithm paper of Beaver et al. (J Appl Physiol 51:1662-1675, 1981) (N = 251), only 176 cited it for the application of their algorithm/s; in turn, 61% (107/176) of them stated the alveolar breath-by-breath gas exchange measurement, but only 1% (1/107) of the latter also reported the assessment of volunteers' functional residual capacity, a crucial parameter for the application of the algorithm. Information related to gas exchange was provided consistently in the methods and in the results in 1 of the 107 documents. Dissemination of algorithm papers in literature investigating exercise transients is by far narrower than expected. The information provided about the actual application of gas exchange algorithms is often inadequate and/or ambiguous. Some guidelines are provided that can help to improve the quality of future publications in the field.
76 FR 27309 - Committee on Measures of Student Success
Federal Register 2010, 2011, 2012, 2013, 2014
2011-05-11
... version of this document is the document published in the Federal Register. Free Internet access to the... text or Adobe Portable Document Format (PDF) on the Internet at the following site: http://www.ed.gov/news/fed-register/index.html . To use PDF you must have Adobe Acrobat Reader, which is available free...
76 FR 50198 - Committee on Measures of Student Success
Federal Register 2010, 2011, 2012, 2013, 2014
2011-08-12
...: The official version of this document is the document published in the Federal Register. Free Internet... Federal Register, in text or Adobe Portable Document Format (PDF) on the Internet at the following site... is available free at this site. If you have questions about using PDF, call the U.S. Government...
10 CFR 2.304 - Formal requirements for documents; signatures; acceptance for filing.
Code of Federal Regulations, 2010 CFR
2010-01-01
... documents. In addition to the requirements in this part, paper documents must be stapled or bound on the left side; typewritten, printed, or otherwise reproduced in permanent form on good unglazed paper of... not less than one inch. Text must be double-spaced, except that quotations may be single-spaced and...
Evaluating Combinations of Ranked Lists and Visualizations of Inter-Document Similarity.
ERIC Educational Resources Information Center
Allan, James; Leuski, Anton; Swan, Russell; Byrd, Donald
2001-01-01
Considers how ideas from document clustering can be used to improve retrieval accuracy of ranked lists in interactive systems and how to evaluate system effectiveness. Describes a TREC (Text Retrieval Conference) study that constructed and evaluated systems that present the user with ranked lists and a visualization of inter-document similarities.…
Combining approaches to on-line handwriting information retrieval
NASA Astrophysics Data System (ADS)
Peña Saldarriaga, Sebastián; Viard-Gaudin, Christian; Morin, Emmanuel
2010-01-01
In this work, we propose to combine two quite different approaches for retrieving handwritten documents. Our hypothesis is that different retrieval algorithms should retrieve different sets of documents for the same query. Therefore, significant improvements in retrieval performances can be expected. The first approach is based on information retrieval techniques carried out on the noisy texts obtained through handwriting recognition, while the second approach is recognition-free using a word spotting algorithm. Results shows that for texts having a word error rate (WER) lower than 23%, the performances obtained with the combined system are close to the performances obtained on clean digital texts. In addition, for poorly recognized texts (WER > 52%), an improvement of nearly 17% can be observed with respect to the best available baseline method.
Ebola Outbreak Response: The Role of Information Resources and the National Library of Medicine
Love, Cynthia B.; Arnesen, Stacey J.; Phillips, Steven J.
2016-01-01
The US National Library of Medicine (NLM) offers Internet-based, no-cost resources useful for responding to the 2014 West Africa Ebola outbreak. Resources for health professionals, planners, responders, and researchers include PubMed, Disaster Lit, the Web page “Ebola Outbreak 2014: Information Resources,” and the Virus Variation database of sequences for Ebolavirus. In cooperation with participating publishers, NLM offers free access to full-text articles from over 650 biomedical journals and 4000 online reference books through the Emergency Access Initiative. At the start of a prolonged disaster event or disease outbreak, the documents and information of most immediate use may not be in the peer-reviewed biomedical journal literature. To maintain current awareness may require using any of the following: news outlets; social media; preliminary online data, maps, and situation reports; and documents published by nongovernmental organizations, international associations, and government agencies. Similar to the pattern of interest shown in the news and social media, use of NLM Ebola-related resources is also increasing since the start of the outbreak was first reported in March 2014 PMID:25325189
Névéol, Aurélie; Pereira, Suzanne; Kerdelhué, Gaetan; Dahamna, Badisse; Joubert, Michel; Darmoni, Stéfan J
2007-01-01
The growing number of resources to be indexed in the catalogue of online health resources in French (CISMeF) calls for curating strategies involving automatic indexing tools while maintaining the catalogue's high indexing quality standards. To develop a simple automatic tool that retrieves MeSH descriptors from documents titles. In parallel to research on advanced indexing methods, a bag-of-words tool was developed for timely inclusion in CISMeF's maintenance system. An evaluation was carried out on a corpus of 99 documents. The indexing sets retrieved by the automatic tool were compared to manual indexing based on the title and on the full text of resources. 58% of the major main headings were retrieved by the bag-of-words algorithm and the precision on main heading retrieval was 69%. Bag-of-words indexing has effectively been used on selected resources to be included in CISMeF since August 2006. Meanwhile, on going work aims at improving the current version of the tool.
Ebola outbreak response: the role of information resources and the National Library of Medicine.
Love, Cynthia B; Arnesen, Stacey J; Phillips, Steven J
2015-02-01
The US National Library of Medicine (NLM) offers Internet-based, no-cost resources useful for responding to the 2014 West Africa Ebola outbreak. Resources for health professionals, planners, responders, and researchers include PubMed, Disaster Lit, the Web page "Ebola Outbreak 2014: Information Resources," and the Virus Variation database of sequences for Ebolavirus. In cooperation with participating publishers, NLM offers free access to full-text articles from over 650 biomedical journals and 4000 online reference books through the Emergency Access Initiative. At the start of a prolonged disaster event or disease outbreak, the documents and information of most immediate use may not be in the peer-reviewed biomedical journal literature. To maintain current awareness may require using any of the following: news outlets; social media; preliminary online data, maps, and situation reports; and documents published by nongovernmental organizations, international associations, and government agencies. Similar to the pattern of interest shown in the news and social media, use of NLM Ebola-related resources is also increasing since the start of the outbreak was first reported in March 2014.
Patent citation network in nanotechnology (1976-2004)
NASA Astrophysics Data System (ADS)
Li, Xin; Chen, Hsinchun; Huang, Zan; Roco, Mihail C.
2007-06-01
The patent citation networks are described using critical node, core network, and network topological analysis. The main objective is understanding of the knowledge transfer processes between technical fields, institutions and countries. This includes identifying key influential players and subfields, the knowledge transfer patterns among them, and the overall knowledge transfer efficiency. The proposed framework is applied to the field of nanoscale science and engineering (NSE), including the citation networks of patent documents, submitting institutions, technology fields, and countries. The NSE patents were identified by keywords "full-text" searching of patents at the United States Patent and Trademark Office (USPTO). The analysis shows that the United States is the most important citation center in NSE research. The institution citation network illustrates a more efficient knowledge transfer between institutions than a random network. The country citation network displays a knowledge transfer capability as efficient as a random network. The technology field citation network and the patent document citation network exhibit a␣less efficient knowledge diffusion capability than a random network. All four citation networks show a tendency to form local citation clusters.
Shuttle-Mir, CD-ROM Supplement
NASA Technical Reports Server (NTRS)
Morgan, Clay; Launius, Roger (Technical Monitor)
2001-01-01
This CD-ROM is a companion to an illustrated history book with the same title. This multi-media, searchable CD includes the full text and images in the book, as well as additional material. Interviews, photographs, and biographies of the U.S. Astronauts, cosmonauts, and team members for the Shuttle-Mir Program are available. STS Mission Summaries for each mission involved can be viewed, including launch and landing details, crew lists, and mission highlights. Photographs and videos from each mission are included, as well as diagrams of different spacecraft, and computer-generated animations of the Mir deorbit, collision, and flyaround. Additional documents include mission status reports, published documents, news releases, personal letters, and oral histories. The experiments carried out on Mir are described, highlighting combustion and fluid physics research, life in microgravity, and research of the development of the solar system. The focus on improving space technology and planning for the International Space Station is explained. The main features of the book itself include: (1) Training and Operations; (2) Long Duration Psychology; (3) Bilingual Blues; and (4) Earth Observations.
Human Rights Texts: Converting Human Rights Primary Source Documents into Data.
Fariss, Christopher J; Linder, Fridolin J; Jones, Zachary M; Crabtree, Charles D; Biek, Megan A; Ross, Ana-Sophia M; Kaur, Taranamol; Tsai, Michael
2015-01-01
We introduce and make publicly available a large corpus of digitized primary source human rights documents which are published annually by monitoring agencies that include Amnesty International, Human Rights Watch, the Lawyers Committee for Human Rights, and the United States Department of State. In addition to the digitized text, we also make available and describe document-term matrices, which are datasets that systematically organize the word counts from each unique document by each unique term within the corpus of human rights documents. To contextualize the importance of this corpus, we describe the development of coding procedures in the human rights community and several existing categorical indicators that have been created by human coding of the human rights documents contained in the corpus. We then discuss how the new human rights corpus and the existing human rights datasets can be used with a variety of statistical analyses and machine learning algorithms to help scholars understand how human rights practices and reporting have evolved over time. We close with a discussion of our plans for dataset maintenance, updating, and availability.
Human Rights Texts: Converting Human Rights Primary Source Documents into Data
Fariss, Christopher J.; Linder, Fridolin J.; Jones, Zachary M.; Crabtree, Charles D.; Biek, Megan A.; Ross, Ana-Sophia M.; Kaur, Taranamol; Tsai, Michael
2015-01-01
We introduce and make publicly available a large corpus of digitized primary source human rights documents which are published annually by monitoring agencies that include Amnesty International, Human Rights Watch, the Lawyers Committee for Human Rights, and the United States Department of State. In addition to the digitized text, we also make available and describe document-term matrices, which are datasets that systematically organize the word counts from each unique document by each unique term within the corpus of human rights documents. To contextualize the importance of this corpus, we describe the development of coding procedures in the human rights community and several existing categorical indicators that have been created by human coding of the human rights documents contained in the corpus. We then discuss how the new human rights corpus and the existing human rights datasets can be used with a variety of statistical analyses and machine learning algorithms to help scholars understand how human rights practices and reporting have evolved over time. We close with a discussion of our plans for dataset maintenance, updating, and availability. PMID:26418817
National Survey of Patients’ Bill of Rights Statutes
Jacob, Dan M.; Hochhauser, Mark; Parker, Ruth M.
2009-01-01
BACKGROUND Despite vigorous national debate between 1999–2001 the federal patients’ bill of rights (PBOR) was not enacted. However, states have enacted legislation and the Joint Commission defined an accreditation standard to present patients with their rights. Because such initiatives can be undermined by overly complex language, we surveyed the readability of hospital PBOR documents as well as texts mandated by state law. METHODS State Web sites and codes were searched to identify PBOR statutes for general patient populations. The rights addressed were compared with the 12 themes presented in the American Hospital Association’s (AHA) PBOR text of 2002. In addition, we obtained PBOR texts from a sample of hospitals in each state. Readability was evaluated using Prose, a software program which reports an average of eight readability formulas. RESULTS Of 23 states with a PBOR statute for the general public, all establish a grievance policy, four protect a private right of action, and one stipulates fines for violations. These laws address an average of 7.4 of the 12 AHA themes. Nine states’ statutes specify PBOR text for distribution to patients. These documents have an average readability of 15th grade (range, 11.6, New York, to 17.0, Minnesota). PBOR documents from 240 US hospitals have an average readability of 14th grade (range, 8.2 to 17.0). CONCLUSIONS While the average U.S. adult reads at an 8th grade reading level, an advanced college reading level is routinely required to read PBOR documents. Patients are not likely to learn about their rights from documents they cannot read. PMID:19189192
The Ecological Approach to Text Visualization.
ERIC Educational Resources Information Center
Wise, James A.
1999-01-01
Presents both theoretical and technical bases on which to build a "science of text visualization." The Spatial Paradigm for Information Retrieval and Exploration (SPIRE) text-visualization system, which images information from free-text documents as natural terrains, serves as an example of the "ecological approach" in its visual metaphor, its…
2011-01-01
Background The selection of relevant articles for curation, and linking those articles to experimental techniques confirming the findings became one of the primary subjects of the recent BioCreative III contest. The contest’s Protein-Protein Interaction (PPI) task consisted of two sub-tasks: Article Classification Task (ACT) and Interaction Method Task (IMT). ACT aimed to automatically select relevant documents for PPI curation, whereas the goal of IMT was to recognise the methods used in experiments for identifying the interactions in full-text articles. Results We proposed and compared several classification-based methods for both tasks, employing rich contextual features as well as features extracted from external knowledge sources. For IMT, a new method that classifies pair-wise relations between every text phrase and candidate interaction method obtained promising results with an F1 score of 64.49%, as tested on the task’s development dataset. We also explored ways to combine this new approach and more conventional, multi-label document classification methods. For ACT, our classifiers exploited automatically detected named entities and other linguistic information. The evaluation results on the BioCreative III PPI test datasets showed that our systems were very competitive: one of our IMT methods yielded the best performance among all participants, as measured by F1 score, Matthew’s Correlation Coefficient and AUC iP/R; whereas for ACT, our best classifier was ranked second as measured by AUC iP/R, and also competitive according to other metrics. Conclusions Our novel approach that converts the multi-class, multi-label classification problem to a binary classification problem showed much promise in IMT. Nevertheless, on the test dataset the best performance was achieved by taking the union of the output of this method and that of a multi-class, multi-label document classifier, which indicates that the two types of systems complement each other in terms of recall. For ACT, our system exploited a rich set of features and also obtained encouraging results. We examined the features with respect to their contributions to the classification results, and concluded that contextual words surrounding named entities, as well as the MeSH headings associated with the documents were among the main contributors to the performance. PMID:22151769
Scholarly Information Extraction Is Going to Make a Quantum Leap with PubMed Central (PMC).
Matthies, Franz; Hahn, Udo
2017-01-01
With the increasing availability of complete full texts (journal articles), rather than their surrogates (titles, abstracts), as resources for text analytics, entirely new opportunities arise for information extraction and text mining from scholarly publications. Yet, we gathered evidence that a range of problems are encountered for full-text processing when biomedical text analytics simply reuse existing NLP pipelines which were developed on the basis of abstracts (rather than full texts). We conducted experiments with four different relation extraction engines all of which were top performers in previous BioNLP Event Extraction Challenges. We found that abstract-trained engines loose up to 6.6% F-score points when run on full-text data. Hence, the reuse of existing abstract-based NLP software in a full-text scenario is considered harmful because of heavy performance losses. Given the current lack of annotated full-text resources to train on, our study quantifies the price paid for this short cut.
Text Classification for Organizational Researchers
Kobayashi, Vladimer B.; Mol, Stefan T.; Berkers, Hannah A.; Kismihók, Gábor; Den Hartog, Deanne N.
2017-01-01
Organizations are increasingly interested in classifying texts or parts thereof into categories, as this enables more effective use of their information. Manual procedures for text classification work well for up to a few hundred documents. However, when the number of documents is larger, manual procedures become laborious, time-consuming, and potentially unreliable. Techniques from text mining facilitate the automatic assignment of text strings to categories, making classification expedient, fast, and reliable, which creates potential for its application in organizational research. The purpose of this article is to familiarize organizational researchers with text mining techniques from machine learning and statistics. We describe the text classification process in several roughly sequential steps, namely training data preparation, preprocessing, transformation, application of classification techniques, and validation, and provide concrete recommendations at each step. To help researchers develop their own text classifiers, the R code associated with each step is presented in a tutorial. The tutorial draws from our own work on job vacancy mining. We end the article by discussing how researchers can validate a text classification model and the associated output. PMID:29881249
New directions in biomedical text annotation: definitions, guidelines and corpus construction
Wilbur, W John; Rzhetsky, Andrey; Shatkay, Hagit
2006-01-01
Background While biomedical text mining is emerging as an important research area, practical results have proven difficult to achieve. We believe that an important first step towards more accurate text-mining lies in the ability to identify and characterize text that satisfies various types of information needs. We report here the results of our inquiry into properties of scientific text that have sufficient generality to transcend the confines of a narrow subject area, while supporting practical mining of text for factual information. Our ultimate goal is to annotate a significant corpus of biomedical text and train machine learning methods to automatically categorize such text along certain dimensions that we have defined. Results We have identified five qualitative dimensions that we believe characterize a broad range of scientific sentences, and are therefore useful for supporting a general approach to text-mining: focus, polarity, certainty, evidence, and directionality. We define these dimensions and describe the guidelines we have developed for annotating text with regard to them. To examine the effectiveness of the guidelines, twelve annotators independently annotated the same set of 101 sentences that were randomly selected from current biomedical periodicals. Analysis of these annotations shows 70–80% inter-annotator agreement, suggesting that our guidelines indeed present a well-defined, executable and reproducible task. Conclusion We present our guidelines defining a text annotation task, along with annotation results from multiple independently produced annotations, demonstrating the feasibility of the task. The annotation of a very large corpus of documents along these guidelines is currently ongoing. These annotations form the basis for the categorization of text along multiple dimensions, to support viable text mining for experimental results, methodology statements, and other forms of information. We are currently developing machine learning methods, to be trained and tested on the annotated corpus, that would allow for the automatic categorization of biomedical text along the general dimensions that we have presented. The guidelines in full detail, along with annotated examples, are publicly available. PMID:16867190
The NASA Astrophysics Data System joins the Revolution
NASA Astrophysics Data System (ADS)
Accomazzi, Alberto; Kurtz, Michael J.; Henneken, Edwin; Grant, Carolyn S.; Thompson, Donna M.; Chyla, Roman; Holachek, Alexandra; Sudilovsky, Vladimir; Elliott, Jonathan; Murray, Stephen S.
2015-08-01
Whether or not scholarly publications are going through an evolution or revolution, one comforting certainty remains: the NASA Astrophysics Data System (ADS) is here to help the working astronomer and librarian navigate through the increasingly complex communication environment we find ourselves in. Born as a bibliographic database, today's ADS is best described as a an "aggregator" of scholarly resources relevant to the needs of researchers in astronomy and physics. In addition to indexing content from a variety of publishers, data and software archives, the ADS enriches its records by text-mining and indexing the full-text articles, enriching its metadata through the extraction of citations and acknowledgments and the ingest of bibliographies and data links maintained by astronomy institutions and data archives. In addition, ADS generates and maintains citation and co-readership networks to support discovery and bibliometric analysis.In this talk I will summarize new and ongoing curation activities and technology developments of the ADS in the face of the ever-changing world of scholarly publishing and the trends in information-sharing behavior of astronomers. Recent curation efforts include the indexing of non-standard scholarly content (such as software packages, IVOA documents and standards, and NASA award proposals); the indexing of additional content (full-text of articles, acknowledgments, affiliations, ORCID ids); and enhanced support for bibliographic groups and data links. Recent technology developments include a new Application Programming Interface which provides access to a variety of ADS microservices, a new user interface featuring a variety of visualizations and bibliometric analysis, and integration with ORCID services to support paper claiming.
Using complex networks for text classification: Discriminating informative and imaginative documents
NASA Astrophysics Data System (ADS)
de Arruda, Henrique F.; Costa, Luciano da F.; Amancio, Diego R.
2016-01-01
Statistical methods have been widely employed in recent years to grasp many language properties. The application of such techniques have allowed an improvement of several linguistic applications, such as machine translation and document classification. In the latter, many approaches have emphasised the semantical content of texts, as is the case of bag-of-word language models. These approaches have certainly yielded reasonable performance. However, some potential features such as the structural organization of texts have been used only in a few studies. In this context, we probe how features derived from textual structure analysis can be effectively employed in a classification task. More specifically, we performed a supervised classification aiming at discriminating informative from imaginative documents. Using a networked model that describes the local topological/dynamical properties of function words, we achieved an accuracy rate of up to 95%, which is much higher than similar networked approaches. A systematic analysis of feature relevance revealed that symmetry and accessibility measurements are among the most prominent network measurements. Our results suggest that these measurements could be used in related language applications, as they play a complementary role in characterising texts.
Necrotising pneumonia caused by non-PVL Staphylococcus aureus with 2-year follow-up.
Hilton, Bryn; Tavare, Aniket N; Creer, Dean
2017-12-07
Necrotising pneumonia (NP) is a rare but life-threatening complication of pulmonary infection. It is characterised by progressive necrosis of lung parenchyma with cavitating foci evident upon radiological investigation. This article reports the case of a 52-year-old woman, immunocompetent healthcare professional presenting to Accident and Emergency with NP and Staphylococcus aureus septicaemia. The cavitating lesion was not identified on initial chest X-ray leading to a delay in antimicrobial optimisation. However, the patient went on to achieve a full symptomatic recovery in 1 month and complete radiological recovery at 2-year follow-up. Long-term prognosis for adult cases of NP currently remains undocumented. This case serves as the first piece of published evidence documenting full physiological and radiological recovery following appropriate treatment of NP in an immunocompetent adult patient. © BMJ Publishing Group Ltd (unless otherwise stated in the text of the article) 2017. All rights reserved. No commercial use is permitted unless otherwise expressly granted.
Protocols for Scholarly Communication
NASA Astrophysics Data System (ADS)
Pepe, A.; Yeomans, J.
2007-10-01
CERN, the European Organization for Nuclear Research, has operated an institutional preprint repository for more than 10 years. The repository contains over 850,000 records of which more than 450,000 are full-text OA preprints, mostly in the field of particle physics, and it is integrated with the library's holdings of books, conference proceedings, journals and other grey literature. In order to encourage effective propagation and open access to scholarly material, CERN is implementing a range of innovative library services into its document repository: automatic keywording, reference extraction, collaborative management tools and bibliometric tools. Some of these services, such as user reviewing and automatic metadata extraction, could make up an interesting testbed for future publishing solutions and certainly provide an exciting environment for e-science possibilities. The future protocol for scientific communication should guide authors naturally towards OA publication, and CERN wants to help reach a full open access publishing environment for the particle physics community and related sciences in the next few years.
Overview of Historical Earthquake Document Database in Japan and Future Development
NASA Astrophysics Data System (ADS)
Nishiyama, A.; Satake, K.
2014-12-01
In Japan, damage and disasters from historical large earthquakes have been documented and preserved. Compilation of historical earthquake documents started in the early 20th century and 33 volumes of historical document source books (about 27,000 pages) have been published. However, these source books are not effectively utilized for researchers due to a contamination of low-reliability historical records and a difficulty for keyword searching by characters and dates. To overcome these problems and to promote historical earthquake studies in Japan, construction of text database started in the 21 century. As for historical earthquakes from the beginning of the 7th century to the early 17th century, "Online Database of Historical Documents in Japanese Earthquakes and Eruptions in the Ancient and Medieval Ages" (Ishibashi, 2009) has been already constructed. They investigated the source books or original texts of historical literature, emended the descriptions, and assigned the reliability of each historical document on the basis of written age. Another database compiled the historical documents for seven damaging earthquakes occurred along the Sea of Japan coast in Honshu, central Japan in the Edo period (from the beginning of the 17th century to the middle of the 19th century) and constructed text database and seismic intensity data base. These are now publicized on the web (written only in Japanese). However, only about 9 % of the earthquake source books have been digitized so far. Therefore, we plan to digitize all of the remaining historical documents by the research-program which started in 2014. The specification of the data base will be similar for previous ones. We also plan to combine this database with liquefaction traces database, which will be constructed by other research program, by adding the location information described in historical documents. Constructed database would be utilized to estimate the distributions of seismic intensities and tsunami heights.
Full-field optical coherence tomography used for security and document identity
NASA Astrophysics Data System (ADS)
Chang, Shoude; Mao, Youxin; Sherif, Sherif; Flueraru, Costel
2006-09-01
The optical coherence tomography (OCT) is an emerging technology for high-resolution cross-sectional imaging of 3D structures. In the past years, OCT systems have been used mainly for medical, especially ophthalmological diagnostics. Concerning the nature of OCT system being capable to explore the internal features of an object, we apply the OCT technology to directly retrieve the 2D information pre-stored in a multiple-layer information carrier. The standard depth-resolution of an OCT system is at micrometer level. If a 20mm by 20mm sampling area with a 1024 x 1024 CCD array is used in the OCT system having 10 μm, an information carrier having a volume of 20mm x 20mm x 2mm could contain 200 Mega-pixel images. Because of its tiny size and large information volume, the information carrier, with its OCT retrieving system, will have potential applications in documents security and object identification. In addition, as the information carrier can be made by low-scattering transparent material, the signal/noise ratio will be improved dramatically. As a consequence, the specific hardware and complicated software can also be greatly simplified. Owing to non-scanning along X-Y axis, the full-field OCT could be the simplest and most economic imaging system for extracting information from such a multilayer information carrier. In this paper, deign and implementation of a full-field OCT system is described and the related algorithms are introduced. In our experiments, a four layers information carrier is used, which contains 4 layers of image pattern, two text images and two fingerprint images. The extracted tomography images of each layer are also provided.
Möller, Ingrid; Loza, Estibaliz; Uson, Jacqueline; Acebes, Carlos; Andreu, Jose Luis; Batlle, Enrique; Bueno, Ángel; Collado, Paz; Fernández-Gallardo, Juan Manuel; González, Carlos; Jiménez Palop, Mercedes; Lisbona, María Pilar; Macarrón, Pilar; Maymó, Joan; Narváez, Jose Antonio; Navarro-Compán, Victoria; Sanz, Jesús; Rosario, M Piedad; Vicente, Esther; Naredo, Esperanza
To develop evidence-based recommendations on the use of ultrasound (US) and magnetic resonance imaging (MRI) in patients with rheumatoid arthritis (RA). Recommendations were generated following a nominal group technique. A panel of experts, consisting of 15 rheumatologists and 3 radiologists, was established in the first panel meeting to define the scope and purpose of the consensus document, as well as chapters, potential recommendations and systematic literature reviews (we used and updated those from previous EULAR documents). A first draft of recommendations and text was generated. Then, an electronic Delphi process (2 rounds) was carried out. Recommendations were voted from 1 (total disagreement) to 10 (total agreement). We defined agreement if at least 70% of experts voted ≥7. The level of evidence and grade or recommendation was assessed using the Oxford Centre for Evidence-based Medicine Levels of Evidence. The full text was circulated and reviewed by the panel. The consensus was coordinated by an expert methodologist. A total of 20 recommendations were proposed. They include the validity of US and MRI regarding inflammation and damage detection, diagnosis, prediction (structural damage progression, flare, treatment response, etc.), monitoring and the use of US guided injections/biopsies. These recommendations will help clinicians use US and MRI in RA patients. Copyright © 2016 Elsevier España, S.L.U. and Sociedad Española de Reumatología y Colegio Mexicano de Reumatología. All rights reserved.
This Guide to Documenting and Managing Cost and Performance Information for Remediation Projects provides the recommended procedures for documenting the results of completed and on-going full-scale and demonstration-scale remediation projects.
NASA Astrophysics Data System (ADS)
Zhang, Hui; Wang, Deqing; Wu, Wenjun; Hu, Hongping
2012-11-01
In today's business environment, enterprises are increasingly under pressure to process the vast amount of data produced everyday within enterprises. One method is to focus on the business intelligence (BI) applications and increasing the commercial added-value through such business analytics activities. Term weighting scheme, which has been used to convert the documents as vectors in the term space, is a vital task in enterprise Information Retrieval (IR), text categorisation, text analytics, etc. When determining term weight in a document, the traditional TF-IDF scheme sets weight value for the term considering only its occurrence frequency within the document and in the entire set of documents, which leads to some meaningful terms that cannot get the appropriate weight. In this article, we propose a new term weighting scheme called Term Frequency - Function of Document Frequency (TF-FDF) to address this issue. Instead of using monotonically decreasing function such as Inverse Document Frequency, FDF presents a convex function that dynamically adjusts weights according to the significance of the words in a document set. This function can be manually tuned based on the distribution of the most meaningful words which semantically represent the document set. Our experiments show that the TF-FDF can achieve higher value of Normalised Discounted Cumulative Gain in IR than that of TF-IDF and its variants, and improving the accuracy of relevance ranking of the IR results.
Yu, Hong; Agarwal, Shashank; Johnston, Mark; Cohen, Aaron
2009-01-06
Biomedical scientists need to access figures to validate research facts and to formulate or to test novel research hypotheses. However, figures are difficult to comprehend without associated text (e.g., figure legend and other reference text). We are developing automated systems to extract the relevant explanatory information along with figures extracted from full text articles. Such systems could be very useful in improving figure retrieval and in reducing the workload of biomedical scientists, who otherwise have to retrieve and read the entire full-text journal article to determine which figures are relevant to their research. As a crucial step, we studied the importance of associated text in biomedical figure comprehension. Twenty subjects evaluated three figure-text combinations: figure+legend, figure+legend+title+abstract, and figure+full-text. Using a Likert scale, each subject scored each figure+text according to the extent to which the subject thought he/she understood the meaning of the figure and the confidence in providing the assigned score. Additionally, each subject entered a free text summary for each figure-text. We identified missing information using indicator words present within the text summaries. Both the Likert scores and the missing information were statistically analyzed for differences among the figure-text types. We also evaluated the quality of text summaries with the text-summarization evaluation method the ROUGE score. Our results showed statistically significant differences in figure comprehension when varying levels of text were provided. When the full-text article is not available, presenting just the figure+legend left biomedical researchers lacking 39-68% of the information about a figure as compared to having complete figure comprehension; adding the title and abstract improved the situation, but still left biomedical researchers missing 30% of the information. When the full-text article is available, figure comprehension increased to 86-97%; this indicates that researchers felt that only 3-14% of the necessary information for full figure comprehension was missing when full text was available to them. Clearly there is information in the abstract and in the full text that biomedical scientists deem important for understanding the figures that appear in full-text biomedical articles. We conclude that the texts that appear in full-text biomedical articles are useful for understanding the meaning of a figure, and an effective figure-mining system needs to unlock the information beyond figure legend. Our work provides important guidance to the figure mining systems that extract information only from figure and figure legend.
2009-01-01
Background Biomedical scientists need to access figures to validate research facts and to formulate or to test novel research hypotheses. However, figures are difficult to comprehend without associated text (e.g., figure legend and other reference text). We are developing automated systems to extract the relevant explanatory information along with figures extracted from full text articles. Such systems could be very useful in improving figure retrieval and in reducing the workload of biomedical scientists, who otherwise have to retrieve and read the entire full-text journal article to determine which figures are relevant to their research. As a crucial step, we studied the importance of associated text in biomedical figure comprehension. Methods Twenty subjects evaluated three figure-text combinations: figure+legend, figure+legend+title+abstract, and figure+full-text. Using a Likert scale, each subject scored each figure+text according to the extent to which the subject thought he/she understood the meaning of the figure and the confidence in providing the assigned score. Additionally, each subject entered a free text summary for each figure-text. We identified missing information using indicator words present within the text summaries. Both the Likert scores and the missing information were statistically analyzed for differences among the figure-text types. We also evaluated the quality of text summaries with the text-summarization evaluation method the ROUGE score. Results Our results showed statistically significant differences in figure comprehension when varying levels of text were provided. When the full-text article is not available, presenting just the figure+legend left biomedical researchers lacking 39–68% of the information about a figure as compared to having complete figure comprehension; adding the title and abstract improved the situation, but still left biomedical researchers missing 30% of the information. When the full-text article is available, figure comprehension increased to 86–97%; this indicates that researchers felt that only 3–14% of the necessary information for full figure comprehension was missing when full text was available to them. Clearly there is information in the abstract and in the full text that biomedical scientists deem important for understanding the figures that appear in full-text biomedical articles. Conclusion We conclude that the texts that appear in full-text biomedical articles are useful for understanding the meaning of a figure, and an effective figure-mining system needs to unlock the information beyond figure legend. Our work provides important guidance to the figure mining systems that extract information only from figure and figure legend. PMID:19126221
ERIC Educational Resources Information Center
Armbruster, Bonnie B.; Anderson, Thomas H.
Idea-mapping (i-mapping), a way of representing ideas from a text in the form of a diagram, is defined and illustrated in this document as a way to help students "see" how the ideas they read are linked to each other. The first portion of the document discusses the fundamental relationships found in texts (A is a characteristic of B, A…
Clicks versus Citations: Click Count as a Metric in High Energy Physics Publishing
DOE Office of Scientific and Technical Information (OSTI.GOV)
Bitton, Ayelet; /UC, San Diego /SLAC
2011-06-22
High-energy physicists worldwide rely on online resources such as SPIRES and arXiv to perform gather research and share their own publications. SPIRES is a tool designed to search the literature within high-energy physics, while arXiv provides the actual full-text documents of this literature. In high-energy physics, papers are often ranked according to the number of citations they acquire - meaning the number of times a later paper references the original. This paper investigates the correlation between the number of times a paper is clicked in order to be downloaded and the number of citations it receives following the click. Itmore » explores how physicists truly read what they cite.« less
Language Classification using N-grams Accelerated by FPGA-based Bloom Filters
DOE Office of Scientific and Technical Information (OSTI.GOV)
Jacob, A; Gokhale, M
N-Gram (n-character sequences in text documents) counting is a well-established technique used in classifying the language of text in a document. In this paper, n-gram processing is accelerated through the use of reconfigurable hardware on the XtremeData XD1000 system. Our design employs parallelism at multiple levels, with parallel Bloom Filters accessing on-chip RAM, parallel language classifiers, and parallel document processing. In contrast to another hardware implementation (HAIL algorithm) that uses off-chip SRAM for lookup, our highly scalable implementation uses only on-chip memory blocks. Our implementation of end-to-end language classification runs at 85x comparable software and 1.45x the competing hardware design.
Katritsis, Demosthenes G; Boriani, Giuseppe; Cosio, Francisco G; Jais, Pierre; Hindricks, Gerhard; Josephson, Mark E; Keegan, Roberto; Knight, Bradley P; Kuck, Karl-Heinz; Lane, Deirdre A; Lip, Gregory Yh; Malmborg, Helena; Oral, Hakan; Pappone, Carlo; Themistoclakis, Sakis; Wood, Kathryn A; Young-Hoon, Kim; Lundqvist, Carina Blomström
2016-01-01
This paper is an executive summary of the full European Heart Rhythm Association (EHRA) consensus document on the management of supraventricular arrhythmias, published in Europace . It summarises developments in the field and provides recommendations for patient management, with particular emphasis on new advances since the previous European Society of Cardiology guidelines. The EHRA consensus document is available to read in full at http://europace.oxfordjournals.org.
Full-text publication of abstract-presented work in sport and exercise psychology
Warden, Stuart
2018-01-01
Objectives Meetings promote information sharing, but do not enable full dissemination of details. A systematic search was conducted for abstracts presented at the 2010 and 2011 Association of Applied Sport Psychology Annual Conferences to determine the full-text dissemination rate of work presented in abstract form and investigate factors influencing this rate. Methods Systematic searches were sequentially conducted to determine whether the abstract-presented work had been published in full-text format in the 5 years following presentation. If a potential full-text publication was identified, information from the conference abstract (eg, results, number of participants in the sample(s), measurement tools used and so on) was compared with the full text to ensure the two entities represented the same body of work. Abstract factors of interest were assessed using logistic regression. Results Ninety-four out of 423 presented abstracts (22.2%) were published in full text. Odds of full-text publication increased if the abstract was from an international institution, presented in certain conference sections or presented as a lecture. Conclusion Those attending professional conferences should be cautious when translating data presented at conferences into their applied work because of the low rate of peer-reviewed and full-text publication of the information. PMID:29629187
Text, photo, and line extraction in scanned documents
NASA Astrophysics Data System (ADS)
Erkilinc, M. Sezer; Jaber, Mustafa; Saber, Eli; Bauer, Peter; Depalov, Dejan
2012-07-01
We propose a page layout analysis algorithm to classify a scanned document into different regions such as text, photo, or strong lines. The proposed scheme consists of five modules. The first module performs several image preprocessing techniques such as image scaling, filtering, color space conversion, and gamma correction to enhance the scanned image quality and reduce the computation time in later stages. Text detection is applied in the second module wherein wavelet transform and run-length encoding are employed to generate and validate text regions, respectively. The third module uses a Markov random field based block-wise segmentation that employs a basis vector projection technique with maximum a posteriori probability optimization to detect photo regions. In the fourth module, methods for edge detection, edge linking, line-segment fitting, and Hough transform are utilized to detect strong edges and lines. In the last module, the resultant text, photo, and edge maps are combined to generate a page layout map using K-Means clustering. The proposed algorithm has been tested on several hundred documents that contain simple and complex page layout structures and contents such as articles, magazines, business cards, dictionaries, and newsletters, and compared against state-of-the-art page-segmentation techniques with benchmark performance. The results indicate that our methodology achieves an average of ˜89% classification accuracy in text, photo, and background regions.
Support Vector Machines: Relevance Feedback and Information Retrieval.
ERIC Educational Resources Information Center
Drucker, Harris; Shahrary, Behzad; Gibbon, David C.
2002-01-01
Compares support vector machines (SVMs) to Rocchio, Ide regular and Ide dec-hi algorithms in information retrieval (IR) of text documents using relevancy feedback. If the preliminary search is so poor that one has to search through many documents to find at least one relevant document, then SVM is preferred. Includes nine tables. (Contains 24…
17 CFR 4.1 - Requirements as to form.
Code of Federal Regulations, 2013 CFR
2013-04-01
... table of contents is required, the electronic document must either include page numbers in the text or... as to form. (a) Each document distributed pursuant to this part 4 must be: (1) Clear and legible; (2...” disclosed under this part 4 must be displayed in capital letters and in boldface type. (c) Where a document...
17 CFR 4.1 - Requirements as to form.
Code of Federal Regulations, 2010 CFR
2010-04-01
... table of contents is required, the electronic document must either include page numbers in the text or... as to form. (a) Each document distributed pursuant to this part 4 must be: (1) Clear and legible; (2...” disclosed under this part 4 must be displayed in capital letters and in boldface type. (c) Where a document...
17 CFR 4.1 - Requirements as to form.
Code of Federal Regulations, 2012 CFR
2012-04-01
... table of contents is required, the electronic document must either include page numbers in the text or... as to form. (a) Each document distributed pursuant to this part 4 must be: (1) Clear and legible; (2...” disclosed under this part 4 must be displayed in capital letters and in boldface type. (c) Where a document...
17 CFR 4.1 - Requirements as to form.
Code of Federal Regulations, 2011 CFR
2011-04-01
... table of contents is required, the electronic document must either include page numbers in the text or... as to form. (a) Each document distributed pursuant to this part 4 must be: (1) Clear and legible; (2...” disclosed under this part 4 must be displayed in capital letters and in boldface type. (c) Where a document...
17 CFR 4.1 - Requirements as to form.
Code of Federal Regulations, 2014 CFR
2014-04-01
... table of contents is required, the electronic document must either include page numbers in the text or... as to form. (a) Each document distributed pursuant to this part 4 must be: (1) Clear and legible; (2...” disclosed under this part 4 must be displayed in capital letters and in boldface type. (c) Where a document...
EFL Learners' Multiple Documents Literacy: Effects of a Strategy-Directed Intervention Program
ERIC Educational Resources Information Center
Karimi, Mohammad Nabi
2015-01-01
There is a substantial body of L2 research documenting the central role of strategy instruction in reading comprehension. However, this line of research has been conducted mostly within the single text paradigm of reading research. With reading literacy undergoing a marked shift from single source reading to multiple documents literacy, little is…
American Catholic Higher Education. Essential Documents, 1967-1990.
ERIC Educational Resources Information Center
Gallin, Alice, Ed.
This reference volume contains the texts of documents pertinent to the development of Catholic higher education during the years from 1967 to 1990. The documents reveal church officials' and university presidents' collaborative efforts to address the questions of what it means it mean to be a university or college and what it means for such an…
Business Documents Don't Have to Be Boring
ERIC Educational Resources Information Center
Schultz, Benjamin
2006-01-01
With business documents, visuals can serve to enhance the written word in conveying the message. Images can be especially effective when used subtly, on part of the page, on successive pages to provide continuity, or even set as watermarks over the entire page. A main reason given for traditional text-only business documents is that they are…
Different Words for the Same Concept: Learning Collaboratively from Multiple Documents
ERIC Educational Resources Information Center
Jucks, Regina; Paus, Elisabeth
2013-01-01
This study investigated how varying the lexical encodings of technical terms in multiple texts influences learners' dyadic processing of scientific-related information. Fifty-seven pairs of college students read journalistic texts on depression. Each partner in a dyad received one text; for half of the dyads the partner's text contained different…
Automatic system for computer program documentation
NASA Technical Reports Server (NTRS)
Simmons, D. B.; Elliott, R. W.; Arseven, S.; Colunga, D.
1972-01-01
Work done on a project to design an automatic system for computer program documentation aids was made to determine what existing programs could be used effectively to document computer programs. Results of the study are included in the form of an extensive bibliography and working papers on appropriate operating systems, text editors, program editors, data structures, standards, decision tables, flowchart systems, and proprietary documentation aids. The preliminary design for an automated documentation system is also included. An actual program has been documented in detail to demonstrate the types of output that can be produced by the proposed system.
2010-01-01
Background In the United States, the Health Insurance Portability and Accountability Act (HIPAA) protects the confidentiality of patient data and requires the informed consent of the patient and approval of the Internal Review Board to use data for research purposes, but these requirements can be waived if data is de-identified. For clinical data to be considered de-identified, the HIPAA "Safe Harbor" technique requires 18 data elements (called PHI: Protected Health Information) to be removed. The de-identification of narrative text documents is often realized manually, and requires significant resources. Well aware of these issues, several authors have investigated automated de-identification of narrative text documents from the electronic health record, and a review of recent research in this domain is presented here. Methods This review focuses on recently published research (after 1995), and includes relevant publications from bibliographic queries in PubMed, conference proceedings, the ACM Digital Library, and interesting publications referenced in already included papers. Results The literature search returned more than 200 publications. The majority focused only on structured data de-identification instead of narrative text, on image de-identification, or described manual de-identification, and were therefore excluded. Finally, 18 publications describing automated text de-identification were selected for detailed analysis of the architecture and methods used, the types of PHI detected and removed, the external resources used, and the types of clinical documents targeted. All text de-identification systems aimed to identify and remove person names, and many included other types of PHI. Most systems used only one or two specific clinical document types, and were mostly based on two different groups of methodologies: pattern matching and machine learning. Many systems combined both approaches for different types of PHI, but the majority relied only on pattern matching, rules, and dictionaries. Conclusions In general, methods based on dictionaries performed better with PHI that is rarely mentioned in clinical text, but are more difficult to generalize. Methods based on machine learning tend to perform better, especially with PHI that is not mentioned in the dictionaries used. Finally, the issues of anonymization, sufficient performance, and "over-scrubbing" are discussed in this publication. PMID:20678228
Meystre, Stephane M; Friedlin, F Jeffrey; South, Brett R; Shen, Shuying; Samore, Matthew H
2010-08-02
In the United States, the Health Insurance Portability and Accountability Act (HIPAA) protects the confidentiality of patient data and requires the informed consent of the patient and approval of the Internal Review Board to use data for research purposes, but these requirements can be waived if data is de-identified. For clinical data to be considered de-identified, the HIPAA "Safe Harbor" technique requires 18 data elements (called PHI: Protected Health Information) to be removed. The de-identification of narrative text documents is often realized manually, and requires significant resources. Well aware of these issues, several authors have investigated automated de-identification of narrative text documents from the electronic health record, and a review of recent research in this domain is presented here. This review focuses on recently published research (after 1995), and includes relevant publications from bibliographic queries in PubMed, conference proceedings, the ACM Digital Library, and interesting publications referenced in already included papers. The literature search returned more than 200 publications. The majority focused only on structured data de-identification instead of narrative text, on image de-identification, or described manual de-identification, and were therefore excluded. Finally, 18 publications describing automated text de-identification were selected for detailed analysis of the architecture and methods used, the types of PHI detected and removed, the external resources used, and the types of clinical documents targeted. All text de-identification systems aimed to identify and remove person names, and many included other types of PHI. Most systems used only one or two specific clinical document types, and were mostly based on two different groups of methodologies: pattern matching and machine learning. Many systems combined both approaches for different types of PHI, but the majority relied only on pattern matching, rules, and dictionaries. In general, methods based on dictionaries performed better with PHI that is rarely mentioned in clinical text, but are more difficult to generalize. Methods based on machine learning tend to perform better, especially with PHI that is not mentioned in the dictionaries used. Finally, the issues of anonymization, sufficient performance, and "over-scrubbing" are discussed in this publication.
Onboard shuttle on-line software requirements system: Prototype
NASA Technical Reports Server (NTRS)
Kolkhorst, Barbara; Ogletree, Barry
1989-01-01
The prototype discussed here was developed as proof of a concept for a system which could support high volumes of requirements documents with integrated text and graphics; the solution proposed here could be extended to other projects whose goal is to place paper documents in an electronic system for viewing and printing purposes. The technical problems (such as conversion of documentation between word processors, management of a variety of graphics file formats, and difficulties involved in scanning integrated text and graphics) would be very similar for other systems of this type. Indeed, technological advances in areas such as scanning hardware and software and display terminals insure that some of the problems encountered here will be solved in the near-term (less than five years). Examples of these solvable problems include automated input of integrated text and graphics, errors in the recognition process, and the loss of image information which results from the digitization process. The solution developed for the Online Software Requirements System is modular and allows hardware and software components to be upgraded or replaced as industry solutions mature. The extensive commercial software content allows the NASA customer to apply resources to solving the problem and maintaining documents.
ERIC Educational Resources Information Center
Tauchert, Wolfgang; And Others
1991-01-01
Describes the PADOK-II project in Germany, which was designed to give information on the effects of linguistic algorithms on retrieval in a full-text database, the German Patent Information System (GPI). Relevance assessments are discussed, statistical evaluations are described, and searches are compared for the full-text section versus the…
Comparative Analysis of Document level Text Classification Algorithms using R
NASA Astrophysics Data System (ADS)
Syamala, Maganti; Nalini, N. J., Dr; Maguluri, Lakshamanaphaneendra; Ragupathy, R., Dr.
2017-08-01
From the past few decades there has been tremendous volumes of data available in Internet either in structured or unstructured form. Also, there is an exponential growth of information on Internet, so there is an emergent need of text classifiers. Text mining is an interdisciplinary field which draws attention on information retrieval, data mining, machine learning, statistics and computational linguistics. And to handle this situation, a wide range of supervised learning algorithms has been introduced. Among all these K-Nearest Neighbor(KNN) is efficient and simplest classifier in text classification family. But KNN suffers from imbalanced class distribution and noisy term features. So, to cope up with this challenge we use document based centroid dimensionality reduction(CentroidDR) using R Programming. By combining these two text classification techniques, KNN and Centroid classifiers, we propose a scalable and effective flat classifier, called MCenKNN which works well substantially better than CenKNN.
PuReD-MCL: a graph-based PubMed document clustering methodology.
Theodosiou, T; Darzentas, N; Angelis, L; Ouzounis, C A
2008-09-01
Biomedical literature is the principal repository of biomedical knowledge, with PubMed being the most complete database collecting, organizing and analyzing such textual knowledge. There are numerous efforts that attempt to exploit this information by using text mining and machine learning techniques. We developed a novel approach, called PuReD-MCL (Pubmed Related Documents-MCL), which is based on the graph clustering algorithm MCL and relevant resources from PubMed. PuReD-MCL avoids using natural language processing (NLP) techniques directly; instead, it takes advantage of existing resources, available from PubMed. PuReD-MCL then clusters documents efficiently using the MCL graph clustering algorithm, which is based on graph flow simulation. This process allows users to analyse the results by highlighting important clues, and finally to visualize the clusters and all relevant information using an interactive graph layout algorithm, for instance BioLayout Express 3D. The methodology was applied to two different datasets, previously used for the validation of the document clustering tool TextQuest. The first dataset involves the organisms Escherichia coli and yeast, whereas the second is related to Drosophila development. PuReD-MCL successfully reproduces the annotated results obtained from TextQuest, while at the same time provides additional insights into the clusters and the corresponding documents. Source code in perl and R are available from http://tartara.csd.auth.gr/~theodos/
Data Compression in Full-Text Retrieval Systems.
ERIC Educational Resources Information Center
Bell, Timothy C.; And Others
1993-01-01
Describes compression methods for components of full-text systems such as text databases on CD-ROM. Topics discussed include storage media; structures for full-text retrieval, including indexes, inverted files, and bitmaps; compression tools; memory requirements during retrieval; and ranking and information retrieval. (Contains 53 references.)…
43 CFR 4.612 - What documentation of fees and expenses must I provide?
Code of Federal Regulations, 2010 CFR
2010-10-01
... 43 Public Lands: Interior 1 2010-10-01 2010-10-01 false What documentation of fees and expenses... Proceedings Information Required from Applicants § 4.612 What documentation of fees and expenses must I provide? (a) Your application must be accompanied by full documentation of the fees and expenses for which...
Smith, Heather D.; Bogenschutz, Elizabeth D.; Bayliss, Amy J.; Altenburger, Peter A.
2011-01-01
Background and Objective Professional meetings, such as the American Physical Therapy Association's (APTA's) Combined Sections Meeting (CSM), provide forums for sharing information relevant to physical therapy. An indicator of whether therapists fully disseminate their work is the number of full-text peer-reviewed publications that result. The purposes of this study were: (1) to determine the full-text publication rate of work presented in abstract form at CSM and (2) to investigate factors influencing this rate. Methods A systematic search was undertaken to locate full-text publications of work presented in abstract form within the Orthopaedic and Sports Physical Therapy sections at CSM between 2000 and 2004. Eligible publications were published within 5 years following abstract presentation. The influences of APTA section, year of abstract presentation, institution of origin, study design, sample size, study significance, reporting of a funding source, and presentation type on full-text publication rate were assessed. Characteristics of full-text publications were explored. Results Work presented in 1 out of 4 abstracts (25.4%) progressed to full-text publication. Odds of full-text publication increased if the abstract originated from a doctorate-granting or “other” institution, reported findings of an experimental study, reported a statistically significant finding, included a larger sample size, disclosed a funding source, or was presented as a platform presentation. More than one third (37.8%) of full-text publications were published in the Journal of Orthopaedic and Sports Physical Therapy or Physical Therapy, and 4 out of 10 full-text publications (39.2%) contained at least one major change from information presented in abstract form. Conclusions The full-text publication rate for information presented in abstract form within the Orthopaedic and Sports Physical Therapy sections at CSM is low relative to comparative disciplines. Caution should be exercised when translating information presented at CSM into practice. PMID:21169423
Going, going, still there: using the WebCite service to permanently archive cited web pages.
Eysenbach, Gunther; Trudel, Mathieu
2005-12-30
Scholars are increasingly citing electronic "web references" which are not preserved in libraries or full text archives. WebCite is a new standard for citing web references. To "webcite" a document involves archiving the cited Web page through www.webcitation.org and citing the WebCite permalink instead of (or in addition to) the unstable live Web page. This journal has amended its "instructions for authors" accordingly, asking authors to archive cited Web pages before submitting a manuscript. Almost 200 other journals are already using the system. We discuss the rationale for WebCite, its technology, and how scholars, editors, and publishers can benefit from the service. Citing scholars initiate an archiving process of all cited Web references, ideally before they submit a manuscript. Authors of online documents and websites which are expected to be cited by others can ensure that their work is permanently available by creating an archived copy using WebCite and providing the citation information including the WebCite link on their Web document(s). Editors should ask their authors to cache all cited Web addresses (Uniform Resource Locators, or URLs) "prospectively" before submitting their manuscripts to their journal. Editors and publishers should also instruct their copyeditors to cache cited Web material if the author has not done so already. Finally, WebCite can process publisher submitted "citing articles" (submitted for example as eXtensible Markup Language [XML] documents) to automatically archive all cited Web pages shortly before or on publication. Finally, WebCite can act as a focussed crawler, caching retrospectively references of already published articles. Copyright issues are addressed by honouring respective Internet standards (robot exclusion files, no-cache and no-archive tags). Long-term preservation is ensured by agreements with libraries and digital preservation organizations. The resulting WebCite Index may also have applications for research assessment exercises, being able to measure the impact of Web services and published Web documents through access and Web citation metrics.
Calvert, Melanie; Kyte, Derek; Duffy, Helen; Gheorghe, Adrian; Mercieca-Bebber, Rebecca; Ives, Jonathan; Draper, Heather; Brundage, Michael; Blazeby, Jane; King, Madeleine
2014-01-01
Background Evidence suggests there are inconsistencies in patient-reported outcome (PRO) assessment and reporting in clinical trials, which may limit the use of these data to inform patient care. For trials with a PRO endpoint, routine inclusion of key PRO information in the protocol may help improve trial conduct and the reporting and appraisal of PRO results; however, it is currently unclear exactly what PRO-specific information should be included. The aim of this review was to summarize the current PRO-specific guidance for clinical trial protocol developers. Methods and Findings We searched the MEDLINE, EMBASE, CINHAL and Cochrane Library databases (inception to February 2013) for PRO-specific guidance regarding trial protocol development. Further guidance documents were identified via Google, Google scholar, requests to members of the UK Clinical Research Collaboration registered clinical trials units and international experts. Two independent investigators undertook title/abstract screening, full text review and data extraction, with a third involved in the event of disagreement. 21,175 citations were screened and 54 met the inclusion criteria. Guidance documents were difficult to access: electronic database searches identified just 8 documents, with the remaining 46 sourced elsewhere (5 from citation tracking, 27 from hand searching, 7 from the grey literature review and 7 from experts). 162 unique PRO-specific protocol recommendations were extracted from included documents. A further 10 PRO recommendations were identified relating to supporting trial documentation. Only 5/162 (3%) recommendations appeared in ≥50% of guidance documents reviewed, indicating a lack of consistency. Conclusions PRO-specific protocol guidelines were difficult to access, lacked consistency and may be challenging to implement in practice. There is a need to develop easily accessible consensus-driven PRO protocol guidance. Guidance should be aimed at ensuring key PRO information is routinely included in appropriate trial protocols, in order to facilitate rigorous collection/reporting of PRO data, to effectively inform patient care. PMID:25333995
Blanc, Xavier; Collet, Tinh-Hai; Auer, Reto; Iriarte, Pablo; Krause, Jan; Légaré, France; Cornuz, Jacques; Clair, Carole
2015-04-07
Full-text searches of articles increase the recall, defined by the proportion of relevant publications that are retrieved. However, this method is rarely used in medical research due to resource constraints. For the purpose of a systematic review of publications addressing shared decision making, a full-text search method was required to retrieve publications where shared decision making does not appear in the title or abstract. The objective of our study was to assess the efficiency and reliability of full-text searches in major medical journals for identifying shared decision making publications. A full-text search was performed on the websites of 15 high-impact journals in general internal medicine to look up publications of any type from 1996-2011 containing the phrase "shared decision making". The search method was compared with a PubMed search of titles and abstracts only. The full-text search was further validated by requesting all publications from the same time period from the individual journal publishers and searching through the collected dataset. The full-text search for "shared decision making" on journal websites identified 1286 publications in 15 journals compared to 119 through the PubMed search. The search within the publisher-provided publications of 6 journals identified 613 publications compared to 646 with the full-text search on the respective journal websites. The concordance rate was 94.3% between both full-text searches. Full-text searching on medical journal websites is an efficient and reliable way to identify relevant articles in the field of shared decision making for review or other purposes. It may be more widely used in biomedical research in other fields in the future, with the collaboration of publishers and journals toward open-access data.
ERIC Educational Resources Information Center
Congress of the U.S., Washington, DC. House Committee on the Judiciary.
This document contains witnesses' testimonies and prepared statements from the Congressional hearing called to consider enactment of H.R. 2673, a bill to facilitate implementation of the 1980 Hague Convention on the Civil Aspects of International Child Abduction. The text of H.R. 2673 is included in the document as is the text of H.R. 3971, a bill…
Graphics-based intelligent search and abstracting using Data Modeling
NASA Astrophysics Data System (ADS)
Jaenisch, Holger M.; Handley, James W.; Case, Carl T.; Songy, Claude G.
2002-11-01
This paper presents an autonomous text and context-mining algorithm that converts text documents into point clouds for visual search cues. This algorithm is applied to the task of data-mining a scriptural database comprised of the Old and New Testaments from the Bible and the Book of Mormon, Doctrine and Covenants, and the Pearl of Great Price. Results are generated which graphically show the scripture that represents the average concept of the database and the mining of the documents down to the verse level.
National Wind Technology Center sitewide, Golden, CO: Environmental assessment
DOE Office of Scientific and Technical Information (OSTI.GOV)
NONE
1996-11-01
The National Renewable Energy Laboratory (NREL), the nation`s primary solar and renewable energy research laboratory, proposes to expand its wind technology research and development program activities at its National Wind Technology Center (NWTC) near Golden, Colorado. NWTC is an existing wind energy research facility operated by NREL for the US Department of Energy (DOE). Proposed activities include the construction and reuse of buildings and facilities, installation of up to 20 wind turbine test sites, improvements in infrastructure, and subsequent research activities, technology testing, and site operations. In addition to wind turbine test activities, NWTC may be used to support othermore » NREL program activities and small-scale demonstration projects. This document assesses potential consequences to resources within the physical, biological, and human environment, including potential impacts to: air quality, geology and soils, water resources, biological resources, cultural and historic resources, socioeconomic resources, land use, visual resources, noise environment, hazardous materials and waste management, and health and safety conditions. Comment letters were received from several agencies in response to the scoping and predecisional draft reviews. The comments have been incorporated as appropriate into the document with full text of the letters contained in the Appendices. Additionally, information from the Rocky Flats Environmental Technology Site on going sitewide assessment of potential environmental impacts has been reviewed and discussed by representatives of both parties and incorporated into the document as appropriate.« less
Term Familiarity to indicate Perceived and Actual Difficulty of Text in Medical Digital Libraries.
Leroy, Gondy; Endicott, James E
2011-10-01
With increasing text digitization, digital libraries can personalize materials for individuals with different education levels and language skills. To this end, documents need meta-information describing their difficulty level. Previous attempts at such labeling used readability formulas but the formulas have not been validated with modern texts and their outcome is seldom associated with actual difficulty. We focus on medical texts and are developing new, evidence-based meta-tags that are associated with perceived and actual text difficulty. This work describes a first tag, term familiarity , which is based on term frequency in the Google corpus. We evaluated its feasibility to serve as a tag by looking at a document corpus (N=1,073) and found that terms in blogs or journal articles displayed unexpected but significantly different scores. Term familiarity was then applied to texts and results from a previous user study (N=86) and could better explain differences for perceived and actual difficulty.
NASA Astrophysics Data System (ADS)
de Andrade Lopes, Alneu; Minghim, Rosane; Melo, Vinícius; Paulovich, Fernando V.
2006-01-01
The current availability of information many times impair the tasks of searching, browsing and analyzing information pertinent to a topic of interest. This paper presents a methodology to create a meaningful graphical representation of documents corpora targeted at supporting exploration of correlated documents. The purpose of such an approach is to produce a map from a document body on a research topic or field based on the analysis of their contents, and similarities amongst articles. The document map is generated, after text pre-processing, by projecting the data in two dimensions using Latent Semantic Indexing. The projection is followed by hierarchical clustering to support sub-area identification. The map can be interactively explored, helping to narrow down the search for relevant articles. Tests were performed using a collection of documents pre-classified into three research subject classes: Case-Based Reasoning, Information Retrieval, and Inductive Logic Programming. The map produced was capable of separating the main areas and approaching documents by their similarity, revealing possible topics, and identifying boundaries between them. The tool can deal with the exploration of inter-topics and intra-topic relationship and is useful in many contexts that need deciding on relevant articles to read, such as scientific research, education, and training.
ERIC Educational Resources Information Center
Larsen, Kent S., Ed.
Materials in this resource document were compiled for use in a Washington seminar directed to the interests of state and local government to develop strategies for privacy protection. Included are the texts of issue papers and supporting documents in the following subject areas: (1) criminal justice information; (2) public employee records; (3)…
AFT-QuEST Consortium Yearbook. Proceedings of the AFT-QuEST Consortium (April 22-26, 1973).
ERIC Educational Resources Information Center
American Federation of Teachers, Washington, DC.
This document is a report on the proceedings of the 1973 American Federation of Teachers-Quality Educational Standards in Teaching (AFT-QuEST) consortium sponsored by the AFT. Included in this document are the texts of speeches and outlines of workshops and iscussions. The document is divided into the following sections: goals, major proposals,…
A Comparison of Product Realization Frameworks
1993-10-01
software (integrated FrameMaker ). Also included are BOLD for on-line documentation delivery, printer/plotter support, and 18 network licensing support. AMPLE...are built with DSS. Documentation tools include an on-line information system (BOLD), text editing (Notepad), word processing (integrated FrameMaker ...within an application. FrameMaker is fully integrated with the Falcon Framework to provide consistent documentation capabilities within engineering
Conjunctive Cohesion in English Language EU Documents--A Corpus-Based Analysis and Its Implications
ERIC Educational Resources Information Center
Trebits, Anna
2009-01-01
This paper reports the findings of a study which forms part of a larger-scale research project investigating the use of English in the documents of the European Union (EU). The documents of the EU show various features of texts written for legal, business and other specific purposes. Moreover, the translation services of the EU institutions often…
Automatic reconstruction of a bacterial regulatory network using Natural Language Processing
Rodríguez-Penagos, Carlos; Salgado, Heladia; Martínez-Flores, Irma; Collado-Vides, Julio
2007-01-01
Background Manual curation of biological databases, an expensive and labor-intensive process, is essential for high quality integrated data. In this paper we report the implementation of a state-of-the-art Natural Language Processing system that creates computer-readable networks of regulatory interactions directly from different collections of abstracts and full-text papers. Our major aim is to understand how automatic annotation using Text-Mining techniques can complement manual curation of biological databases. We implemented a rule-based system to generate networks from different sets of documents dealing with regulation in Escherichia coli K-12. Results Performance evaluation is based on the most comprehensive transcriptional regulation database for any organism, the manually-curated RegulonDB, 45% of which we were able to recreate automatically. From our automated analysis we were also able to find some new interactions from papers not already curated, or that were missed in the manual filtering and review of the literature. We also put forward a novel Regulatory Interaction Markup Language better suited than SBML for simultaneously representing data of interest for biologists and text miners. Conclusion Manual curation of the output of automatic processing of text is a good way to complement a more detailed review of the literature, either for validating the results of what has been already annotated, or for discovering facts and information that might have been overlooked at the triage or curation stages. PMID:17683642
Supporting the education evidence portal via text mining
Ananiadou, Sophia; Thompson, Paul; Thomas, James; Mu, Tingting; Oliver, Sandy; Rickinson, Mark; Sasaki, Yutaka; Weissenbacher, Davy; McNaught, John
2010-01-01
The UK Education Evidence Portal (eep) provides a single, searchable, point of access to the contents of the websites of 33 organizations relating to education, with the aim of revolutionizing work practices for the education community. Use of the portal alleviates the need to spend time searching multiple resources to find relevant information. However, the combined content of the websites of interest is still very large (over 500 000 documents and growing). This means that searches using the portal can produce very large numbers of hits. As users often have limited time, they would benefit from enhanced methods of performing searches and viewing results, allowing them to drill down to information of interest more efficiently, without having to sift through potentially long lists of irrelevant documents. The Joint Information Systems Committee (JISC)-funded ASSIST project has produced a prototype web interface to demonstrate the applicability of integrating a number of text-mining tools and methods into the eep, to facilitate an enhanced searching, browsing and document-viewing experience. New features include automatic classification of documents according to a taxonomy, automatic clustering of search results according to similar document content, and automatic identification and highlighting of key terms within documents. PMID:20643679
NASA Astrophysics Data System (ADS)
Tirupattur, Naveen; Lapish, Christopher C.; Mukhopadhyay, Snehasis
2011-06-01
Text mining, sometimes alternately referred to as text analytics, refers to the process of extracting high-quality knowledge from the analysis of textual data. Text mining has wide variety of applications in areas such as biomedical science, news analysis, and homeland security. In this paper, we describe an approach and some relatively small-scale experiments which apply text mining to neuroscience research literature to find novel associations among a diverse set of entities. Neuroscience is a discipline which encompasses an exceptionally wide range of experimental approaches and rapidly growing interest. This combination results in an overwhelmingly large and often diffuse literature which makes a comprehensive synthesis difficult. Understanding the relations or associations among the entities appearing in the literature not only improves the researchers current understanding of recent advances in their field, but also provides an important computational tool to formulate novel hypotheses and thereby assist in scientific discoveries. We describe a methodology to automatically mine the literature and form novel associations through direct analysis of published texts. The method first retrieves a set of documents from databases such as PubMed using a set of relevant domain terms. In the current study these terms yielded a set of documents ranging from 160,909 to 367,214 documents. Each document is then represented in a numerical vector form from which an Association Graph is computed which represents relationships between all pairs of domain terms, based on co-occurrence. Association graphs can then be subjected to various graph theoretic algorithms such as transitive closure and cycle (circuit) detection to derive additional information, and can also be visually presented to a human researcher for understanding. In this paper, we present three relatively small-scale problem-specific case studies to demonstrate that such an approach is very successful in replicating a neuroscience expert's mental model of object-object associations entirely by means of text mining. These preliminary results provide the confidence that this type of text mining based research approach provides an extremely powerful tool to better understand the literature and drive novel discovery for the neuroscience community.
The Council of Europe and Sport, 1966-1998. Volume III: Texts of the Anti-Doping Convention.
ERIC Educational Resources Information Center
Council of Europe, Strasbourg (France).
This document presents texts in the field of sports and doping that were adopted by various committees of the Council of Europe. The seven sections present: (1) "Texts Adopted by the Committee of Ministers, 1996-1988"; (2) "Texts Adopted at the Conferences of European Ministers Responsible for Sport Since 1978" and…
A Study of Readability of Texts in Bangla through Machine Learning Approaches
ERIC Educational Resources Information Center
Sinha, Manjira; Basu, Anupam
2016-01-01
In this work, we have investigated text readability in Bangla language. Text readability is an indicator of the suitability of a given document with respect to a target reader group. Therefore, text readability has huge impact on educational content preparation. The advances in the field of natural language processing have enabled the automatic…
A Survey of Text Materials Used in Aviation Maintenance Technician Schools. Final Report.
ERIC Educational Resources Information Center
Allen, David; Bowers, William K.
The report documents the results of a national survey of book publishing firms and aviation maintenance technician schools to (1) identify the text materials used in the training of aviation mechanics; (2) appraise the suitability and availability of identified text materials; and (3) determine the adequacy of the text materials in meeting the…
Automation for System Safety Analysis
NASA Technical Reports Server (NTRS)
Malin, Jane T.; Fleming, Land; Throop, David; Thronesbery, Carroll; Flores, Joshua; Bennett, Ted; Wennberg, Paul
2009-01-01
This presentation describes work to integrate a set of tools to support early model-based analysis of failures and hazards due to system-software interactions. The tools perform and assist analysts in the following tasks: 1) extract model parts from text for architecture and safety/hazard models; 2) combine the parts with library information to develop the models for visualization and analysis; 3) perform graph analysis and simulation to identify and evaluate possible paths from hazard sources to vulnerable entities and functions, in nominal and anomalous system-software configurations and scenarios; and 4) identify resulting candidate scenarios for software integration testing. There has been significant technical progress in model extraction from Orion program text sources, architecture model derivation (components and connections) and documentation of extraction sources. Models have been derived from Internal Interface Requirements Documents (IIRDs) and FMEA documents. Linguistic text processing is used to extract model parts and relationships, and the Aerospace Ontology also aids automated model development from the extracted information. Visualizations of these models assist analysts in requirements overview and in checking consistency and completeness.
Correcting geometric and photometric distortion of document images on a smartphone
NASA Astrophysics Data System (ADS)
Simon, Christian; Williem; Park, In Kyu
2015-01-01
A set of document image processing algorithms for improving the optical character recognition (OCR) capability of smartphone applications is presented. The scope of the problem covers the geometric and photometric distortion correction of document images. The proposed framework was developed to satisfy industrial requirements. It is implemented on an off-the-shelf smartphone with limited resources in terms of speed and memory. Geometric distortions, i.e., skew and perspective distortion, are corrected by sending horizontal and vertical vanishing points toward infinity in a downsampled image. Photometric distortion includes image degradation from moiré pattern noise and specular highlights. Moiré pattern noise is removed using low-pass filters with different sizes independently applied to the background and text region. The contrast of the text in a specular highlighted area is enhanced by locally enlarging the intensity difference between the background and text while the noise is suppressed. Intensive experiments indicate that the proposed methods show a consistent and robust performance on a smartphone with a runtime of less than 1 s.
Boost OCR accuracy using iVector based system combination approach
NASA Astrophysics Data System (ADS)
Peng, Xujun; Cao, Huaigu; Natarajan, Prem
2015-01-01
Optical character recognition (OCR) is a challenging task because most existing preprocessing approaches are sensitive to writing style, writing material, noises and image resolution. Thus, a single recognition system cannot address all factors of real document images. In this paper, we describe an approach to combine diverse recognition systems by using iVector based features, which is a newly developed method in the field of speaker verification. Prior to system combination, document images are preprocessed and text line images are extracted with different approaches for each system, where iVector is transformed from a high-dimensional supervector of each text line and is used to predict the accuracy of OCR. We merge hypotheses from multiple recognition systems according to the overlap ratio and the predicted OCR score of text line images. We present evaluation results on an Arabic document database where the proposed method is compared against the single best OCR system using word error rate (WER) metric.
High-Reproducibility and High-Accuracy Method for Automated Topic Classification
NASA Astrophysics Data System (ADS)
Lancichinetti, Andrea; Sirer, M. Irmak; Wang, Jane X.; Acuna, Daniel; Körding, Konrad; Amaral, Luís A. Nunes
2015-01-01
Much of human knowledge sits in large databases of unstructured text. Leveraging this knowledge requires algorithms that extract and record metadata on unstructured text documents. Assigning topics to documents will enable intelligent searching, statistical characterization, and meaningful classification. Latent Dirichlet allocation (LDA) is the state of the art in topic modeling. Here, we perform a systematic theoretical and numerical analysis that demonstrates that current optimization techniques for LDA often yield results that are not accurate in inferring the most suitable model parameters. Adapting approaches from community detection in networks, we propose a new algorithm that displays high reproducibility and high accuracy and also has high computational efficiency. We apply it to a large set of documents in the English Wikipedia and reveal its hierarchical structure.
Liu, Yuanchao; Liu, Ming; Wang, Xin
2015-01-01
The objective of text clustering is to divide document collections into clusters based on the similarity between documents. In this paper, an extension-based feature modeling approach towards semantically sensitive text clustering is proposed along with the corresponding feature space construction and similarity computation method. By combining the similarity in traditional feature space and that in extension space, the adverse effects of the complexity and diversity of natural language can be addressed and clustering semantic sensitivity can be improved correspondingly. The generated clusters can be organized using different granularities. The experimental evaluations on well-known clustering algorithms and datasets have verified the effectiveness of our approach.
Liu, Yuanchao; Liu, Ming; Wang, Xin
2015-01-01
The objective of text clustering is to divide document collections into clusters based on the similarity between documents. In this paper, an extension-based feature modeling approach towards semantically sensitive text clustering is proposed along with the corresponding feature space construction and similarity computation method. By combining the similarity in traditional feature space and that in extension space, the adverse effects of the complexity and diversity of natural language can be addressed and clustering semantic sensitivity can be improved correspondingly. The generated clusters can be organized using different granularities. The experimental evaluations on well-known clustering algorithms and datasets have verified the effectiveness of our approach. PMID:25794172
Use of Co-occurrences for Temporal Expressions Annotation
NASA Astrophysics Data System (ADS)
Craveiro, Olga; Macedo, Joaquim; Madeira, Henrique
The annotation or extraction of temporal information from text documents is becoming increasingly important in many natural language processing applications such as text summarization, information retrieval, question answering, etc.. This paper presents an original method for easy recognition of temporal expressions in text documents. The method creates semantically classified temporal patterns, using word co-occurrences obtained from training corpora and a pre-defined seed keywords set, derived from the used language temporal references. A participation on a Portuguese named entity evaluation contest showed promising effectiveness and efficiency results. This approach can be adapted to recognize other type of expressions or languages, within other contexts, by defining the suitable word sets and training corpora.
ASM Based Synthesis of Handwritten Arabic Text Pages
Al-Hamadi, Ayoub; Elzobi, Moftah; El-etriby, Sherif; Ghoneim, Ahmed
2015-01-01
Document analysis tasks, as text recognition, word spotting, or segmentation, are highly dependent on comprehensive and suitable databases for training and validation. However their generation is expensive in sense of labor and time. As a matter of fact, there is a lack of such databases, which complicates research and development. This is especially true for the case of Arabic handwriting recognition, that involves different preprocessing, segmentation, and recognition methods, which have individual demands on samples and ground truth. To bypass this problem, we present an efficient system that automatically turns Arabic Unicode text into synthetic images of handwritten documents and detailed ground truth. Active Shape Models (ASMs) based on 28046 online samples were used for character synthesis and statistical properties were extracted from the IESK-arDB database to simulate baselines and word slant or skew. In the synthesis step ASM based representations are composed to words and text pages, smoothed by B-Spline interpolation and rendered considering writing speed and pen characteristics. Finally, we use the synthetic data to validate a segmentation method. An experimental comparison with the IESK-arDB database encourages to train and test document analysis related methods on synthetic samples, whenever no sufficient natural ground truthed data is available. PMID:26295059
ASM Based Synthesis of Handwritten Arabic Text Pages.
Dinges, Laslo; Al-Hamadi, Ayoub; Elzobi, Moftah; El-Etriby, Sherif; Ghoneim, Ahmed
2015-01-01
Document analysis tasks, as text recognition, word spotting, or segmentation, are highly dependent on comprehensive and suitable databases for training and validation. However their generation is expensive in sense of labor and time. As a matter of fact, there is a lack of such databases, which complicates research and development. This is especially true for the case of Arabic handwriting recognition, that involves different preprocessing, segmentation, and recognition methods, which have individual demands on samples and ground truth. To bypass this problem, we present an efficient system that automatically turns Arabic Unicode text into synthetic images of handwritten documents and detailed ground truth. Active Shape Models (ASMs) based on 28046 online samples were used for character synthesis and statistical properties were extracted from the IESK-arDB database to simulate baselines and word slant or skew. In the synthesis step ASM based representations are composed to words and text pages, smoothed by B-Spline interpolation and rendered considering writing speed and pen characteristics. Finally, we use the synthetic data to validate a segmentation method. An experimental comparison with the IESK-arDB database encourages to train and test document analysis related methods on synthetic samples, whenever no sufficient natural ground truthed data is available.
Document similarity measures and document browsing
NASA Astrophysics Data System (ADS)
Ahmadullin, Ildus; Fan, Jian; Damera-Venkata, Niranjan; Lim, Suk Hwan; Lin, Qian; Liu, Jerry; Liu, Sam; O'Brien-Strain, Eamonn; Allebach, Jan
2011-03-01
Managing large document databases is an important task today. Being able to automatically com- pare document layouts and classify and search documents with respect to their visual appearance proves to be desirable in many applications. We measure single page documents' similarity with respect to distance functions between three document components: background, text, and saliency. Each document component is represented as a Gaussian mixture distribution; and distances between dierent documents' components are calculated as probabilistic similarities between corresponding distributions. The similarity measure between documents is represented as a weighted sum of the components' distances. Using this document similarity measure, we propose a browsing mechanism operating on a document dataset. For these purposes, we use a hierarchical browsing environment which we call the document similarity pyramid. It allows the user to browse a large document dataset and to search for documents in the dataset that are similar to the query. The user can browse the dataset on dierent levels of the pyramid, and zoom into the documents that are of interest.
Pankau, Thomas; Wichmann, Gunnar; Neumuth, Thomas; Preim, Bernhard; Dietz, Andreas; Stumpp, Patrick; Boehm, Andreas
2015-10-01
Many treatment approaches are available for head and neck cancer (HNC), leading to challenges for a multidisciplinary medical team in matching each patient with an appropriate regimen. In this effort, primary diagnostics and its reliable documentation are indispensable. A three-dimensional (3D) documentation system was developed and tested to determine its influence on interpretation of these data, especially for TNM classification. A total of 42 HNC patient data sets were available, including primary diagnostics such as panendoscopy, performed and evaluated by an experienced head and neck surgeon. In addition to the conventional panendoscopy form and report, a 3D representation was generated with the "Tumor Therapy Manager" (TTM) software. These cases were randomly re-evaluated by 11 experienced otolaryngologists from five hospitals, half with and half without the TTM data. The accuracy of tumor staging was assessed by pre-post comparison of the TNM classification. TNM staging showed no significant differences in tumor classification (T) with and without 3D from TTM. However, there was a significant decrease in standard deviation from 0.86 to 0.63 via TTM ([Formula: see text]). In nodal staging without TTM, the lymph nodes (N) were significantly underestimated with [Formula: see text] classes compared with [Formula: see text] with TTM ([Formula: see text]). Likewise, the standard deviation was reduced from 0.79 to 0.69 ([Formula: see text]). There was no influence of TTM results on the evaluation of distant metastases (M). TNM staging was more reproducible and nodal staging more accurate when 3D documentation of HNC primary data was available to experienced otolaryngologists. The more precise assessment of the tumor classification with TTM should provide improved decision-making concerning therapy, especially within the interdisciplinary tumor board.
On the Reconstruction of Text Phylogeny Trees: Evaluation and Analysis of Textual Relationships
Marmerola, Guilherme D.; Dias, Zanoni; Goldenstein, Siome; Rocha, Anderson
2016-01-01
Over the history of mankind, textual records change. Sometimes due to mistakes during transcription, sometimes on purpose, as a way to rewrite facts and reinterpret history. There are several classical cases, such as the logarithmic tables, and the transmission of antique and medieval scholarship. Today, text documents are largely edited and redistributed on the Web. Articles on news portals and collaborative platforms (such as Wikipedia), source code, posts on social networks, and even scientific publications or literary works are some examples in which textual content can be subject to changes in an evolutionary process. In this scenario, given a set of near-duplicate documents, it is worthwhile to find which one is the original and the history of changes that created the whole set. Such functionality would have immediate applications on news tracking services, detection of plagiarism, textual criticism, and copyright enforcement, for instance. However, this is not an easy task, as textual features pointing to the documents’ evolutionary direction may not be evident and are often dataset dependent. Moreover, side information, such as time stamps, are neither always available nor reliable. In this paper, we propose a framework for reliably reconstructing text phylogeny trees, and seamlessly exploring new approaches on a wide range of scenarios of text reusage. We employ and evaluate distinct combinations of dissimilarity measures and reconstruction strategies within the proposed framework, and evaluate each approach with extensive experiments, including a set of artificial near-duplicate documents with known phylogeny, and from documents collected from Wikipedia, whose modifications were made by Internet users. We also present results from qualitative experiments in two different applications: text plagiarism and reconstruction of evolutionary trees for manuscripts (stemmatology). PMID:27992446
NASA Astrophysics Data System (ADS)
Boling, M. E.
1989-09-01
Prototypes were assembled pursuant to recommendations made in report K/DSRD-96, Issues and Approaches for Electronic Document Approval and Transmittal Using Digital Signatures and Text Authentication, and to examine and discover the possibilities for integrating available hardware and software to provide cost effective systems for digital signatures and text authentication. These prototypes show that on a LAN, a multitasking, windowed, mouse/keyboard menu-driven interface can be assembled to provide easy and quick access to bit-mapped images of documents, electronic forms and electronic mail messages with a means to sign, encrypt, deliver, receive or retrieve and authenticate text and signatures. In addition they show that some of this same software may be used in a classified environment using host to terminal transactions to accomplish these same operations. Finally, a prototype was developed demonstrating that binary files may be signed electronically and sent by point to point communication and over ARPANET to remote locations where the authenticity of the code and signature may be verified. Related studies on the subject of electronic signatures and text authentication using public key encryption were done within the Department of Energy. These studies include timing studies of public key encryption software and hardware and testing of experimental user-generated host resident software for public key encryption. This software used commercially available command-line source code. These studies are responsive to an initiative within the Office of the Secretary of Defense (OSD) for the protection of unclassified but sensitive data. It is notable that these related studies are all built around the same commercially available public key encryption products from the private sector and that the software selection was made independently by each study group.
Visualizing the semantic content of large text databases using text maps
NASA Technical Reports Server (NTRS)
Combs, Nathan
1993-01-01
A methodology for generating text map representations of the semantic content of text databases is presented. Text maps provide a graphical metaphor for conceptualizing and visualizing the contents and data interrelationships of large text databases. Described are a set of experiments conducted against the TIPSTER corpora of Wall Street Journal articles. These experiments provide an introduction to current work in the representation and visualization of documents by way of their semantic content.
Enriching a document collection by integrating information extraction and PDF annotation
NASA Astrophysics Data System (ADS)
Powley, Brett; Dale, Robert; Anisimoff, Ilya
2009-01-01
Modern digital libraries offer all the hyperlinking possibilities of the World Wide Web: when a reader finds a citation of interest, in many cases she can now click on a link to be taken to the cited work. This paper presents work aimed at providing the same ease of navigation for legacy PDF document collections that were created before the possibility of integrating hyperlinks into documents was ever considered. To achieve our goal, we need to carry out two tasks: first, we need to identify and link citations and references in the text with high reliability; and second, we need the ability to determine physical PDF page locations for these elements. We demonstrate the use of a high-accuracy citation extraction algorithm which significantly improves on earlier reported techniques, and a technique for integrating PDF processing with a conventional text-stream based information extraction pipeline. We demonstrate these techniques in the context of a particular document collection, this being the ACL Anthology; but the same approach can be applied to other document sets.
Munkhdalai, Tsendsuren; Li, Meijing; Batsuren, Khuyagbaatar; Park, Hyeon Ah; Choi, Nak Hyeon; Ryu, Keun Ho
2015-01-01
Chemical and biomedical Named Entity Recognition (NER) is an essential prerequisite task before effective text mining can begin for biochemical-text data. Exploiting unlabeled text data to leverage system performance has been an active and challenging research topic in text mining due to the recent growth in the amount of biomedical literature. We present a semi-supervised learning method that efficiently exploits unlabeled data in order to incorporate domain knowledge into a named entity recognition model and to leverage system performance. The proposed method includes Natural Language Processing (NLP) tasks for text preprocessing, learning word representation features from a large amount of text data for feature extraction, and conditional random fields for token classification. Other than the free text in the domain, the proposed method does not rely on any lexicon nor any dictionary in order to keep the system applicable to other NER tasks in bio-text data. We extended BANNER, a biomedical NER system, with the proposed method. This yields an integrated system that can be applied to chemical and drug NER or biomedical NER. We call our branch of the BANNER system BANNER-CHEMDNER, which is scalable over millions of documents, processing about 530 documents per minute, is configurable via XML, and can be plugged into other systems by using the BANNER Unstructured Information Management Architecture (UIMA) interface. BANNER-CHEMDNER achieved an 85.68% and an 86.47% F-measure on the testing sets of CHEMDNER Chemical Entity Mention (CEM) and Chemical Document Indexing (CDI) subtasks, respectively, and achieved an 87.04% F-measure on the official testing set of the BioCreative II gene mention task, showing remarkable performance in both chemical and biomedical NER. BANNER-CHEMDNER system is available at: https://bitbucket.org/tsendeemts/banner-chemdner.
Federal Register 2010, 2011, 2012, 2013, 2014
2012-08-01
... restrictions has been revised. The Designated List and the regulatory text in that document contain language which is inadvertently not consistent with the rest of the document as to the historical period that the...
10 CFR 961.11 - Text of the contract.
Code of Federal Regulations, 2014 CFR
2014-01-01
... program including information on cost projections, project plans and progress reports. 5. (a) Beginning on...-type documents or computer software (including computer programs, computer software data bases, and computer software documentation). Examples of technical data include research and engineering data...
10 CFR 961.11 - Text of the contract.
Code of Federal Regulations, 2013 CFR
2013-01-01
... program including information on cost projections, project plans and progress reports. 5. (a) Beginning on...-type documents or computer software (including computer programs, computer software data bases, and computer software documentation). Examples of technical data include research and engineering data...
Creating a Gold Standard for the Readability Measurement of Health Texts
Kandula, Sasikiran; Zeng-Treitler, Qing
2008-01-01
Developing easy-to-read health texts for consumers continues to be a challenge in health communication. Though readability formulae such as Flesch-Kincaid Grade Level have been used in many studies, they were found to be inadequate to estimate the difficulty of some types of health texts. One impediment to the development of new readability assessment techniques is the absence of a gold standard that can be used to validate them. To overcome this deficiency, we have compiled a corpus of 324 health documents consisting of six different types of texts. These documents were manually reviewed and assigned a readability level (1-7 Likert scale) by a panel of five health literacy experts. The expert assigned ratings were found to be highly correlated with a patient representative’s readability ratings (r = 0.81, p<0.0001). PMID:18999150
TES: A Text Extraction System.
ERIC Educational Resources Information Center
Goh, A.; Hui, S. C.
1996-01-01
Describes how TES, a text extraction system, is able to electronically retrieve a set of sentences from a document to form an indicative abstract. Discusses various text abstraction techniques and related work in the area, provides an overview of the TES system, and compares system results against manually produced abstracts. (LAM)
The Relative Ease of Writing Narrative Text.
ERIC Educational Resources Information Center
Kellogg, Ronald T.; And Others
A study investigated whether the narrative writing task is more compatible with the structure of conscious thought than are other writing tasks. If so, composing a narrative text should demand less cognitive effort, occur more fluently, and yield a more coherent document than composing persuasive or descriptive texts. Sixteen college students were…
Death as Insight into Life: Adolescents' Gothic Text Encounters
ERIC Educational Resources Information Center
Del Nero, Jennifer
2017-01-01
This qualitative case study explores adolescents' responses to texts containing death and destruction, a seminal trope of the Gothic literary genre. Participants read both classic and popular culture texts featuring characters grappling with death in their seventh grade reading classroom. Observations, interviews, and documents were collected and…
Analysing Representations of Otherness Using Different Text-Types.
ERIC Educational Resources Information Center
Murphy-LeJeune, Elizabeth; And Others
1996-01-01
Demonstrates how the teacher can use texts to confront learners with cultural representations. Four texts are used to represent a literary extract, a student essay, an advertising document, and a newspaper article. The article illustrates approaches that borrow from stylistics, linguistics, and discourse analysis. (21 references) (Author/CK)
An Overall Perspective of Machine Translation with Its Shortcomings
ERIC Educational Resources Information Center
Akbari, Alireza
2014-01-01
The petition for language translation has strikingly augmented recently due to cross-cultural communication and exchange of information. In order to communicate well, text should be translated correctly and completely in each field such as legal documents, technical texts, scientific texts, publicity leaflets, and instructional materials. In this…
Semantic Metadata for Heterogeneous Spatial Planning Documents
NASA Astrophysics Data System (ADS)
Iwaniak, A.; Kaczmarek, I.; Łukowicz, J.; Strzelecki, M.; Coetzee, S.; Paluszyński, W.
2016-09-01
Spatial planning documents contain information about the principles and rights of land use in different zones of a local authority. They are the basis for administrative decision making in support of sustainable development. In Poland these documents are published on the Web according to a prescribed non-extendable XML schema, designed for optimum presentation to humans in HTML web pages. There is no document standard, and limited functionality exists for adding references to external resources. The text in these documents is discoverable and searchable by general-purpose web search engines, but the semantics of the content cannot be discovered or queried. The spatial information in these documents is geographically referenced but not machine-readable. Major manual efforts are required to integrate such heterogeneous spatial planning documents from various local authorities for analysis, scenario planning and decision support. This article presents results of an implementation using machine-readable semantic metadata to identify relationships among regulations in the text, spatial objects in the drawings and links to external resources. A spatial planning ontology was used to annotate different sections of spatial planning documents with semantic metadata in the Resource Description Framework in Attributes (RDFa). The semantic interpretation of the content, links between document elements and links to external resources were embedded in XHTML pages. An example and use case from the spatial planning domain in Poland is presented to evaluate its efficiency and applicability. The solution enables the automated integration of spatial planning documents from multiple local authorities to assist decision makers with understanding and interpreting spatial planning information. The approach is equally applicable to legal documents from other countries and domains, such as cultural heritage and environmental management.
Recent Experiments with INQUERY
1995-11-01
were conducted with version of the INQUERY information retrieval system INQUERY is based on the Bayesian inference network retrieval model It is...corpus based query expansion For TREC a subset of of the adhoc document set was used to build the InFinder database None of the...experiments that showed signi cant improvements in retrieval eectiveness when document rankings based on the entire document text are combined with
2010-11-05
The Food and Drug Administration (FDA) is announcing the reclassification of the full-field digital mammography (FFDM) system from class III (premarket approval) to class II (special controls). The device type is intended to produce planar digital x-ray images of the entire breast; this generic type of device may include digital mammography acquisition software, full-field digital image receptor, acquisition workstation, automatic exposure control, image processing and reconstruction programs, patient and equipment supports, component parts, and accessories. The special control that will apply to the device is the guidance document entitled "Class II Special Controls Guidance Document: Full-Field Digital Mammography System." FDA is reclassifying the device into class II (special controls) because general controls along with special controls will provide a reasonable assurance of safety and effectiveness of the device. Elsewhere in this issue of the Federal Register, FDA is announcing the availability of the guidance document that will serve as the special control for this device.
Text Mining the History of Medicine.
Thompson, Paul; Batista-Navarro, Riza Theresa; Kontonatsios, Georgios; Carter, Jacob; Toon, Elizabeth; McNaught, John; Timmermann, Carsten; Worboys, Michael; Ananiadou, Sophia
2016-01-01
Historical text archives constitute a rich and diverse source of information, which is becoming increasingly readily accessible, due to large-scale digitisation efforts. However, it can be difficult for researchers to explore and search such large volumes of data in an efficient manner. Text mining (TM) methods can help, through their ability to recognise various types of semantic information automatically, e.g., instances of concepts (places, medical conditions, drugs, etc.), synonyms/variant forms of concepts, and relationships holding between concepts (which drugs are used to treat which medical conditions, etc.). TM analysis allows search systems to incorporate functionality such as automatic suggestions of synonyms of user-entered query terms, exploration of different concepts mentioned within search results or isolation of documents in which concepts are related in specific ways. However, applying TM methods to historical text can be challenging, according to differences and evolutions in vocabulary, terminology, language structure and style, compared to more modern text. In this article, we present our efforts to overcome the various challenges faced in the semantic analysis of published historical medical text dating back to the mid 19th century. Firstly, we used evidence from diverse historical medical documents from different periods to develop new resources that provide accounts of the multiple, evolving ways in which concepts, their variants and relationships amongst them may be expressed. These resources were employed to support the development of a modular processing pipeline of TM tools for the robust detection of semantic information in historical medical documents with varying characteristics. We applied the pipeline to two large-scale medical document archives covering wide temporal ranges as the basis for the development of a publicly accessible semantically-oriented search system. The novel resources are available for research purposes, while the processing pipeline and its modules may be used and configured within the Argo TM platform.
Text Mining the History of Medicine
Thompson, Paul; Batista-Navarro, Riza Theresa; Kontonatsios, Georgios; Carter, Jacob; Toon, Elizabeth; McNaught, John; Timmermann, Carsten; Worboys, Michael; Ananiadou, Sophia
2016-01-01
Historical text archives constitute a rich and diverse source of information, which is becoming increasingly readily accessible, due to large-scale digitisation efforts. However, it can be difficult for researchers to explore and search such large volumes of data in an efficient manner. Text mining (TM) methods can help, through their ability to recognise various types of semantic information automatically, e.g., instances of concepts (places, medical conditions, drugs, etc.), synonyms/variant forms of concepts, and relationships holding between concepts (which drugs are used to treat which medical conditions, etc.). TM analysis allows search systems to incorporate functionality such as automatic suggestions of synonyms of user-entered query terms, exploration of different concepts mentioned within search results or isolation of documents in which concepts are related in specific ways. However, applying TM methods to historical text can be challenging, according to differences and evolutions in vocabulary, terminology, language structure and style, compared to more modern text. In this article, we present our efforts to overcome the various challenges faced in the semantic analysis of published historical medical text dating back to the mid 19th century. Firstly, we used evidence from diverse historical medical documents from different periods to develop new resources that provide accounts of the multiple, evolving ways in which concepts, their variants and relationships amongst them may be expressed. These resources were employed to support the development of a modular processing pipeline of TM tools for the robust detection of semantic information in historical medical documents with varying characteristics. We applied the pipeline to two large-scale medical document archives covering wide temporal ranges as the basis for the development of a publicly accessible semantically-oriented search system. The novel resources are available for research purposes, while the processing pipeline and its modules may be used and configured within the Argo TM platform. PMID:26734936
NASA Astrophysics Data System (ADS)
Martínez-Álvarez, Patricia
2017-09-01
The field of bilingual special education is currently plagued with contradictions resulting in a serious underrepresentation of emergent bilinguals with learning disabilities in professional science fields. This underrepresentation is due in large part to the fact that educational systems around the world are inadequately prepared to address the educational needs of these children; this inadequacy is rooted in a lack of understanding of the linguistic and cultural factors impacting learning. Accepting such a premise and assuming that children learn in unexpected ways when instructional practices attend to culture and language, this study documents a place-based learning experience integrating geoscience and literacy in a fourth-grade dual language classroom. Data sources include transcribed audio-taped conversations from learning experience sessions and interviews that took place as six focus children, who had been identified as having specific learning disabilities, read published science texts (i.e. texts unaltered linguistically or conceptually to meet the needs of the readers). My analysis revealed that participants generated responses that were often unexpected if solely analyzed from those Western scientific perspectives traditionally valued in school contexts. However, these responses were also full of purposeful and rich understandings that revealed opportunities for expansive learning. Adopting a cultural historical activity theory perspective, instructional tools such as texts, visuals, and questions were found to act as mediators impacting the learning in both activity systems: (a) teacher- researcher learning from children, and (b) children learning from teachers. I conclude by suggesting that there is a need to understand students' ways of knowing to their full complexity, and to deliberately recognize teachers as learners, researchers, and means to expansive learning patterns that span beyond traditional learning boundaries.
Interactive publications: creation and usage
NASA Astrophysics Data System (ADS)
Thoma, George R.; Ford, Glenn; Chung, Michael; Vasudevan, Kirankumar; Antani, Sameer
2006-02-01
As envisioned here, an "interactive publication" has similarities to multimedia documents that have been in existence for a decade or more, but possesses specific differentiating characteristics. In common usage, the latter refers to online entities that, in addition to text, consist of files of images and video clips residing separately in databases, rarely providing immediate context to the document text. While an interactive publication has many media objects as does the "traditional" multimedia document, it is a self-contained document, either as a single file with media files embedded within it, or as a "folder" containing tightly linked media files. The main characteristic that differentiates an interactive publication from a traditional multimedia document is that the reader would be able to reuse the media content for analysis and presentation, and to check the underlying data and possibly derive alternative conclusions leading, for example, to more in-depth peer reviews. We have created prototype publications containing paginated text and several media types encountered in the biomedical literature: 3D animations of anatomic structures; graphs, charts and tabular data; cell development images (video sequences); and clinical images such as CT, MRI and ultrasound in the DICOM format. This paper presents developments to date including: a tool to convert static tables or graphs into interactive entities, authoring procedures followed to create prototypes, and advantages and drawbacks of each of these platforms. It also outlines future work including meeting the challenge of network distribution for these large files.
Analysis of Nature of Science Included in Recent Popular Writing Using Text Mining Techniques
NASA Astrophysics Data System (ADS)
Jiang, Feng; McComas, William F.
2014-09-01
This study examined the inclusion of nature of science (NOS) in popular science writing to determine whether it could serve supplementary resource for teaching NOS and to evaluate the accuracy of text mining and classification as a viable research tool in science education research. Four groups of documents published from 2001 to 2010 were analyzed: Scientific American, Discover magazine, winners of the Royal Society Winton Prize for Science Books, and books from NSTA's list of Outstanding Science Trade Books. Computer analysis categorized passages in the selected documents based on their inclusions of NOS. Human analysis assessed the frequency, context, coverage, and accuracy of the inclusions of NOS within computer identified NOS passages. NOS was rarely addressed in selected document sets but somewhat more frequently addressed in the letters section of the two magazines. This result suggests that readers seem interested in the discussion of NOS-related themes. In the popular science books analyzed, NOS presentations were found more likely to be aggregated in the beginning and the end of the book, rather than scattered throughout. The most commonly addressed NOS elements in the analyzed documents are science and society and empiricism in science. Only one inaccurate presentation of NOS were identified in all analyzed documents. The text mining technique demonstrated exciting performance, which invites more applications of the technique to analyze other aspects of science textbooks, popular science writing, or other materials involved in science teaching and learning.
ERIC Educational Resources Information Center
Happel, Sue; Loeb, Joyce
Although the activities in this unit are designed primarily for students in the intermediate grades, the document's text, illustrations, and bibliographic references are suitable for anyone interested in learning about Africa. Following a brief introduction and map work, the document is arranged into six sections. Section 1 traces Africa's history…
10 CFR 961.11 - Text of the contract.
Code of Federal Regulations, 2012 CFR
2012-01-01
... characteristic, of a specific or technical nature. It may, for example, document research, experimental... computer software documentation). Examples of technical data include research and engineering data... repository, to take title to the spent nuclear fuel or high-level radioactive waste involved as expeditiously...
10 CFR 961.11 - Text of the contract.
Code of Federal Regulations, 2011 CFR
2011-01-01
... characteristic, of a specific or technical nature. It may, for example, document research, experimental... computer software documentation). Examples of technical data include research and engineering data... repository, to take title to the spent nuclear fuel or high-level radioactive waste involved as expeditiously...
48 CFR 752.7003 - Documentation for payment.
Code of Federal Regulations, 2011 CFR
2011-10-01
.... 752.7003 Section 752.7003 Federal Acquisition Regulations System AGENCY FOR INTERNATIONAL DEVELOPMENT CLAUSES AND FORMS SOLICITATION PROVISIONS AND CONTRACT CLAUSES Texts of USAID Contract Clauses 752.7003 Documentation for payment. The following clause is required in all USAID direct contracts, excluding fixed price...