text document classification: Topics by Science.gov

Sample records for text document classification

A Semi-supervised Heat Kernel Pagerank MBO Algorithm for Data Classification

DTIC Science & Technology

2016-07-01

financial predictions, etc. and is finding growing use in text mining studies. In this paper, we present an efficient algorithm for classification of high...video data, set of images, hyperspectral data, medical data, text data, etc. Moreover, the framework provides a way to analyze data whose different...also be incorporated. For text classification, one can use tfidf (term frequency inverse document frequency) to form feature vectors for each document
Bibliographic Classification Theory and Text Linguistics: Aboutness Analysis, Intertextuality and the Cognitive Act of Classifying Documents.

ERIC Educational Resources Information Center

Beghtol, Clare

1986-01-01

Explicates a definition and theory of "aboutness" and aboutness analysis developed by text linguist van Dijk; explores implications of text linguistics for bibliographic classification theory; suggests the elements that a theory of the cognitive process of classifying documents needs to encompass; and delineates how people identify…
Classification of forensic autopsy reports through conceptual graph-based document representation model.

PubMed

Mujtaba, Ghulam; Shuib, Liyana; Raj, Ram Gopal; Rajandram, Retnagowri; Shaikh, Khairunisa; Al-Garadi, Mohammed Ali

2018-06-01

Text categorization has been used extensively in recent years to classify plain-text clinical reports. This study employs text categorization techniques for the classification of open narrative forensic autopsy reports. One of the key steps in text classification is document representation. In document representation, a clinical report is transformed into a format that is suitable for classification. The traditional document representation technique for text categorization is the bag-of-words (BoW) technique. In this study, the traditional BoW technique is ineffective in classifying forensic autopsy reports because it merely extracts frequent but discriminative features from clinical reports. Moreover, this technique fails to capture word inversion, as well as word-level synonymy and polysemy, when classifying autopsy reports. Hence, the BoW technique suffers from low accuracy and low robustness unless it is improved with contextual and application-specific information. To overcome the aforementioned limitations of the BoW technique, this research aims to develop an effective conceptual graph-based document representation (CGDR) technique to classify 1500 forensic autopsy reports from four (4) manners of death (MoD) and sixteen (16) causes of death (CoD). Term-based and Systematized Nomenclature of Medicine-Clinical Terms (SNOMED CT) based conceptual features were extracted and represented through graphs. These features were then used to train a two-level text classifier. The first level classifier was responsible for predicting MoD. In addition, the second level classifier was responsible for predicting CoD using the proposed conceptual graph-based document representation technique. To demonstrate the significance of the proposed technique, its results were compared with those of six (6) state-of-the-art document representation techniques. Lastly, this study compared the effects of one-level classification and two-level classification on the experimental results. The experimental results indicated that the CGDR technique achieved 12% to 15% improvement in accuracy compared with fully automated document representation baseline techniques. Moreover, two-level classification obtained better results compared with one-level classification. The promising results of the proposed conceptual graph-based document representation technique suggest that pathologists can adopt the proposed system as their basis for second opinion, thereby supporting them in effectively determining CoD. Copyright © 2018 Elsevier Inc. All rights reserved.
Text Classification for Organizational Researchers

PubMed Central

Kobayashi, Vladimer B.; Mol, Stefan T.; Berkers, Hannah A.; Kismihók, Gábor; Den Hartog, Deanne N.

2017-01-01

Organizations are increasingly interested in classifying texts or parts thereof into categories, as this enables more effective use of their information. Manual procedures for text classification work well for up to a few hundred documents. However, when the number of documents is larger, manual procedures become laborious, time-consuming, and potentially unreliable. Techniques from text mining facilitate the automatic assignment of text strings to categories, making classification expedient, fast, and reliable, which creates potential for its application in organizational research. The purpose of this article is to familiarize organizational researchers with text mining techniques from machine learning and statistics. We describe the text classification process in several roughly sequential steps, namely training data preparation, preprocessing, transformation, application of classification techniques, and validation, and provide concrete recommendations at each step. To help researchers develop their own text classifiers, the R code associated with each step is presented in a tutorial. The tutorial draws from our own work on job vacancy mining. We end the article by discussing how researchers can validate a text classification model and the associated output. PMID:29881249
Automatic document classification of biological literature

PubMed Central

Chen, David; Müller, Hans-Michael; Sternberg, Paul W

2006-01-01

Background Document classification is a wide-spread problem with many applications, from organizing search engine snippets to spam filtering. We previously described Textpresso, a text-mining system for biological literature, which marks up full text according to a shallow ontology that includes terms of biological interest. This project investigates document classification in the context of biological literature, making use of the Textpresso markup of a corpus of Caenorhabditis elegans literature. Results We present a two-step text categorization algorithm to classify a corpus of C. elegans papers. Our classification method first uses a support vector machine-trained classifier, followed by a novel, phrase-based clustering algorithm. This clustering step autonomously creates cluster labels that are descriptive and understandable by humans. This clustering engine performed better on a standard test-set (Reuters 21578) compared to previously published results (F-value of 0.55 vs. 0.49), while producing cluster descriptions that appear more useful. A web interface allows researchers to quickly navigate through the hierarchy and look for documents that belong to a specific concept. Conclusion We have demonstrated a simple method to classify biological documents that embodies an improvement over current methods. While the classification results are currently optimized for Caenorhabditis elegans papers by human-created rules, the classification engine can be adapted to different types of documents. We have demonstrated this by presenting a web interface that allows researchers to quickly navigate through the hierarchy and look for documents that belong to a specific concept. PMID:16893465
Prediction of cause of death from forensic autopsy reports using text classification techniques: A comparative study.

PubMed

Mujtaba, Ghulam; Shuib, Liyana; Raj, Ram Gopal; Rajandram, Retnagowri; Shaikh, Khairunisa

2018-07-01

Automatic text classification techniques are useful for classifying plaintext medical documents. This study aims to automatically predict the cause of death from free text forensic autopsy reports by comparing various schemes for feature extraction, term weighing or feature value representation, text classification, and feature reduction. For experiments, the autopsy reports belonging to eight different causes of death were collected, preprocessed and converted into 43 master feature vectors using various schemes for feature extraction, representation, and reduction. The six different text classification techniques were applied on these 43 master feature vectors to construct a classification model that can predict the cause of death. Finally, classification model performance was evaluated using four performance measures i.e. overall accuracy, macro precision, macro-F-measure, and macro recall. From experiments, it was found that that unigram features obtained the highest performance compared to bigram, trigram, and hybrid-gram features. Furthermore, in feature representation schemes, term frequency, and term frequency with inverse document frequency obtained similar and better results when compared with binary frequency, and normalized term frequency with inverse document frequency. Furthermore, the chi-square feature reduction approach outperformed Pearson correlation, and information gain approaches. Finally, in text classification algorithms, support vector machine classifier outperforms random forest, Naive Bayes, k-nearest neighbor, decision tree, and ensemble-voted classifier. Our results and comparisons hold practical importance and serve as references for future works. Moreover, the comparison outputs will act as state-of-art techniques to compare future proposals with existing automated text classification techniques. Copyright © 2017 Elsevier Ltd and Faculty of Forensic and Legal Medicine. All rights reserved.
Page layout analysis and classification for complex scanned documents

NASA Astrophysics Data System (ADS)

Erkilinc, M. Sezer; Jaber, Mustafa; Saber, Eli; Bauer, Peter; Depalov, Dejan

2011-09-01

A framework for region/zone classification in color and gray-scale scanned documents is proposed in this paper. The algorithm includes modules for extracting text, photo, and strong edge/line regions. Firstly, a text detection module which is based on wavelet analysis and Run Length Encoding (RLE) technique is employed. Local and global energy maps in high frequency bands of the wavelet domain are generated and used as initial text maps. Further analysis using RLE yields a final text map. The second module is developed to detect image/photo and pictorial regions in the input document. A block-based classifier using basis vector projections is employed to identify photo candidate regions. Then, a final photo map is obtained by applying probabilistic model based on Markov random field (MRF) based maximum a posteriori (MAP) optimization with iterated conditional mode (ICM). The final module detects lines and strong edges using Hough transform and edge-linkages analysis, respectively. The text, photo, and strong edge/line maps are combined to generate a page layout classification of the scanned target document. Experimental results and objective evaluation show that the proposed technique has a very effective performance on variety of simple and complex scanned document types obtained from MediaTeam Oulu document database. The proposed page layout classifier can be used in systems for efficient document storage, content based document retrieval, optical character recognition, mobile phone imagery, and augmented reality.
PDF text classification to leverage information extraction from publication reports.

PubMed

Bui, Duy Duc An; Del Fiol, Guilherme; Jonnalagadda, Siddhartha

2016-06-01

Data extraction from original study reports is a time-consuming, error-prone process in systematic review development. Information extraction (IE) systems have the potential to assist humans in the extraction task, however majority of IE systems were not designed to work on Portable Document Format (PDF) document, an important and common extraction source for systematic review. In a PDF document, narrative content is often mixed with publication metadata or semi-structured text, which add challenges to the underlining natural language processing algorithm. Our goal is to categorize PDF texts for strategic use by IE systems. We used an open-source tool to extract raw texts from a PDF document and developed a text classification algorithm that follows a multi-pass sieve framework to automatically classify PDF text snippets (for brevity, texts) into TITLE, ABSTRACT, BODYTEXT, SEMISTRUCTURE, and METADATA categories. To validate the algorithm, we developed a gold standard of PDF reports that were included in the development of previous systematic reviews by the Cochrane Collaboration. In a two-step procedure, we evaluated (1) classification performance, and compared it with machine learning classifier, and (2) the effects of the algorithm on an IE system that extracts clinical outcome mentions. The multi-pass sieve algorithm achieved an accuracy of 92.6%, which was 9.7% (p<0.001) higher than the best performing machine learning classifier that used a logistic regression algorithm. F-measure improvements were observed in the classification of TITLE (+15.6%), ABSTRACT (+54.2%), BODYTEXT (+3.7%), SEMISTRUCTURE (+34%), and MEDADATA (+14.2%). In addition, use of the algorithm to filter semi-structured texts and publication metadata improved performance of the outcome extraction system (F-measure +4.1%, p=0.002). It also reduced of number of sentences to be processed by 44.9% (p<0.001), which corresponds to a processing time reduction of 50% (p=0.005). The rule-based multi-pass sieve framework can be used effectively in categorizing texts extracted from PDF documents. Text classification is an important prerequisite step to leverage information extraction from PDF documents. Copyright © 2016 Elsevier Inc. All rights reserved.
Simple-random-sampling-based multiclass text classification algorithm.

PubMed

Liu, Wuying; Wang, Lin; Yi, Mianzhu

2014-01-01

Multiclass text classification (MTC) is a challenging issue and the corresponding MTC algorithms can be used in many applications. The space-time overhead of the algorithms must be concerned about the era of big data. Through the investigation of the token frequency distribution in a Chinese web document collection, this paper reexamines the power law and proposes a simple-random-sampling-based MTC (SRSMTC) algorithm. Supported by a token level memory to store labeled documents, the SRSMTC algorithm uses a text retrieval approach to solve text classification problems. The experimental results on the TanCorp data set show that SRSMTC algorithm can achieve the state-of-the-art performance at greatly reduced space-time requirements.
Biomedical literature classification using encyclopedic knowledge: a Wikipedia-based bag-of-concepts approach.

PubMed

Mouriño García, Marcos Antonio; Pérez Rodríguez, Roberto; Anido Rifón, Luis E

2015-01-01

Automatic classification of text documents into a set of categories has a lot of applications. Among those applications, the automatic classification of biomedical literature stands out as an important application for automatic document classification strategies. Biomedical staff and researchers have to deal with a lot of literature in their daily activities, so it would be useful a system that allows for accessing to documents of interest in a simple and effective way; thus, it is necessary that these documents are sorted based on some criteria-that is to say, they have to be classified. Documents to classify are usually represented following the bag-of-words (BoW) paradigm. Features are words in the text-thus suffering from synonymy and polysemy-and their weights are just based on their frequency of occurrence. This paper presents an empirical study of the efficiency of a classifier that leverages encyclopedic background knowledge-concretely Wikipedia-in order to create bag-of-concepts (BoC) representations of documents, understanding concept as "unit of meaning", and thus tackling synonymy and polysemy. Besides, the weighting of concepts is based on their semantic relevance in the text. For the evaluation of the proposal, empirical experiments have been conducted with one of the commonly used corpora for evaluating classification and retrieval of biomedical information, OHSUMED, and also with a purpose-built corpus of MEDLINE biomedical abstracts, UVigoMED. Results obtained show that the Wikipedia-based bag-of-concepts representation outperforms the classical bag-of-words representation up to 157% in the single-label classification problem and up to 100% in the multi-label problem for OHSUMED corpus, and up to 122% in the single-label classification problem and up to 155% in the multi-label problem for UVigoMED corpus.
Using complex networks for text classification: Discriminating informative and imaginative documents

NASA Astrophysics Data System (ADS)

de Arruda, Henrique F.; Costa, Luciano da F.; Amancio, Diego R.

2016-01-01

Statistical methods have been widely employed in recent years to grasp many language properties. The application of such techniques have allowed an improvement of several linguistic applications, such as machine translation and document classification. In the latter, many approaches have emphasised the semantical content of texts, as is the case of bag-of-word language models. These approaches have certainly yielded reasonable performance. However, some potential features such as the structural organization of texts have been used only in a few studies. In this context, we probe how features derived from textual structure analysis can be effectively employed in a classification task. More specifically, we performed a supervised classification aiming at discriminating informative from imaginative documents. Using a networked model that describes the local topological/dynamical properties of function words, we achieved an accuracy rate of up to 95%, which is much higher than similar networked approaches. A systematic analysis of feature relevance revealed that symmetry and accessibility measurements are among the most prominent network measurements. Our results suggest that these measurements could be used in related language applications, as they play a complementary role in characterising texts.
Building a common pipeline for rule-based document classification.

PubMed

Patterson, Olga V; Ginter, Thomas; DuVall, Scott L

2013-01-01

Instance-based classification of clinical text is a widely used natural language processing task employed as a step for patient classification, document retrieval, or information extraction. Rule-based approaches rely on concept identification and context analysis in order to determine the appropriate class. We propose a five-step process that enables even small research teams to develop simple but powerful rule-based NLP systems by taking advantage of a common UIMA AS based pipeline for classification. Our proposed methodology coupled with the general-purpose solution provides researchers with access to the data locked in clinical text in cases of limited human resources and compact timelines.
Ensemble methods with simple features for document zone classification

NASA Astrophysics Data System (ADS)

Obafemi-Ajayi, Tayo; Agam, Gady; Xie, Bingqing

2012-01-01

Document layout analysis is of fundamental importance for document image understanding and information retrieval. It requires the identification of blocks extracted from a document image via features extraction and block classification. In this paper, we focus on the classification of the extracted blocks into five classes: text (machine printed), handwriting, graphics, images, and noise. We propose a new set of features for efficient classifications of these blocks. We present a comparative evaluation of three ensemble based classification algorithms (boosting, bagging, and combined model trees) in addition to other known learning algorithms. Experimental results are demonstrated for a set of 36503 zones extracted from 416 document images which were randomly selected from the tobacco legacy document collection. The results obtained verify the robustness and effectiveness of the proposed set of features in comparison to the commonly used Ocropus recognition features. When used in conjunction with the Ocropus feature set, we further improve the performance of the block classification system to obtain a classification accuracy of 99.21%.
Vietnamese Document Representation and Classification

NASA Astrophysics Data System (ADS)

Nguyen, Giang-Son; Gao, Xiaoying; Andreae, Peter

Vietnamese is very different from English and little research has been done on Vietnamese document classification, or indeed, on any kind of Vietnamese language processing, and only a few small corpora are available for research. We created a large Vietnamese text corpus with about 18000 documents, and manually classified them based on different criteria such as topics and styles, giving several classification tasks of different difficulty levels. This paper introduces a new syllable-based document representation at the morphological level of the language for efficient classification. We tested the representation on our corpus with different classification tasks using six classification algorithms and two feature selection techniques. Our experiments show that the new representation is effective for Vietnamese categorization, and suggest that best performance can be achieved using syllable-pair document representation, an SVM with a polynomial kernel as the learning algorithm, and using Information gain and an external dictionary for feature selection.
Investigation into Text Classification With Kernel Based Schemes

DTIC Science & Technology

2010-03-01

Document Matrix TDMs Term-Document Matrices TMG Text to Matrix Generator TN True Negative TP True Positive VSM Vector Space Model xxii THIS PAGE...are represented as a term-document matrix, common evaluation metrics, and the software package Text to Matrix Generator ( TMG ). The classifier...AND METRICS This chapter introduces the indexing capabilities of the Text to Matrix Generator ( TMG ) Toolbox. Specific attention is placed on the
New Framework for Cross-Domain Document Classification

DTIC Science & Technology

2011-03-01

classification. The following paragraphs will introduce these related works in more detail. Wang et al . attempted to improve the accuracy of text document...of using Wikipedia to develop a thesaurus [20]. Gabrilovich et al . had an approach that is more elaborate in its use of Wikipedia text [21]. The...did show a modest improvement when it is performed using the Wikipedia information. Wang et al . improved on the results of co-clustering algorithm [24
Research on aviation unsafe incidents classification with improved TF-IDF algorithm

NASA Astrophysics Data System (ADS)

Wang, Yanhua; Zhang, Zhiyuan; Huo, Weigang

2016-05-01

The text content of Aviation Safety Confidential Reports contains a large number of valuable information. Term frequency-inverse document frequency algorithm is commonly used in text analysis, but it does not take into account the sequential relationship of the words in the text and its role in semantic expression. According to the seven category labels of civil aviation unsafe incidents, aiming at solving the problems of TF-IDF algorithm, this paper improved TF-IDF algorithm based on co-occurrence network; established feature words extraction and words sequential relations for classified incidents. Aviation domain lexicon was used to improve the accuracy rate of classification. Feature words network model was designed for multi-documents unsafe incidents classification, and it was used in the experiment. Finally, the classification accuracy of improved algorithm was verified by the experiments.
Relevance popularity: A term event model based feature selection scheme for text classification.

PubMed

Feng, Guozhong; An, Baiguo; Yang, Fengqin; Wang, Han; Zhang, Libiao

2017-01-01

Feature selection is a practical approach for improving the performance of text classification methods by optimizing the feature subsets input to classifiers. In traditional feature selection methods such as information gain and chi-square, the number of documents that contain a particular term (i.e. the document frequency) is often used. However, the frequency of a given term appearing in each document has not been fully investigated, even though it is a promising feature to produce accurate classifications. In this paper, we propose a new feature selection scheme based on a term event Multinomial naive Bayes probabilistic model. According to the model assumptions, the matching score function, which is based on the prediction probability ratio, can be factorized. Finally, we derive a feature selection measurement for each term after replacing inner parameters by their estimators. On a benchmark English text datasets (20 Newsgroups) and a Chinese text dataset (MPH-20), our numerical experiment results obtained from using two widely used text classifiers (naive Bayes and support vector machine) demonstrate that our method outperformed the representative feature selection methods.
Detection and Evaluation of Cheating on College Exams Using Supervised Classification

ERIC Educational Resources Information Center

Cavalcanti, Elmano Ramalho; Pires, Carlos Eduardo; Cavalcanti, Elmano Pontes; Pires, Vládia Freire

2012-01-01

Text mining has been used for various purposes, such as document classification and extraction of domain-specific information from text. In this paper we present a study in which text mining methodology and algorithms were properly employed for academic dishonesty (cheating) detection and evaluation on open-ended college exams, based on document…
Comparative Analysis of Document level Text Classification Algorithms using R

NASA Astrophysics Data System (ADS)

Syamala, Maganti; Nalini, N. J., Dr; Maguluri, Lakshamanaphaneendra; Ragupathy, R., Dr.

2017-08-01

From the past few decades there has been tremendous volumes of data available in Internet either in structured or unstructured form. Also, there is an exponential growth of information on Internet, so there is an emergent need of text classifiers. Text mining is an interdisciplinary field which draws attention on information retrieval, data mining, machine learning, statistics and computational linguistics. And to handle this situation, a wide range of supervised learning algorithms has been introduced. Among all these K-Nearest Neighbor(KNN) is efficient and simplest classifier in text classification family. But KNN suffers from imbalanced class distribution and noisy term features. So, to cope up with this challenge we use document based centroid dimensionality reduction(CentroidDR) using R Programming. By combining these two text classification techniques, KNN and Centroid classifiers, we propose a scalable and effective flat classifier, called MCenKNN which works well substantially better than CenKNN.

Biomedical literature classification using encyclopedic knowledge: a Wikipedia-based bag-of-concepts approach

PubMed Central

Pérez Rodríguez, Roberto; Anido Rifón, Luis E.

2015-01-01

Automatic classification of text documents into a set of categories has a lot of applications. Among those applications, the automatic classification of biomedical literature stands out as an important application for automatic document classification strategies. Biomedical staff and researchers have to deal with a lot of literature in their daily activities, so it would be useful a system that allows for accessing to documents of interest in a simple and effective way; thus, it is necessary that these documents are sorted based on some criteria—that is to say, they have to be classified. Documents to classify are usually represented following the bag-of-words (BoW) paradigm. Features are words in the text—thus suffering from synonymy and polysemy—and their weights are just based on their frequency of occurrence. This paper presents an empirical study of the efficiency of a classifier that leverages encyclopedic background knowledge—concretely Wikipedia—in order to create bag-of-concepts (BoC) representations of documents, understanding concept as “unit of meaning”, and thus tackling synonymy and polysemy. Besides, the weighting of concepts is based on their semantic relevance in the text. For the evaluation of the proposal, empirical experiments have been conducted with one of the commonly used corpora for evaluating classification and retrieval of biomedical information, OHSUMED, and also with a purpose-built corpus of MEDLINE biomedical abstracts, UVigoMED. Results obtained show that the Wikipedia-based bag-of-concepts representation outperforms the classical bag-of-words representation up to 157% in the single-label classification problem and up to 100% in the multi-label problem for OHSUMED corpus, and up to 122% in the single-label classification problem and up to 155% in the multi-label problem for UVigoMED corpus. PMID:26468436
A Feature Mining Based Approach for the Classification of Text Documents into Disjoint Classes.

ERIC Educational Resources Information Center

Nieto Sanchez, Salvador; Triantaphyllou, Evangelos; Kraft, Donald

2002-01-01

Proposes a new approach for classifying text documents into two disjoint classes. Highlights include a brief overview of document clustering; a data mining approach called the One Clause at a Time (OCAT) algorithm which is based on mathematical logic; vector space model (VSM); and comparing the OCAT to the VSM. (Author/LRW)
Use of Headings and Classifications by Physicians in Medical Narratives of EHRs

PubMed Central

Häyrinen, K.; Harno, K.; Nykänen, P.

2011-01-01

Objective The purpose of this study was to describe and evaluate patient care documentation by hospital physicians in EHRs and especially the use of national headings and classifications in these documentations Material and Methods The initial material consisted of a random sample of 3,481 medical narratives documented in EHRs during the period 2004-2005 in one department of a Finnish central hospital. The final material comprised a subset of 1,974 medical records with a focus on consultation requests and consultation responses by two specialist groups from 871 patients. This electronic documentation was analyzed using deductive content analyses and descriptive statistics. Results The physicians documented patient care in EHRs principally as narrative text. The medical narratives recorded by specialists were structured with headings in less than half of the patient cases. Consultation responses in general were more often structured with headings than consultation requests. The use of classifications was otherwise insignificant, but diagnoses were documented as ICD 10 codes in over 50% of consultation responses by both medical specialties. Conclusion There is an obvious need to improve the structuring of narrative text with national headings and classifications. According to the findings of this study, reason for care, patient history, health status, follow-up care plan and diagnosis are meaningful headings in physicians’ documentation. The existing list of headings needs to be analyzed within a consistent unified terminology system as a basis for further development. Adhering to headings and classifications in EHR documentation enables patient data to be shared and aggregated. The secondary use of data is expected to improve care management and quality of care. PMID:23616866
A Cognitive Computing Approach for Classification of Complaints in the Insurance Industry

NASA Astrophysics Data System (ADS)

Forster, J.; Entrup, B.

2017-10-01

In this paper we present and evaluate a cognitive computing approach for classification of dissatisfaction and four complaint specific complaint classes in correspondence documents between insurance clients and an insurance company. A cognitive computing approach includes the combination classical natural language processing methods, machine learning algorithms and the evaluation of hypothesis. The approach combines a MaxEnt machine learning algorithm with language modelling, tf-idf and sentiment analytics to create a multi-label text classification model. The result is trained and tested with a set of 2500 original insurance communication documents written in German, which have been manually annotated by the partnering insurance company. With a F1-Score of 0.9, a reliable text classification component has been implemented and evaluated. A final outlook towards a cognitive computing insurance assistant is given in the end.
Language Classification using N-grams Accelerated by FPGA-based Bloom Filters

DOE Office of Scientific and Technical Information (OSTI.GOV)

Jacob, A; Gokhale, M

N-Gram (n-character sequences in text documents) counting is a well-established technique used in classifying the language of text in a document. In this paper, n-gram processing is accelerated through the use of reconfigurable hardware on the XtremeData XD1000 system. Our design employs parallelism at multiple levels, with parallel Bloom Filters accessing on-chip RAM, parallel language classifiers, and parallel document processing. In contrast to another hardware implementation (HAIL algorithm) that uses off-chip SRAM for lookup, our highly scalable implementation uses only on-chip memory blocks. Our implementation of end-to-end language classification runs at 85x comparable software and 1.45x the competing hardware design.
3D model-based documentation with the Tumor Therapy Manager (TTM) improves TNM staging of head and neck tumor patients.

PubMed

Pankau, Thomas; Wichmann, Gunnar; Neumuth, Thomas; Preim, Bernhard; Dietz, Andreas; Stumpp, Patrick; Boehm, Andreas

2015-10-01

Many treatment approaches are available for head and neck cancer (HNC), leading to challenges for a multidisciplinary medical team in matching each patient with an appropriate regimen. In this effort, primary diagnostics and its reliable documentation are indispensable. A three-dimensional (3D) documentation system was developed and tested to determine its influence on interpretation of these data, especially for TNM classification. A total of 42 HNC patient data sets were available, including primary diagnostics such as panendoscopy, performed and evaluated by an experienced head and neck surgeon. In addition to the conventional panendoscopy form and report, a 3D representation was generated with the "Tumor Therapy Manager" (TTM) software. These cases were randomly re-evaluated by 11 experienced otolaryngologists from five hospitals, half with and half without the TTM data. The accuracy of tumor staging was assessed by pre-post comparison of the TNM classification. TNM staging showed no significant differences in tumor classification (T) with and without 3D from TTM. However, there was a significant decrease in standard deviation from 0.86 to 0.63 via TTM ([Formula: see text]). In nodal staging without TTM, the lymph nodes (N) were significantly underestimated with [Formula: see text] classes compared with [Formula: see text] with TTM ([Formula: see text]). Likewise, the standard deviation was reduced from 0.79 to 0.69 ([Formula: see text]). There was no influence of TTM results on the evaluation of distant metastases (M). TNM staging was more reproducible and nodal staging more accurate when 3D documentation of HNC primary data was available to experienced otolaryngologists. The more precise assessment of the tumor classification with TTM should provide improved decision-making concerning therapy, especially within the interdisciplinary tumor board.
"What is relevant in a text document?": An interpretable machine learning approach

PubMed Central

Arras, Leila; Horn, Franziska; Montavon, Grégoire; Müller, Klaus-Robert

2017-01-01

Text documents can be described by a number of abstract concepts such as semantic category, writing style, or sentiment. Machine learning (ML) models have been trained to automatically map documents to these abstract concepts, allowing to annotate very large text collections, more than could be processed by a human in a lifetime. Besides predicting the text’s category very accurately, it is also highly desirable to understand how and why the categorization process takes place. In this paper, we demonstrate that such understanding can be achieved by tracing the classification decision back to individual words using layer-wise relevance propagation (LRP), a recently developed technique for explaining predictions of complex non-linear classifiers. We train two word-based ML models, a convolutional neural network (CNN) and a bag-of-words SVM classifier, on a topic categorization task and adapt the LRP method to decompose the predictions of these models onto words. Resulting scores indicate how much individual words contribute to the overall classification decision. This enables one to distill relevant information from text documents without an explicit semantic information extraction step. We further use the word-wise relevance scores for generating novel vector-based document representations which capture semantic information. Based on these document vectors, we introduce a measure of model explanatory power and show that, although the SVM and CNN models perform similarly in terms of classification accuracy, the latter exhibits a higher level of explainability which makes it more comprehensible for humans and potentially more useful for other applications. PMID:28800619
High-Reproducibility and High-Accuracy Method for Automated Topic Classification

NASA Astrophysics Data System (ADS)

Lancichinetti, Andrea; Sirer, M. Irmak; Wang, Jane X.; Acuna, Daniel; Körding, Konrad; Amaral, Luís A. Nunes

2015-01-01

Much of human knowledge sits in large databases of unstructured text. Leveraging this knowledge requires algorithms that extract and record metadata on unstructured text documents. Assigning topics to documents will enable intelligent searching, statistical characterization, and meaningful classification. Latent Dirichlet allocation (LDA) is the state of the art in topic modeling. Here, we perform a systematic theoretical and numerical analysis that demonstrates that current optimization techniques for LDA often yield results that are not accurate in inferring the most suitable model parameters. Adapting approaches from community detection in networks, we propose a new algorithm that displays high reproducibility and high accuracy and also has high computational efficiency. We apply it to a large set of documents in the English Wikipedia and reveal its hierarchical structure.
Text Mining in Biomedical Domain with Emphasis on Document Clustering.

PubMed

Renganathan, Vinaitheerthan

2017-07-01

With the exponential increase in the number of articles published every year in the biomedical domain, there is a need to build automated systems to extract unknown information from the articles published. Text mining techniques enable the extraction of unknown knowledge from unstructured documents. This paper reviews text mining processes in detail and the software tools available to carry out text mining. It also reviews the roles and applications of text mining in the biomedical domain. Text mining processes, such as search and retrieval of documents, pre-processing of documents, natural language processing, methods for text clustering, and methods for text classification are described in detail. Text mining techniques can facilitate the mining of vast amounts of knowledge on a given topic from published biomedical research articles and draw meaningful conclusions that are not possible otherwise.
Mining protein function from text using term-based support vector machines

PubMed Central

Rice, Simon B; Nenadic, Goran; Stapley, Benjamin J

2005-01-01

Background Text mining has spurred huge interest in the domain of biology. The goal of the BioCreAtIvE exercise was to evaluate the performance of current text mining systems. We participated in Task 2, which addressed assigning Gene Ontology terms to human proteins and selecting relevant evidence from full-text documents. We approached it as a modified form of the document classification task. We used a supervised machine-learning approach (based on support vector machines) to assign protein function and select passages that support the assignments. As classification features, we used a protein's co-occurring terms that were automatically extracted from documents. Results The results evaluated by curators were modest, and quite variable for different problems: in many cases we have relatively good assignment of GO terms to proteins, but the selected supporting text was typically non-relevant (precision spanning from 3% to 50%). The method appears to work best when a substantial set of relevant documents is obtained, while it works poorly on single documents and/or short passages. The initial results suggest that our approach can also mine annotations from text even when an explicit statement relating a protein to a GO term is absent. Conclusion A machine learning approach to mining protein function predictions from text can yield good performance only if sufficient training data is available, and significant amount of supporting data is used for prediction. The most promising results are for combined document retrieval and GO term assignment, which calls for the integration of methods developed in BioCreAtIvE Task 1 and Task 2. PMID:15960835
Text Mining in Biomedical Domain with Emphasis on Document Clustering

PubMed Central

2017-01-01

Objectives With the exponential increase in the number of articles published every year in the biomedical domain, there is a need to build automated systems to extract unknown information from the articles published. Text mining techniques enable the extraction of unknown knowledge from unstructured documents. Methods This paper reviews text mining processes in detail and the software tools available to carry out text mining. It also reviews the roles and applications of text mining in the biomedical domain. Results Text mining processes, such as search and retrieval of documents, pre-processing of documents, natural language processing, methods for text clustering, and methods for text classification are described in detail. Conclusions Text mining techniques can facilitate the mining of vast amounts of knowledge on a given topic from published biomedical research articles and draw meaningful conclusions that are not possible otherwise. PMID:28875048
Improving Classification of Protein Interaction Articles Using Context Similarity-Based Feature Selection.

PubMed

Chen, Yifei; Sun, Yuxing; Han, Bing-Qing

2015-01-01

Protein interaction article classification is a text classification task in the biological domain to determine which articles describe protein-protein interactions. Since the feature space in text classification is high-dimensional, feature selection is widely used for reducing the dimensionality of features to speed up computation without sacrificing classification performance. Many existing feature selection methods are based on the statistical measure of document frequency and term frequency. One potential drawback of these methods is that they treat features separately. Hence, first we design a similarity measure between the context information to take word cooccurrences and phrase chunks around the features into account. Then we introduce the similarity of context information to the importance measure of the features to substitute the document and term frequency. Hence we propose new context similarity-based feature selection methods. Their performance is evaluated on two protein interaction article collections and compared against the frequency-based methods. The experimental results reveal that the context similarity-based methods perform better in terms of the F1 measure and the dimension reduction rate. Benefiting from the context information surrounding the features, the proposed methods can select distinctive features effectively for protein interaction article classification.
Information Gain Based Dimensionality Selection for Classifying Text Documents

DOE Office of Scientific and Technical Information (OSTI.GOV)

Dumidu Wijayasekara; Milos Manic; Miles McQueen

2013-06-01

Selecting the optimal dimensions for various knowledge extraction applications is an essential component of data mining. Dimensionality selection techniques are utilized in classification applications to increase the classification accuracy and reduce the computational complexity. In text classification, where the dimensionality of the dataset is extremely high, dimensionality selection is even more important. This paper presents a novel, genetic algorithm based methodology, for dimensionality selection in text mining applications that utilizes information gain. The presented methodology uses information gain of each dimension to change the mutation probability of chromosomes dynamically. Since the information gain is calculated a priori, the computational complexitymore » is not affected. The presented method was tested on a specific text classification problem and compared with conventional genetic algorithm based dimensionality selection. The results show an improvement of 3% in the true positives and 1.6% in the true negatives over conventional dimensionality selection methods.« less
Classification of document page images based on visual similarity of layout structures

NASA Astrophysics Data System (ADS)

Shin, Christian K.; Doermann, David S.

1999-12-01

Searching for documents by their type or genre is a natural way to enhance the effectiveness of document retrieval. The layout of a document contains a significant amount of information that can be used to classify a document's type in the absence of domain specific models. A document type or genre can be defined by the user based primarily on layout structure. Our classification approach is based on 'visual similarity' of the layout structure by building a supervised classifier, given examples of the class. We use image features, such as the percentages of tex and non-text (graphics, image, table, and ruling) content regions, column structures, variations in the point size of fonts, the density of content area, and various statistics on features of connected components which can be derived from class samples without class knowledge. In order to obtain class labels for training samples, we conducted a user relevance test where subjects ranked UW-I document images with respect to the 12 representative images. We implemented our classification scheme using the OC1, a decision tree classifier, and report our findings.
Document-Level Classification of CT Pulmonary Angiography Reports based on an Extension of the ConText Algorithm

PubMed Central

Chapman, Brian E.; Lee, Sean; Kang, Hyunseok Peter; Chapman, Wendy W.

2011-01-01

In this paper we describe an application called peFinder for document-level classification of CT pulmonary angiography reports. peFinder is based on a generalized version of the ConText algorithm, a simple text processing algorithm for identifying features in clinical report documents. peFinder was used to answer questions about the disease state (pulmonary emboli present or absent), the certainty state of the diagnosis (uncertainty present or absent), the temporal state of an identified pulmonary embolus (acute or chronic), and the technical quality state of the exam (diagnostic or not diagnostic). Gold standard answers for each question were determined from the consensus classifications of three human annotators. peFinder results were compared to naive Bayes’ classifiers using unigrams and bigrams. The sensitivities (and positive predictive values) for peFinder were 0.98(0.83), 0.86(0.96), 0.94(0.93), and 0.60(0.90) for disease state, quality state, certainty state, and temporal state respectively, compared to 0.68(0.77), 0.67(0.87), 0.62(0.82), and 0.04(0.25) for the naive Bayes’ classifier using unigrams, and 0.75(0.79), 0.52(0.69), 0.59(0.84), and 0.04(0.25) for the naive Bayes’ classifier using bigrams. PMID:21459155
Machine printed text and handwriting identification in noisy document images.

PubMed

Zheng, Yefeng; Li, Huiping; Doermann, David

2004-03-01

In this paper, we address the problem of the identification of text in noisy document images. We are especially focused on segmenting and identifying between handwriting and machine printed text because: 1) Handwriting in a document often indicates corrections, additions, or other supplemental information that should be treated differently from the main content and 2) the segmentation and recognition techniques requested for machine printed and handwritten text are significantly different. A novel aspect of our approach is that we treat noise as a separate class and model noise based on selected features. Trained Fisher classifiers are used to identify machine printed text and handwriting from noise and we further exploit context to refine the classification. A Markov Random Field-based (MRF) approach is used to model the geometrical structure of the printed text, handwriting, and noise to rectify misclassifications. Experimental results show that our approach is robust and can significantly improve page segmentation in noisy document collections.
Semantic Classification of Diseases in Discharge Summaries Using a Context-aware Rule-based Classifier

PubMed Central

Solt, Illés; Tikk, Domonkos; Gál, Viktor; Kardkovács, Zsolt T.

2009-01-01

Objective Automated and disease-specific classification of textual clinical discharge summaries is of great importance in human life science, as it helps physicians to make medical studies by providing statistically relevant data for analysis. This can be further facilitated if, at the labeling of discharge summaries, semantic labels are also extracted from text, such as whether a given disease is present, absent, questionable in a patient, or is unmentioned in the document. The authors present a classification technique that successfully solves the semantic classification task. Design The authors introduce a context-aware rule-based semantic classification technique for use on clinical discharge summaries. The classification is performed in subsequent steps. First, some misleading parts are removed from the text; then the text is partitioned into positive, negative, and uncertain context segments, then a sequence of binary classifiers is applied to assign the appropriate semantic labels. Measurement For evaluation the authors used the documents of the i2b2 Obesity Challenge and adopted its evaluation measures: F1-macro and F1-micro for measurements. Results On the two subtasks of the Obesity Challenge (textual and intuitive classification) the system performed very well, and achieved a F1-macro = 0.80 for the textual and F1-macro = 0.67 for the intuitive tasks, and obtained second place at the textual and first place at the intuitive subtasks of the challenge. Conclusions The authors show in the paper that a simple rule-based classifier can tackle the semantic classification task more successfully than machine learning techniques, if the training data are limited and some semantic labels are very sparse. PMID:19390101
Comparisons and Selections of Features and Classifiers for Short Text Classification

NASA Astrophysics Data System (ADS)

Wang, Ye; Zhou, Zhi; Jin, Shan; Liu, Debin; Lu, Mi

2017-10-01

Short text is considerably different from traditional long text documents due to its shortness and conciseness, which somehow hinders the applications of conventional machine learning and data mining algorithms in short text classification. According to traditional artificial intelligence methods, we divide short text classification into three steps, namely preprocessing, feature selection and classifier comparison. In this paper, we have illustrated step-by-step how we approach our goals. Specifically, in feature selection, we compared the performance and robustness of the four methods of one-hot encoding, tf-idf weighting, word2vec and paragraph2vec, and in the classification part, we deliberately chose and compared Naive Bayes, Logistic Regression, Support Vector Machine, K-nearest Neighbor and Decision Tree as our classifiers. Then, we compared and analysed the classifiers horizontally with each other and vertically with feature selections. Regarding the datasets, we crawled more than 400,000 short text files from Shanghai and Shenzhen Stock Exchanges and manually labeled them into two classes, the big and the small. There are eight labels in the big class, and 59 labels in the small class.
Probabilistic topic modeling for the analysis and classification of genomic sequences

PubMed Central

2015-01-01

Background Studies on genomic sequences for classification and taxonomic identification have a leading role in the biomedical field and in the analysis of biodiversity. These studies are focusing on the so-called barcode genes, representing a well defined region of the whole genome. Recently, alignment-free techniques are gaining more importance because they are able to overcome the drawbacks of sequence alignment techniques. In this paper a new alignment-free method for DNA sequences clustering and classification is proposed. The method is based on k-mers representation and text mining techniques. Methods The presented method is based on Probabilistic Topic Modeling, a statistical technique originally proposed for text documents. Probabilistic topic models are able to find in a document corpus the topics (recurrent themes) characterizing classes of documents. This technique, applied on DNA sequences representing the documents, exploits the frequency of fixed-length k-mers and builds a generative model for a training group of sequences. This generative model, obtained through the Latent Dirichlet Allocation (LDA) algorithm, is then used to classify a large set of genomic sequences. Results and conclusions We performed classification of over 7000 16S DNA barcode sequences taken from Ribosomal Database Project (RDP) repository, training probabilistic topic models. The proposed method is compared to the RDP tool and Support Vector Machine (SVM) classification algorithm in a extensive set of trials using both complete sequences and short sequence snippets (from 400 bp to 25 bp). Our method reaches very similar results to RDP classifier and SVM for complete sequences. The most interesting results are obtained when short sequence snippets are considered. In these conditions the proposed method outperforms RDP and SVM with ultra short sequences and it exhibits a smooth decrease of performance, at every taxonomic level, when the sequence length is decreased. PMID:25916734
Sentiment analysis of feature ranking methods for classification accuracy

NASA Astrophysics Data System (ADS)

Joseph, Shashank; Mugauri, Calvin; Sumathy, S.

2017-11-01

Text pre-processing and feature selection are important and critical steps in text mining. Text pre-processing of large volumes of datasets is a difficult task as unstructured raw data is converted into structured format. Traditional methods of processing and weighing took much time and were less accurate. To overcome this challenge, feature ranking techniques have been devised. A feature set from text preprocessing is fed as input for feature selection. Feature selection helps improve text classification accuracy. Of the three feature selection categories available, the filter category will be the focus. Five feature ranking methods namely: document frequency, standard deviation information gain, CHI-SQUARE, and weighted-log likelihood -ratio is analyzed.

Ensemble of classifiers for ontology enrichment

NASA Astrophysics Data System (ADS)

Semenova, A. V.; Kureichik, V. M.

2018-05-01

A classifier is a basis of ontology learning systems. Classification of text documents is used in many applications, such as information retrieval, information extraction, definition of spam. A new ensemble of classifiers based on SVM (a method of support vectors), LSTM (neural network) and word embedding are suggested. An experiment was conducted on open data, which allows us to conclude that the proposed classification method is promising. The implementation of the proposed classifier is performed in the Matlab using the functions of the Text Analytics Toolbox. The principal difference between the proposed ensembles of classifiers is the high quality of classification of data at acceptable time costs.
Effective biomedical document classification for identifying publications relevant to the mouse Gene Expression Database (GXD).

PubMed

Jiang, Xiangying; Ringwald, Martin; Blake, Judith; Shatkay, Hagit

2017-01-01

The Gene Expression Database (GXD) is a comprehensive online database within the Mouse Genome Informatics resource, aiming to provide available information about endogenous gene expression during mouse development. The information stems primarily from many thousands of biomedical publications that database curators must go through and read. Given the very large number of biomedical papers published each year, automatic document classification plays an important role in biomedical research. Specifically, an effective and efficient document classifier is needed for supporting the GXD annotation workflow. We present here an effective yet relatively simple classification scheme, which uses readily available tools while employing feature selection, aiming to assist curators in identifying publications relevant to GXD. We examine the performance of our method over a large manually curated dataset, consisting of more than 25 000 PubMed abstracts, of which about half are curated as relevant to GXD while the other half as irrelevant to GXD. In addition to text from title-and-abstract, we also consider image captions, an important information source that we integrate into our method. We apply a captions-based classifier to a subset of about 3300 documents, for which the full text of the curated articles is available. The results demonstrate that our proposed approach is robust and effectively addresses the GXD document classification. Moreover, using information obtained from image captions clearly improves performance, compared to title and abstract alone, affirming the utility of image captions as a substantial evidence source for automatically determining the relevance of biomedical publications to a specific subject area. www.informatics.jax.org. © The Author(s) 2017. Published by Oxford University Press.
Document-level classification of CT pulmonary angiography reports based on an extension of the ConText algorithm.

PubMed

Chapman, Brian E; Lee, Sean; Kang, Hyunseok Peter; Chapman, Wendy W

2011-10-01

In this paper we describe an application called peFinder for document-level classification of CT pulmonary angiography reports. peFinder is based on a generalized version of the ConText algorithm, a simple text processing algorithm for identifying features in clinical report documents. peFinder was used to answer questions about the disease state (pulmonary emboli present or absent), the certainty state of the diagnosis (uncertainty present or absent), the temporal state of an identified pulmonary embolus (acute or chronic), and the technical quality state of the exam (diagnostic or not diagnostic). Gold standard answers for each question were determined from the consensus classifications of three human annotators. peFinder results were compared to naive Bayes' classifiers using unigrams and bigrams. The sensitivities (and positive predictive values) for peFinder were 0.98(0.83), 0.86(0.96), 0.94(0.93), and 0.60(0.90) for disease state, quality state, certainty state, and temporal state respectively, compared to 0.68(0.77), 0.67(0.87), 0.62(0.82), and 0.04(0.25) for the naive Bayes' classifier using unigrams, and 0.75(0.79), 0.52(0.69), 0.59(0.84), and 0.04(0.25) for the naive Bayes' classifier using bigrams. Copyright © 2011 Elsevier Inc. All rights reserved.
Model-based document categorization employing semantic pattern analysis and local structure clustering

NASA Astrophysics Data System (ADS)

Fume, Kosei; Ishitani, Yasuto

2008-01-01

We propose a document categorization method based on a document model that can be defined externally for each task and that categorizes Web content or business documents into a target category in accordance with the similarity of the model. The main feature of the proposed method consists of two aspects of semantics extraction from an input document. The semantics of terms are extracted by the semantic pattern analysis and implicit meanings of document substructure are specified by a bottom-up text clustering technique focusing on the similarity of text line attributes. We have constructed a system based on the proposed method for trial purposes. The experimental results show that the system achieves more than 80% classification accuracy in categorizing Web content and business documents into 15 or 70 categories.
Detecting experimental techniques and selecting relevant documents for protein-protein interactions from biomedical literature.

PubMed

Wang, Xinglong; Rak, Rafal; Restificar, Angelo; Nobata, Chikashi; Rupp, C J; Batista-Navarro, Riza Theresa B; Nawaz, Raheel; Ananiadou, Sophia

2011-10-03

The selection of relevant articles for curation, and linking those articles to experimental techniques confirming the findings became one of the primary subjects of the recent BioCreative III contest. The contest's Protein-Protein Interaction (PPI) task consisted of two sub-tasks: Article Classification Task (ACT) and Interaction Method Task (IMT). ACT aimed to automatically select relevant documents for PPI curation, whereas the goal of IMT was to recognise the methods used in experiments for identifying the interactions in full-text articles. We proposed and compared several classification-based methods for both tasks, employing rich contextual features as well as features extracted from external knowledge sources. For IMT, a new method that classifies pair-wise relations between every text phrase and candidate interaction method obtained promising results with an F1 score of 64.49%, as tested on the task's development dataset. We also explored ways to combine this new approach and more conventional, multi-label document classification methods. For ACT, our classifiers exploited automatically detected named entities and other linguistic information. The evaluation results on the BioCreative III PPI test datasets showed that our systems were very competitive: one of our IMT methods yielded the best performance among all participants, as measured by F1 score, Matthew's Correlation Coefficient and AUC iP/R; whereas for ACT, our best classifier was ranked second as measured by AUC iP/R, and also competitive according to other metrics. Our novel approach that converts the multi-class, multi-label classification problem to a binary classification problem showed much promise in IMT. Nevertheless, on the test dataset the best performance was achieved by taking the union of the output of this method and that of a multi-class, multi-label document classifier, which indicates that the two types of systems complement each other in terms of recall. For ACT, our system exploited a rich set of features and also obtained encouraging results. We examined the features with respect to their contributions to the classification results, and concluded that contextual words surrounding named entities, as well as the MeSH headings associated with the documents were among the main contributors to the performance.
Towards Automatic Classification of Wikipedia Content

NASA Astrophysics Data System (ADS)

Szymański, Julian

Wikipedia - the Free Encyclopedia encounters the problem of proper classification of new articles everyday. The process of assignment of articles to categories is performed manually and it is a time consuming task. It requires knowledge about Wikipedia structure, which is beyond typical editor competence, which leads to human-caused mistakes - omitting or wrong assignments of articles to categories. The article presents application of SVM classifier for automatic classification of documents from The Free Encyclopedia. The classifier application has been tested while using two text representations: inter-documents connections (hyperlinks) and word content. The results of the performed experiments evaluated on hand crafted data show that the Wikipedia classification process can be partially automated. The proposed approach can be used for building a decision support system which suggests editors the best categories that fit new content entered to Wikipedia.
Automated compound classification using a chemical ontology.

PubMed

Bobach, Claudia; Böhme, Timo; Laube, Ulf; Püschel, Anett; Weber, Lutz

2012-12-29

Classification of chemical compounds into compound classes by using structure derived descriptors is a well-established method to aid the evaluation and abstraction of compound properties in chemical compound databases. MeSH and recently ChEBI are examples of chemical ontologies that provide a hierarchical classification of compounds into general compound classes of biological interest based on their structural as well as property or use features. In these ontologies, compounds have been assigned manually to their respective classes. However, with the ever increasing possibilities to extract new compounds from text documents using name-to-structure tools and considering the large number of compounds deposited in databases, automated and comprehensive chemical classification methods are needed to avoid the error prone and time consuming manual classification of compounds. In the present work we implement principles and methods to construct a chemical ontology of classes that shall support the automated, high-quality compound classification in chemical databases or text documents. While SMARTS expressions have already been used to define chemical structure class concepts, in the present work we have extended the expressive power of such class definitions by expanding their structure-based reasoning logic. Thus, to achieve the required precision and granularity of chemical class definitions, sets of SMARTS class definitions are connected by OR and NOT logical operators. In addition, AND logic has been implemented to allow the concomitant use of flexible atom lists and stereochemistry definitions. The resulting chemical ontology is a multi-hierarchical taxonomy of concept nodes connected by directed, transitive relationships. A proposal for a rule based definition of chemical classes has been made that allows to define chemical compound classes more precisely than before. The proposed structure-based reasoning logic allows to translate chemistry expert knowledge into a computer interpretable form, preventing erroneous compound assignments and allowing automatic compound classification. The automated assignment of compounds in databases, compound structure files or text documents to their related ontology classes is possible through the integration with a chemical structure search engine. As an application example, the annotation of chemical structure files with a prototypic ontology is demonstrated.
Automated compound classification using a chemical ontology

PubMed Central

2012-01-01

Background Classification of chemical compounds into compound classes by using structure derived descriptors is a well-established method to aid the evaluation and abstraction of compound properties in chemical compound databases. MeSH and recently ChEBI are examples of chemical ontologies that provide a hierarchical classification of compounds into general compound classes of biological interest based on their structural as well as property or use features. In these ontologies, compounds have been assigned manually to their respective classes. However, with the ever increasing possibilities to extract new compounds from text documents using name-to-structure tools and considering the large number of compounds deposited in databases, automated and comprehensive chemical classification methods are needed to avoid the error prone and time consuming manual classification of compounds. Results In the present work we implement principles and methods to construct a chemical ontology of classes that shall support the automated, high-quality compound classification in chemical databases or text documents. While SMARTS expressions have already been used to define chemical structure class concepts, in the present work we have extended the expressive power of such class definitions by expanding their structure-based reasoning logic. Thus, to achieve the required precision and granularity of chemical class definitions, sets of SMARTS class definitions are connected by OR and NOT logical operators. In addition, AND logic has been implemented to allow the concomitant use of flexible atom lists and stereochemistry definitions. The resulting chemical ontology is a multi-hierarchical taxonomy of concept nodes connected by directed, transitive relationships. Conclusions A proposal for a rule based definition of chemical classes has been made that allows to define chemical compound classes more precisely than before. The proposed structure-based reasoning logic allows to translate chemistry expert knowledge into a computer interpretable form, preventing erroneous compound assignments and allowing automatic compound classification. The automated assignment of compounds in databases, compound structure files or text documents to their related ontology classes is possible through the integration with a chemical structure search engine. As an application example, the annotation of chemical structure files with a prototypic ontology is demonstrated. PMID:23273256
Semantic classification of business images

NASA Astrophysics Data System (ADS)

Erol, Berna; Hull, Jonathan J.

2006-01-01

Digital cameras are becoming increasingly common for capturing information in business settings. In this paper, we describe a novel method for classifying images into the following semantic classes: document, whiteboard, business card, slide, and regular images. Our method is based on combining low-level image features, such as text color, layout, and handwriting features with high-level OCR output analysis. Several Support Vector Machine Classifiers are combined for multi-class classification of input images. The system yields 95% accuracy in classification.
Global and Local Features Based Classification for Bleed-Through Removal

NASA Astrophysics Data System (ADS)

Hu, Xiangyu; Lin, Hui; Li, Shutao; Sun, Bin

2016-12-01

The text on one side of historical documents often seeps through and appears on the other side, so the bleed-through is a common problem in historical document images. It makes the document images hard to read and the text difficult to recognize. To improve the image quality and readability, the bleed-through has to be removed. This paper proposes a global and local features extraction based bleed-through removal method. The Gaussian mixture model is used to get the global features of the images. Local features are extracted by the patch around each pixel. Then, the extreme learning machine classifier is utilized to classify the scanned images into the foreground text and the bleed-through component. Experimental results on real document image datasets show that the proposed method outperforms the state-of-the-art bleed-through removal methods and preserves the text strokes well.
Text Classification for Intelligent Portfolio Management

DTIC Science & Technology

2002-05-01

years including nearest neighbor classification [15], naive Bayes with EM (Ex- pectation Maximization) [11] [13], Winnow with active learning [10... Active Learning and Expectation Maximization (EM). In particular, active learning is used to actively select documents for labeling, then EM assigns...generalization with active learning . Machine Learning, 15(2):201–221, 1994. [3] I. Dagan and P. Engelson. Committee-based sampling for training
Natural Language Processing Based Instrument for Classification of Free Text Medical Records

PubMed Central

2016-01-01

According to the Ministry of Labor, Health and Social Affairs of Georgia a new health management system has to be introduced in the nearest future. In this context arises the problem of structuring and classifying documents containing all the history of medical services provided. The present work introduces the instrument for classification of medical records based on the Georgian language. It is the first attempt of such classification of the Georgian language based medical records. On the whole 24.855 examination records have been studied. The documents were classified into three main groups (ultrasonography, endoscopy, and X-ray) and 13 subgroups using two well-known methods: Support Vector Machine (SVM) and K-Nearest Neighbor (KNN). The results obtained demonstrated that both machine learning methods performed successfully, with a little supremacy of SVM. In the process of classification a “shrink” method, based on features selection, was introduced and applied. At the first stage of classification the results of the “shrink” case were better; however, on the second stage of classification into subclasses 23% of all documents could not be linked to only one definite individual subclass (liver or binary system) due to common features characterizing these subclasses. The overall results of the study were successful. PMID:27668260
Automated Classification of Selected Data Elements from Free-text Diagnostic Reports for Clinical Research.

PubMed

Löpprich, Martin; Krauss, Felix; Ganzinger, Matthias; Senghas, Karsten; Riezler, Stefan; Knaup, Petra

2016-08-05

In the Multiple Myeloma clinical registry at Heidelberg University Hospital, most data are extracted from discharge letters. Our aim was to analyze if it is possible to make the manual documentation process more efficient by using methods of natural language processing for multiclass classification of free-text diagnostic reports to automatically document the diagnosis and state of disease of myeloma patients. The first objective was to create a corpus consisting of free-text diagnosis paragraphs of patients with multiple myeloma from German diagnostic reports, and its manual annotation of relevant data elements by documentation specialists. The second objective was to construct and evaluate a framework using different NLP methods to enable automatic multiclass classification of relevant data elements from free-text diagnostic reports. The main diagnoses paragraph was extracted from the clinical report of one third randomly selected patients of the multiple myeloma research database from Heidelberg University Hospital (in total 737 selected patients). An EDC system was setup and two data entry specialists performed independently a manual documentation of at least nine specific data elements for multiple myeloma characterization. Both data entries were compared and assessed by a third specialist and an annotated text corpus was created. A framework was constructed, consisting of a self-developed package to split multiple diagnosis sequences into several subsequences, four different preprocessing steps to normalize the input data and two classifiers: a maximum entropy classifier (MEC) and a support vector machine (SVM). In total 15 different pipelines were examined and assessed by a ten-fold cross-validation, reiterated 100 times. For quality indication the average error rate and the average F1-score were conducted. For significance testing the approximate randomization test was used. The created annotated corpus consists of 737 different diagnoses paragraphs with a total number of 865 coded diagnosis. The dataset is publicly available in the supplementary online files for training and testing of further NLP methods. Both classifiers showed low average error rates (MEC: 1.05; SVM: 0.84) and high F1-scores (MEC: 0.89; SVM: 0.92). However the results varied widely depending on the classified data element. Preprocessing methods increased this effect and had significant impact on the classification, both positive and negative. The automatic diagnosis splitter increased the average error rate significantly, even if the F1-score decreased only slightly. The low average error rates and high average F1-scores of each pipeline demonstrate the suitability of the investigated NPL methods. However, it was also shown that there is no best practice for an automatic classification of data elements from free-text diagnostic reports.
A survey on the geographic scope of textual documents

NASA Astrophysics Data System (ADS)

Monteiro, Bruno R.; Davis, Clodoveu A.; Fonseca, Fred

2016-11-01

Recognizing references to places in texts is needed in many applications, such as search engines, location-based social media and document classification. In this paper we present a survey of methods and techniques for the recognition and identification of places referenced in texts. We discuss concepts and terminology, and propose a classification of the solutions given in the literature. We introduce a definition of the Geographic Scope Resolution (GSR) problem, dividing it in three steps: geoparsing, reference resolution, and grounding references. Solutions to the first two steps are organized according to the method used, and solutions to the third step are organized according to the type of output produced. We found that it is difficult to compare existing solutions directly to one another, because they often create their own benchmarking data, targeted to their own problem.
An implementation of support vector machine on sentiment classification of movie reviews

NASA Astrophysics Data System (ADS)

Yulietha, I. M.; Faraby, S. A.; Adiwijaya; Widyaningtyas, W. C.

2018-03-01

With technological advances, all information about movie is available on the internet. If the information is processed properly, it will get the quality of the information. This research proposes to the classify sentiments on movie review documents. This research uses Support Vector Machine (SVM) method because it can classify high dimensional data in accordance with the data used in this research in the form of text. Support Vector Machine is a popular machine learning technique for text classification because it can classify by learning from a collection of documents that have been classified previously and can provide good result. Based on number of datasets, the 90-10 composition has the best result that is 85.6%. Based on SVM kernel, kernel linear with constant 1 has the best result that is 84.9%
[The electronic use of the NANDA-, NOC- and NIC- classifications and implications for nursing practice].

PubMed

Bernhart-Just, Alexandra; Hillewerth, Kathrin; Holzer-Pruss, Christina; Paprotny, Monika; Zimmermann Heinrich, Heidi

2009-12-01

The data model developed on behalf of the Nursing Service Commission of the Canton of Zurich (Pflegedienstkommission des Kantons Zürich) is based on the NANDA nursing diagnoses, the Nursing Outcome Classification, and the Nursing Intervention Classification (NNN Classifications). It also includes integrated functions for cost-centered accounting, service recording, and the Swiss Nursing Minimum Data Set. The data model uses the NNN classifications to map a possible form of the nursing process in the electronic patient health record, where the nurse can choose nursing diagnoses, outcomes, and interventions relevant to the patient situation. The nurses' choice is guided both by the different classifications and their linkages, and the use of specific text components pre-defined for each classification and accessible through the respective linkages. This article describes the developed data model and illustrates its clinical application in a specific patient's situation. Preparatory work required for the implementation of NNN classifications in practical nursing such as content filtering and the creation of linkages between the NNN classifications are described. Against the background of documentation of the nursing process based on the DAPEP(1) data model, possible changes and requirements are deduced. The article provides a contribution to the discussion of a change in documentation of the nursing process by implementing nursing classifications in electronic patient records.
Detecting experimental techniques and selecting relevant documents for protein-protein interactions from biomedical literature

PubMed Central

2011-01-01

Background The selection of relevant articles for curation, and linking those articles to experimental techniques confirming the findings became one of the primary subjects of the recent BioCreative III contest. The contest’s Protein-Protein Interaction (PPI) task consisted of two sub-tasks: Article Classification Task (ACT) and Interaction Method Task (IMT). ACT aimed to automatically select relevant documents for PPI curation, whereas the goal of IMT was to recognise the methods used in experiments for identifying the interactions in full-text articles. Results We proposed and compared several classification-based methods for both tasks, employing rich contextual features as well as features extracted from external knowledge sources. For IMT, a new method that classifies pair-wise relations between every text phrase and candidate interaction method obtained promising results with an F1 score of 64.49%, as tested on the task’s development dataset. We also explored ways to combine this new approach and more conventional, multi-label document classification methods. For ACT, our classifiers exploited automatically detected named entities and other linguistic information. The evaluation results on the BioCreative III PPI test datasets showed that our systems were very competitive: one of our IMT methods yielded the best performance among all participants, as measured by F1 score, Matthew’s Correlation Coefficient and AUC iP/R; whereas for ACT, our best classifier was ranked second as measured by AUC iP/R, and also competitive according to other metrics. Conclusions Our novel approach that converts the multi-class, multi-label classification problem to a binary classification problem showed much promise in IMT. Nevertheless, on the test dataset the best performance was achieved by taking the union of the output of this method and that of a multi-class, multi-label document classifier, which indicates that the two types of systems complement each other in terms of recall. For ACT, our system exploited a rich set of features and also obtained encouraging results. We examined the features with respect to their contributions to the classification results, and concluded that contextual words surrounding named entities, as well as the MeSH headings associated with the documents were among the main contributors to the performance. PMID:22151769
Using ontology network structure in text mining.

PubMed

Berndt, Donald J; McCart, James A; Luther, Stephen L

2010-11-13

Statistical text mining treats documents as bags of words, with a focus on term frequencies within documents and across document collections. Unlike natural language processing (NLP) techniques that rely on an engineered vocabulary or a full-featured ontology, statistical approaches do not make use of domain-specific knowledge. The freedom from biases can be an advantage, but at the cost of ignoring potentially valuable knowledge. The approach proposed here investigates a hybrid strategy based on computing graph measures of term importance over an entire ontology and injecting the measures into the statistical text mining process. As a starting point, we adapt existing search engine algorithms such as PageRank and HITS to determine term importance within an ontology graph. The graph-theoretic approach is evaluated using a smoking data set from the i2b2 National Center for Biomedical Computing, cast as a simple binary classification task for categorizing smoking-related documents, demonstrating consistent improvements in accuracy.
Assessment of text documentation accompanying uncoded diagnoses in computerized health insurance claims in Japan.

PubMed

Tanihara, Shinichi

2015-01-01

Uncoded diagnoses in health insurance claims (HICs) may introduce bias into Japanese health statistics dependent on computerized HICs. This study's aim was to identify the causes and characteristics of uncoded diagnoses. Uncoded diagnoses from computerized HICs (outpatient, inpatient, and the diagnosis procedure-combination per-diem payment system [DPC/PDPS]) submitted to the National Health Insurance Organization of Kumamoto Prefecture in May 2010 were analyzed. The text documentation accompanying the uncoded diagnoses was used to classify diagnoses in accordance with the International Classification of Diseases-10 (ICD-10). The text documentation was also classified into four categories using the standard descriptions of diagnoses defined in the master files of the computerized HIC system: 1) standard descriptions of diagnoses, 2) standard descriptions with a modifier, 3) non-standard descriptions of diagnoses, and 4) unclassifiable text documentation. Using these classifications, the proportions of uncoded diagnoses by ICD-10 disease category were calculated. Of the uncoded diagnoses analyzed (n = 363 753), non-standard descriptions of diagnoses for outpatient, inpatient, and DPC/PDPS HICs comprised 12.1%, 14.6%, and 1.0% of uncoded diagnoses, respectively. The proportion of uncoded diagnoses with standard descriptions with a modifier for Diseases of the eye and adnexa was significantly higher than the overall proportion of uncoded diagnoses among every HIC type. The pattern of uncoded diagnoses differed by HIC type and disease category. Evaluating the proportion of uncoded diagnoses in all medical facilities and developing effective coding methods for diagnoses with modifiers, prefixes, and suffixes should reduce number of uncoded diagnoses in computerized HICs and improve the quality of HIC databases.
Analysis of Nature of Science Included in Recent Popular Writing Using Text Mining Techniques

ERIC Educational Resources Information Center

Jiang, Feng; McComas, William F.

2014-01-01

This study examined the inclusion of nature of science (NOS) in popular science writing to determine whether it could serve supplementary resource for teaching NOS and to evaluate the accuracy of text mining and classification as a viable research tool in science education research. Four groups of documents published from 2001 to 2010 were…

Content Abstract Classification Using Naive Bayes

NASA Astrophysics Data System (ADS)

Latif, Syukriyanto; Suwardoyo, Untung; Aldrin Wihelmus Sanadi, Edwin

2018-03-01

This study aims to classify abstract content based on the use of the highest number of words in an abstract content of the English language journals. This research uses a system of text mining technology that extracts text data to search information from a set of documents. Abstract content of 120 data downloaded at www.computer.org. Data grouping consists of three categories: DM (Data Mining), ITS (Intelligent Transport System) and MM (Multimedia). Systems built using naive bayes algorithms to classify abstract journals and feature selection processes using term weighting to give weight to each word. Dimensional reduction techniques to reduce the dimensions of word counts rarely appear in each document based on dimensional reduction test parameters of 10% -90% of 5.344 words. The performance of the classification system is tested by using the Confusion Matrix based on comparative test data and test data. The results showed that the best classification results were obtained during the 75% training data test and 25% test data from the total data. Accuracy rates for categories of DM, ITS and MM were 100%, 100%, 86%. respectively with dimension reduction parameters of 30% and the value of learning rate between 0.1-0.5.
Improved document image segmentation algorithm using multiresolution morphology

NASA Astrophysics Data System (ADS)

Bukhari, Syed Saqib; Shafait, Faisal; Breuel, Thomas M.

2011-01-01

Page segmentation into text and non-text elements is an essential preprocessing step before optical character recognition (OCR) operation. In case of poor segmentation, an OCR classification engine produces garbage characters due to the presence of non-text elements. This paper describes modifications to the text/non-text segmentation algorithm presented by Bloomberg,1 which is also available in his open-source Leptonica library.2The modifications result in significant improvements and achieved better segmentation accuracy than the original algorithm for UW-III, UNLV, ICDAR 2009 page segmentation competition test images and circuit diagram datasets.
Determining Fuzzy Membership for Sentiment Classification: A Three-Layer Sentiment Propagation Model

PubMed Central

Zhao, Chuanjun; Wang, Suge; Li, Deyu

2016-01-01

Enormous quantities of review documents exist in forums, blogs, twitter accounts, and shopping web sites. Analysis of the sentiment information hidden in these review documents is very useful for consumers and manufacturers. The sentiment orientation and sentiment intensity of a review can be described in more detail by using a sentiment score than by using bipolar sentiment polarity. Existing methods for calculating review sentiment scores frequently use a sentiment lexicon or the locations of features in a sentence, a paragraph, and a document. In order to achieve more accurate sentiment scores of review documents, a three-layer sentiment propagation model (TLSPM) is proposed that uses three kinds of interrelations, those among documents, topics, and words. First, we use nine relationship pairwise matrices between documents, topics, and words. In TLSPM, we suppose that sentiment neighbors tend to have the same sentiment polarity and similar sentiment intensity in the sentiment propagation network. Then, we implement the sentiment propagation processes among the documents, topics, and words in turn. Finally, we can obtain the steady sentiment scores of documents by a continuous iteration process. Intuition might suggest that documents with strong sentiment intensity make larger contributions to classification than those with weak sentiment intensity. Therefore, we use the fuzzy membership of documents obtained by TLSPM as the weight of the text to train a fuzzy support vector machine model (FSVM). As compared with a support vector machine (SVM) and four other fuzzy membership determination methods, the results show that FSVM trained with TLSPM can enhance the effectiveness of sentiment classification. In addition, FSVM trained with TLSPM can reduce the mean square error (MSE) on seven sentiment rating prediction data sets. PMID:27846225
Automated Patent Categorization and Guided Patent Search using IPC as Inspired by MeSH and PubMed.

PubMed

Eisinger, Daniel; Tsatsaronis, George; Bundschus, Markus; Wieneke, Ulrich; Schroeder, Michael

2013-04-15

Document search on PubMed, the pre-eminent database for biomedical literature, relies on the annotation of its documents with relevant terms from the Medical Subject Headings ontology (MeSH) for improving recall through query expansion. Patent documents are another important information source, though they are considerably less accessible. One option to expand patent search beyond pure keywords is the inclusion of classification information: Since every patent is assigned at least one class code, it should be possible for these assignments to be automatically used in a similar way as the MeSH annotations in PubMed. In order to develop a system for this task, it is necessary to have a good understanding of the properties of both classification systems. This report describes our comparative analysis of MeSH and the main patent classification system, the International Patent Classification (IPC). We investigate the hierarchical structures as well as the properties of the terms/classes respectively, and we compare the assignment of IPC codes to patents with the annotation of PubMed documents with MeSH terms.Our analysis shows a strong structural similarity of the hierarchies, but significant differences of terms and annotations. The low number of IPC class assignments and the lack of occurrences of class labels in patent texts imply that current patent search is severely limited. To overcome these limits, we evaluate a method for the automated assignment of additional classes to patent documents, and we propose a system for guided patent search based on the use of class co-occurrence information and external resources.
Automated Patent Categorization and Guided Patent Search using IPC as Inspired by MeSH and PubMed

PubMed Central

2013-01-01

Document search on PubMed, the pre-eminent database for biomedical literature, relies on the annotation of its documents with relevant terms from the Medical Subject Headings ontology (MeSH) for improving recall through query expansion. Patent documents are another important information source, though they are considerably less accessible. One option to expand patent search beyond pure keywords is the inclusion of classification information: Since every patent is assigned at least one class code, it should be possible for these assignments to be automatically used in a similar way as the MeSH annotations in PubMed. In order to develop a system for this task, it is necessary to have a good understanding of the properties of both classification systems. This report describes our comparative analysis of MeSH and the main patent classification system, the International Patent Classification (IPC). We investigate the hierarchical structures as well as the properties of the terms/classes respectively, and we compare the assignment of IPC codes to patents with the annotation of PubMed documents with MeSH terms. Our analysis shows a strong structural similarity of the hierarchies, but significant differences of terms and annotations. The low number of IPC class assignments and the lack of occurrences of class labels in patent texts imply that current patent search is severely limited. To overcome these limits, we evaluate a method for the automated assignment of additional classes to patent documents, and we propose a system for guided patent search based on the use of class co-occurrence information and external resources. PMID:23734562
Using clustering and a modified classification algorithm for automatic text summarization

NASA Astrophysics Data System (ADS)

Aries, Abdelkrime; Oufaida, Houda; Nouali, Omar

2013-01-01

In this paper we describe a modified classification method destined for extractive summarization purpose. The classification in this method doesn't need a learning corpus; it uses the input text to do that. First, we cluster the document sentences to exploit the diversity of topics, then we use a learning algorithm (here we used Naive Bayes) on each cluster considering it as a class. After obtaining the classification model, we calculate the score of a sentence in each class, using a scoring model derived from classification algorithm. These scores are used, then, to reorder the sentences and extract the first ones as the output summary. We conducted some experiments using a corpus of scientific papers, and we have compared our results to another summarization system called UNIS.1 Also, we experiment the impact of clustering threshold tuning, on the resulted summary, as well as the impact of adding more features to the classifier. We found that this method is interesting, and gives good performance, and the addition of new features (which is simple using this method) can improve summary's accuracy.
Unsupervised Biomedical Named Entity Recognition: Experiments with Clinical and Biological Texts

PubMed Central

Zhang, Shaodian; Elhadad, Nóemie

2013-01-01

Named entity recognition is a crucial component of biomedical natural language processing, enabling information extraction and ultimately reasoning over and knowledge discovery from text. Much progress has been made in the design of rule-based and supervised tools, but they are often genre and task dependent. As such, adapting them to different genres of text or identifying new types of entities requires major effort in re-annotation or rule development. In this paper, we propose an unsupervised approach to extracting named entities from biomedical text. We describe a stepwise solution to tackle the challenges of entity boundary detection and entity type classification without relying on any handcrafted rules, heuristics, or annotated data. A noun phrase chunker followed by a filter based on inverse document frequency extracts candidate entities from free text. Classification of candidate entities into categories of interest is carried out by leveraging principles from distributional semantics. Experiments show that our system, especially the entity classification step, yields competitive results on two popular biomedical datasets of clinical notes and biological literature, and outperforms a baseline dictionary match approach. Detailed error analysis provides a road map for future work. PMID:23954592
Automated ancillary cancer history classification for mesothelioma patients from free-text clinical reports

PubMed Central

Wilson, Richard A.; Chapman, Wendy W.; DeFries, Shawn J.; Becich, Michael J.; Chapman, Brian E.

2010-01-01

Background: Clinical records are often unstructured, free-text documents that create information extraction challenges and costs. Healthcare delivery and research organizations, such as the National Mesothelioma Virtual Bank, require the aggregation of both structured and unstructured data types. Natural language processing offers techniques for automatically extracting information from unstructured, free-text documents. Methods: Five hundred and eight history and physical reports from mesothelioma patients were split into development (208) and test sets (300). A reference standard was developed and each report was annotated by experts with regard to the patient’s personal history of ancillary cancer and family history of any cancer. The Hx application was developed to process reports, extract relevant features, perform reference resolution and classify them with regard to cancer history. Two methods, Dynamic-Window and ConText, for extracting information were evaluated. Hx’s classification responses using each of the two methods were measured against the reference standard. The average Cohen’s weighted kappa served as the human benchmark in evaluating the system. Results: Hx had a high overall accuracy, with each method, scoring 96.2%. F-measures using the Dynamic-Window and ConText methods were 91.8% and 91.6%, which were comparable to the human benchmark of 92.8%. For the personal history classification, Dynamic-Window scored highest with 89.2% and for the family history classification, ConText scored highest with 97.6%, in which both methods were comparable to the human benchmark of 88.3% and 97.2%, respectively. Conclusion: We evaluated an automated application’s performance in classifying a mesothelioma patient’s personal and family history of cancer from clinical reports. To do so, the Hx application must process reports, identify cancer concepts, distinguish the known mesothelioma from ancillary cancers, recognize negation, perform reference resolution and determine the experiencer. Results indicated that both information extraction methods tested were dependant on the domain-specific lexicon and negation extraction. We showed that the more general method, ConText, performed as well as our task-specific method. Although Dynamic- Window could be modified to retrieve other concepts, ConText is more robust and performs better on inconclusive concepts. Hx could greatly improve and expedite the process of extracting data from free-text, clinical records for a variety of research or healthcare delivery organizations. PMID:21031012
Critical Linguistics: A Starting Point for Oppositional Reading.

ERIC Educational Resources Information Center

Janks, Hilary

This document focuses on specific linguistic features that serve ideological functions in texts written in South Africa from 1985 to 1988. The features examined include: naming; metaphors; old words with new meanings; words becoming tainted; renaming or lexicalization; overlexicalization; strategies for resisting classification; tense and aspect;…
A classification of errors in lay comprehension of medical documents.

PubMed

Keselman, Alla; Smith, Catherine Arnott

2012-12-01

Emphasis on participatory medicine requires that patients and consumers participate in tasks traditionally reserved for healthcare providers. This includes reading and comprehending medical documents, often but not necessarily in the context of interacting with Personal Health Records (PHRs). Research suggests that while giving patients access to medical documents has many benefits (e.g., improved patient-provider communication), lay people often have difficulty understanding medical information. Informatics can address the problem by developing tools that support comprehension; this requires in-depth understanding of the nature and causes of errors that lay people make when comprehending clinical documents. The objective of this study was to develop a classification scheme of comprehension errors, based on lay individuals' retellings of two documents containing clinical text: a description of a clinical trial and a typical office visit note. While not comprehensive, the scheme can serve as a foundation of further development of a taxonomy of patients' comprehension errors. Eighty participants, all healthy volunteers, read and retold two medical documents. A data-driven content analysis procedure was used to extract and classify retelling errors. The resulting hierarchical classification scheme contains nine categories and 23 subcategories. The most common error made by the participants involved incorrectly recalling brand names of medications. Other common errors included misunderstanding clinical concepts, misreporting the objective of a clinical research study and physician's findings during a patient's visit, and confusing and misspelling clinical terms. A combination of informatics support and health education is likely to improve the accuracy of lay comprehension of medical documents. Published by Elsevier Inc.
Application of diffusion maps to identify human factors of self-reported anomalies in aviation.

PubMed

Andrzejczak, Chris; Karwowski, Waldemar; Mikusinski, Piotr

2012-01-01

A study investigating what factors are present leading to pilots submitting voluntary anomaly reports regarding their flight performance was conducted. Diffusion Maps (DM) were selected as the method of choice for performing dimensionality reduction on text records for this study. Diffusion Maps have seen successful use in other domains such as image classification and pattern recognition. High-dimensionality data in the form of narrative text reports from the NASA Aviation Safety Reporting System (ASRS) were clustered and categorized by way of dimensionality reduction. Supervised analyses were performed to create a baseline document clustering system. Dimensionality reduction techniques identified concepts or keywords within records, and allowed the creation of a framework for an unsupervised document classification system. Results from the unsupervised clustering algorithm performed similarly to the supervised methods outlined in the study. The dimensionality reduction was performed on 100 of the most commonly occurring words within 126,000 text records describing commercial aviation incidents. This study demonstrates that unsupervised machine clustering and organization of incident reports is possible based on unbiased inputs. Findings from this study reinforced traditional views on what factors contribute to civil aviation anomalies, however, new associations between previously unrelated factors and conditions were also found.
Scalable Kernel Methods and Algorithms for General Sequence Analysis

ERIC Educational Resources Information Center

Kuksa, Pavel

2011-01-01

Analysis of large-scale sequential data has become an important task in machine learning and pattern recognition, inspired in part by numerous scientific and technological applications such as the document and text classification or the analysis of biological sequences. However, current computational methods for sequence comparison still lack…
10 CFR 1016.32 - Classification and preparation of documents.

Code of Federal Regulations, 2013 CFR

2013-01-01

... 10 Energy 4 2013-01-01 2013-01-01 false Classification and preparation of documents. 1016.32... of Information § 1016.32 Classification and preparation of documents. (a) Classification. Restricted... he is not positive is not within that definition and CG-UF-3 does not provide positive classification...
10 CFR 1016.32 - Classification and preparation of documents.

Code of Federal Regulations, 2014 CFR

2014-01-01

... 10 Energy 4 2014-01-01 2014-01-01 false Classification and preparation of documents. 1016.32... of Information § 1016.32 Classification and preparation of documents. (a) Classification. Restricted... he is not positive is not within that definition and CG-UF-3 does not provide positive classification...
10 CFR 1016.32 - Classification and preparation of documents.

Code of Federal Regulations, 2012 CFR

2012-01-01

... 10 Energy 4 2012-01-01 2012-01-01 false Classification and preparation of documents. 1016.32... of Information § 1016.32 Classification and preparation of documents. (a) Classification. Restricted... he is not positive is not within that definition and CG-UF-3 does not provide positive classification...
Protein classification based on text document classification techniques.

PubMed

Cheng, Betty Yee Man; Carbonell, Jaime G; Klein-Seetharaman, Judith

2005-03-01

The need for accurate, automated protein classification methods continues to increase as advances in biotechnology uncover new proteins. G-protein coupled receptors (GPCRs) are a particularly difficult superfamily of proteins to classify due to extreme diversity among its members. Previous comparisons of BLAST, k-nearest neighbor (k-NN), hidden markov model (HMM) and support vector machine (SVM) using alignment-based features have suggested that classifiers at the complexity of SVM are needed to attain high accuracy. Here, analogous to document classification, we applied Decision Tree and Naive Bayes classifiers with chi-square feature selection on counts of n-grams (i.e. short peptide sequences of length n) to this classification task. Using the GPCR dataset and evaluation protocol from the previous study, the Naive Bayes classifier attained an accuracy of 93.0 and 92.4% in level I and level II subfamily classification respectively, while SVM has a reported accuracy of 88.4 and 86.3%. This is a 39.7 and 44.5% reduction in residual error for level I and level II subfamily classification, respectively. The Decision Tree, while inferior to SVM, outperforms HMM in both level I and level II subfamily classification. For those GPCR families whose profiles are stored in the Protein FAMilies database of alignments and HMMs (PFAM), our method performs comparably to a search against those profiles. Finally, our method can be generalized to other protein families by applying it to the superfamily of nuclear receptors with 94.5, 97.8 and 93.6% accuracy in family, level I and level II subfamily classification respectively. Copyright 2005 Wiley-Liss, Inc.
10 CFR 1016.32 - Classification and preparation of documents.

Code of Federal Regulations, 2010 CFR

2010-01-01

... 10 Energy 4 2010-01-01 2010-01-01 false Classification and preparation of documents. 1016.32 Section 1016.32 Energy DEPARTMENT OF ENERGY (GENERAL PROVISIONS) SAFEGUARDING OF RESTRICTED DATA Control of Information § 1016.32 Classification and preparation of documents. (a) Classification. Restricted...
10 CFR 1016.32 - Classification and preparation of documents.

Code of Federal Regulations, 2011 CFR

2011-01-01

... 10 Energy 4 2011-01-01 2011-01-01 false Classification and preparation of documents. 1016.32 Section 1016.32 Energy DEPARTMENT OF ENERGY (GENERAL PROVISIONS) SAFEGUARDING OF RESTRICTED DATA Control of Information § 1016.32 Classification and preparation of documents. (a) Classification. Restricted...
Automating document classification for the Immune Epitope Database

PubMed Central

Wang, Peng; Morgan, Alexander A; Zhang, Qing; Sette, Alessandro; Peters, Bjoern

2007-01-01

Background The Immune Epitope Database contains information on immune epitopes curated manually from the scientific literature. Like similar projects in other knowledge domains, significant effort is spent on identifying which articles are relevant for this purpose. Results We here report our experience in automating this process using Naïve Bayes classifiers trained on 20,910 abstracts classified by domain experts. Improvements on the basic classifier performance were made by a) utilizing information stored in PubMed beyond the abstract itself b) applying standard feature selection criteria and c) extracting domain specific feature patterns that e.g. identify peptides sequences. We have implemented the classifier into the curation process determining if abstracts are clearly relevant, clearly irrelevant, or if no certain classification can be made, in which case the abstracts are manually classified. Testing this classification scheme on an independent dataset, we achieve 95% sensitivity and specificity in the 51.1% of abstracts that were automatically classified. Conclusion By implementing text classification, we have sped up the reference selection process without sacrificing sensitivity or specificity of the human expert classification. This study provides both practical recommendations for users of text classification tools, as well as a large dataset which can serve as a benchmark for tool developers. PMID:17655769
A systematic literature review of automated clinical coding and classification systems

PubMed Central

Williams, Margaret; Fenton, Susan H; Jenders, Robert A; Hersh, William R

2010-01-01

Clinical coding and classification processes transform natural language descriptions in clinical text into data that can subsequently be used for clinical care, research, and other purposes. This systematic literature review examined studies that evaluated all types of automated coding and classification systems to determine the performance of such systems. Studies indexed in Medline or other relevant databases prior to March 2009 were considered. The 113 studies included in this review show that automated tools exist for a variety of coding and classification purposes, focus on various healthcare specialties, and handle a wide variety of clinical document types. Automated coding and classification systems themselves are not generalizable, nor are the results of the studies evaluating them. Published research shows these systems hold promise, but these data must be considered in context, with performance relative to the complexity of the task and the desired outcome. PMID:20962126

A systematic literature review of automated clinical coding and classification systems.

PubMed

Stanfill, Mary H; Williams, Margaret; Fenton, Susan H; Jenders, Robert A; Hersh, William R

2010-01-01

Clinical coding and classification processes transform natural language descriptions in clinical text into data that can subsequently be used for clinical care, research, and other purposes. This systematic literature review examined studies that evaluated all types of automated coding and classification systems to determine the performance of such systems. Studies indexed in Medline or other relevant databases prior to March 2009 were considered. The 113 studies included in this review show that automated tools exist for a variety of coding and classification purposes, focus on various healthcare specialties, and handle a wide variety of clinical document types. Automated coding and classification systems themselves are not generalizable, nor are the results of the studies evaluating them. Published research shows these systems hold promise, but these data must be considered in context, with performance relative to the complexity of the task and the desired outcome.
10 CFR 95.37 - Classification and preparation of documents.

Code of Federal Regulations, 2010 CFR

2010-01-01

... classification decisions. (c) Markings required on face of documents. (1) For derivative classification of... to a document must be placed in a conspicuous fashion in letters at the top and bottom of the outside... on the face of the document: Reproduction or Further Dissemination Requires Approval of If any...
10 CFR 95.37 - Classification and preparation of documents.

Code of Federal Regulations, 2011 CFR

2011-01-01

... classification decisions. (c) Markings required on face of documents. (1) For derivative classification of... to a document must be placed in a conspicuous fashion in letters at the top and bottom of the outside... on the face of the document: Reproduction or Further Dissemination Requires Approval of If any...
Automatic Classification Using Supervised Learning in a Medical Document Filtering Application.

ERIC Educational Resources Information Center

Mostafa, J.; Lam, W.

2000-01-01

Presents a multilevel model of the information filtering process that permits document classification. Evaluates a document classification approach based on a supervised learning algorithm, measures the accuracy of the algorithm in a neural network that was trained to classify medical documents on cell biology, and discusses filtering…
Exploiting salient semantic analysis for information retrieval

NASA Astrophysics Data System (ADS)

Luo, Jing; Meng, Bo; Quan, Changqin; Tu, Xinhui

2016-11-01

Recently, many Wikipedia-based methods have been proposed to improve the performance of different natural language processing (NLP) tasks, such as semantic relatedness computation, text classification and information retrieval. Among these methods, salient semantic analysis (SSA) has been proven to be an effective way to generate conceptual representation for words or documents. However, its feasibility and effectiveness in information retrieval is mostly unknown. In this paper, we study how to efficiently use SSA to improve the information retrieval performance, and propose a SSA-based retrieval method under the language model framework. First, SSA model is adopted to build conceptual representations for documents and queries. Then, these conceptual representations and the bag-of-words (BOW) representations can be used in combination to estimate the language models of queries and documents. The proposed method is evaluated on several standard text retrieval conference (TREC) collections. Experiment results on standard TREC collections show the proposed models consistently outperform the existing Wikipedia-based retrieval methods.
TEXTINFO: a tool for automatic determination of patient clinical profiles using text analysis.

PubMed Central

Borst, F.; Lyman, M.; Nhàn, N. T.; Tick, L. J.; Sager, N.; Scherrer, J. R.

1991-01-01

The clinical data contained in narrative patient documents is made available via grammatical and semantic processing. Retrievals from the resulting relational database tables are matched against a set of clinical descriptors to obtain clinical profiles of the patients in terms of the descriptors present in the documents. Discharge summaries of 57 Dept. of Digestive Surgery patients were processed in this manner. Factor analysis and discriminant analysis procedures were then applied, showing the profiles to be useful for diagnosis definitions (by establishing relations between diagnoses and clinical findings), for diagnosis assessment (by viewing the match between a definition and observed events recorded in a patient text), and potentially for outcome evaluation based on the classification abilities of clinical signs. PMID:1807679
Extracting biomedical events from pairs of text entities

PubMed Central

2015-01-01

Background Huge amounts of electronic biomedical documents, such as molecular biology reports or genomic papers are generated daily. Nowadays, these documents are mainly available in the form of unstructured free texts, which require heavy processing for their registration into organized databases. This organization is instrumental for information retrieval, enabling to answer the advanced queries of researchers and practitioners in biology, medicine, and related fields. Hence, the massive data flow calls for efficient automatic methods of text-mining that extract high-level information, such as biomedical events, from biomedical text. The usual computational tools of Natural Language Processing cannot be readily applied to extract these biomedical events, due to the peculiarities of the domain. Indeed, biomedical documents contain highly domain-specific jargon and syntax. These documents also describe distinctive dependencies, making text-mining in molecular biology a specific discipline. Results We address biomedical event extraction as the classification of pairs of text entities into the classes corresponding to event types. The candidate pairs of text entities are recursively provided to a multiclass classifier relying on Support Vector Machines. This recursive process extracts events involving other events as arguments. Compared to joint models based on Markov Random Fields, our model simplifies inference and hence requires shorter training and prediction times along with lower memory capacity. Compared to usual pipeline approaches, our model passes over a complex intermediate problem, while making a more extensive usage of sophisticated joint features between text entities. Our method focuses on the core event extraction of the Genia task of BioNLP challenges yielding the best result reported so far on the 2013 edition. PMID:26201478
Document cards: a top trumps visualization for documents.

PubMed

Strobelt, Hendrik; Oelke, Daniela; Rohrdantz, Christian; Stoffel, Andreas; Keim, Daniel A; Deussen, Oliver

2009-01-01

Finding suitable, less space consuming views for a document's main content is crucial to provide convenient access to large document collections on display devices of different size. We present a novel compact visualization which represents the document's key semantic as a mixture of images and important key terms, similar to cards in a top trumps game. The key terms are extracted using an advanced text mining approach based on a fully automatic document structure extraction. The images and their captions are extracted using a graphical heuristic and the captions are used for a semi-semantic image weighting. Furthermore, we use the image color histogram for classification and show at least one representative from each non-empty image class. The approach is demonstrated for the IEEE InfoVis publications of a complete year. The method can easily be applied to other publication collections and sets of documents which contain images.
Supporting the education evidence portal via text mining

PubMed Central

Ananiadou, Sophia; Thompson, Paul; Thomas, James; Mu, Tingting; Oliver, Sandy; Rickinson, Mark; Sasaki, Yutaka; Weissenbacher, Davy; McNaught, John

2010-01-01

The UK Education Evidence Portal (eep) provides a single, searchable, point of access to the contents of the websites of 33 organizations relating to education, with the aim of revolutionizing work practices for the education community. Use of the portal alleviates the need to spend time searching multiple resources to find relevant information. However, the combined content of the websites of interest is still very large (over 500 000 documents and growing). This means that searches using the portal can produce very large numbers of hits. As users often have limited time, they would benefit from enhanced methods of performing searches and viewing results, allowing them to drill down to information of interest more efficiently, without having to sift through potentially long lists of irrelevant documents. The Joint Information Systems Committee (JISC)-funded ASSIST project has produced a prototype web interface to demonstrate the applicability of integrating a number of text-mining tools and methods into the eep, to facilitate an enhanced searching, browsing and document-viewing experience. New features include automatic classification of documents according to a taxonomy, automatic clustering of search results according to similar document content, and automatic identification and highlighting of key terms within documents. PMID:20643679
49 CFR 1177.1 - Definitions and classifications of documents.

Code of Federal Regulations, 2011 CFR

2011-10-01

... 49 Transportation 8 2011-10-01 2011-10-01 false Definitions and classifications of documents. 1177.1 Section 1177.1 Transportation Other Regulations Relating to Transportation (Continued) SURFACE... Definitions and classifications of documents. (a) A “primary document” is a mortgage (excluding those under...
Transfer Learning beyond Text Classification

NASA Astrophysics Data System (ADS)

Yang, Qiang

Transfer learning is a new machine learning and data mining framework that allows the training and test data to come from different distributions or feature spaces. We can find many novel applications of machine learning and data mining where transfer learning is necessary. While much has been done in transfer learning in text classification and reinforcement learning, there has been a lack of documented success stories of novel applications of transfer learning in other areas. In this invited article, I will argue that transfer learning is in fact quite ubiquitous in many real world applications. In this article, I will illustrate this point through an overview of a broad spectrum of applications of transfer learning that range from collaborative filtering to sensor based location estimation and logical action model learning for AI planning. I will also discuss some potential future directions of transfer learning.
Electronic Derivative Classifier/Reviewing Official

DOE Office of Scientific and Technical Information (OSTI.GOV)

Harris, Joshua C; McDuffie, Gregory P; Light, Ken L

2017-02-17

The electronic Derivative Classifier, Reviewing Official (eDC/RO) is a web based document management and routing system that reduces security risks and increases workflow efficiencies. The system automates the upload, notification review request, and document status tracking of documents for classification review on a secure server. It supports a variety of document formats (i.e., pdf, doc, docx, xls, xlsx, xlsm, ppt, pptx, vsd, vsdx and txt), and allows for the dynamic placement of classification markings such as the classification level, category and caveats on the document, in addition to a document footer and digital signature.
Semiotic indexing of digital resources

DOEpatents

Parker, Charles T; Garrity, George M

2014-12-02

A method of classifying a plurality of documents. The method includes steps of providing a first set of classification terms and a second set of classification terms, the second set of classification terms being different from the first set of classification terms; generating a first frequency array of a number of occurrences of each term from the first set of classification terms in each document; generating a second frequency array of a number of occurrences of each term from the second set of classification terms in each document; generating a first similarity matrix from the first frequency array; generating a second similarity matrix from the second frequency array; determining an entrywise combination of the first similarity matrix and the second similarity matrix; and clustering the plurality of documents based on the result of the entrywise combination.
Logo detection and classification in a sport video: video indexing for sponsorship revenue control

NASA Astrophysics Data System (ADS)

Kovar, Bohumil; Hanjalic, Alan

2001-12-01

This paper presents a novel approach to detecting and classifying a trademark logo in frames of a sport video. In view of the fact that we attempt to detect and recognize a logo in a natural scene, the algorithm developed in this paper differs from traditional techniques for logo detection and classification that are applicable either to well-structured general text documents (e.g. invoices, memos, bank cheques) or to specialized trademark logo databases, where logos appear isolated on a clear background and where their detection and classification is not disturbed by the surrounding visual detail. Although the development of our algorithm is still in its starting phase, experimental results performed so far on a set of soccer TV broadcasts are very encouraging.
Text, photo, and line extraction in scanned documents

NASA Astrophysics Data System (ADS)

Erkilinc, M. Sezer; Jaber, Mustafa; Saber, Eli; Bauer, Peter; Depalov, Dejan

2012-07-01

We propose a page layout analysis algorithm to classify a scanned document into different regions such as text, photo, or strong lines. The proposed scheme consists of five modules. The first module performs several image preprocessing techniques such as image scaling, filtering, color space conversion, and gamma correction to enhance the scanned image quality and reduce the computation time in later stages. Text detection is applied in the second module wherein wavelet transform and run-length encoding are employed to generate and validate text regions, respectively. The third module uses a Markov random field based block-wise segmentation that employs a basis vector projection technique with maximum a posteriori probability optimization to detect photo regions. In the fourth module, methods for edge detection, edge linking, line-segment fitting, and Hough transform are utilized to detect strong edges and lines. In the last module, the resultant text, photo, and edge maps are combined to generate a page layout map using K-Means clustering. The proposed algorithm has been tested on several hundred documents that contain simple and complex page layout structures and contents such as articles, magazines, business cards, dictionaries, and newsletters, and compared against state-of-the-art page-segmentation techniques with benchmark performance. The results indicate that our methodology achieves an average of ˜89% classification accuracy in text, photo, and background regions.
Improving imbalanced scientific text classification using sampling strategies and dictionaries.

PubMed

Borrajo, L; Romero, R; Iglesias, E L; Redondo Marey, C M

2011-09-15

Many real applications have the imbalanced class distribution problem, where one of the classes is represented by a very small number of cases compared to the other classes. One of the systems affected are those related to the recovery and classification of scientific documentation. Sampling strategies such as Oversampling and Subsampling are popular in tackling the problem of class imbalance. In this work, we study their effects on three types of classifiers (Knn, SVM and Naive-Bayes) when they are applied to search on the PubMed scientific database. Another purpose of this paper is to study the use of dictionaries in the classification of biomedical texts. Experiments are conducted with three different dictionaries (BioCreative, NLPBA, and an ad-hoc subset of the UniProt database named Protein) using the mentioned classifiers and sampling strategies. Best results were obtained with NLPBA and Protein dictionaries and the SVM classifier using the Subsampling balancing technique. These results were compared with those obtained by other authors using the TREC Genomics 2005 public corpus. Copyright 2011 The Author(s). Published by Journal of Integrative Bioinformatics.
Experiments on Supervised Learning Algorithms for Text Categorization

NASA Technical Reports Server (NTRS)

Namburu, Setu Madhavi; Tu, Haiying; Luo, Jianhui; Pattipati, Krishna R.

2005-01-01

Modern information society is facing the challenge of handling massive volume of online documents, news, intelligence reports, and so on. How to use the information accurately and in a timely manner becomes a major concern in many areas. While the general information may also include images and voice, we focus on the categorization of text data in this paper. We provide a brief overview of the information processing flow for text categorization, and discuss two supervised learning algorithms, viz., support vector machines (SVM) and partial least squares (PLS), which have been successfully applied in other domains, e.g., fault diagnosis [9]. While SVM has been well explored for binary classification and was reported as an efficient algorithm for text categorization, PLS has not yet been applied to text categorization. Our experiments are conducted on three data sets: Reuter's- 21578 dataset about corporate mergers and data acquisitions (ACQ), WebKB and the 20-Newsgroups. Results show that the performance of PLS is comparable to SVM in text categorization. A major drawback of SVM for multi-class categorization is that it requires a voting scheme based on the results of pair-wise classification. PLS does not have this drawback and could be a better candidate for multi-class text categorization.
Documenting Community Engagement Practices and Outcomes: Insights from Recipients of the 2010 Carnegie Community Engagement Classification

ERIC Educational Resources Information Center

Noel, Jana; Earwicker, David P.

2015-01-01

This study was performed to document the strategies and methods used by successful applicants for the 2010 Carnegie Community Engagement Classification and to document the cultural shifts connected with the application process and receipt of the Classification. Four major findings emerged: (1) Applicants benefited from a team approach; (2)…
Texture for script identification.

PubMed

Busch, Andrew; Boles, Wageeh W; Sridharan, Sridha

2005-11-01

The problem of determining the script and language of a document image has a number of important applications in the field of document analysis, such as indexing and sorting of large collections of such images, or as a precursor to optical character recognition (OCR). In this paper, we investigate the use of texture as a tool for determining the script of a document image, based on the observation that text has a distinct visual texture. An experimental evaluation of a number of commonly used texture features is conducted on a newly created script database, providing a qualitative measure of which features are most appropriate for this task. Strategies for improving classification results in situations with limited training data and multiple font types are also proposed.
A Bag of Concepts Approach for Biomedical Document Classification Using Wikipedia Knowledge.

PubMed

Mouriño-García, Marcos A; Pérez-Rodríguez, Roberto; Anido-Rifón, Luis E

2017-01-01

The ability to efficiently review the existing literature is essential for the rapid progress of research. This paper describes a classifier of text documents, represented as vectors in spaces of Wikipedia concepts, and analyses its suitability for classification of Spanish biomedical documents when only English documents are available for training. We propose the cross-language concept matching (CLCM) technique, which relies on Wikipedia interlanguage links to convert concept vectors from the Spanish to the English space. The performance of the classifier is compared to several baselines: a classifier based on machine translation, a classifier that represents documents after performing Explicit Semantic Analysis (ESA), and a classifier that uses a domain-specific semantic an- notator (MetaMap). The corpus used for the experiments (Cross-Language UVigoMED) was purpose-built for this study, and it is composed of 12,832 English and 2,184 Spanish MEDLINE abstracts. The performance of our approach is superior to any other state-of-the art classifier in the benchmark, with performance increases up to: 124% over classical machine translation, 332% over MetaMap, and 60 times over the classifier based on ESA. The results have statistical significance, showing p-values < 0.0001. Using knowledge mined from Wikipedia to represent documents as vectors in a space of Wikipedia concepts and translating vectors between language-specific concept spaces, a cross-language classifier can be built, and it performs better than several state-of-the-art classifiers. Schattauer GmbH.

A Bag of Concepts Approach for Biomedical Document Classification Using Wikipedia Knowledge*. Spanish-English Cross-language Case Study.

PubMed

Mouriño-García, Marcos A; Pérez-Rodríguez, Roberto; Anido-Rifón, Luis E

2017-10-26

The ability to efficiently review the existing literature is essential for the rapid progress of research. This paper describes a classifier of text documents, represented as vectors in spaces of Wikipedia concepts, and analyses its suitability for classification of Spanish biomedical documents when only English documents are available for training. We propose the cross-language concept matching (CLCM) technique, which relies on Wikipedia interlanguage links to convert concept vectors from the Spanish to the English space. The performance of the classifier is compared to several baselines: a classifier based on machine translation, a classifier that represents documents after performing Explicit Semantic Analysis (ESA), and a classifier that uses a domain-specific semantic annotator (MetaMap). The corpus used for the experiments (Cross-Language UVigoMED) was purpose-built for this study, and it is composed of 12,832 English and 2,184 Spanish MEDLINE abstracts. The performance of our approach is superior to any other state-of-the art classifier in the benchmark, with performance increases up to: 124% over classical machine translation, 332% over MetaMap, and 60 times over the classifier based on ESA. The results have statistical significance, showing p-values < 0.0001. Using knowledge mined from Wikipedia to represent documents as vectors in a space of Wikipedia concepts and translating vectors between language-specific concept spaces, a cross-language classifier can be built, and it performs better than several state-of-the-art classifiers.
DOE Office of Scientific and Technical Information (OSTI.GOV)

Crossno, Patricia Joyce; Dunlavy, Daniel M.; Stanton, Eric T.

This report is a summary of the accomplishments of the 'Scalable Solutions for Processing and Searching Very Large Document Collections' LDRD, which ran from FY08 through FY10. Our goal was to investigate scalable text analysis; specifically, methods for information retrieval and visualization that could scale to extremely large document collections. Towards that end, we designed, implemented, and demonstrated a scalable framework for text analysis - ParaText - as a major project deliverable. Further, we demonstrated the benefits of using visual analysis in text analysis algorithm development, improved performance of heterogeneous ensemble models in data classification problems, and the advantages ofmore » information theoretic methods in user analysis and interpretation in cross language information retrieval. The project involved 5 members of the technical staff and 3 summer interns (including one who worked two summers). It resulted in a total of 14 publications, 3 new software libraries (2 open source and 1 internal to Sandia), several new end-user software applications, and over 20 presentations. Several follow-on projects have already begun or will start in FY11, with additional projects currently in proposal.« less
Combining dictionary techniques with extensible markup language (XML)--requirements to a new approach towards flexible and standardized documentation.

PubMed Central

Altmann, U.; Tafazzoli, A. G.; Noelle, G.; Huybrechts, T.; Schweiger, R.; WÃ¤chter, W.; Dudeck, J. W.

1999-01-01

In oncology various international and national standards exist for the documentation of different aspects of a disease. Since elements of these standards are repeated in different contexts, a common data dictionary could support consistent representation in any context. For the construction of such a dictionary existing documents have to be worked up in a complex procedure, that considers aspects of hierarchical decomposition of documents and of domain control as well as aspects of user presentation and models of the underlying model of patient data. In contrast to other thesauri, text chunks like definitions or explanations are very important and have to be preserved, since oncologic documentation often means coding and classification on an aggregate level and the safe use of coding systems is an important precondition for comparability of data. This paper discusses the potentials of the use of XML in combination with a dictionary for the promotion and development of standard conformable applications for tumor documentation. PMID:10566311
Hierarchic Agglomerative Clustering Methods for Automatic Document Classification.

ERIC Educational Resources Information Center

Griffiths, Alan; And Others

1984-01-01

Considers classifications produced by application of single linkage, complete linkage, group average, and word clustering methods to Keen and Cranfield document test collections, and studies structure of hierarchies produced, extent to which methods distort input similarity matrices during classification generation, and retrieval effectiveness…
10 CFR 1045.40 - Marking requirements.

Code of Federal Regulations, 2010 CFR

2010-01-01

... DEPARTMENT OF ENERGY (GENERAL PROVISIONS) NUCLEAR CLASSIFICATION AND DECLASSIFICATION Generation and Review... holder that it contains RD or FRD information, the level of classification assigned, and the additional... classification level of the document, the following notices shall appear on the front of the document, as...
10 CFR 1045.40 - Marking requirements.

Code of Federal Regulations, 2011 CFR

2011-01-01

... DEPARTMENT OF ENERGY (GENERAL PROVISIONS) NUCLEAR CLASSIFICATION AND DECLASSIFICATION Generation and Review... holder that it contains RD or FRD information, the level of classification assigned, and the additional... classification level of the document, the following notices shall appear on the front of the document, as...
Analysis of Nature of Science Included in Recent Popular Writing Using Text Mining Techniques

NASA Astrophysics Data System (ADS)

Jiang, Feng; McComas, William F.

2014-09-01

This study examined the inclusion of nature of science (NOS) in popular science writing to determine whether it could serve supplementary resource for teaching NOS and to evaluate the accuracy of text mining and classification as a viable research tool in science education research. Four groups of documents published from 2001 to 2010 were analyzed: Scientific American, Discover magazine, winners of the Royal Society Winton Prize for Science Books, and books from NSTA's list of Outstanding Science Trade Books. Computer analysis categorized passages in the selected documents based on their inclusions of NOS. Human analysis assessed the frequency, context, coverage, and accuracy of the inclusions of NOS within computer identified NOS passages. NOS was rarely addressed in selected document sets but somewhat more frequently addressed in the letters section of the two magazines. This result suggests that readers seem interested in the discussion of NOS-related themes. In the popular science books analyzed, NOS presentations were found more likely to be aggregated in the beginning and the end of the book, rather than scattered throughout. The most commonly addressed NOS elements in the analyzed documents are science and society and empiricism in science. Only one inaccurate presentation of NOS were identified in all analyzed documents. The text mining technique demonstrated exciting performance, which invites more applications of the technique to analyze other aspects of science textbooks, popular science writing, or other materials involved in science teaching and learning.
A Feature Selection Method Based on Fisher's Discriminant Ratio for Text Sentiment Classification

NASA Astrophysics Data System (ADS)

Wang, Suge; Li, Deyu; Wei, Yingjie; Li, Hongxia

With the rapid growth of e-commerce, product reviews on the Web have become an important information source for customers' decision making when they intend to buy some product. As the reviews are often too many for customers to go through, how to automatically classify them into different sentiment orientation categories (i.e. positive/negative) has become a research problem. In this paper, based on Fisher's discriminant ratio, an effective feature selection method is proposed for product review text sentiment classification. In order to validate the validity of the proposed method, we compared it with other methods respectively based on information gain and mutual information while support vector machine is adopted as the classifier. In this paper, 6 subexperiments are conducted by combining different feature selection methods with 2 kinds of candidate feature sets. Under 1006 review documents of cars, the experimental results indicate that the Fisher's discriminant ratio based on word frequency estimation has the best performance with F value 83.3% while the candidate features are the words which appear in both positive and negative texts.
17 CFR 200.507 - Declassification dates on derivative documents.

Code of Federal Regulations, 2011 CFR

2011-04-01

... derivative documents. 200.507 Section 200.507 Commodity and Securities Exchanges SECURITIES AND EXCHANGE... of National Security Information and Material § 200.507 Declassification dates on derivative... derivative document that derives its classification from the approved use of the classification guide of...
32 CFR 2001.24 - Additional requirements.

Code of Federal Regulations, 2012 CFR

2012-07-01

...,” “Secret,” and “Confidential” shall not be used to identify classified national security information. (b) Transmittal documents. A transmittal document shall indicate on its face the highest classification level of... Removed or Upon Removal of Attachments, This Document is (Classification Level) (c) Foreign government...
A Review of Equation of State Models, Chemical Equilibrium Calculations and CERV Code Requirements for SHS Detonation Modelling

DTIC Science & Technology

2009-10-01

parameters for a large number of species. These authors provide many sample calculations with the JCZS database incorporated in CHEETAH 2.0, including...FORM (highest classification of Title, Abstract, Keywords) DOCUMENT CONTROL DATA (Security classification of title, body of abstract and...CLASSIFICATION OF FORM 13. ABSTRACT (a brief and factual summary of the document. It may also appear elsewhere in the body of the document itself
Development and Validation of a Natural Language Processing Tool to Identify Patients Treated for Pneumonia across VA Emergency Departments.

PubMed

Jones, B E; South, B R; Shao, Y; Lu, C C; Leng, J; Sauer, B C; Gundlapalli, A V; Samore, M H; Zeng, Q

2018-01-01

Identifying pneumonia using diagnosis codes alone may be insufficient for research on clinical decision making. Natural language processing (NLP) may enable the inclusion of cases missed by diagnosis codes. This article (1) develops a NLP tool that identifies the clinical assertion of pneumonia from physician emergency department (ED) notes, and (2) compares classification methods using diagnosis codes versus NLP against a gold standard of manual chart review to identify patients initially treated for pneumonia. Among a national population of ED visits occurring between 2006 and 2012 across the Veterans Affairs health system, we extracted 811 physician documents containing search terms for pneumonia for training, and 100 random documents for validation. Two reviewers annotated span- and document-level classifications of the clinical assertion of pneumonia. An NLP tool using a support vector machine was trained on the enriched documents. We extracted diagnosis codes assigned in the ED and upon hospital discharge and calculated performance characteristics for diagnosis codes, NLP, and NLP plus diagnosis codes against manual review in training and validation sets. Among the training documents, 51% contained clinical assertions of pneumonia; in the validation set, 9% were classified with pneumonia, of which 100% contained pneumonia search terms. After enriching with search terms, the NLP system alone demonstrated a recall/sensitivity of 0.72 (training) and 0.55 (validation), and a precision/positive predictive value (PPV) of 0.89 (training) and 0.71 (validation). ED-assigned diagnostic codes demonstrated lower recall/sensitivity (0.48 and 0.44) but higher precision/PPV (0.95 in training, 1.0 in validation); the NLP system identified more "possible-treated" cases than diagnostic coding. An approach combining NLP and ED-assigned diagnostic coding classification achieved the best performance (sensitivity 0.89 and PPV 0.80). System-wide application of NLP to clinical text can increase capture of initial diagnostic hypotheses, an important inclusion when studying diagnosis and clinical decision-making under uncertainty. Schattauer GmbH Stuttgart.
Research on Classification of Chinese Text Data Based on SVM

NASA Astrophysics Data System (ADS)

Lin, Yuan; Yu, Hongzhi; Wan, Fucheng; Xu, Tao

2017-09-01

Data Mining has important application value in today’s industry and academia. Text classification is a very important technology in data mining. At present, there are many mature algorithms for text classification. KNN, NB, AB, SVM, decision tree and other classification methods all show good classification performance. Support Vector Machine’ (SVM) classification method is a good classifier in machine learning research. This paper will study the classification effect based on the SVM method in the Chinese text data, and use the support vector machine method in the chinese text to achieve the classify chinese text, and to able to combination of academia and practical application.
Incorporating domain knowledge in chemical and biomedical named entity recognition with word representations.

PubMed

Munkhdalai, Tsendsuren; Li, Meijing; Batsuren, Khuyagbaatar; Park, Hyeon Ah; Choi, Nak Hyeon; Ryu, Keun Ho

2015-01-01

Chemical and biomedical Named Entity Recognition (NER) is an essential prerequisite task before effective text mining can begin for biochemical-text data. Exploiting unlabeled text data to leverage system performance has been an active and challenging research topic in text mining due to the recent growth in the amount of biomedical literature. We present a semi-supervised learning method that efficiently exploits unlabeled data in order to incorporate domain knowledge into a named entity recognition model and to leverage system performance. The proposed method includes Natural Language Processing (NLP) tasks for text preprocessing, learning word representation features from a large amount of text data for feature extraction, and conditional random fields for token classification. Other than the free text in the domain, the proposed method does not rely on any lexicon nor any dictionary in order to keep the system applicable to other NER tasks in bio-text data. We extended BANNER, a biomedical NER system, with the proposed method. This yields an integrated system that can be applied to chemical and drug NER or biomedical NER. We call our branch of the BANNER system BANNER-CHEMDNER, which is scalable over millions of documents, processing about 530 documents per minute, is configurable via XML, and can be plugged into other systems by using the BANNER Unstructured Information Management Architecture (UIMA) interface. BANNER-CHEMDNER achieved an 85.68% and an 86.47% F-measure on the testing sets of CHEMDNER Chemical Entity Mention (CEM) and Chemical Document Indexing (CDI) subtasks, respectively, and achieved an 87.04% F-measure on the official testing set of the BioCreative II gene mention task, showing remarkable performance in both chemical and biomedical NER. BANNER-CHEMDNER system is available at: https://bitbucket.org/tsendeemts/banner-chemdner.
10 CFR 95.37 - Classification and preparation of documents.

Code of Federal Regulations, 2014 CFR

2014-01-01

... Information must contain the identity of the source document or the classification guide, including the agency.../Exemption) Classifier: (Name/Title/Number) (2) For Restricted Data documents: (i) Identity of the classifier. The identity of the classifier must be shown by completion of the “Derivative Classifier” line. The...
10 CFR 95.37 - Classification and preparation of documents.

Code of Federal Regulations, 2012 CFR

2012-01-01

... Information must contain the identity of the source document or the classification guide, including the agency.../Exemption) Classifier: (Name/Title/Number) (2) For Restricted Data documents: (i) Identity of the classifier. The identity of the classifier must be shown by completion of the “Derivative Classifier” line. The...
10 CFR 95.37 - Classification and preparation of documents.

Code of Federal Regulations, 2013 CFR

2013-01-01

... Information must contain the identity of the source document or the classification guide, including the agency.../Exemption) Classifier: (Name/Title/Number) (2) For Restricted Data documents: (i) Identity of the classifier. The identity of the classifier must be shown by completion of the “Derivative Classifier” line. The...
Detection of interaction articles and experimental methods in biomedical literature.

PubMed

Schneider, Gerold; Clematide, Simon; Rinaldi, Fabio

2011-10-03

This article describes the approaches taken by the OntoGene group at the University of Zurich in dealing with two tasks of the BioCreative III competition: classification of articles which contain curatable protein-protein interactions (PPI-ACT) and extraction of experimental methods (PPI-IMT). Two main achievements are described in this paper: (a) a system for document classification which crucially relies on the results of an advanced pipeline of natural language processing tools; (b) a system which is capable of detecting all experimental methods mentioned in scientific literature, and listing them with a competitive ranking (AUC iP/R > 0.5). The results of the BioCreative III shared evaluation clearly demonstrate that significant progress has been achieved in the domain of biomedical text mining in the past few years. Our own contribution, together with the results of other participants, provides evidence that natural language processing techniques have become by now an integral part of advanced text mining approaches.
SYRIAC: The systematic review information automated collection system a data warehouse for facilitating automated biomedical text classification.

PubMed

Yang, Jianji J; Cohen, Aaron M; Cohen, Aaron; McDonagh, Marian S

2008-11-06

Automatic document classification can be valuable in increasing the efficiency in updating systematic reviews (SR). In order for the machine learning process to work well, it is critical to create and maintain high-quality training datasets consisting of expert SR inclusion/exclusion decisions. This task can be laborious, especially when the number of topics is large and source data format is inconsistent.To approach this problem, we build an automated system to streamline the required steps, from initial notification of update in source annotation files to loading the data warehouse, along with a web interface to monitor the status of each topic. In our current collection of 26 SR topics, we were able to standardize almost all of the relevance judgments and recovered PMIDs for over 80% of all articles. Of those PMIDs, over 99% were correct in a manual random sample study. Our system performs an essential function in creating training and evaluation data sets for SR text mining research.
SYRIAC: The SYstematic Review Information Automated Collection System A Data Warehouse for Facilitating Automated Biomedical Text Classification

PubMed Central

Yang, Jianji J.; Cohen, Aaron M.; McDonagh, Marian S.

2008-01-01

Automatic document classification can be valuable in increasing the efficiency in updating systematic reviews (SR). In order for the machine learning process to work well, it is critical to create and maintain high-quality training datasets consisting of expert SR inclusion/exclusion decisions. This task can be laborious, especially when the number of topics is large and source data format is inconsistent. To approach this problem, we build an automated system to streamline the required steps, from initial notification of update in source annotation files to loading the data warehouse, along with a web interface to monitor the status of each topic. In our current collection of 26 SR topics, we were able to standardize almost all of the relevance judgments and recovered PMIDs for over 80% of all articles. Of those PMIDs, over 99% were correct in a manual random sample study. Our system performs an essential function in creating training and evaluation datasets for SR text mining research. PMID:18999194

Computing symmetrical strength of N-grams: a two pass filtering approach in automatic classification of text documents.

PubMed

Agnihotri, Deepak; Verma, Kesari; Tripathi, Priyanka

2016-01-01

The contiguous sequences of the terms (N-grams) in the documents are symmetrically distributed among different classes. The symmetrical distribution of the N-Grams raises uncertainty in the belongings of the N-Grams towards the class. In this paper, we focused on the selection of most discriminating N-Grams by reducing the effects of symmetrical distribution. In this context, a new text feature selection method named as the symmetrical strength of the N-Grams (SSNG) is proposed using a two pass filtering based feature selection (TPF) approach. Initially, in the first pass of the TPF, the SSNG method chooses various informative N-Grams from the entire extracted N-Grams of the corpus. Subsequently, in the second pass the well-known Chi Square (χ(2)) method is being used to select few most informative N-Grams. Further, to classify the documents the two standard classifiers Multinomial Naive Bayes and Linear Support Vector Machine have been applied on the ten standard text data sets. In most of the datasets, the experimental results state the performance and success rate of SSNG method using TPF approach is superior to the state-of-the-art methods viz. Mutual Information, Information Gain, Odds Ratio, Discriminating Feature Selection and χ(2).
10 CFR 1045.39 - Challenging classification and declassification determinations.

Code of Federal Regulations, 2013 CFR

2013-01-01

... holder of an RD or FRD document who, in good faith, believes that the RD or FRD document has an improper... classified the document. (b) Agencies shall establish procedures under which authorized holders of RD and FRD... involving RD or FRD may be appealed to the Director of Classification. In the case of FRD and RD related...
10 CFR 1045.39 - Challenging classification and declassification determinations.

Code of Federal Regulations, 2012 CFR

2012-01-01

... holder of an RD or FRD document who, in good faith, believes that the RD or FRD document has an improper... classified the document. (b) Agencies shall establish procedures under which authorized holders of RD and FRD... involving RD or FRD may be appealed to the Director of Classification. In the case of FRD and RD related...
10 CFR 1045.39 - Challenging classification and declassification determinations.

Code of Federal Regulations, 2014 CFR

2014-01-01

... holder of an RD or FRD document who, in good faith, believes that the RD or FRD document has an improper... classified the document. (b) Agencies shall establish procedures under which authorized holders of RD and FRD... involving RD or FRD may be appealed to the Director of Classification. In the case of FRD and RD related...
Sentiment classification technology based on Markov logic networks

NASA Astrophysics Data System (ADS)

He, Hui; Li, Zhigang; Yao, Chongchong; Zhang, Weizhe

2016-07-01

With diverse online media emerging, there is a growing concern of sentiment classification problem. At present, text sentiment classification mainly utilizes supervised machine learning methods, which feature certain domain dependency. On the basis of Markov logic networks (MLNs), this study proposed a cross-domain multi-task text sentiment classification method rooted in transfer learning. Through many-to-one knowledge transfer, labeled text sentiment classification, knowledge was successfully transferred into other domains, and the precision of the sentiment classification analysis in the text tendency domain was improved. The experimental results revealed the following: (1) the model based on a MLN demonstrated higher precision than the single individual learning plan model. (2) Multi-task transfer learning based on Markov logical networks could acquire more knowledge than self-domain learning. The cross-domain text sentiment classification model could significantly improve the precision and efficiency of text sentiment classification.
Vessel classification in overhead satellite imagery using weighted "bag of visual words"

NASA Astrophysics Data System (ADS)

Parameswaran, Shibin; Rainey, Katie

2015-05-01

Vessel type classification in maritime imagery is a challenging problem and has applications to many military and surveillance applications. The ability to classify a vessel correctly varies significantly depending on its appearance which in turn is affected by external factors such as lighting or weather conditions, viewing geometry and sea state. The difficulty in classifying vessels also varies among different ship types as some types of vessels show more within-class variation than others. In our previous work, we showed that the bag of visual words" (V-BoW) was an effective feature representation for this classification task in the maritime domain. The V-BoW feature representation is analogous to the bag of words" (BoW) representation used in information retrieval (IR) application in text or natural language processing (NLP) domain. It has been shown in the textual IR applications that the performance of the BoW feature representation can be improved significantly by applying appropriate term-weighting such as log term frequency, inverse document frequency etc. Given the close correspondence between textual BoW (T-BoW) and V-BoW feature representations, we propose to apply several well-known term weighting schemes from the text IR domain on V-BoW feature representation to increase its ability to discriminate between ship types.
MedEx/J: A One-Scan Simple and Fast NLP Tool for Japanese Clinical Texts.

PubMed

Aramaki, Eiji; Yano, Ken; Wakamiya, Shoko

2017-01-01

Because of recent replacement of physical documents with electronic medical records (EMR), the importance of information processing in the medical field has increased. In light of this trend, we have been developing MedEx/J, which retrieves important Japanese language information from medical reports. MedEx/J executes two tasks simultaneously: (1) term extraction, and (2) positive and negative event classification. We designate this approach as a one-scan approach, providing simplicity of systems and reasonable accuracy. MedEx/J performance on the two tasks is described herein: (1) term extraction (Fβ = 1 = 0.87) and (2) positive-negative classification (Fβ = 1 = 0.63). This paper also presents discussion and explains remaining issues in the medical natural language processing field.
Advanced Fuel Properties; A Computer Program for Estimating Property Values

DTIC Science & Technology

1993-05-01

security considerations, contractual obligations, or notice on a specific document. REPORT DOCUMENTATION PAGE Fogu Approwd I OMB No. 0704-01=5 Ps NP...found in fuels. 14. SUBJECT TERMS 15. NUMBEROF PAGES 175 Fuel properties, Physical Propertie, Thermodynamnics, Predictions 16. PRICE CODE 17. SECURITY ...CLASSIFICATION is. SECURrrY CLASSIFICATION 19. SECURITY CLASSIFICATION 20. LIMITFATION OF ABSTRACT OF REPORT OF THIS PAGE OF ABSTRACT Unclassified
Review of chart recognition in document images

NASA Astrophysics Data System (ADS)

Liu, Yan; Lu, Xiaoqing; Qin, Yeyang; Tang, Zhi; Xu, Jianbo

2013-01-01

As an effective information transmitting way, chart is widely used to represent scientific statistics datum in books, research papers, newspapers etc. Though textual information is still the major source of data, there has been an increasing trend of introducing graphs, pictures, and figures into the information pool. Text recognition techniques for documents have been accomplished using optical character recognition (OCR) software. Chart recognition techniques as a necessary supplement of OCR for document images are still an unsolved problem due to the great subjectiveness and variety of charts styles. This paper reviews the development process of chart recognition techniques in the past decades and presents the focuses of current researches. The whole process of chart recognition is presented systematically, which mainly includes three parts: chart segmentation, chart classification, and chart Interpretation. In each part, the latest research work is introduced. In the last, the paper concludes with a summary and promising future research direction.
Design of Automatic Extraction Algorithm of Knowledge Points for MOOCs

PubMed Central

Chen, Haijian; Han, Dongmei; Zhao, Lina

2015-01-01

In recent years, Massive Open Online Courses (MOOCs) are very popular among college students and have a powerful impact on academic institutions. In the MOOCs environment, knowledge discovery and knowledge sharing are very important, which currently are often achieved by ontology techniques. In building ontology, automatic extraction technology is crucial. Because the general methods of text mining algorithm do not have obvious effect on online course, we designed automatic extracting course knowledge points (AECKP) algorithm for online course. It includes document classification, Chinese word segmentation, and POS tagging for each document. Vector Space Model (VSM) is used to calculate similarity and design the weight to optimize the TF-IDF algorithm output values, and the higher scores will be selected as knowledge points. Course documents of “C programming language” are selected for the experiment in this study. The results show that the proposed approach can achieve satisfactory accuracy rate and recall rate. PMID:26448738
Multi-font printed Mongolian document recognition system

NASA Astrophysics Data System (ADS)

Peng, Liangrui; Liu, Changsong; Ding, Xiaoqing; Wang, Hua; Jin, Jianming

2009-01-01

Mongolian is one of the major ethnic languages in China. Large amount of Mongolian printed documents need to be digitized in digital library and various applications. Traditional Mongolian script has unique writing style and multi-font-type variations, which bring challenges to Mongolian OCR research. As traditional Mongolian script has some characteristics, for example, one character may be part of another character, we define the character set for recognition according to the segmented components, and the components are combined into characters by rule-based post-processing module. For character recognition, a method based on visual directional feature and multi-level classifiers is presented. For character segmentation, a scheme is used to find the segmentation point by analyzing the properties of projection and connected components. As Mongolian has different font-types which are categorized into two major groups, the parameter of segmentation is adjusted for each group. A font-type classification method for the two font-type group is introduced. For recognition of Mongolian text mixed with Chinese and English, language identification and relevant character recognition kernels are integrated. Experiments show that the presented methods are effective. The text recognition rate is 96.9% on the test samples from practical documents with multi-font-types and mixed scripts.
Monitoring nanotechnology using patent classifications: an overview and comparison of nanotechnology classification schemes

NASA Astrophysics Data System (ADS)

Jürgens, Björn; Herrero-Solana, Victor

2017-04-01

Patents are an essential information source used to monitor, track, and analyze nanotechnology. When it comes to search nanotechnology-related patents, a keyword search is often incomplete and struggles to cover such an interdisciplinary discipline. Patent classification schemes can reveal far better results since they are assigned by experts who classify the patent documents according to their technology. In this paper, we present the most important classifications to search nanotechnology patents and analyze how nanotechnology is covered in the main patent classification systems used in search systems nowadays: the International Patent Classification (IPC), the United States Patent Classification (USPC), and the Cooperative Patent Classification (CPC). We conclude that nanotechnology has a significantly better patent coverage in the CPC since considerable more nanotechnology documents were retrieved than by using other classifications, and thus, recommend its use for all professionals involved in nanotechnology patent searches.
21 CFR 880.5440 - Intravascular administration set.

Code of Federal Regulations, 2012 CFR

2012-04-01

...) Classification. Class II (special controls). The special control for pharmacy compounding systems within this classification is the FDA guidance document entitled “Class II Special Controls Guidance Document: Pharmacy Compounding Systems; Final Guidance for Industry and FDA Reviewers.” Pharmacy compounding systems classified...
21 CFR 880.5440 - Intravascular administration set.

Code of Federal Regulations, 2011 CFR

2011-04-01

...) Classification. Class II (special controls). The special control for pharmacy compounding systems within this classification is the FDA guidance document entitled “Class II Special Controls Guidance Document: Pharmacy Compounding Systems; Final Guidance for Industry and FDA Reviewers.” Pharmacy compounding systems classified...
21 CFR 880.5440 - Intravascular administration set.

Code of Federal Regulations, 2013 CFR

2013-04-01

...) Classification. Class II (special controls). The special control for pharmacy compounding systems within this classification is the FDA guidance document entitled “Class II Special Controls Guidance Document: Pharmacy Compounding Systems; Final Guidance for Industry and FDA Reviewers.” Pharmacy compounding systems classified...
21 CFR 880.5440 - Intravascular administration set.

Code of Federal Regulations, 2014 CFR

2014-04-01

...) Classification. Class II (special controls). The special control for pharmacy compounding systems within this classification is the FDA guidance document entitled “Class II Special Controls Guidance Document: Pharmacy Compounding Systems; Final Guidance for Industry and FDA Reviewers.” Pharmacy compounding systems classified...
Different approaches for identifying important concepts in probabilistic biomedical text summarization.

PubMed

Moradi, Milad; Ghadiri, Nasser

2018-01-01

Automatic text summarization tools help users in the biomedical domain to acquire their intended information from various textual resources more efficiently. Some of biomedical text summarization systems put the basis of their sentence selection approach on the frequency of concepts extracted from the input text. However, it seems that exploring other measures rather than the raw frequency for identifying valuable contents within an input document, or considering correlations existing between concepts, may be more useful for this type of summarization. In this paper, we describe a Bayesian summarization method for biomedical text documents. The Bayesian summarizer initially maps the input text to the Unified Medical Language System (UMLS) concepts; then it selects the important ones to be used as classification features. We introduce six different feature selection approaches to identify the most important concepts of the text and select the most informative contents according to the distribution of these concepts. We show that with the use of an appropriate feature selection approach, the Bayesian summarizer can improve the performance of biomedical summarization. Using the Recall-Oriented Understudy for Gisting Evaluation (ROUGE) toolkit, we perform extensive evaluations on a corpus of scientific papers in the biomedical domain. The results show that when the Bayesian summarizer utilizes the feature selection methods that do not use the raw frequency, it can outperform the biomedical summarizers that rely on the frequency of concepts, domain-independent and baseline methods. Copyright © 2017 Elsevier B.V. All rights reserved.
LDA boost classification: boosting by topics

NASA Astrophysics Data System (ADS)

Lei, La; Qiao, Guo; Qimin, Cao; Qitao, Li

2012-12-01

AdaBoost is an efficacious classification algorithm especially in text categorization (TC) tasks. The methodology of setting up a classifier committee and voting on the documents for classification can achieve high categorization precision. However, traditional Vector Space Model can easily lead to the curse of dimensionality and feature sparsity problems; so it affects classification performance seriously. This article proposed a novel classification algorithm called LDABoost based on boosting ideology which uses Latent Dirichlet Allocation (LDA) to modeling the feature space. Instead of using words or phrase, LDABoost use latent topics as the features. In this way, the feature dimension is significantly reduced. Improved Naïve Bayes (NB) is designed as the weaker classifier which keeps the efficiency advantage of classic NB algorithm and has higher precision. Moreover, a two-stage iterative weighted method called Cute Integration in this article is proposed for improving the accuracy by integrating weak classifiers into strong classifier in a more rational way. Mutual Information is used as metrics of weights allocation. The voting information and the categorization decision made by basis classifiers are fully utilized for generating the strong classifier. Experimental results reveals LDABoost making categorization in a low-dimensional space, it has higher accuracy than traditional AdaBoost algorithms and many other classic classification algorithms. Moreover, its runtime consumption is lower than different versions of AdaBoost, TC algorithms based on support vector machine and Neural Networks.
Text mining and natural language processing approaches for automatic categorization of lay requests to web-based expert forums.

PubMed

Himmel, Wolfgang; Reincke, Ulrich; Michelmann, Hans Wilhelm

2009-07-22

Both healthy and sick people increasingly use electronic media to obtain medical information and advice. For example, Internet users may send requests to Web-based expert forums, or so-called "ask the doctor" services. To automatically classify lay requests to an Internet medical expert forum using a combination of different text-mining strategies. We first manually classified a sample of 988 requests directed to a involuntary childlessness forum on the German website "Rund ums Baby" ("Everything about Babies") into one or more of 38 categories belonging to two dimensions ("subject matter" and "expectations"). After creating start and synonym lists, we calculated the average Cramer's V statistic for the association of each word with each category. We also used principle component analysis and singular value decomposition as further text-mining strategies. With these measures we trained regression models and determined, on the basis of best regression models, for any request the probability of belonging to each of the 38 different categories, with a cutoff of 50%. Recall and precision of a test sample were calculated as a measure of quality for the automatic classification. According to the manual classification of 988 documents, 102 (10%) documents fell into the category "in vitro fertilization (IVF)," 81 (8%) into the category "ovulation," 79 (8%) into "cycle," and 57 (6%) into "semen analysis." These were the four most frequent categories in the subject matter dimension (consisting of 32 categories). The expectation dimension comprised six categories; we classified 533 documents (54%) as "general information" and 351 (36%) as a wish for "treatment recommendations." The generation of indicator variables based on the chi-square analysis and Cramer's V proved to be the best approach for automatic classification in about half of the categories. In combination with the two other approaches, 100% precision and 100% recall were realized in 18 (47%) out of the 38 categories in the test sample. For 35 (92%) categories, precision and recall were better than 80%. For some categories, the input variables (ie, "words") also included variables from other categories, most often with a negative sign. For example, absence of words predictive for "menstruation" was a strong indicator for the category "pregnancy test." Our approach suggests a way of automatically classifying and analyzing unstructured information in Internet expert forums. The technique can perform a preliminary categorization of new requests and help Internet medical experts to better handle the mass of information and to give professional feedback.
6 CFR 7.26 - Derivative classification.

Code of Federal Regulations, 2011 CFR

2011-01-01

... already classified, and marking the newly developed material consistent with the classification markings... classification decisions and carry forward to any newly created documents the pertinent classification markings. (d) Information classified derivatively from other classified information shall be classified and...

Capturing patient information at nursing shift changes: methodological evaluation of speech recognition and information extraction

PubMed Central

Suominen, Hanna; Johnson, Maree; Zhou, Liyuan; Sanchez, Paula; Sirel, Raul; Basilakis, Jim; Hanlen, Leif; Estival, Dominique; Dawson, Linda; Kelly, Barbara

2015-01-01

Objective We study the use of speech recognition and information extraction to generate drafts of Australian nursing-handover documents. Methods Speech recognition correctness and clinicians’ preferences were evaluated using 15 recorder–microphone combinations, six documents, three speakers, Dragon Medical 11, and five survey/interview participants. Information extraction correctness evaluation used 260 documents, six-class classification for each word, two annotators, and the CRF++ conditional random field toolkit. Results A noise-cancelling lapel-microphone with a digital voice recorder gave the best correctness (79%). This microphone was also the most preferred option by all but one participant. Although the participants liked the small size of this recorder, their preference was for tablets that can also be used for document proofing and sign-off, among other tasks. Accented speech was harder to recognize than native language and a male speaker was detected better than a female speaker. Information extraction was excellent in filtering out irrelevant text (85% F1) and identifying text relevant to two classes (87% and 70% F1). Similarly to the annotators’ disagreements, there was confusion between the remaining three classes, which explains the modest 62% macro-averaged F1. Discussion We present evidence for the feasibility of speech recognition and information extraction to support clinicians’ in entering text and unlock its content for computerized decision-making and surveillance in healthcare. Conclusions The benefits of this automation include storing all information; making the drafts available and accessible almost instantly to everyone with authorized access; and avoiding information loss, delays, and misinterpretations inherent to using a ward clerk or transcription services. PMID:25336589
An Automatic Multidocument Text Summarization Approach Based on Naïve Bayesian Classifier Using Timestamp Strategy

PubMed Central

Ramanujam, Nedunchelian; Kaliappan, Manivannan

2016-01-01

Nowadays, automatic multidocument text summarization systems can successfully retrieve the summary sentences from the input documents. But, it has many limitations such as inaccurate extraction to essential sentences, low coverage, poor coherence among the sentences, and redundancy. This paper introduces a new concept of timestamp approach with Naïve Bayesian Classification approach for multidocument text summarization. The timestamp provides the summary an ordered look, which achieves the coherent looking summary. It extracts the more relevant information from the multiple documents. Here, scoring strategy is also used to calculate the score for the words to obtain the word frequency. The higher linguistic quality is estimated in terms of readability and comprehensibility. In order to show the efficiency of the proposed method, this paper presents the comparison between the proposed methods with the existing MEAD algorithm. The timestamp procedure is also applied on the MEAD algorithm and the results are examined with the proposed method. The results show that the proposed method results in lesser time than the existing MEAD algorithm to execute the summarization process. Moreover, the proposed method results in better precision, recall, and F-score than the existing clustering with lexical chaining approach. PMID:27034971
46 CFR 503.59 - Safeguarding classified information.

Code of Federal Regulations, 2014 CFR

2014-10-01

... classification. (b) Whenever classified material is removed from a storage facility, such material shall not be... classification of the information; and (2) The prospective recipient requires access to the information in order... documents that have been destroyed. (k) An inventory of all documents classified higher than confidential...
46 CFR 503.59 - Safeguarding classified information.

Code of Federal Regulations, 2013 CFR

2013-10-01

... classification. (b) Whenever classified material is removed from a storage facility, such material shall not be... classification of the information; and (2) The prospective recipient requires access to the information in order... documents that have been destroyed. (k) An inventory of all documents classified higher than confidential...
46 CFR 503.59 - Safeguarding classified information.

Code of Federal Regulations, 2012 CFR

2012-10-01

... classification. (b) Whenever classified material is removed from a storage facility, such material shall not be... classification of the information; and (2) The prospective recipient requires access to the information in order... documents that have been destroyed. (k) An inventory of all documents classified higher than confidential...
Word Frequency Analysis. MOS: 54E. Skill Levels 1 & 2.

DTIC Science & Technology

1981-05-01

REPRODUCE LEGIBLY. qi l)NCLASSTFTFD SECURITY CLASSIFICATION OF THIS PAGE (ൌ n Date Entered) REPORT DOCUMENTATION PAGE REoLNSTUCT{IONS B.ROE COMPNETI...mid Identify by block numbr) MOS Vocabulary Readab ili ty Comprehension of text Curriculum Development A,~ A-65rA Cc .e o ld., n ~lt, block ntibe This...vocabulary set is only a sampling of the entire MOS vocabulary and is subject to change. N -Albry ItvCe DnesVAi 4- ^ -- I I i i i I t |w o• I HEADQUARTERS
77 FR 32010 - Applications (Classification, Advisory, and License) and Documentation

Federal Register 2010, 2011, 2012, 2013, 2014

2012-05-31

... DEPARTMENT OF COMMERCE Bureau of Industry and Security 15 CFR Part 748 Applications (Classification, Advisory, and License) and Documentation CFR Correction 0 In Title 15 of the Code of Federal... fourth column of the table, the two entries for ``National Semiconductor Hong Kong Limited'' are removed...
Validation of Case Finding Algorithms for Hepatocellular Cancer From Administrative Data and Electronic Health Records Using Natural Language Processing.

PubMed

Sada, Yvonne; Hou, Jason; Richardson, Peter; El-Serag, Hashem; Davila, Jessica

2016-02-01

Accurate identification of hepatocellular cancer (HCC) cases from automated data is needed for efficient and valid quality improvement initiatives and research. We validated HCC International Classification of Diseases, 9th Revision (ICD-9) codes, and evaluated whether natural language processing by the Automated Retrieval Console (ARC) for document classification improves HCC identification. We identified a cohort of patients with ICD-9 codes for HCC during 2005-2010 from Veterans Affairs administrative data. Pathology and radiology reports were reviewed to confirm HCC. The positive predictive value (PPV), sensitivity, and specificity of ICD-9 codes were calculated. A split validation study of pathology and radiology reports was performed to develop and validate ARC algorithms. Reports were manually classified as diagnostic of HCC or not. ARC generated document classification algorithms using the Clinical Text Analysis and Knowledge Extraction System. ARC performance was compared with manual classification. PPV, sensitivity, and specificity of ARC were calculated. A total of 1138 patients with HCC were identified by ICD-9 codes. On the basis of manual review, 773 had HCC. The HCC ICD-9 code algorithm had a PPV of 0.67, sensitivity of 0.95, and specificity of 0.93. For a random subset of 619 patients, we identified 471 pathology reports for 323 patients and 943 radiology reports for 557 patients. The pathology ARC algorithm had PPV of 0.96, sensitivity of 0.96, and specificity of 0.97. The radiology ARC algorithm had PPV of 0.75, sensitivity of 0.94, and specificity of 0.68. A combined approach of ICD-9 codes and natural language processing of pathology and radiology reports improves HCC case identification in automated data.
17 CFR 200.506 - Derivative classification.

Code of Federal Regulations, 2012 CFR

2012-04-01

... 17 Commodity and Securities Exchanges 2 2012-04-01 2012-04-01 false Derivative classification. 200...; CONDUCT AND ETHICS; AND INFORMATION AND REQUESTS Classification and Declassification of National Security Information and Material § 200.506 Derivative classification. Any document that includes paraphrases...
17 CFR 200.506 - Derivative classification.

Code of Federal Regulations, 2013 CFR

2013-04-01

... 17 Commodity and Securities Exchanges 2 2013-04-01 2013-04-01 false Derivative classification. 200...; CONDUCT AND ETHICS; AND INFORMATION AND REQUESTS Classification and Declassification of National Security Information and Material § 200.506 Derivative classification. Any document that includes paraphrases...
17 CFR 200.506 - Derivative classification.

Code of Federal Regulations, 2014 CFR

2014-04-01

... 17 Commodity and Securities Exchanges 3 2014-04-01 2014-04-01 false Derivative classification. 200...; CONDUCT AND ETHICS; AND INFORMATION AND REQUESTS Classification and Declassification of National Security Information and Material § 200.506 Derivative classification. Any document that includes paraphrases...
17 CFR 200.506 - Derivative classification.

Code of Federal Regulations, 2011 CFR

2011-04-01

... 17 Commodity and Securities Exchanges 2 2011-04-01 2011-04-01 false Derivative classification. 200...; CONDUCT AND ETHICS; AND INFORMATION AND REQUESTS Classification and Declassification of National Security Information and Material § 200.506 Derivative classification. Any document that includes paraphrases...
Gene function prediction based on the Gene Ontology hierarchical structure.

PubMed

Cheng, Liangxi; Lin, Hongfei; Hu, Yuncui; Wang, Jian; Yang, Zhihao

2014-01-01

The information of the Gene Ontology annotation is helpful in the explanation of life science phenomena, and can provide great support for the research of the biomedical field. The use of the Gene Ontology is gradually affecting the way people store and understand bioinformatic data. To facilitate the prediction of gene functions with the aid of text mining methods and existing resources, we transform it into a multi-label top-down classification problem and develop a method that uses the hierarchical relationships in the Gene Ontology structure to relieve the quantitative imbalance of positive and negative training samples. Meanwhile the method enhances the discriminating ability of classifiers by retaining and highlighting the key training samples. Additionally, the top-down classifier based on a tree structure takes the relationship of target classes into consideration and thus solves the incompatibility between the classification results and the Gene Ontology structure. Our experiment on the Gene Ontology annotation corpus achieves an F-value performance of 50.7% (precision: 52.7% recall: 48.9%). The experimental results demonstrate that when the size of training set is small, it can be expanded via topological propagation of associated documents between the parent and child nodes in the tree structure. The top-down classification model applies to the set of texts in an ontology structure or with a hierarchical relationship.
32 CFR 2001.22 - Derivative classification.

Code of Federal Regulations, 2012 CFR

2012-07-01

... 32 National Defense 6 2012-07-01 2012-07-01 false Derivative classification. 2001.22 Section 2001... Identification and Markings § 2001.22 Derivative classification. (a) General. Information classified derivatively on the basis of source documents or classification guides shall bear all markings prescribed in...
32 CFR 2001.22 - Derivative classification.

Code of Federal Regulations, 2013 CFR

2013-07-01

... 32 National Defense 6 2013-07-01 2013-07-01 false Derivative classification. 2001.22 Section 2001... Identification and Markings § 2001.22 Derivative classification. (a) General. Information classified derivatively on the basis of source documents or classification guides shall bear all markings prescribed in...
10 CFR 1045.37 - Classification guides.

Code of Federal Regulations, 2014 CFR

2014-01-01

... 10 Energy 4 2014-01-01 2014-01-01 false Classification guides. 1045.37 Section 1045.37 Energy DEPARTMENT OF ENERGY (GENERAL PROVISIONS) NUCLEAR CLASSIFICATION AND DECLASSIFICATION Generation and Review of Documents Containing Restricted Data and Formerly Restricted Data § 1045.37 Classification guides...
10 CFR 1045.37 - Classification guides.

Code of Federal Regulations, 2013 CFR

2013-01-01

... 10 Energy 4 2013-01-01 2013-01-01 false Classification guides. 1045.37 Section 1045.37 Energy DEPARTMENT OF ENERGY (GENERAL PROVISIONS) NUCLEAR CLASSIFICATION AND DECLASSIFICATION Generation and Review of Documents Containing Restricted Data and Formerly Restricted Data § 1045.37 Classification guides...
32 CFR 2001.22 - Derivative classification.

Code of Federal Regulations, 2014 CFR

2014-07-01

... 32 National Defense 6 2014-07-01 2014-07-01 false Derivative classification. 2001.22 Section 2001... Identification and Markings § 2001.22 Derivative classification. (a) General. Information classified derivatively on the basis of source documents or classification guides shall bear all markings prescribed in...
10 CFR 1045.37 - Classification guides.

Code of Federal Regulations, 2012 CFR

2012-01-01

... 10 Energy 4 2012-01-01 2012-01-01 false Classification guides. 1045.37 Section 1045.37 Energy DEPARTMENT OF ENERGY (GENERAL PROVISIONS) NUCLEAR CLASSIFICATION AND DECLASSIFICATION Generation and Review of Documents Containing Restricted Data and Formerly Restricted Data § 1045.37 Classification guides...
10 CFR 1045.37 - Classification guides.

Code of Federal Regulations, 2010 CFR

2010-01-01

... 10 Energy 4 2010-01-01 2010-01-01 false Classification guides. 1045.37 Section 1045.37 Energy DEPARTMENT OF ENERGY (GENERAL PROVISIONS) NUCLEAR CLASSIFICATION AND DECLASSIFICATION Generation and Review of Documents Containing Restricted Data and Formerly Restricted Data § 1045.37 Classification guides...

10 CFR 1045.37 - Classification guides.

Code of Federal Regulations, 2011 CFR

2011-01-01

... 10 Energy 4 2011-01-01 2011-01-01 false Classification guides. 1045.37 Section 1045.37 Energy DEPARTMENT OF ENERGY (GENERAL PROVISIONS) NUCLEAR CLASSIFICATION AND DECLASSIFICATION Generation and Review of Documents Containing Restricted Data and Formerly Restricted Data § 1045.37 Classification guides...
32 CFR 2001.22 - Derivative classification.

Code of Federal Regulations, 2011 CFR

2011-07-01

... 32 National Defense 6 2011-07-01 2011-07-01 false Derivative classification. 2001.22 Section 2001... Identification and Markings § 2001.22 Derivative classification. (a) General. Information classified derivatively on the basis of source documents or classification guides shall bear all markings prescribed in...
19 CFR 177.8 - Issuance of rulings.

Code of Federal Regulations, 2012 CFR

2012-04-01

.... Any person engaging in a Customs transaction with respect to which a binding tariff classification ruling letter (including pre-entry classification decisions) has been issued under this part shall... tariff classification of merchandise shall set forth such classification in the documents or information...
19 CFR 177.8 - Issuance of rulings.

Code of Federal Regulations, 2011 CFR

2011-04-01

.... Any person engaging in a Customs transaction with respect to which a binding tariff classification ruling letter (including pre-entry classification decisions) has been issued under this part shall... tariff classification of merchandise shall set forth such classification in the documents or information...
19 CFR 177.8 - Issuance of rulings.

Code of Federal Regulations, 2013 CFR

2013-04-01

.... Any person engaging in a Customs transaction with respect to which a binding tariff classification ruling letter (including pre-entry classification decisions) has been issued under this part shall... tariff classification of merchandise shall set forth such classification in the documents or information...
19 CFR 177.8 - Issuance of rulings.

Code of Federal Regulations, 2014 CFR

2014-04-01

.... Any person engaging in a Customs transaction with respect to which a binding tariff classification ruling letter (including pre-entry classification decisions) has been issued under this part shall... tariff classification of merchandise shall set forth such classification in the documents or information...
32 CFR 1636.2 - The claim of conscientious objection.

Code of Federal Regulations, 2010 CFR

2010-07-01

... CLASSIFICATION OF CONSCIENTIOUS OBJECTORS § 1636.2 The claim of conscientious objection. A claim to classification in Class 1-A-0 or Class 1-0, must be made by the registrant in writing. Claims and documents in... or after the Director has made a specific request for submission of such documents. All claims or...
32 CFR 1636.2 - The claim of conscientious objection.

Code of Federal Regulations, 2011 CFR

2011-07-01

... CLASSIFICATION OF CONSCIENTIOUS OBJECTORS § 1636.2 The claim of conscientious objection. A claim to classification in Class 1-A-0 or Class 1-0, must be made by the registrant in writing. Claims and documents in... or after the Director has made a specific request for submission of such documents. All claims or...
Government Classification: An Overview.

ERIC Educational Resources Information Center

Brown, Karen M.

Classification of government documents (confidential, secret, top secret) is a system used by the executive branch to, in part, protect national security and foreign policy interests. The systematic use of classification markings with precise definitions was established during World War I, and since 1936 major changes in classification have…
22 CFR 41.12 - Classification symbols.

Code of Federal Regulations, 2014 CFR

2014-04-01

... Foreign Relations DEPARTMENT OF STATE VISAS VISAS: DOCUMENTATION OF NONIMMIGRANTS UNDER THE IMMIGRATION AND NATIONALITY ACT, AS AMENDED Classification of Nonimmigrants § 41.12 Classification symbols. A visa... appropriate visa symbol to show the classification of the alien. The symbol shall be inserted in the space...
22 CFR 41.12 - Classification symbols.

Code of Federal Regulations, 2013 CFR

2013-04-01

... Foreign Relations DEPARTMENT OF STATE VISAS VISAS: DOCUMENTATION OF NONIMMIGRANTS UNDER THE IMMIGRATION AND NATIONALITY ACT, AS AMENDED Classification of Nonimmigrants § 41.12 Classification symbols. A visa... appropriate visa symbol to show the classification of the alien. The symbol shall be inserted in the space...
22 CFR 41.12 - Classification symbols.

Code of Federal Regulations, 2012 CFR

2012-04-01

... Foreign Relations DEPARTMENT OF STATE VISAS VISAS: DOCUMENTATION OF NONIMMIGRANTS UNDER THE IMMIGRATION AND NATIONALITY ACT, AS AMENDED Classification of Nonimmigrants § 41.12 Classification symbols. A visa... appropriate visa symbol to show the classification of the alien. The symbol shall be inserted in the space...
Coding of procedures documented by general practitioners in Swedish primary care-an explorative study using two procedure coding systems

PubMed Central

2012-01-01

Background Procedures documented by general practitioners in primary care have not been studied in relation to procedure coding systems. We aimed to describe procedures documented by Swedish general practitioners in electronic patient records and to compare them to the Swedish Classification of Health Interventions (KVÅ) and SNOMED CT. Methods Procedures in 200 record entries were identified, coded, assessed in relation to two procedure coding systems and analysed. Results 417 procedures found in the 200 electronic patient record entries were coded with 36 different Classification of Health Interventions categories and 148 different SNOMED CT concepts. 22.8% of the procedures could not be coded with any Classification of Health Interventions category and 4.3% could not be coded with any SNOMED CT concept. 206 procedure-concept/category pairs were assessed as a complete match in SNOMED CT compared to 10 in the Classification of Health Interventions. Conclusions Procedures documented by general practitioners were present in nearly all electronic patient record entries. Almost all procedures could be coded using SNOMED CT. Classification of Health Interventions covered the procedures to a lesser extent and with a much lower degree of concordance. SNOMED CT is a more flexible terminology system that can be used for different purposes for procedure coding in primary care. PMID:22230095
Drug related webpages classification using images and text information based on multi-kernel learning

NASA Astrophysics Data System (ADS)

Hu, Ruiguang; Xiao, Liping; Zheng, Wenjuan

2015-12-01

In this paper, multi-kernel learning(MKL) is used for drug-related webpages classification. First, body text and image-label text are extracted through HTML parsing, and valid images are chosen by the FOCARSS algorithm. Second, text based BOW model is used to generate text representation, and image-based BOW model is used to generate images representation. Last, text and images representation are fused with a few methods. Experimental results demonstrate that the classification accuracy of MKL is higher than those of all other fusion methods in decision level and feature level, and much higher than the accuracy of single-modal classification.
An ensemble heterogeneous classification methodology for discovering health-related knowledge in social media messages.

PubMed

Tuarob, Suppawong; Tucker, Conrad S; Salathe, Marcel; Ram, Nilam

2014-06-01

The role of social media as a source of timely and massive information has become more apparent since the era of Web 2.0.Multiple studies illustrated the use of information in social media to discover biomedical and health-related knowledge.Most methods proposed in the literature employ traditional document classification techniques that represent a document as a bag of words.These techniques work well when documents are rich in text and conform to standard English; however, they are not optimal for social media data where sparsity and noise are norms.This paper aims to address the limitations posed by the traditional bag-of-word based methods and propose to use heterogeneous features in combination with ensemble machine learning techniques to discover health-related information, which could prove to be useful to multiple biomedical applications, especially those needing to discover health-related knowledge in large scale social media data.Furthermore, the proposed methodology could be generalized to discover different types of information in various kinds of textual data. Social media data is characterized by an abundance of short social-oriented messages that do not conform to standard languages, both grammatically and syntactically.The problem of discovering health-related knowledge in social media data streams is then transformed into a text classification problem, where a text is identified as positive if it is health-related and negative otherwise.We first identify the limitations of the traditional methods which train machines with N-gram word features, then propose to overcome such limitations by utilizing the collaboration of machine learning based classifiers, each of which is trained to learn a semantically different aspect of the data.The parameter analysis for tuning each classifier is also reported. Three data sets are used in this research.The first data set comprises of approximately 5000 hand-labeled tweets, and is used for cross validation of the classification models in the small scale experiment, and for training the classifiers in the real-world large scale experiment.The second data set is a random sample of real-world Twitter data in the US.The third data set is a random sample of real-world Facebook Timeline posts. Two sets of evaluations are conducted to investigate the proposed model's ability to discover health-related information in the social media domain: small scale and large scale evaluations.The small scale evaluation employs 10-fold cross validation on the labeled data, and aims to tune parameters of the proposed models, and to compare with the stage-of-the-art method.The large scale evaluation tests the trained classification models on the native, real-world data sets, and is needed to verify the ability of the proposed model to handle the massive heterogeneity in real-world social media. The small scale experiment reveals that the proposed method is able to mitigate the limitations in the well established techniques existing in the literature, resulting in performance improvement of 18.61% (F-measure).The large scale experiment further reveals that the baseline fails to perform well on larger data with higher degrees of heterogeneity, while the proposed method is able to yield reasonably good performance and outperform the baseline by 46.62% (F-Measure) on average. Copyright © 2014 Elsevier Inc. All rights reserved.
Using binary classification to prioritize and curate articles for the Comparative Toxicogenomics Database.

PubMed

Vishnyakova, Dina; Pasche, Emilie; Ruch, Patrick

2012-01-01

We report on the original integration of an automatic text categorization pipeline, so-called ToxiCat (Toxicogenomic Categorizer), that we developed to perform biomedical documents classification and prioritization in order to speed up the curation of the Comparative Toxicogenomics Database (CTD). The task can be basically described as a binary classification task, where a scoring function is used to rank a selected set of articles. Then components of a question-answering system are used to extract CTD-specific annotations from the ranked list of articles. The ranking function is generated using a Support Vector Machine, which combines three main modules: an information retrieval engine for MEDLINE (EAGLi), a gene normalization service (NormaGene) developed for a previous BioCreative campaign and finally, a set of answering components and entity recognizer for diseases and chemicals. The main components of the pipeline are publicly available both as web application and web services. The specific integration performed for the BioCreative competition is available via a web user interface at http://pingu.unige.ch:8080/Toxicat.
A comparison study between MLP and convolutional neural network models for character recognition

NASA Astrophysics Data System (ADS)

Ben Driss, S.; Soua, M.; Kachouri, R.; Akil, M.

2017-05-01

Optical Character Recognition (OCR) systems have been designed to operate on text contained in scanned documents and images. They include text detection and character recognition in which characters are described then classified. In the classification step, characters are identified according to their features or template descriptions. Then, a given classifier is employed to identify characters. In this context, we have proposed the unified character descriptor (UCD) to represent characters based on their features. Then, matching was employed to ensure the classification. This recognition scheme performs a good OCR Accuracy on homogeneous scanned documents, however it cannot discriminate characters with high font variation and distortion.3 To improve recognition, classifiers based on neural networks can be used. The multilayer perceptron (MLP) ensures high recognition accuracy when performing a robust training. Moreover, the convolutional neural network (CNN), is gaining nowadays a lot of popularity for its high performance. Furthermore, both CNN and MLP may suffer from the large amount of computation in the training phase. In this paper, we establish a comparison between MLP and CNN. We provide MLP with the UCD descriptor and the appropriate network configuration. For CNN, we employ the convolutional network designed for handwritten and machine-printed character recognition (Lenet-5) and we adapt it to support 62 classes, including both digits and characters. In addition, GPU parallelization is studied to speed up both of MLP and CNN classifiers. Based on our experimentations, we demonstrate that the used real-time CNN is 2x more relevant than MLP when classifying characters.
12 CFR 403.3 - Classification principles and authority.

Code of Federal Regulations, 2014 CFR

2014-01-01

... to any document is limited as follows and is nondelegable: Classification Classifier CONFIDENTIAL... 12 Banks and Banking 5 2014-01-01 2014-01-01 false Classification principles and authority. 403.3 Section 403.3 Banks and Banking EXPORT-IMPORT BANK OF THE UNITED STATES CLASSIFICATION, DECLASSIFICATION...
12 CFR 403.3 - Classification principles and authority.

Code of Federal Regulations, 2013 CFR

2013-01-01

... to any document is limited as follows and is nondelegable: Classification Classifier CONFIDENTIAL... 12 Banks and Banking 5 2013-01-01 2013-01-01 false Classification principles and authority. 403.3 Section 403.3 Banks and Banking EXPORT-IMPORT BANK OF THE UNITED STATES CLASSIFICATION, DECLASSIFICATION...
12 CFR 403.3 - Classification principles and authority.

Code of Federal Regulations, 2012 CFR

2012-01-01

... to any document is limited as follows and is nondelegable: Classification Classifier CONFIDENTIAL... 12 Banks and Banking 5 2012-01-01 2012-01-01 false Classification principles and authority. 403.3 Section 403.3 Banks and Banking EXPORT-IMPORT BANK OF THE UNITED STATES CLASSIFICATION, DECLASSIFICATION...

Railroad Classification Yard Technology Manual: Volume II : Yard Computer Systems

DOT National Transportation Integrated Search

1981-08-01

This volume (Volume II) of the Railroad Classification Yard Technology Manual documents the railroad classification yard computer systems methodology. The subjects covered are: functional description of process control and inventory computer systems,...
77 FR 60475 - Draft of SWGDOC Standard Classification of Typewritten Text

Federal Register 2010, 2011, 2012, 2013, 2014

2012-10-03

... DEPARTMENT OF JUSTICE Office of Justice Programs [OJP (NIJ) Docket No. 1607] Draft of SWGDOC Standard Classification of Typewritten Text AGENCY: National Institute of Justice, DOJ. ACTION: Notice and..., ``SWGDOC Standard Classification of Typewritten Text''. The opportunity to provide comments on this...
Motif-Based Text Mining of Microbial Metagenome Redundancy Profiling Data for Disease Classification.

PubMed

Wang, Yin; Li, Rudong; Zhou, Yuhua; Ling, Zongxin; Guo, Xiaokui; Xie, Lu; Liu, Lei

2016-01-01

Text data of 16S rRNA are informative for classifications of microbiota-associated diseases. However, the raw text data need to be systematically processed so that features for classification can be defined/extracted; moreover, the high-dimension feature spaces generated by the text data also pose an additional difficulty. Here we present a Phylogenetic Tree-Based Motif Finding algorithm (PMF) to analyze 16S rRNA text data. By integrating phylogenetic rules and other statistical indexes for classification, we can effectively reduce the dimension of the large feature spaces generated by the text datasets. Using the retrieved motifs in combination with common classification methods, we can discriminate different samples of both pneumonia and dental caries better than other existing methods. We extend the phylogenetic approaches to perform supervised learning on microbiota text data to discriminate the pathological states for pneumonia and dental caries. The results have shown that PMF may enhance the efficiency and reliability in analyzing high-dimension text data.
Border Lakes land-cover classification

Treesearch

Marvin Bauer; Brian Loeffelholz; Doug Shinneman

2009-01-01

This document contains metadata and description of land-cover classification of approximately 5.1 million acres of land bordering Minnesota, U.S.A. and Ontario, Canada. The classification focused on the separation and identification of specific forest-cover types. Some separation of the nonforest classes also was performed. The classification was derived from multi-...
Pedoinformatics Approach to Soil Text Analytics

NASA Astrophysics Data System (ADS)

Furey, J.; Seiter, J.; Davis, A.

2017-12-01

The several extant schema for the classification of soils rely on differing criteria, but the major soil science taxonomies, including the United States Department of Agriculture (USDA) and the international harmonized World Reference Base for Soil Resources systems, are based principally on inferred pedogenic properties. These taxonomies largely result from compiled individual observations of soil morphologies within soil profiles, and the vast majority of this pedologic information is contained in qualitative text descriptions. We present text mining analyses of hundreds of gigabytes of parsed text and other data in the digitally available USDA soil taxonomy documentation, the Soil Survey Geographic (SSURGO) database, and the National Cooperative Soil Survey (NCSS) soil characterization database. These analyses implemented iPython calls to Gensim modules for topic modelling, with latent semantic indexing completed down to the lowest taxon level (soil series) paragraphs. Via a custom extension of the Natural Language Toolkit (NLTK), approximately one percent of the USDA soil series descriptions were used to train a classifier for the remainder of the documents, essentially by treating soil science words as comprising a novel language. While location-specific descriptors at the soil series level are amenable to geomatics methods, unsupervised clustering of the occurrence of other soil science words did not closely follow the usual hierarchy of soil taxa. We present preliminary phrasal analyses that may account for some of these effects.
Significance of clustering and classification applications in digital and physical libraries

NASA Astrophysics Data System (ADS)

Triantafyllou, Ioannis; Koulouris, Alexandros; Zervos, Spiros; Dendrinos, Markos; Giannakopoulos, Georgios

2015-02-01

Applications of clustering and classification techniques can be proved very significant in both digital and physical (paper-based) libraries. The most essential application, document classification and clustering, is crucial for the content that is produced and maintained in digital libraries, repositories, databases, social media, blogs etc., based on various tags and ontology elements, transcending the traditional library-oriented classification schemes. Other applications with very useful and beneficial role in the new digital library environment involve document routing, summarization and query expansion. Paper-based libraries can benefit as well since classification combined with advanced material characterization techniques such as FTIR (Fourier Transform InfraRed spectroscopy) can be vital for the study and prevention of material deterioration. An improved two-level self-organizing clustering architecture is proposed in order to enhance the discrimination capacity of the learning space, prior to classification, yielding promising results when applied to the above mentioned library tasks.
Machine learning approaches to diagnosis and laterality effects in semantic dementia discourse.

PubMed

Garrard, Peter; Rentoumi, Vassiliki; Gesierich, Benno; Miller, Bruce; Gorno-Tempini, Maria Luisa

2014-06-01

Advances in automatic text classification have been necessitated by the rapid increase in the availability of digital documents. Machine learning (ML) algorithms can 'learn' from data: for instance a ML system can be trained on a set of features derived from written texts belonging to known categories, and learn to distinguish between them. Such a trained system can then be used to classify unseen texts. In this paper, we explore the potential of the technique to classify transcribed speech samples along clinical dimensions, using vocabulary data alone. We report the accuracy with which two related ML algorithms [naive Bayes Gaussian (NBG) and naive Bayes multinomial (NBM)] categorized picture descriptions produced by: 32 semantic dementia (SD) patients versus 10 healthy, age-matched controls; and SD patients with left- (n = 21) versus right-predominant (n = 11) patterns of temporal lobe atrophy. We used information gain (IG) to identify the vocabulary features that were most informative to each of these two distinctions. In the SD versus control classification task, both algorithms achieved accuracies of greater than 90%. In the right- versus left-temporal lobe predominant classification, NBM achieved a high level of accuracy (88%), but this was achieved by both NBM and NBG when the features used in the training set were restricted to those with high values of IG. The most informative features for the patient versus control task were low frequency content words, generic terms and components of metanarrative statements. For the right versus left task the number of informative lexical features was too small to support any specific inferences. An enriched feature set, including values derived from Quantitative Production Analysis (QPA) may shed further light on this little understood distinction. Copyright © 2013 Elsevier Ltd. All rights reserved.
Multi-Agent Information Classification Using Dynamic Acquaintance Lists.

ERIC Educational Resources Information Center

Mukhopadhyay, Snehasis; Peng, Shengquan; Raje, Rajeev; Palakal, Mathew; Mostafa, Javed

2003-01-01

Discussion of automated information services focuses on information classification and collaborative agents, i.e. intelligent computer programs. Highlights include multi-agent systems; distributed artificial intelligence; thesauri; document representation and classification; agent modeling; acquaintances, or remote agents discovered through…
Methodological Issues in Predicting Pediatric Epilepsy Surgery Candidates Through Natural Language Processing and Machine Learning

PubMed Central

Cohen, Kevin Bretonnel; Glass, Benjamin; Greiner, Hansel M.; Holland-Bouley, Katherine; Standridge, Shannon; Arya, Ravindra; Faist, Robert; Morita, Diego; Mangano, Francesco; Connolly, Brian; Glauser, Tracy; Pestian, John

2016-01-01

Objective: We describe the development and evaluation of a system that uses machine learning and natural language processing techniques to identify potential candidates for surgical intervention for drug-resistant pediatric epilepsy. The data are comprised of free-text clinical notes extracted from the electronic health record (EHR). Both known clinical outcomes from the EHR and manual chart annotations provide gold standards for the patient’s status. The following hypotheses are then tested: 1) machine learning methods can identify epilepsy surgery candidates as well as physicians do and 2) machine learning methods can identify candidates earlier than physicians do. These hypotheses are tested by systematically evaluating the effects of the data source, amount of training data, class balance, classification algorithm, and feature set on classifier performance. The results support both hypotheses, with F-measures ranging from 0.71 to 0.82. The feature set, classification algorithm, amount of training data, class balance, and gold standard all significantly affected classification performance. It was further observed that classification performance was better than the highest agreement between two annotators, even at one year before documented surgery referral. The results demonstrate that such machine learning methods can contribute to predicting pediatric epilepsy surgery candidates and reducing lag time to surgery referral. PMID:27257386
A framework for biomedical figure segmentation towards image-based document retrieval

PubMed Central

2013-01-01

The figures included in many of the biomedical publications play an important role in understanding the biological experiments and facts described within. Recent studies have shown that it is possible to integrate the information that is extracted from figures in classical document classification and retrieval tasks in order to improve their accuracy. One important observation about the figures included in biomedical publications is that they are often composed of multiple subfigures or panels, each describing different methodologies or results. The use of these multimodal figures is a common practice in bioscience, as experimental results are graphically validated via multiple methodologies or procedures. Thus, for a better use of multimodal figures in document classification or retrieval tasks, as well as for providing the evidence source for derived assertions, it is important to automatically segment multimodal figures into subfigures and panels. This is a challenging task, however, as different panels can contain similar objects (i.e., barcharts and linecharts) with multiple layouts. Also, certain types of biomedical figures are text-heavy (e.g., DNA sequences and protein sequences images) and they differ from traditional images. As a result, classical image segmentation techniques based on low-level image features, such as edges or color, are not directly applicable to robustly partition multimodal figures into single modal panels. In this paper, we describe a robust solution for automatically identifying and segmenting unimodal panels from a multimodal figure. Our framework starts by robustly harvesting figure-caption pairs from biomedical articles. We base our approach on the observation that the document layout can be used to identify encoded figures and figure boundaries within PDF files. Taking into consideration the document layout allows us to correctly extract figures from the PDF document and associate their corresponding caption. We combine pixel-level representations of the extracted images with information gathered from their corresponding captions to estimate the number of panels in the figure. Thus, our approach simultaneously identifies the number of panels and the layout of figures. In order to evaluate the approach described here, we applied our system on documents containing protein-protein interactions (PPIs) and compared the results against a gold standard that was annotated by biologists. Experimental results showed that our automatic figure segmentation approach surpasses pure caption-based and image-based approaches, achieving a 96.64% accuracy. To allow for efficient retrieval of information, as well as to provide the basis for integration into document classification and retrieval systems among other, we further developed a web-based interface that lets users easily retrieve panels containing the terms specified in the user queries. PMID:24565394
Text Mining and Natural Language Processing Approaches for Automatic Categorization of Lay Requests to Web-Based Expert Forums

PubMed Central

Reincke, Ulrich; Michelmann, Hans Wilhelm

2009-01-01

Background Both healthy and sick people increasingly use electronic media to obtain medical information and advice. For example, Internet users may send requests to Web-based expert forums, or so-called “ask the doctor” services. Objective To automatically classify lay requests to an Internet medical expert forum using a combination of different text-mining strategies. Methods We first manually classified a sample of 988 requests directed to a involuntary childlessness forum on the German website “Rund ums Baby” (“Everything about Babies”) into one or more of 38 categories belonging to two dimensions (“subject matter” and “expectations”). After creating start and synonym lists, we calculated the average Cramer’s V statistic for the association of each word with each category. We also used principle component analysis and singular value decomposition as further text-mining strategies. With these measures we trained regression models and determined, on the basis of best regression models, for any request the probability of belonging to each of the 38 different categories, with a cutoff of 50%. Recall and precision of a test sample were calculated as a measure of quality for the automatic classification. Results According to the manual classification of 988 documents, 102 (10%) documents fell into the category “in vitro fertilization (IVF),” 81 (8%) into the category “ovulation,” 79 (8%) into “cycle,” and 57 (6%) into “semen analysis.” These were the four most frequent categories in the subject matter dimension (consisting of 32 categories). The expectation dimension comprised six categories; we classified 533 documents (54%) as “general information” and 351 (36%) as a wish for “treatment recommendations.” The generation of indicator variables based on the chi-square analysis and Cramer’s V proved to be the best approach for automatic classification in about half of the categories. In combination with the two other approaches, 100% precision and 100% recall were realized in 18 (47%) out of the 38 categories in the test sample. For 35 (92%) categories, precision and recall were better than 80%. For some categories, the input variables (ie, “words”) also included variables from other categories, most often with a negative sign. For example, absence of words predictive for “menstruation” was a strong indicator for the category “pregnancy test.” Conclusions Our approach suggests a way of automatically classifying and analyzing unstructured information in Internet expert forums. The technique can perform a preliminary categorization of new requests and help Internet medical experts to better handle the mass of information and to give professional feedback. PMID:19632978
32 CFR 2700.12 - Criteria for and level of original classification.

Code of Federal Regulations, 2010 CFR

2010-07-01

... Criteria for and level of original classification. (a) General Policy. Documents or other material are to... authorized or shall have force. (d) Unnecessary classification, and classification at a level higher than is... 32 National Defense 6 2010-07-01 2010-07-01 false Criteria for and level of original...
Prominent feature extraction for review analysis: an empirical study

NASA Astrophysics Data System (ADS)

Agarwal, Basant; Mittal, Namita

2016-05-01

Sentiment analysis (SA) research has increased tremendously in recent times. SA aims to determine the sentiment orientation of a given text into positive or negative polarity. Motivation for SA research is the need for the industry to know the opinion of the users about their product from online portals, blogs, discussion boards and reviews and so on. Efficient features need to be extracted for machine-learning algorithm for better sentiment classification. In this paper, initially various features are extracted such as unigrams, bi-grams and dependency features from the text. In addition, new bi-tagged features are also extracted that conform to predefined part-of-speech patterns. Furthermore, various composite features are created using these features. Information gain (IG) and minimum redundancy maximum relevancy (mRMR) feature selection methods are used to eliminate the noisy and irrelevant features from the feature vector. Finally, machine-learning algorithms are used for classifying the review document into positive or negative class. Effects of different categories of features are investigated on four standard data-sets, namely, movie review and product (book, DVD and electronics) review data-sets. Experimental results show that composite features created from prominent features of unigram and bi-tagged features perform better than other features for sentiment classification. mRMR is a better feature selection method as compared with IG for sentiment classification. Boolean Multinomial Naïve Bayes) algorithm performs better than support vector machine classifier for SA in terms of accuracy and execution time.
Wing Classification in the Virtual Research Center

NASA Technical Reports Server (NTRS)

Campbell, William H.

1999-01-01

The Virtual Research Center (VRC) is a Web site that hosts a database of documents organized to allow teams of scientists and engineers to store and maintain documents. A number of other workgroup-related capabilities are provided. My tasks as a NASA/ASEE Summer Faculty Fellow included developing a scheme for classifying the workgroups using the VRC using the various Divisions within NASA Enterprises. To this end I developed a plan to use several CGI Perl scripts to gather classification information from the leaders of the workgroups, and to display all the workgroups within a specified classification. I designed, implemented, and partially tested scripts which can be used to do the classification. I was also asked to consider directions for future development of the VRC. I think that the VRC can use XML to advantage. XML is a markup language with designer tags that can be used to build meaning into documents. An investigation as to how CORBA, an object-oriented object request broker included with JDK 1.2, might be used also seems justified.
Topic Detection in Online Chat

DTIC Science & Technology

2009-09-01

CODE 17. SECURITY CLASSIFICATION OF REPORT Unclassified 18 . SECURITY CLASSIFICATION OF THIS PAGE Unclassified 19. SECURITY CLASSIFICATION...Documents and Author-Author Documents—Radial Kernel. .............. 66 Figure 18 . Classifiers Results: LDA Models Created by Textbook-Author...Trained on Two Classes............................................................................................... 72 Table 18 . Maximum
Data Needs in Vocational Education. "The Development of a Minimal Information System to Satisfy the Needs of Selected User Groups." Final Report. Volume III. Project EDNEED Lexicon.

ERIC Educational Resources Information Center

Nerden, J. T.; And Others

Designed for the exclusive purpose of accompanying the Project EDNEED (Empirical Determination of Nationally Essential Educational Data) classification document, this volume comprises the third of a five-volume final report. It provides uniform definitions for vocational education terms found in the EDNEED classification document, and aids in…
75 FR 69143 - Postal Rate and Classification Changes

Federal Register 2010, 2011, 2012, 2013, 2014

2010-11-10

...This document addresses a recently-filed Postal Service request for three postal rate and classification changes. One change will affect certain senders of First-Class Mail Presort and Automation Letters. Another change will affect Standard Mail and High Density milers. The third change affects the Move Update Charge threshold. This document provides details about the anticipated changes and addresses procedural steps associated with this filing.
Field sampling and data analysis methods for development of ecological land classifications: an application on the Manistee National Forest.

Treesearch

George E. Host; Carl W. Ramm; Eunice A. Padley; Kurt S. Pregitzer; James B. Hart; David T. Cleland

1992-01-01

Presents technical documentation for development of an Ecological Classification System for the Manistee National Forest in northwest Lower Michigan, and suggests procedures applicable to other ecological land classification projects. Includes discussion of sampling design, field data collection, data summarization and analyses, development of classification units,...
Multi-dimensional classification of biomedical text: Toward automated, practical provision of high-utility text to diverse users

PubMed Central

Shatkay, Hagit; Pan, Fengxia; Rzhetsky, Andrey; Wilbur, W. John

2008-01-01

Motivation: Much current research in biomedical text mining is concerned with serving biologists by extracting certain information from scientific text. We note that there is no ‘average biologist’ client; different users have distinct needs. For instance, as noted in past evaluation efforts (BioCreative, TREC, KDD) database curators are often interested in sentences showing experimental evidence and methods. Conversely, lab scientists searching for known information about a protein may seek facts, typically stated with high confidence. Text-mining systems can target specific end-users and become more effective, if the system can first identify text regions rich in the type of scientific content that is of interest to the user, retrieve documents that have many such regions, and focus on fact extraction from these regions. Here, we study the ability to characterize and classify such text automatically. We have recently introduced a multi-dimensional categorization and annotation scheme, developed to be applicable to a wide variety of biomedical documents and scientific statements, while intended to support specific biomedical retrieval and extraction tasks. Results: The annotation scheme was applied to a large corpus in a controlled effort by eight independent annotators, where three individual annotators independently tagged each sentence. We then trained and tested machine learning classifiers to automatically categorize sentence fragments based on the annotation. We discuss here the issues involved in this task, and present an overview of the results. The latter strongly suggest that automatic annotation along most of the dimensions is highly feasible, and that this new framework for scientific sentence categorization is applicable in practice. Contact: shatkay@cs.queensu.ca PMID:18718948
Classification of Instructional Programs, 1990 Edition.

ERIC Educational Resources Information Center

Morgan, Robert L.; And Others

This document, the Department of Education's standard educational program classification system for secondary and postsecondary schools, supersedes all previous editions. The manual is divided into seven chapters, each of which contains, in numerical order, the complete list of currently active Classification of Instructional Programs (CIP)…

Free-Text Disease Classification

DTIC Science & Technology

2011-09-01

NAVAL POSTGRADUATE SCHOOL MONTEREY, CALIFORNIA THESIS FREE-TEXT DISEASE CLASSIFICATION by Craig Maxey September 2011 Thesis Advisor: Lyn R. Whitaker...2104-10-31 Free-Text Disease Classification Craig Maxey Naval Postgraduate School Monterey, CA 93943 Department of the Navy Approved for public...POSTGRADUATE SCHOOL September 2011 Author: Craig Maxey Approved by: Lyn R. Whitaker Thesis Advisor Samuel E. Buttrey Second Reader Robert F. Dell Chair
Culto: AN Ontology-Based Annotation Tool for Data Curation in Cultural Heritage

NASA Astrophysics Data System (ADS)

Garozzo, R.; Murabito, F.; Santagati, C.; Pino, C.; Spampinato, C.

2017-08-01

This paper proposes CulTO, a software tool relying on a computational ontology for Cultural Heritage domain modelling, with a specific focus on religious historical buildings, for supporting cultural heritage experts in their investigations. It is specifically thought to support annotation, automatic indexing, classification and curation of photographic data and text documents of historical buildings. CULTO also serves as a useful tool for Historical Building Information Modeling (H-BIM) by enabling semantic 3D data modeling and further enrichment with non-geometrical information of historical buildings through the inclusion of new concepts about historical documents, images, decay or deformation evidence as well as decorative elements into BIM platforms. CulTO is the result of a joint research effort between the Laboratory of Surveying and Architectural Photogrammetry "Luigi Andreozzi" and the PeRCeiVe Lab (Pattern Recognition and Computer Vision Lab) of the University of Catania,
A Method for Extracting Important Segments from Documents Using Support Vector Machines

NASA Astrophysics Data System (ADS)

Suzuki, Daisuke; Utsumi, Akira

In this paper we propose an extraction-based method for automatic summarization. The proposed method consists of two processes: important segment extraction and sentence compaction. The process of important segment extraction classifies each segment in a document as important or not by Support Vector Machines (SVMs). The process of sentence compaction then determines grammatically appropriate portions of a sentence for a summary according to its dependency structure and the classification result by SVMs. To test the performance of our method, we conducted an evaluation experiment using the Text Summarization Challenge (TSC-1) corpus of human-prepared summaries. The result was that our method achieved better performance than a segment-extraction-only method and the Lead method, especially for sentences only a part of which was included in human summaries. Further analysis of the experimental results suggests that a hybrid method that integrates sentence extraction with segment extraction may generate better summaries.
Land use and land cover digital data

USGS Publications Warehouse

Fegeas, Robin G.; Claire, Robert W.; Guptill, Stephen C.; Anderson, K. Eric; Hallam, Cheryl A.

1983-01-01

The discipline of cartography is undergoing a number of profound changesthat center on the emerging influence ofdigital manipulation and analysis ofdata for the preparation of cartographic materials and for use in geographic information systems. Operational requirements have led to the development by the USGS National Mapping Division of several documents that establish in-house digital cartographic standards. In an effort to fulfill lead agency requirements for promulgation of Federal standards in the earth sciences, the documents have been edited and assembled with explanatory text into a USGS Circular. This Circular describes some of the pertinent issues relative to digital cartographic data standards, documents the digital cartographic data standards currently in use within the USGS, and details the efforts of the USGS related to the definition of national digital cartographic data standards. It consists of several chapters; the first is a general overview, and each succeeding chapter is made up from documents that establish in-house standards for one of the various types of digital cartographic data currently produced. This chapter 895-E, describes the Geographic Information Retrieval and Analysis System that is used in conjunction with the USGS land use and land cover classification system to encode, edit, manipuate, and analyze land use and land cover digital data.
Drug interaction databases in medical literature: transparency of ownership, funding, classification algorithms, level of documentation, and staff qualifications. A systematic review.

PubMed

Kongsholm, Gertrud Gansmo; Nielsen, Anna Katrine Toft; Damkier, Per

2015-11-01

It is well documented that drug-drug interaction databases (DIDs) differ substantially with respect to classification of drug-drug interactions (DDIs). The aim of this study was to study online available transparency of ownership, funding, information, classifications, staff training, and underlying documentation of the five most commonly used open access English language-based online DIDs and the three most commonly used subscription English language-based online DIDs in the literature. We conducted a systematic literature search to identify the five most commonly used open access and the three most commonly used subscription DIDs in the medical literature. The following parameters were assessed for each of the databases: Ownership, classification of interactions, primary information sources, and staff qualification. We compared the overall proportion of yes/no answers from open access databases and subscription databases by Fisher's exact test-both prior to and after requesting missing information. Among open access DIDs, 20/60 items could be verified from the webpage directly compared to 24/36 for the subscription DIDs (p = 0.0028). Following personal request, these numbers rose to 22/60 and 30/36, respectively (p < 0.0001). For items within the "classification of interaction" domain, proportions were 3/25 versus 11/15 available from the webpage (P = 0.0001) and 3/25 versus 15/15 (p < 0.0001) available upon personal request. Available information on online available transparency of ownership, funding, information, classifications, staff training, and underlying documentation varies substantially among various DIDs. Open access DIDs had a statistically lower score on parameters assessed.
Acute Radiation Sickness Amelioration Analysis

DTIC Science & Technology

1994-05-01

Emetic Drugs 16. PRICE CODE Antagonists 17. SECURITY CLASSIFICATION 18. SECURITY CLASSIFICATION 19, SECURITY CLASSIFICATION 20. LIMITATION OF ABSTRACT OF...102 UNCLASSIFIED mcuIw IA IIIcaIIin or Isis PAW CLASSFIED BY: N/A since Unclassified. DECLASSIFY ON: N/A since Unclassified. SECURITY CLASSIFICATION OF...Approximately 2000 documents relevant to the development of the candidate anti-emetic drugs ondansetron (Zofran, Glaxo Pharmaceuticals) and granisetron
Local Area Network (LAN) Compatibility Issues

DTIC Science & Technology

1991-09-01

September, 1991 Thesis Advisor: Dr. Norman Schneidewind Approved for public release; distribution is unlimited 92 303s246 Unclassified SECURITY ...CLASSIFICATION OF THIS PAGE REPORT DOCUMENTATION PAGE Ia. REPORT SECURITY CLASSIFICATION 1 b. RESTRICTIVE MARKINGS unclassified 2a. SECURITY CLASSIFICATION...Work UiNt ACCeLUOn Number 11. TITLE (Include Security Classification) LOCAL AREA NETWORK (LAN) COMPATIBILITY ISSUES 12. PERSONAL AUTHOR(S) Rita V
Glossary: Defense Acquisition Acronyms and Terms. Revision 2

DTIC Science & Technology

1987-07-01

Approved REPORT DOCUMENTATION PAGE OMBNo. 070-O 18 la. REPORT SECURITY CLASSIFICATION lb. RESTRICTIVE MARKINGS % unclassified 2a. SECURITY CLASSIFICATION ...WORK UNIT Fort Belvoir, VA 22060-5426 ELEMENT NO. NO. NO. ACCESSION NO. 11. TITLE (Include Security Classification ) Glossary Defense Acquisition...DISTRIBUTION/AVAILABILITY OF ABSTRACT 21 ABSTRACT SECURITY CLASSIFICATION [RUNCLASSIFIED/UNLIMITED 0 SAME AS RPT 0 DTIC USERS unclassified 22a. NAME OF
10 CFR 1044.08 - Do you have to submit the documents for classification review before you give them to someone?

Code of Federal Regulations, 2014 CFR

2014-01-01

... 10 Energy 4 2014-01-01 2014-01-01 false Do you have to submit the documents for classification review before you give them to someone? 1044.08 Section 1044.08 Energy DEPARTMENT OF ENERGY (GENERAL PROVISIONS) SECURITY REQUIREMENTS FOR PROTECTED DISCLOSURES UNDER SECTION 3164 OF THE NATIONAL DEFENSE AUTHORIZATION ACT FOR FISCAL YEAR 2000 § 1044.08...
10 CFR 1044.08 - Do you have to submit the documents for classification review before you give them to someone?

Code of Federal Regulations, 2011 CFR

2011-01-01

... 10 Energy 4 2011-01-01 2011-01-01 false Do you have to submit the documents for classification review before you give them to someone? 1044.08 Section 1044.08 Energy DEPARTMENT OF ENERGY (GENERAL PROVISIONS) SECURITY REQUIREMENTS FOR PROTECTED DISCLOSURES UNDER SECTION 3164 OF THE NATIONAL DEFENSE AUTHORIZATION ACT FOR FISCAL YEAR 2000 § 1044.08...
10 CFR 1044.08 - Do you have to submit the documents for classification review before you give them to someone?

Code of Federal Regulations, 2012 CFR

2012-01-01

... 10 Energy 4 2012-01-01 2012-01-01 false Do you have to submit the documents for classification review before you give them to someone? 1044.08 Section 1044.08 Energy DEPARTMENT OF ENERGY (GENERAL PROVISIONS) SECURITY REQUIREMENTS FOR PROTECTED DISCLOSURES UNDER SECTION 3164 OF THE NATIONAL DEFENSE AUTHORIZATION ACT FOR FISCAL YEAR 2000 § 1044.08...
10 CFR 1044.08 - Do you have to submit the documents for classification review before you give them to someone?

Code of Federal Regulations, 2010 CFR

2010-01-01

... 10 Energy 4 2010-01-01 2010-01-01 false Do you have to submit the documents for classification review before you give them to someone? 1044.08 Section 1044.08 Energy DEPARTMENT OF ENERGY (GENERAL PROVISIONS) SECURITY REQUIREMENTS FOR PROTECTED DISCLOSURES UNDER SECTION 3164 OF THE NATIONAL DEFENSE AUTHORIZATION ACT FOR FISCAL YEAR 2000 § 1044.08...
10 CFR 1044.08 - Do you have to submit the documents for classification review before you give them to someone?

Code of Federal Regulations, 2013 CFR

2013-01-01

... 10 Energy 4 2013-01-01 2013-01-01 false Do you have to submit the documents for classification review before you give them to someone? 1044.08 Section 1044.08 Energy DEPARTMENT OF ENERGY (GENERAL PROVISIONS) SECURITY REQUIREMENTS FOR PROTECTED DISCLOSURES UNDER SECTION 3164 OF THE NATIONAL DEFENSE AUTHORIZATION ACT FOR FISCAL YEAR 2000 § 1044.08...
The effects of pre-processing strategies in sentiment analysis of online movie reviews

NASA Astrophysics Data System (ADS)

Zin, Harnani Mat; Mustapha, Norwati; Murad, Masrah Azrifah Azmi; Sharef, Nurfadhlina Mohd

2017-10-01

With the ever increasing of internet applications and social networking sites, people nowadays can easily express their feelings towards any products and services. These online reviews act as an important source for further analysis and improved decision making. These reviews are mostly unstructured by nature and thus, need processing like sentiment analysis and classification to provide a meaningful information for future uses. In text analysis tasks, the appropriate selection of words/features will have a huge impact on the effectiveness of the classifier. Thus, this paper explores the effect of the pre-processing strategies in the sentiment analysis of online movie reviews. In this paper, supervised machine learning method was used to classify the reviews. The support vector machine (SVM) with linear and non-linear kernel has been considered as classifier for the classification of the reviews. The performance of the classifier is critically examined based on the results of precision, recall, f-measure, and accuracy. Two different features representations were used which are term frequency and term frequency-inverse document frequency. Results show that the pre-processing strategies give a significant impact on the classification process.
Clinical research data warehouse governance for distributed research networks in the USA: a systematic review of the literature

PubMed Central

Holmes, John H; Elliott, Thomas E; Brown, Jeffrey S; Raebel, Marsha A; Davidson, Arthur; Nelson, Andrew F; Chung, Annie; La Chance, Pierre; Steiner, John F

2014-01-01

Objective To review the published, peer-reviewed literature on clinical research data warehouse governance in distributed research networks (DRNs). Materials and methods Medline, PubMed, EMBASE, CINAHL, and INSPEC were searched for relevant documents published through July 31, 2013 using a systematic approach. Only documents relating to DRNs in the USA were included. Documents were analyzed using a classification framework consisting of 10 facets to identify themes. Results 6641 documents were retrieved. After screening for duplicates and relevance, 38 were included in the final review. A peer-reviewed literature on data warehouse governance is emerging, but is still sparse. Peer-reviewed publications on UK research network governance were more prevalent, although not reviewed for this analysis. All 10 classification facets were used, with some documents falling into two or more classifications. No document addressed costs associated with governance. Discussion Even though DRNs are emerging as vehicles for research and public health surveillance, understanding of DRN data governance policies and procedures is limited. This is expected to change as more DRN projects disseminate their governance approaches as publicly available toolkits and peer-reviewed publications. Conclusions While peer-reviewed, US-based DRN data warehouse governance publications have increased, DRN developers and administrators are encouraged to publish information about these programs. PMID:24682495
Differential Topic Models.

PubMed

Chen, Changyou; Buntine, Wray; Ding, Nan; Xie, Lexing; Du, Lan

2015-02-01

In applications we may want to compare different document collections: they could have shared content but also different and unique aspects in particular collections. This task has been called comparative text mining or cross-collection modeling. We present a differential topic model for this application that models both topic differences and similarities. For this we use hierarchical Bayesian nonparametric models. Moreover, we found it was important to properly model power-law phenomena in topic-word distributions and thus we used the full Pitman-Yor process rather than just a Dirichlet process. Furthermore, we propose the transformed Pitman-Yor process (TPYP) to incorporate prior knowledge such as vocabulary variations in different collections into the model. To deal with the non-conjugate issue between model prior and likelihood in the TPYP, we thus propose an efficient sampling algorithm using a data augmentation technique based on the multinomial theorem. Experimental results show the model discovers interesting aspects of different collections. We also show the proposed MCMC based algorithm achieves a dramatically reduced test perplexity compared to some existing topic models. Finally, we show our model outperforms the state-of-the-art for document classification/ideology prediction on a number of text collections.
32 CFR 732.25 - Accounting classifications for nonnaval medical and dental care expenses.

Code of Federal Regulations, 2010 CFR

2010-07-01

... and dental care expenses. 732.25 Section 732.25 National Defense Department of Defense (Continued) DEPARTMENT OF THE NAVY PERSONNEL NONNAVAL MEDICAL AND DENTAL CARE Accounting Classifications for Nonnaval Medical and Dental Care Expenses and Standard Document Numbers § 732.25 Accounting classifications for...
32 CFR 732.25 - Accounting classifications for nonnaval medical and dental care expenses.

Code of Federal Regulations, 2012 CFR

2012-07-01

... and dental care expenses. 732.25 Section 732.25 National Defense Department of Defense (Continued) DEPARTMENT OF THE NAVY PERSONNEL NONNAVAL MEDICAL AND DENTAL CARE Accounting Classifications for Nonnaval Medical and Dental Care Expenses and Standard Document Numbers § 732.25 Accounting classifications for...
32 CFR 732.25 - Accounting classifications for nonnaval medical and dental care expenses.

Code of Federal Regulations, 2013 CFR

2013-07-01

... and dental care expenses. 732.25 Section 732.25 National Defense Department of Defense (Continued) DEPARTMENT OF THE NAVY PERSONNEL NONNAVAL MEDICAL AND DENTAL CARE Accounting Classifications for Nonnaval Medical and Dental Care Expenses and Standard Document Numbers § 732.25 Accounting classifications for...
32 CFR 732.25 - Accounting classifications for nonnaval medical and dental care expenses.

Code of Federal Regulations, 2011 CFR

2011-07-01

... and dental care expenses. 732.25 Section 732.25 National Defense Department of Defense (Continued) DEPARTMENT OF THE NAVY PERSONNEL NONNAVAL MEDICAL AND DENTAL CARE Accounting Classifications for Nonnaval Medical and Dental Care Expenses and Standard Document Numbers § 732.25 Accounting classifications for...

32 CFR 732.25 - Accounting classifications for nonnaval medical and dental care expenses.

Code of Federal Regulations, 2014 CFR

2014-07-01

... and dental care expenses. 732.25 Section 732.25 National Defense Department of Defense (Continued) DEPARTMENT OF THE NAVY PERSONNEL NONNAVAL MEDICAL AND DENTAL CARE Accounting Classifications for Nonnaval Medical and Dental Care Expenses and Standard Document Numbers § 732.25 Accounting classifications for...
Department of Defense Federal Supply Classification Listing of DoD standardization Documents

DTIC Science & Technology

1989-07-01

military activities . This mandatory provision requires that the Federal and Military specifications, standards, and related standardization documents be...fiche as follows: a. DODISS Alphabetic Listing - Reflects all active documents alphabetically by nomenclature cross referenced to document number...document date, preparing activity and custodians. b. DODISS Numerical Listing - Reflects all active documents. New, revised, amended, changed and
[Use of ionizing radiation sources in metallurgy: risk assessment].

PubMed

Giugni, U

2012-01-01

Use of ionizing radiation sources in the metallurgical industry: risk assessment. Radioactive sources and fixed or mobile X-ray equipment are used for both process and quality control. The use of ionizing radiation sources requires careful risk assessment. The text lists the characteristics of the sources and the legal requirements, and contains a description of the documentation required and the methods used for risk assessment. It describes how to estimate the doses to operators and the relevant classification criteria used for the purpose of radiation protection. Training programs must be organized in close collaboration between the radiation protection expert and the occupational physician.
Methodology for the Evaluation of the Algorithms for Text Line Segmentation Based on Extended Binary Classification

NASA Astrophysics Data System (ADS)

Brodic, D.

2011-01-01

Text line segmentation represents the key element in the optical character recognition process. Hence, testing of text line segmentation algorithms has substantial relevance. All previously proposed testing methods deal mainly with text database as a template. They are used for testing as well as for the evaluation of the text segmentation algorithm. In this manuscript, methodology for the evaluation of the algorithm for text segmentation based on extended binary classification is proposed. It is established on the various multiline text samples linked with text segmentation. Their results are distributed according to binary classification. Final result is obtained by comparative analysis of cross linked data. At the end, its suitability for different types of scripts represents its main advantage.
Implementation of standardized nomenclature in the electronic medical record.

PubMed

Klehr, Joan; Hafner, Jennifer; Spelz, Leah Mylrea; Steen, Sara; Weaver, Kathy

2009-01-01

To describe a customized electronic medical record documentation system which provides an electronic health record, Epic, which was implemented in December 2006 using standardized taxonomies for nursing documentation. Descriptive data is provided regarding the development, implementation, and evaluation processes for the electronic medical record system. Nurses used standardized nursing nomenclature including NANDA-I diagnoses, Nursing Interventions Classification, and Nursing Outcomes Classification in a measurable and user-friendly format using the care plan activity. Key factors in the success of the project included close collaboration among staff nurses and information technology staff, ongoing support and encouragement from the vice president/chief nursing officer, the ready availability of expert resources, and nursing ownership of the project. Use of this evidence-based documentation enhanced institutional leadership in clinical documentation.
A Data Augmentation Approach to Short Text Classification

ERIC Educational Resources Information Center

Rosario, Ryan Robert

2017-01-01

Text classification typically performs best with large training sets, but short texts are very common on the World Wide Web. Can we use resampling and data augmentation to construct larger texts using similar terms? Several current methods exist for working with short text that rely on using external data and contexts, or workarounds. Our focus is…
Validation of Case Finding Algorithms for Hepatocellular Cancer from Administrative Data and Electronic Health Records using Natural Language Processing

PubMed Central

Sada, Yvonne; Hou, Jason; Richardson, Peter; El-Serag, Hashem; Davila, Jessica

2013-01-01

Background Accurate identification of hepatocellular cancer (HCC) cases from automated data is needed for efficient and valid quality improvement initiatives and research. We validated HCC ICD-9 codes, and evaluated whether natural language processing (NLP) by the Automated Retrieval Console (ARC) for document classification improves HCC identification. Methods We identified a cohort of patients with ICD-9 codes for HCC during 2005–2010 from Veterans Affairs administrative data. Pathology and radiology reports were reviewed to confirm HCC. The positive predictive value (PPV), sensitivity, and specificity of ICD-9 codes were calculated. A split validation study of pathology and radiology reports was performed to develop and validate ARC algorithms. Reports were manually classified as diagnostic of HCC or not. ARC generated document classification algorithms using the Clinical Text Analysis and Knowledge Extraction System. ARC performance was compared to manual classification. PPV, sensitivity, and specificity of ARC were calculated. Results 1138 patients with HCC were identified by ICD-9 codes. Based on manual review, 773 had HCC. The HCC ICD-9 code algorithm had a PPV of 0.67, sensitivity of 0.95, and specificity of 0.93. For a random subset of 619 patients, we identified 471 pathology reports for 323 patients and 943 radiology reports for 557 patients. The pathology ARC algorithm had PPV of 0.96, sensitivity of 0.96, and specificity of 0.97. The radiology ARC algorithm had PPV of 0.75, sensitivity of 0.94, and specificity of 0.68. Conclusion A combined approach of ICD-9 codes and NLP of pathology and radiology reports improves HCC case identification in automated data. PMID:23929403
Matrix frequency analysis and its applications to language classification of textual data for English and Hebrew

NASA Astrophysics Data System (ADS)

Uchill, Joseph H.; Assadi, Amir H.

2003-01-01

The advent of the internet has opened a host of new and exciting questions in the science and mathematics of information organization and data mining. In particular, a highly ambitious promise of the internet is to bring the bulk of human knowledge to everyone with access to a computer network, providing a democratic medium for sharing and communicating knowledge regardless of the language of the communication. The development of sharing and communication of knowledge via transfer of digital files is the first crucial achievement in this direction. Nonetheless, available solutions to numerous ancillary problems remain far from satisfactory. Among such outstanding problems are the first few fundamental questions that have been responsible for the emergence and rapid growth of the new field of Knowledge Engineering, namely, classification of forms of data, their effective organization, and extraction of knowledge from massive distributed data sets, and the design of fast effective search engines. The precision of machine learning algorithms in classification and recognition of image data (e.g. those scanned from books and other printed documents) are still far from human performance and speed in similar tasks. Discriminating the many forms of ASCII data from each other is not as difficult in view of the emerging universal standards for file-format. Nonetheless, most of the past and relatively recent human knowledge is yet to be transformed and saved in such machine readable formats. In particular, an outstanding problem in knowledge engineering is the problem of organization and management--with precision comparable to human performance--of knowledge in the form of images of documents that broadly belong to either text, image or a blend of both. It was shown in that the effectiveness of OCR was intertwined with the success of language and font recognition.
A vectorial semantics approach to personality assessment.

PubMed

Neuman, Yair; Cohen, Yochai

2014-04-23

Personality assessment and, specifically, the assessment of personality disorders have traditionally been indifferent to computational models. Computational personality is a new field that involves the automatic classification of individuals' personality traits that can be compared against gold-standard labels. In this context, we introduce a new vectorial semantics approach to personality assessment, which involves the construction of vectors representing personality dimensions and disorders, and the automatic measurements of the similarity between these vectors and texts written by human subjects. We evaluated our approach by using a corpus of 2468 essays written by students who were also assessed through the five-factor personality model. To validate our approach, we measured the similarity between the essays and the personality vectors to produce personality disorder scores. These scores and their correspondence with the subjects' classification of the five personality factors reproduce patterns well-documented in the psychological literature. In addition, we show that, based on the personality vectors, we can predict each of the five personality factors with high accuracy.
A Vectorial Semantics Approach to Personality Assessment

NASA Astrophysics Data System (ADS)

Neuman, Yair; Cohen, Yochai

2014-04-01

Personality assessment and, specifically, the assessment of personality disorders have traditionally been indifferent to computational models. Computational personality is a new field that involves the automatic classification of individuals' personality traits that can be compared against gold-standard labels. In this context, we introduce a new vectorial semantics approach to personality assessment, which involves the construction of vectors representing personality dimensions and disorders, and the automatic measurements of the similarity between these vectors and texts written by human subjects. We evaluated our approach by using a corpus of 2468 essays written by students who were also assessed through the five-factor personality model. To validate our approach, we measured the similarity between the essays and the personality vectors to produce personality disorder scores. These scores and their correspondence with the subjects' classification of the five personality factors reproduce patterns well-documented in the psychological literature. In addition, we show that, based on the personality vectors, we can predict each of the five personality factors with high accuracy.
A Vectorial Semantics Approach to Personality Assessment

PubMed Central

Neuman, Yair; Cohen, Yochai

2014-01-01

Personality assessment and, specifically, the assessment of personality disorders have traditionally been indifferent to computational models. Computational personality is a new field that involves the automatic classification of individuals' personality traits that can be compared against gold-standard labels. In this context, we introduce a new vectorial semantics approach to personality assessment, which involves the construction of vectors representing personality dimensions and disorders, and the automatic measurements of the similarity between these vectors and texts written by human subjects. We evaluated our approach by using a corpus of 2468 essays written by students who were also assessed through the five-factor personality model. To validate our approach, we measured the similarity between the essays and the personality vectors to produce personality disorder scores. These scores and their correspondence with the subjects' classification of the five personality factors reproduce patterns well-documented in the psychological literature. In addition, we show that, based on the personality vectors, we can predict each of the five personality factors with high accuracy. PMID:24755833
76 FR 69034 - Microbiology Devices; Classification of In Vitro Diagnostic Device for Yersinia Species Detection

Federal Register 2010, 2011, 2012, 2013, 2014

2011-11-07

... Drug Administration 21 CFR Part 866 Microbiology Devices; Classification of In Vitro Diagnostic Device... CFR Part 866 [Docket No. FDA-2011-N-0729] Microbiology Devices; Classification of In Vitro Diagnostic... of the Microbiology Devices Advisory Panel (the panel). FDA is publishing in this document the...
Evaluation of Barrier Cable Impact Pad Materials

DTIC Science & Technology

1988-03-01

INFORMATION CENTER CAMERON STATION ALEXANDRIA, VIRGINIA 22314 Unclassified SECURITY CLASSIFICATION OF THIS PAGE Form Approved REPORT DOCUMENTATION PAGE OMB...No. 0704-0188 _____________________________________________Exp. Date: Jun 30, 1986 la. REPORT SECURITY CLASSIFICATION lb. RESTRICTIVE MARKINGS...Unclassified 2a. SECURITY CLASSIFICATION AUTHORITY 3. DISTRIBUTION/AVAILABILITY OF REPORT 2b. DECLASSIFICATION/DOWNGRADING SCHEDULE Approved for public
10 CFR 95.43 - Authority to reproduce.

Code of Federal Regulations, 2013 CFR

2013-01-01

... CSA, Secret and Confidential documents may be reproduced. Reproduced copies of classified documents... material must be conspicuously marked with the same classification markings as the material being...
10 CFR 95.43 - Authority to reproduce.

Code of Federal Regulations, 2011 CFR

2011-01-01

... CSA, Secret and Confidential documents may be reproduced. Reproduced copies of classified documents... material must be conspicuously marked with the same classification markings as the material being...
10 CFR 95.43 - Authority to reproduce.

Code of Federal Regulations, 2010 CFR

2010-01-01

... CSA, Secret and Confidential documents may be reproduced. Reproduced copies of classified documents... material must be conspicuously marked with the same classification markings as the material being...
10 CFR 95.43 - Authority to reproduce.

Code of Federal Regulations, 2014 CFR

2014-01-01

... CSA, Secret and Confidential documents may be reproduced. Reproduced copies of classified documents... material must be conspicuously marked with the same classification markings as the material being...
10 CFR 95.43 - Authority to reproduce.

Code of Federal Regulations, 2012 CFR

2012-01-01

... CSA, Secret and Confidential documents may be reproduced. Reproduced copies of classified documents... material must be conspicuously marked with the same classification markings as the material being...
EL68D Wasteway Watershed Land-Cover Generation

USGS Publications Warehouse

Ruhl, Sheila; Usery, E. Lynn; Finn, Michael P.

2007-01-01

Classification of land cover from Landsat Enhanced Thematic Mapper Plus (ETM+) for the EL68D Wasteway Watershed in the State of Washington is documented. The procedures for classification include use of two ETM+ scenes in a simultaneous unsupervised classification process supported by extensive field data collection using Global Positioning System receivers and digital photos. The procedure resulted in a detailed classification at the individual crop species level.
Advances in Classification Research. Volume 10. Proceedings of the ASIS SIG/CR Classification Research Workshop (10th, Washington, DC, November 1-5, 1999). ASIST Monograph Series.

ERIC Educational Resources Information Center

Albrechtsen, Hanne, Ed.; Mai, Jens-Erik, Ed.

This volume is a compilation of the papers presented at the 10th ASIS (American Society for Information Science) workshop on classification research. Major themes include the social and cultural informatics of classification and coding systems, subject access and indexing theory, genre analysis and the agency of documents in the ordering of…

Trainer Engineering Report (Final) for MILES. Volume 2. Revision

DTIC Science & Technology

1981-04-22

formerly a separate document, Data Item AOOX. iii/iv 1A , SECURITY CLASSIFICATION OF THIS PAGE (Uhen Deaa Enterecd) ... __ . ....... REPORT DOCUMENTATION...NAVTRAEQUIPCEN, Orlando, FL 32813 3 14. MON’TORING AGENCY NAME & ADDRESS(II dilletent from CoftrollIn OGlue*) IS. SECURITY CLASS. (of thie twoot...OBSOLETE UNCLASSIFIED S/N 0102蓞-6601 I SECURITY CL.ASSIFICATION OF THIS iPAGE fUlses Data EaieteE i • CONTENTS .I * INTRODUCTION 1-1 1.1 1980 MILES 1-1
10 CFR 1045.45 - Review of unmarked documents with potential restricted data or formerly restricted data.

Code of Federal Regulations, 2010 CFR

2010-01-01

... under the automatic or systematic review provisions of E.O. 12958 may come upon documents that they... 10 Energy 4 2010-01-01 2010-01-01 false Review of unmarked documents with potential restricted... PROVISIONS) NUCLEAR CLASSIFICATION AND DECLASSIFICATION Generation and Review of Documents Containing...
Arabic Supervised Learning Method Using N-Gram

ERIC Educational Resources Information Center

Sanan, Majed; Rammal, Mahmoud; Zreik, Khaldoun

2008-01-01

Purpose: Recently, classification of Arabic documents is a real problem for juridical centers. In this case, some of the Lebanese official journal documents are classified, and the center has to classify new documents based on these documents. This paper aims to study and explain the useful application of supervised learning method on Arabic texts…
10 CFR 1045.45 - Review of unmarked documents with potential restricted data or formerly restricted data.

Code of Federal Regulations, 2011 CFR

2011-01-01

... under the automatic or systematic review provisions of E.O. 12958 may come upon documents that they... 10 Energy 4 2011-01-01 2011-01-01 false Review of unmarked documents with potential restricted... PROVISIONS) NUCLEAR CLASSIFICATION AND DECLASSIFICATION Generation and Review of Documents Containing...
Document boundary determination using structural and lexical analysis

NASA Astrophysics Data System (ADS)

Taghva, Kazem; Cartright, Marc-Allen

2009-01-01

The document boundary determination problem is the process of identifying individual documents in a stack of papers. In this paper, we report on a classification system for automation of this process. The system employs features based on document structure and lexical content. We also report on experimental results to support the effectiveness of this system.
32 CFR 1907.26 - Notification of decision and prohibition on adverse action.

Code of Federal Regulations, 2010 CFR

2010-07-01

... made to the Interagency Security Classification Appeals Panel (ISCAP) established pursuant to § 5.4 of... CENTRAL INTELLIGENCE AGENCY CHALLENGES TO CLASSIFICATION OF DOCUMENTS BY AUTHORIZED HOLDERS PURSUANT TO Â...
32 CFR 1907.26 - Notification of decision and prohibition on adverse action.

Code of Federal Regulations, 2011 CFR

2011-07-01

... made to the Interagency Security Classification Appeals Panel (ISCAP) established pursuant to § 5.4 of... CENTRAL INTELLIGENCE AGENCY CHALLENGES TO CLASSIFICATION OF DOCUMENTS BY AUTHORIZED HOLDERS PURSUANT TO Â...
A Three-Phase Decision Model of Computer-Aided Coding for the Iranian Classification of Health Interventions (IRCHI).

PubMed

Azadmanjir, Zahra; Safdari, Reza; Ghazisaeedi, Marjan; Mokhtaran, Mehrshad; Kameli, Mohammad Esmail

2017-06-01

Accurate coded data in the healthcare are critical. Computer-Assisted Coding (CAC) is an effective tool to improve clinical coding in particular when a new classification will be developed and implemented. But determine the appropriate method for development need to consider the specifications of existing CAC systems, requirements for each type, our infrastructure and also, the classification scheme. The aim of the study was the development of a decision model for determining accurate code of each medical intervention in Iranian Classification of Health Interventions (IRCHI) that can be implemented as a suitable CAC system. first, a sample of existing CAC systems was reviewed. Then feasibility of each one of CAC types was examined with regard to their prerequisites for their implementation. The next step, proper model was proposed according to the structure of the classification scheme and was implemented as an interactive system. There is a significant relationship between the level of assistance of a CAC system and integration of it with electronic medical documents. Implementation of fully automated CAC systems is impossible due to immature development of electronic medical record and problems in using language for medical documenting. So, a model was proposed to develop semi-automated CAC system based on hierarchical relationships between entities in the classification scheme and also the logic of decision making to specify the characters of code step by step through a web-based interactive user interface for CAC. It was composed of three phases to select Target, Action and Means respectively for an intervention. The proposed model was suitable the current status of clinical documentation and coding in Iran and also, the structure of new classification scheme. Our results show it was practical. However, the model needs to be evaluated in the next stage of the research.
Con-Text: Text Detection for Fine-grained Object Classification.

PubMed

Karaoglu, Sezer; Tao, Ran; van Gemert, Jan C; Gevers, Theo

2017-05-24

This work focuses on fine-grained object classification using recognized scene text in natural images. While the state-of-the-art relies on visual cues only, this paper is the first work which proposes to combine textual and visual cues. Another novelty is the textual cue extraction. Unlike the state-of-the-art text detection methods, we focus more on the background instead of text regions. Once text regions are detected, they are further processed by two methods to perform text recognition i.e. ABBYY commercial OCR engine and a state-of-the-art character recognition algorithm. Then, to perform textual cue encoding, bi- and trigrams are formed between the recognized characters by considering the proposed spatial pairwise constraints. Finally, extracted visual and textual cues are combined for fine-grained classification. The proposed method is validated on four publicly available datasets: ICDAR03, ICDAR13, Con-Text and Flickr-logo. We improve the state-of-the-art end-to-end character recognition by a large margin of 15% on ICDAR03. We show that textual cues are useful in addition to visual cues for fine-grained classification. We show that textual cues are also useful for logo retrieval. Adding textual cues outperforms visual- and textual-only in fine-grained classification (70.7% to 60.3%) and logo retrieval (57.4% to 54.8%).
Teaching Classification To Fit a Modern and Sustainable LIS Curriculum: The Case of Croatia.

ERIC Educational Resources Information Center

Slavic, Aida

Library classification at the Croatian library school of the Department of Information Sciences, University of Zagreb (Croatia) has an important place in the department's curriculum. This is due to the fact that classification is the most important indexing language in Croatian libraries and documentation centers and services, and its role has not…
International Standard Classification of Education (ISCED) Three Stage Classification System: 1973; Part 2 - Definitions.

ERIC Educational Resources Information Center

United Nations Educational, Scientific, and Cultural Organization, Paris (France).

The seven levels of education, as classified numerically by International Standard Classification of Education (ISCED), are defined along with courses, programs, and fields of education listed under each level. Also contained is an alphabetical subject index indicating appropriate code numbers. For related documents see TM003535 and TM003536. (RC)
Identification of Long Bone Fractures in Radiology Reports Using Natural Language Processing to support Healthcare Quality Improvement.

PubMed

Grundmeier, Robert W; Masino, Aaron J; Casper, T Charles; Dean, Jonathan M; Bell, Jamie; Enriquez, Rene; Deakyne, Sara; Chamberlain, James M; Alpern, Elizabeth R

2016-11-09

Important information to support healthcare quality improvement is often recorded in free text documents such as radiology reports. Natural language processing (NLP) methods may help extract this information, but these methods have rarely been applied outside the research laboratories where they were developed. To implement and validate NLP tools to identify long bone fractures for pediatric emergency medicine quality improvement. Using freely available statistical software packages, we implemented NLP methods to identify long bone fractures from radiology reports. A sample of 1,000 radiology reports was used to construct three candidate classification models. A test set of 500 reports was used to validate the model performance. Blinded manual review of radiology reports by two independent physicians provided the reference standard. Each radiology report was segmented and word stem and bigram features were constructed. Common English "stop words" and rare features were excluded. We used 10-fold cross-validation to select optimal configuration parameters for each model. Accuracy, recall, precision and the F1 score were calculated. The final model was compared to the use of diagnosis codes for the identification of patients with long bone fractures. There were 329 unique word stems and 344 bigrams in the training documents. A support vector machine classifier with Gaussian kernel performed best on the test set with accuracy=0.958, recall=0.969, precision=0.940, and F1 score=0.954. Optimal parameters for this model were cost=4 and gamma=0.005. The three classification models that we tested all performed better than diagnosis codes in terms of accuracy, precision, and F1 score (diagnosis code accuracy=0.932, recall=0.960, precision=0.896, and F1 score=0.927). NLP methods using a corpus of 1,000 training documents accurately identified acute long bone fractures from radiology reports. Strategic use of straightforward NLP methods, implemented with freely available software, offers quality improvement teams new opportunities to extract information from narrative documents.
A classification of user-generated content into consumer decision journey stages.

PubMed

Vázquez, Silvia; Muñoz-García, Óscar; Campanella, Inés; Poch, Marc; Fisas, Beatriz; Bel, Nuria; Andreu, Gloria

2014-10-01

In the last decades, the availability of digital user-generated documents from social media has dramatically increased. This massive growth of user-generated content has also affected traditional shopping behaviour. Customers have embraced new communication channels such as microblogs and social networks that enable them not only just to talk with friends and acquaintances about their shopping experience, but also to search for opinions expressed by complete strangers as part of their decision making processes. Uncovering how customers feel about specific products or brands and detecting purchase habits and preferences has traditionally been a costly and highly time-consuming task which involved the use of methods such as focus groups and surveys. However, the new scenario calls for a deep assessment of current market research techniques in order to better interpret and profit from this ever-growing stream of attitudinal data. With this purpose, we present a novel analysis and classification of user-generated content in terms of it belonging to one of the four stages of the Consumer Decision Journey Court et al. (2009) (i.e. the purchase process from the moment when a customer is aware of the existence of the product to the moment when he or she buys, experiences and talks about it). Using a corpus of short texts written in English and Spanish and extracted from different social media, we identify a set of linguistic patterns for each purchase stage that will be then used in a rule-based classifier. Additionally, we use machine learning algorithms to automatically identify business indicators such as the Marketing Mix elements McCarthy and Brogowicz (1981). The classification of the purchase stages achieves an average precision of 74%. The proposed classification of texts depending on the Marketing Mix elements expressed achieved an average precision of 75% for all the elements analysed. Copyright © 2014 Elsevier Ltd. All rights reserved.
Railroad Classification Yard Technology Manual. Volume I : Yard Design Methods

DOT National Transportation Integrated Search

1981-02-01

This volume documents the procedures and methods associated with the design of railroad classification yards. Subjects include: site location, economic analysis, yard capacity analysis, design of flat yards, overall configuration of hump yards, hump ...
5 CFR 9701.212 - Bands.

Code of Federal Regulations, 2014 CFR

2014-01-01

... performance, recognizing and rewarding employees, and other associated duties. (c) DHS must document in... Administrative Personnel DEPARTMENT OF HOMELAND SECURITY HUMAN RESOURCES MANAGEMENT SYSTEM (DEPARTMENT OF... MANAGEMENT SYSTEM Classification Classification Structure § 9701.212 Bands. (a) For purposes of identifying...
5 CFR 9701.212 - Bands.

Code of Federal Regulations, 2011 CFR

2011-01-01

... performance, recognizing and rewarding employees, and other associated duties. (c) DHS must document in... Administrative Personnel DEPARTMENT OF HOMELAND SECURITY HUMAN RESOURCES MANAGEMENT SYSTEM (DEPARTMENT OF... MANAGEMENT SYSTEM Classification Classification Structure § 9701.212 Bands. (a) For purposes of identifying...
5 CFR 9701.212 - Bands.

Code of Federal Regulations, 2012 CFR

2012-01-01

... performance, recognizing and rewarding employees, and other associated duties. (c) DHS must document in... Administrative Personnel DEPARTMENT OF HOMELAND SECURITY HUMAN RESOURCES MANAGEMENT SYSTEM (DEPARTMENT OF... MANAGEMENT SYSTEM Classification Classification Structure § 9701.212 Bands. (a) For purposes of identifying...
Railroad Classification Yard Technology : A Survey and Assessment

DOT National Transportation Integrated Search

1977-01-01

This report documents a survey and assessment of the current state of the art in rail freight-car classification yard technology. The major objective was the identification of research and development necessary for technological improvements in railr...
76 FR 68767 - Draft Guidance for Industry and Food and Drug Administration Staff; De Novo Classification...

Federal Register 2010, 2011, 2012, 2013, 2014

2011-11-07

... and Radiological Health (CDRH) guidance documents is available at http://www.fda.gov/MedicalDevices... ``De Novo Classification Process (Evaluation of Automatic Class III Designation)'' from CDRH you may...
Automated validation of patient safety clinical incident classification: macro analysis.

PubMed

Gupta, Jaiprakash; Patrick, Jon

2013-01-01

Patient safety is the buzz word in healthcare. Incident Information Management System (IIMS) is electronic software that stores clinical mishaps narratives in places where patients are treated. It is estimated that in one state alone over one million electronic text documents are available in IIMS. In this paper we investigate the data density available in the fields entered to notify an incident and the validity of the built in classification used by clinician to categories the incidents. Waikato Environment for Knowledge Analysis (WEKA) software was used to test the classes. Four statistical classifier based on J48, Naïve Bayes (NB), Naïve Bayes Multinominal (NBM) and Support Vector Machine using radial basis function (SVM_RBF) algorithms were used to validate the classes. The data pool was 10,000 clinical incidents drawn from 7 hospitals in one state in Australia. In first part of the study 1000 clinical incidents were selected to determine type and number of fields worth investigating and in the second part another 5448 clinical incidents were randomly selected to validate 13 clinical incident types. Result shows 74.6% of the cells were empty and only 23 fields had content over 70% of the time. The percentage correctly classified classes on four algorithms using categorical dataset ranged from 42 to 49%, using free-text datasets from 65% to 77% and using both datasets from 72% to 79%. Kappa statistic ranged from 0.36 to 0.4. for categorical data, from 0.61 to 0.74. for free-text and from 0.67 to 0.77 for both datasets. Similar increases in performance in the 3 experiments was noted on true positive rate, precision, F-measure and area under curve (AUC) of receiver operating characteristics (ROC) scores. The study demonstrates only 14 of 73 fields in IIMS have data that is usable for machine learning experiments. Irrespective of the type of algorithms used when all datasets are used performance was better. Classifier NBM showed best performance. We think the classifier can be improved further by reclassifying the most confused classes and there is scope to apply text mining tool on patient safety classifications.

Selecting a restoration technique to minimize OCR error.

PubMed

Cannon, M; Fugate, M; Hush, D R; Scovel, C

2003-01-01

This paper introduces a learning problem related to the task of converting printed documents to ASCII text files. The goal of the learning procedure is to produce a function that maps documents to restoration techniques in such a way that on average the restored documents have minimum optical character recognition error. We derive a general form for the optimal function and use it to motivate the development of a nonparametric method based on nearest neighbors. We also develop a direct method of solution based on empirical error minimization for which we prove a finite sample bound on estimation error that is independent of distribution. We show that this empirical error minimization problem is an extension of the empirical optimization problem for traditional M-class classification with general loss function and prove computational hardness for this problem. We then derive a simple iterative algorithm called generalized multiclass ratchet (GMR) and prove that it produces an optimal function asymptotically (with probability 1). To obtain the GMR algorithm we introduce a new data map that extends Kesler's construction for the multiclass problem and then apply an algorithm called Ratchet to this mapped data, where Ratchet is a modification of the Pocket algorithm . Finally, we apply these methods to a collection of documents and report on the experimental results.
36 CFR 1254.76 - What procedures do I follow to copy formerly national security-classified documents?

Code of Federal Regulations, 2014 CFR

2014-07-01

... CFR 2001.24. (b) You may not remove from the research room copies of documents bearing uncancelled... individual documents, the research room staff cancels the classification markings on each page of the copy... declassification authority to the guard or research room attendant when you remove copies of documents from the...
36 CFR 1254.76 - What procedures do I follow to copy formerly national security-classified documents?

Code of Federal Regulations, 2011 CFR

2011-07-01

... CFR 2001.24. (b) You may not remove from the research room copies of documents bearing uncancelled... individual documents, the research room staff cancels the classification markings on each page of the copy... declassification authority to the guard or research room attendant when you remove copies of documents from the...
36 CFR 1254.76 - What procedures do I follow to copy formerly national security-classified documents?

Code of Federal Regulations, 2010 CFR

2010-07-01

... CFR 2001.24. (b) You may not remove from the research room copies of documents bearing uncancelled... individual documents, the research room staff cancels the classification markings on each page of the copy... declassification authority to the guard or research room attendant when you remove copies of documents from the...
36 CFR § 1254.76 - What procedures do I follow to copy formerly national security-classified documents?

Code of Federal Regulations, 2013 CFR

2013-07-01

... CFR 2001.24. (b) You may not remove from the research room copies of documents bearing uncancelled... individual documents, the research room staff cancels the classification markings on each page of the copy... declassification authority to the guard or research room attendant when you remove copies of documents from the...
36 CFR 1254.76 - What procedures do I follow to copy formerly national security-classified documents?

Code of Federal Regulations, 2012 CFR

2012-07-01

... CFR 2001.24. (b) You may not remove from the research room copies of documents bearing uncancelled... individual documents, the research room staff cancels the classification markings on each page of the copy... declassification authority to the guard or research room attendant when you remove copies of documents from the...
Automated Outcome Classification of Computed Tomography Imaging Reports for Pediatric Traumatic Brain Injury.

PubMed

Yadav, Kabir; Sarioglu, Efsun; Choi, Hyeong Ah; Cartwright, Walter B; Hinds, Pamela S; Chamberlain, James M

2016-02-01

The authors have previously demonstrated highly reliable automated classification of free-text computed tomography (CT) imaging reports using a hybrid system that pairs linguistic (natural language processing) and statistical (machine learning) techniques. Previously performed for identifying the outcome of orbital fracture in unprocessed radiology reports from a clinical data repository, the performance has not been replicated for more complex outcomes. To validate automated outcome classification performance of a hybrid natural language processing (NLP) and machine learning system for brain CT imaging reports. The hypothesis was that our system has performance characteristics for identifying pediatric traumatic brain injury (TBI). This was a secondary analysis of a subset of 2,121 CT reports from the Pediatric Emergency Care Applied Research Network (PECARN) TBI study. For that project, radiologists dictated CT reports as free text, which were then deidentified and scanned as PDF documents. Trained data abstractors manually coded each report for TBI outcome. Text was extracted from the PDF files using optical character recognition. The data set was randomly split evenly for training and testing. Training patient reports were used as input to the Medical Language Extraction and Encoding (MedLEE) NLP tool to create structured output containing standardized medical terms and modifiers for negation, certainty, and temporal status. A random subset stratified by site was analyzed using descriptive quantitative content analysis to confirm identification of TBI findings based on the National Institute of Neurological Disorders and Stroke (NINDS) Common Data Elements project. Findings were coded for presence or absence, weighted by frequency of mentions, and past/future/indication modifiers were filtered. After combining with the manual reference standard, a decision tree classifier was created using data mining tools WEKA 3.7.5 and Salford Predictive Miner 7.0. Performance of the decision tree classifier was evaluated on the test patient reports. The prevalence of TBI in the sampled population was 159 of 2,217 (7.2%). The automated classification for pediatric TBI is comparable to our prior results, with the notable exception of lower positive predictive value. Manual review of misclassified reports, 95.5% of which were false-positives, revealed that a sizable number of false-positive errors were due to differing outcome definitions between NINDS TBI findings and PECARN clinical important TBI findings and report ambiguity not meeting definition criteria. A hybrid NLP and machine learning automated classification system continues to show promise in coding free-text electronic clinical data. For complex outcomes, it can reliably identify negative reports, but manual review of positive reports may be required. As such, it can still streamline data collection for clinical research and performance improvement. © 2016 by the Society for Academic Emergency Medicine.
Automated Outcome Classification of Computed Tomography Imaging Reports for Pediatric Traumatic Brain Injury

PubMed Central

Yadav, Kabir; Sarioglu, Efsun; Choi, Hyeong-Ah; Cartwright, Walter B.; Hinds, Pamela S.; Chamberlain, James M.

2016-01-01

Background The authors have previously demonstrated highly reliable automated classification of free text computed tomography (CT) imaging reports using a hybrid system that pairs linguistic (natural language processing) and statistical (machine learning) techniques. Previously performed for identifying the outcome of orbital fracture in unprocessed radiology reports from a clinical data repository, the performance has not been replicated for more complex outcomes. Objectives To validate automated outcome classification performance of a hybrid natural language processing (NLP) and machine learning system for brain CT imaging reports. The hypothesis was that our system has performance characteristics for identifying pediatric traumatic brain injury (TBI). Methods This was a secondary analysis of a subset of 2,121 CT reports from the Pediatric Emergency Care Applied Research Network (PECARN) TBI study. For that project, radiologists dictated CT reports as free text, which were then de-identified and scanned as PDF documents. Trained data abstractors manually coded each report for TBI outcome. Text was extracted from the PDF files using optical character recognition. The dataset was randomly split evenly for training and testing. Training patient reports were used as input to the Medical Language Extraction and Encoding (MedLEE) NLP tool to create structured output containing standardized medical terms and modifiers for negation, certainty, and temporal status. A random subset stratified by site was analyzed using descriptive quantitative content analysis to confirm identification of TBI findings based upon the National Institute of Neurological Disorders and Stroke Common Data Elements project. Findings were coded for presence or absence, weighted by frequency of mentions, and past/future/indication modifiers were filtered. After combining with the manual reference standard, a decision tree classifier was created using data mining tools WEKA 3.7.5 and Salford Predictive Miner 7.0. Performance of the decision tree classifier was evaluated on the test patient reports. Results The prevalence of TBI in the sampled population was 159 out of 2,217 (7.2%). The automated classification for pediatric TBI is comparable to our prior results, with the notable exception of lower positive predictive value (PPV). Manual review of misclassified reports, 95.5% of which were false positives, revealed that a sizable number of false-positive errors were due to differing outcome definitions between NINDS TBI findings and PECARN clinical important TBI findings, and report ambiguity not meeting definition criteria. Conclusions A hybrid NLP and machine learning automated classification system continues to show promise in coding free-text electronic clinical data. For complex outcomes, it can reliably identify negative reports, but manual review of positive reports may be required. As such, it can still streamline data collection for clinical research and performance improvement. PMID:26766600
10 CFR 1045.30 - Purpose and scope.

Code of Federal Regulations, 2013 CFR

2013-01-01

... with access to RD and FRD, describes authorities and procedures for RD and FRD document classification and declassification, provides for periodic or systematic review of RD and FRD documents, and describes procedures for the mandatory review of RD and FRD documents. This subpart applies to all RD and...
10 CFR 1045.30 - Purpose and scope.

Code of Federal Regulations, 2012 CFR

2012-01-01

... with access to RD and FRD, describes authorities and procedures for RD and FRD document classification and declassification, provides for periodic or systematic review of RD and FRD documents, and describes procedures for the mandatory review of RD and FRD documents. This subpart applies to all RD and...
10 CFR 1045.30 - Purpose and scope.

Code of Federal Regulations, 2014 CFR

2014-01-01

... with access to RD and FRD, describes authorities and procedures for RD and FRD document classification and declassification, provides for periodic or systematic review of RD and FRD documents, and describes procedures for the mandatory review of RD and FRD documents. This subpart applies to all RD and...
Electronic Nursing Documentation: Patient Care Continuity Using the Clinical Care Classification System (CCC).

PubMed

Whittenburg, Luann; Meetim, Aunchisa

2016-01-01

An innovative nursing documentation project conducted at Bumrungrad International Hospital in Bangkok, Thailand demonstrated patient care continuity between nursing patient assessments and nursing Plans of Care using the Clinical Care Classification System (CCC). The project developed a new generation of interactive nursing Plans of Care using the six steps of the American Nurses Association (ANA) Nursing process and the MEDCIN^® clinical knowledgebase to present CCC coded concepts as a natural by-product of a nurse's documentation process. The MEDCIN^® clinical knowledgebase is a standardized point-of-care terminology intended for use in electronic health record systems. The CCC is an ANA recognized nursing terminology.
A Handbook for Derivative Classifiers at Los Alamos National Laboratory

DOE Office of Scientific and Technical Information (OSTI.GOV)

Sinkula, Barbara Jean

The Los Alamos Classification Office (within the SAFE-IP group) prepared this handbook as a resource for the Laboratory’s derivative classifiers (DCs). It contains information about United States Government (USG) classification policy, principles, and authorities as they relate to the LANL Classification Program in general, and to the LANL DC program specifically. At a working level, DCs review Laboratory documents and material that are subject to classification review requirements, while the Classification Office provides the training and resources for DCs to perform that vital function.
Railroad classification yard technology : computer system methodology : case study : Potomac Yard

DOT National Transportation Integrated Search

1981-08-01

This report documents the application of the railroad classification yard computer system methodology to Potomac Yard of the Richmond, Fredericksburg, and Potomac Railroad Company (RF&P). This case study entailed evaluation of the yard traffic capaci...
A case report of pornography addiction with dhat syndrome

PubMed Central

Darshan, M. S.; Sathyanarayana Rao, T. S.; Manickam, Sam; Tandon, Abhinav; Ram, Dushad

2014-01-01

A case of pornography addiction with dhat syndrome was diagnosed applying the existing criteria for substance dependence in International Classification for Diseases-10 and Diagnostic and Statistical Manual of Mental Disorders Fourth Edition, Text Revision. There is a lack of clear-cut criteria for identifying and defining such behavioural addictions and also lack of medical documents on pornography addiction. An applied strategy in lines with any substance addiction is used, and we found it helped our patient to gradually deaddict and then completely quit watching pornography. This is one of the few cases being reported scientifically, and we hope more work will be carried out in this ever increasing pornography addiction problem. PMID:25568482
Logistics Composite Model (LCOM) Workbook

DTIC Science & Technology

1976-06-01

nc;-J to e UNCLASSIFIED SECURITY CLASSIFICATION OF THIS PAGE (When Dots Ew.ervol. REPORT DOCUMENTATION PAGE READ INSTRUCTIONS BEFORE COMPLETING FORM I...Controlling Office) 15. SECURITY CLASS. (of this report) Unclassified 158. DECLASSIFICATIONDOWNGRADING SCHEDULE 16. DISTRIBUTION STATEMENT (of this Report...ID SECURITY CLASSIFICATION OF THIS PAGE (When Des Entered) K7 UNCLASSIFIED SECURITY CLASSIFICATION OF THIS PAGE(Whon Dal Entore o l Logic networks
SRI PUFF 8 Computer Program for One-Dimensional Stress Wave Propagation

DTIC Science & Technology

1980-03-01

raial product. UNCLASSIFIED SECURITY CLASSIFICATION OF THIS PAGE (Tfhen Data Entered) REPORT DOCUMENTATION PAGE READ INSTRUCTIONS BEFORE COMPLETING...EDITION OF I NOV 6S (S OBSOLETE UNCLASSIFIED SECURITY CLASSIFICATIOK OF THIS PAGE (When Data Entered) UNCLASSIFIED SECURITY CLASSIFICATION OF THIS...aspects of wave propagation calculations. UNCLASSIFIED SECURITY CLASSIFICATION OF THIS PAGEfWhen Data Entered) FOREWORD This volume constitutes a
Clinical research data warehouse governance for distributed research networks in the USA: a systematic review of the literature.

PubMed

Holmes, John H; Elliott, Thomas E; Brown, Jeffrey S; Raebel, Marsha A; Davidson, Arthur; Nelson, Andrew F; Chung, Annie; La Chance, Pierre; Steiner, John F

2014-01-01

To review the published, peer-reviewed literature on clinical research data warehouse governance in distributed research networks (DRNs). Medline, PubMed, EMBASE, CINAHL, and INSPEC were searched for relevant documents published through July 31, 2013 using a systematic approach. Only documents relating to DRNs in the USA were included. Documents were analyzed using a classification framework consisting of 10 facets to identify themes. 6641 documents were retrieved. After screening for duplicates and relevance, 38 were included in the final review. A peer-reviewed literature on data warehouse governance is emerging, but is still sparse. Peer-reviewed publications on UK research network governance were more prevalent, although not reviewed for this analysis. All 10 classification facets were used, with some documents falling into two or more classifications. No document addressed costs associated with governance. Even though DRNs are emerging as vehicles for research and public health surveillance, understanding of DRN data governance policies and procedures is limited. This is expected to change as more DRN projects disseminate their governance approaches as publicly available toolkits and peer-reviewed publications. While peer-reviewed, US-based DRN data warehouse governance publications have increased, DRN developers and administrators are encouraged to publish information about these programs. Published by the BMJ Publishing Group Limited. For permission to use (where not already granted under a licence) please go to http://group.bmj.com/group/rights-licensing/permissions.
10 CFR 1045.43 - Systematic review for declassification.

Code of Federal Regulations, 2014 CFR

2014-01-01

... ensure that FRD documents, are periodically and systematically reviewed for declassification. The focus... declassification upon review. (b) Agencies with RD or FRD document holdings shall cooperate with the Director of Classification (and with the DoD for FRD) to ensure the systematic review of RD and FRD documents. (c) Review of...
10 CFR 1045.43 - Systematic review for declassification.

Code of Federal Regulations, 2013 CFR

2013-01-01

... ensure that FRD documents, are periodically and systematically reviewed for declassification. The focus... declassification upon review. (b) Agencies with RD or FRD document holdings shall cooperate with the Director of Classification (and with the DoD for FRD) to ensure the systematic review of RD and FRD documents. (c) Review of...

10 CFR 1045.43 - Systematic review for declassification.

Code of Federal Regulations, 2012 CFR

2012-01-01

... ensure that FRD documents, are periodically and systematically reviewed for declassification. The focus... declassification upon review. (b) Agencies with RD or FRD document holdings shall cooperate with the Director of Classification (and with the DoD for FRD) to ensure the systematic review of RD and FRD documents. (c) Review of...
Approach for Text Classification Based on the Similarity Measurement between Normal Cloud Models

PubMed Central

Dai, Jin; Liu, Xin

2014-01-01

The similarity between objects is the core research area of data mining. In order to reduce the interference of the uncertainty of nature language, a similarity measurement between normal cloud models is adopted to text classification research. On this basis, a novel text classifier based on cloud concept jumping up (CCJU-TC) is proposed. It can efficiently accomplish conversion between qualitative concept and quantitative data. Through the conversion from text set to text information table based on VSM model, the text qualitative concept, which is extraction from the same category, is jumping up as a whole category concept. According to the cloud similarity between the test text and each category concept, the test text is assigned to the most similar category. By the comparison among different text classifiers in different feature selection set, it fully proves that not only does CCJU-TC have a strong ability to adapt to the different text features, but also the classification performance is also better than the traditional classifiers. PMID:24711737
Classification of hepatocellular carcinoma stages from free-text clinical and radiology reports

PubMed Central

Yim, Wen-wai; Kwan, Sharon W; Johnson, Guy; Yetisgen, Meliha

2017-01-01

Cancer stage information is important for clinical research. However, they are not always explicitly noted in electronic medical records. In this paper, we present our work on automatic classification of hepatocellular carcinoma (HCC) stages from free-text clinical and radiology notes. To accomplish this, we defined 11 stage parameters used in the three HCC staging systems, American Joint Committee on Cancer (AJCC), Barcelona Clinic Liver Cancer (BCLC), and Cancer of the Liver Italian Program (CLIP). After aggregating stage parameters to the patient-level, the final stage classifications were achieved using an expert-created decision logic. Each stage parameter relevant for staging was extracted using several classification methods, e.g. sentence classification and automatic information structuring, to identify and normalize text as cancer stage parameter values. Stage parameter extraction for the test set performed at 0.81 F1. Cancer stage prediction for AJCC, BCLC, and CLIP stage classifications were 0.55, 0.50, and 0.43 F1.
TCP/IP Implementations and Vendors Guide,

DTIC Science & Technology

1986-02-01

DOCUMENTATION PAGE la. REPORT SECURITY CLASSIFICATION lb. RESTRICTIVE MARKINGS 2a. SECURITY CLASSIFICATION AUTHORITY 3. DISTRIBUTION/ AVAILABILIT OF...UNIX System V (5.2) IMPLEMENTATION-LANGUAGE: C DISTRIBUTOR: UNIQ Digital Technologies 28 S. Water St. Batavia, fI1 60510 (312) 879-1008 CONTACT
Railroad classification yard design methodology study Elkhart Yard Rehabilitation : a case study

DOT National Transportation Integrated Search

1980-02-01

This interim report documents the application of a railroad classification : yard design methodology to CONRAIL's Elkhart Yard Rehabilitation. This : case study effort represents Phase 2 of a larger effort to develop a yard : design methodology, and ...
Family Traits of Galaxies: From the Tuning Fork to a Physical Classification in a Multi-Wavelength Context

NASA Astrophysics Data System (ADS)

Rampazzo, Roberto; D'Onofrio, Mauro; Zaggia, Simone; Elmegreen, Debra M.; Laurikainen, Eija; Duc, Pierre-Alain; Gallart, Carme; Fraix-Burnet, Didier

At the time of the Great Debate nebulæ where recognized to have different morphologies and first classifications, sometimes only descriptive, have been attempted. A review of these early classification systems are well documented by the Allan Sandage's review in 2005 (Sandage 2005). This review emphasized the debt, in term of continuity of forms of spiral galaxies, due by the Hubble's classification scheme to the Reynold's systems proposed in 1920 (Reynolds, 1920).
Advances in Classification Methods for Military Munitions Response

DTIC Science & Technology

2010-12-01

Response Herb Nelson Objective of the Course Provide an update on the sensors , methods, and status of the classification of military munitions...advanced EMI sensors 2Advances in Classification - Introduction Report Documentation Page Form ApprovedOMB No. 0704-0188 Public reporting burden for the...Electromagnetics (EM): Fundamentals and Parameter Extraction Stephen Billings EM Module Outline ● EMI Fundamentals How EMI sensors work and what they measure
U.S. Coast Guard Fleet Mix Planning: A Decision Support System Prototype

DTIC Science & Technology

1991-03-01

91-16785 Al ’ 1 1 1 Unclassified SECURITY CLASSIFICATION OF ThIS PAGE REPORT DOCUMENTATION PAGE I L REPORTSECURITY CLASSIFICATION lb. RESTRICTIVE...MARKINGS Unclassified 2a. SECURITY CLASSIFICATION AUTHORITY 3. DISTRIBUTION/ AVAILABITY OF REPORT Approved for public release; distribution is inlimited...2b. DECIASSIFICATION/DOWNGRADING SCHEDULE 4. PERFORMING ORGANIZATION REPORT NUMBER(S) 5. MONITORING ORGANIZATION REPORT NUMBER(S) 6a. NAME OF
Low-Level Wind Systems in the Warsaw Pact Countries.

DTIC Science & Technology

1985-03-01

CLASSIFICATION OF THIS PAGE o i REPORT DOCUMENTATION PAGE I le. REPORT SECURITY CLASSIFICATION lb. RESTRICTIVE MARKINGS Unclassified 2e, SECURITY...CLASSIFICATION AUTHORITY 3. OISTRIBUTION/AVAI LAOBILfTY OF REPORT 2b. ECLSSIICAIONDOWNRADNG CHEULEApproved for public release; distribution * 2b OELASSFICTIO...OOWGRAING CHEULEunlimited * 4. PERFORMING ORGANIZATION REPORT NUMBER(S) 5. MONITORING ORGANIZATION REPORT NUMBER(S) USAFETAC/TN-85/0Ol 6a. NAME OF
Cumulated UDC Supplement, 1965-1975. Volume III: Classes 6/62 (61 Medical Sciences, 62 Engineering and Technology Generally, 621 Mechanical and Electrical Engineering, 622 Mining, 623 Military and Naval Engineering, 624 Civil and Structural Engineering, 625 Railway and Highway Engineering, 626/627 Hydraulic Engineering Works, 628 Public Health Engineering, 629 Transport (Vehicle) Engineering).

ERIC Educational Resources Information Center

International Federation for Documentation, The Hague (Netherlands). Committee on Classification Research.

In continuation of the "Cumulated UDC Supplement - 1964" published by the International Federation for Documentation, this document provides a cumulative supplement to the Universal Decimal Classification for 1965-1975. This third of five volumes lists new classification subdivisions in the following subject areas: (1) medical sciences; (2)…
Inert Reassessment Document for Poly(oxyethylene) adducts of mixed phytosterols

EPA Pesticide Factsheets

Poly(oxyethy1ene) adducts of mixed phytosterols is uncategorized as to list classification status. Based upon the reasonable certainty of no harm safety finding, the List 4B classification for poly(oxyethy1ene) adducts of mixed phytosterols is affirmed.
On-Line Retrieval II.

ERIC Educational Resources Information Center

Kurtz, Peter; And Others

This report is concerned with the implementation of two interrelated computer systems: an automatic document analysis and classification package, and an on-line interactive information retrieval system which utilizes the information gathered during the automatic classification phase. Well-known techniques developed by Salton and Dennis have been…
40 CFR 164.22 - Contents of document setting forth objections.

Code of Federal Regulations, 2011 CFR

2011-07-01

... objections. 164.22 Section 164.22 Protection of Environment ENVIRONMENTAL PROTECTION AGENCY (CONTINUED... RODENTICIDE ACT, ARISING FROM REFUSALS TO REGISTER, CANCELLATIONS OF REGISTRATIONS, CHANGES OF CLASSIFICATIONS... registration, or change the classification of a pesticide, shall clearly and concisely set forth such...
40 CFR 164.22 - Contents of document setting forth objections.

Code of Federal Regulations, 2010 CFR

2010-07-01

... objections. 164.22 Section 164.22 Protection of Environment ENVIRONMENTAL PROTECTION AGENCY (CONTINUED... RODENTICIDE ACT, ARISING FROM REFUSALS TO REGISTER, CANCELLATIONS OF REGISTRATIONS, CHANGES OF CLASSIFICATIONS... registration, or change the classification of a pesticide, shall clearly and concisely set forth such...
Railroad classification yard design methodology study : East Deerfield Yard, a case study

DOT National Transportation Integrated Search

1980-02-01

This interim report documents the application of a railroad classification yard design methodology to Boston and Maine's East Deerfield Yard Rehabiliation. This case study effort represents Phase 2 of a larger effort to develop a yard design methodol...
7 CFR 1951.885 - Loan classifications.

Code of Federal Regulations, 2011 CFR

2011-01-01

... 7 Agriculture 14 2011-01-01 2011-01-01 false Loan classifications. 1951.885 Section 1951.885 Agriculture Regulations of the Department of Agriculture (Continued) RURAL HOUSING SERVICE, RURAL BUSINESS... obtain proper documentation or any other deviations from prudent lending practices. Adverse trends in the...
7 CFR 1951.885 - Loan classifications.

Code of Federal Regulations, 2010 CFR

2010-01-01

... 7 Agriculture 14 2010-01-01 2009-01-01 true Loan classifications. 1951.885 Section 1951.885 Agriculture Regulations of the Department of Agriculture (Continued) RURAL HOUSING SERVICE, RURAL BUSINESS... obtain proper documentation or any other deviations from prudent lending practices. Adverse trends in the...
Literature-based concept profiles for gene annotation: the issue of weighting.

PubMed

Jelier, Rob; Schuemie, Martijn J; Roes, Peter-Jan; van Mulligen, Erik M; Kors, Jan A

2008-05-01

Text-mining has been used to link biomedical concepts, such as genes or biological processes, to each other for annotation purposes or the generation of new hypotheses. To relate two concepts to each other several authors have used the vector space model, as vectors can be compared efficiently and transparently. Using this model, a concept is characterized by a list of associated concepts, together with weights that indicate the strength of the association. The associated concepts in the vectors and their weights are derived from a set of documents linked to the concept of interest. An important issue with this approach is the determination of the weights of the associated concepts. Various schemes have been proposed to determine these weights, but no comparative studies of the different approaches are available. Here we compare several weighting approaches in a large scale classification experiment. Three different techniques were evaluated: (1) weighting based on averaging, an empirical approach; (2) the log likelihood ratio, a test-based measure; (3) the uncertainty coefficient, an information-theory based measure. The weighting schemes were applied in a system that annotates genes with Gene Ontology codes. As the gold standard for our study we used the annotations provided by the Gene Ontology Annotation project. Classification performance was evaluated by means of the receiver operating characteristics (ROC) curve using the area under the curve (AUC) as the measure of performance. All methods performed well with median AUC scores greater than 0.84, and scored considerably higher than a binary approach without any weighting. Especially for the more specific Gene Ontology codes excellent performance was observed. The differences between the methods were small when considering the whole experiment. However, the number of documents that were linked to a concept proved to be an important variable. When larger amounts of texts were available for the generation of the concepts' vectors, the performance of the methods diverged considerably, with the uncertainty coefficient then outperforming the two other methods.
Document Classification in Support of Automated Metadata Extraction Form Heterogeneous Collections

ERIC Educational Resources Information Center

Flynn, Paul K.

2014-01-01

A number of federal agencies, universities, laboratories, and companies are placing their documents online and making them searchable via metadata fields such as author, title, and publishing organization. To enable this, every document in the collection must be catalogued using the metadata fields. Though time consuming, the task of identifying…
Learning From Short Text Streams With Topic Drifts.

PubMed

Li, Peipei; He, Lu; Wang, Haiyan; Hu, Xuegang; Zhang, Yuhong; Li, Lei; Wu, Xindong

2017-09-18

Short text streams such as search snippets and micro blogs have been popular on the Web with the emergence of social media. Unlike traditional normal text streams, these data present the characteristics of short length, weak signal, high volume, high velocity, topic drift, etc. Short text stream classification is hence a very challenging and significant task. However, this challenge has received little attention from the research community. Therefore, a new feature extension approach is proposed for short text stream classification with the help of a large-scale semantic network obtained from a Web corpus. It is built on an incremental ensemble classification model for efficiency. First, more semantic contexts based on the senses of terms in short texts are introduced to make up of the data sparsity using the open semantic network, in which all terms are disambiguated by their semantics to reduce the noise impact. Second, a concept cluster-based topic drifting detection method is proposed to effectively track hidden topic drifts. Finally, extensive studies demonstrate that as compared to several well-known concept drifting detection methods in data stream, our approach can detect topic drifts effectively, and it enables handling short text streams effectively while maintaining the efficiency as compared to several state-of-the-art short text classification approaches.

Portable automatic text classification for adverse drug reaction detection via multi-corpus training.

PubMed

Sarker, Abeed; Gonzalez, Graciela

2015-02-01

Automatic detection of adverse drug reaction (ADR) mentions from text has recently received significant interest in pharmacovigilance research. Current research focuses on various sources of text-based information, including social media-where enormous amounts of user posted data is available, which have the potential for use in pharmacovigilance if collected and filtered accurately. The aims of this study are: (i) to explore natural language processing (NLP) approaches for generating useful features from text, and utilizing them in optimized machine learning algorithms for automatic classification of ADR assertive text segments; (ii) to present two data sets that we prepared for the task of ADR detection from user posted internet data; and (iii) to investigate if combining training data from distinct corpora can improve automatic classification accuracies. One of our three data sets contains annotated sentences from clinical reports, and the two other data sets, built in-house, consist of annotated posts from social media. Our text classification approach relies on generating a large set of features, representing semantic properties (e.g., sentiment, polarity, and topic), from short text nuggets. Importantly, using our expanded feature sets, we combine training data from different corpora in attempts to boost classification accuracies. Our feature-rich classification approach performs significantly better than previously published approaches with ADR class F-scores of 0.812 (previously reported best: 0.770), 0.538 and 0.678 for the three data sets. Combining training data from multiple compatible corpora further improves the ADR F-scores for the in-house data sets to 0.597 (improvement of 5.9 units) and 0.704 (improvement of 2.6 units) respectively. Our research results indicate that using advanced NLP techniques for generating information rich features from text can significantly improve classification accuracies over existing benchmarks. Our experiments illustrate the benefits of incorporating various semantic features such as topics, concepts, sentiments, and polarities. Finally, we show that integration of information from compatible corpora can significantly improve classification performance. This form of multi-corpus training may be particularly useful in cases where data sets are heavily imbalanced (e.g., social media data), and may reduce the time and costs associated with the annotation of data in the future. Copyright © 2014 The Authors. Published by Elsevier Inc. All rights reserved.
Portable Automatic Text Classification for Adverse Drug Reaction Detection via Multi-corpus Training

PubMed Central

Gonzalez, Graciela

2014-01-01

Objective Automatic detection of Adverse Drug Reaction (ADR) mentions from text has recently received significant interest in pharmacovigilance research. Current research focuses on various sources of text-based information, including social media — where enormous amounts of user posted data is available, which have the potential for use in pharmacovigilance if collected and filtered accurately. The aims of this study are: (i) to explore natural language processing approaches for generating useful features from text, and utilizing them in optimized machine learning algorithms for automatic classification of ADR assertive text segments; (ii) to present two data sets that we prepared for the task of ADR detection from user posted internet data; and (iii) to investigate if combining training data from distinct corpora can improve automatic classification accuracies. Methods One of our three data sets contains annotated sentences from clinical reports, and the two other data sets, built in-house, consist of annotated posts from social media. Our text classification approach relies on generating a large set of features, representing semantic properties (e.g., sentiment, polarity, and topic), from short text nuggets. Importantly, using our expanded feature sets, we combine training data from different corpora in attempts to boost classification accuracies. Results Our feature-rich classification approach performs significantly better than previously published approaches with ADR class F-scores of 0.812 (previously reported best: 0.770), 0.538 and 0.678 for the three data sets. Combining training data from multiple compatible corpora further improves the ADR F-scores for the in-house data sets to 0.597 (improvement of 5.9 units) and 0.704 (improvement of 2.6 units) respectively. Conclusions Our research results indicate that using advanced NLP techniques for generating information rich features from text can significantly improve classification accuracies over existing benchmarks. Our experiments illustrate the benefits of incorporating various semantic features such as topics, concepts, sentiments, and polarities. Finally, we show that integration of information from compatible corpora can significantly improve classification performance. This form of multi-corpus training may be particularly useful in cases where data sets are heavily imbalanced (e.g., social media data), and may reduce the time and costs associated with the annotation of data in the future. PMID:25451103
Web information retrieval for health professionals.

PubMed

Ting, S L; See-To, Eric W K; Tse, Y K

2013-06-01

This paper presents a Web Information Retrieval System (WebIRS), which is designed to assist the healthcare professionals to obtain up-to-date medical knowledge and information via the World Wide Web (WWW). The system leverages the document classification and text summarization techniques to deliver the highly correlated medical information to the physicians. The system architecture of the proposed WebIRS is first discussed, and then a case study on an application of the proposed system in a Hong Kong medical organization is presented to illustrate the adoption process and a questionnaire is administrated to collect feedback on the operation and performance of WebIRS in comparison with conventional information retrieval in the WWW. A prototype system has been constructed and implemented on a trial basis in a medical organization. It has proven to be of benefit to healthcare professionals through its automatic functions in classification and summarizing the medical information that the physicians needed and interested. The results of the case study show that with the use of the proposed WebIRS, significant reduction of searching time and effort, with retrieval of highly relevant materials can be attained.
Supervised Extraction of Diagnosis Codes from EMRs: Role of Feature Selection, Data Selection, and Probabilistic Thresholding.

PubMed

Rios, Anthony; Kavuluru, Ramakanth

2013-09-01

Extracting diagnosis codes from medical records is a complex task carried out by trained coders by reading all the documents associated with a patient's visit. With the popularity of electronic medical records (EMRs), computational approaches to code extraction have been proposed in the recent years. Machine learning approaches to multi-label text classification provide an important methodology in this task given each EMR can be associated with multiple codes. In this paper, we study the the role of feature selection, training data selection, and probabilistic threshold optimization in improving different multi-label classification approaches. We conduct experiments based on two different datasets: a recent gold standard dataset used for this task and a second larger and more complex EMR dataset we curated from the University of Kentucky Medical Center. While conventional approaches achieve results comparable to the state-of-the-art on the gold standard dataset, on our complex in-house dataset, we show that feature selection, training data selection, and probabilistic thresholding provide significant gains in performance.
A nursing-specific model of EPR documentation: organizational and professional requirements.

PubMed

von Krogh, Gunn; Nåden, Dagfinn

2008-01-01

To present the Norwegian documentation KPO model (quality assurance, problem solving, and caring). To present the requirements and multiple electronic patient record (EPR) functions the model is designed to address. The model's professional substance, a conceptual framework for nursing practice is developed by examining, reorganizing, and completing existing frameworks. The model's methodology, an information management system, is developed using an expert group. Both model elements were clinically tested over a period of 1 year. The model is designed for nursing documentation in step with statutory, organizational, and professional requirements. Complete documentation is arranged for by incorporating the Nursing Minimum Data Set. A systematic and comprehensive documentation is arranged for by establishing categories as provided in the model's framework domains. Consistent documentation is arranged for by incorporating NANDA-I Nursing Diagnoses, Nursing Intervention Classification, and Nursing Outcome Classification. The model can be used as a tool in cooperation with vendors to ensure the interests of the nursing profession is met when developing EPR solutions in healthcare. The model can provide clinicians with a framework for documentation in step with legal and organizational requirements and at the same time retain the ability to record all aspects of clinical nursing.
Analysis of vehicle classification and truck weight data of the New England states

DOT National Transportation Integrated Search

1998-09-01

This report is about a statistical analysis of 1995-96 classification and weigh in motion (WIM) data from 17 continuous traffic-monitoring sites in New England. It documents work performed by Oak Ridge National Laboratory in fulfillment of 'Analysis ...
10 CFR 1045.44 - Classification review prior to public release.

Code of Federal Regulations, 2014 CFR

2014-01-01

....44 Classification review prior to public release. Any person with authorized access to RD or FRD who generates a document intended for public release in an RD or FRD subject area shall ensure that it is... organization (for FRD) prior to its release. ...
10 CFR 1045.44 - Classification review prior to public release.

Code of Federal Regulations, 2010 CFR

2010-01-01

....44 Classification review prior to public release. Any person with authorized access to RD or FRD who generates a document intended for public release in an RD or FRD subject area shall ensure that it is... organization (for FRD) prior to its release. ...
10 CFR 1045.44 - Classification review prior to public release.

Code of Federal Regulations, 2011 CFR

2011-01-01

....44 Classification review prior to public release. Any person with authorized access to RD or FRD who generates a document intended for public release in an RD or FRD subject area shall ensure that it is... organization (for FRD) prior to its release. ...
10 CFR 1045.44 - Classification review prior to public release.

Code of Federal Regulations, 2012 CFR

2012-01-01

....44 Classification review prior to public release. Any person with authorized access to RD or FRD who generates a document intended for public release in an RD or FRD subject area shall ensure that it is... organization (for FRD) prior to its release. ...
10 CFR 1045.44 - Classification review prior to public release.

Code of Federal Regulations, 2013 CFR

2013-01-01

....44 Classification review prior to public release. Any person with authorized access to RD or FRD who generates a document intended for public release in an RD or FRD subject area shall ensure that it is... organization (for FRD) prior to its release. ...
19 CFR 146.41 - Privileged foreign status.

Code of Federal Regulations, 2010 CFR

2010-04-01

... change in tariff classification will be given status as privileged foreign merchandise on proper application to the port director. (b) Application. Each application for this status will be made on Customs... which has effected a change in tariff classification. (c) Supporting documentation. Each applicant for...
Digitally Controlled ’Programmable’ Active Filters.

DTIC Science & Technology

1985-12-01

Advisor: Sherif Michael Approved for public release; distribution is unlimited. U - ~ .%~ ~ % %’.4 ~ -. 4-. " %’ -. .4. z. . 4, ,4°*-4° -o - ’ SECURITY ...CLASSIFICATION O THI PAGE ff ,’- -""" REPORT DOCUMENTATION PAGE Ia REPORT SECURITY CLASSIFICATION lb. RESTRICTIVE MARKINGS 2a SECURITY CLASSIFICATION...ELEMENT NO. NO NO. ACCESSION NO. S 11 TITLE (Include Security ClassWfication) , DIGITALLY CONTROLLED "PROGRAMMABLE" ACTIVE FILTERS 1 PERSONAL AUTHOR
Instrumentation for Linear and Nonlinear Optical Device Characterization

DTIC Science & Technology

2018-01-31

1998. 4. TITLE. Enter title and subtitle with volume number and part number, if applicable. On classified documents, enter the title classification...with security classification regulations, e.g. U, C, S, etc. If this form contains classified information, stamp classification level on the top...hundreds of picoseconds. Figure 3 illustrates example data taken from the oscilloscope. 0 5 10 15 Time (ns) 20 25 30 Figure 3. (a) A screen shot
Interactive Electronic Circuit Simulation on Small Computer Systems

DTIC Science & Technology

1979-11-01

longer needed. Do not return it to the originator. UNCLASSIFIED SECURITY CLASSIFICATION OF THIS PAGE (Whan Dots Entered) REPORT DOCUMENTATION PAGE... CLASSIFICATION OF THIS PAGE(H7i»n Data Entend) Interactive-mode circuit simulation and batch-mode circuit simulation on minicomputers are compared...on the circuit Q. For circuits with Q less than 1, this ratio is typically 10:1. UNCLASSIFIED 2 SECURITY CLASSIFICATION OF THIS PAGEflWiim Data
A Three-Phase Decision Model of Computer-Aided Coding for the Iranian Classification of Health Interventions (IRCHI)

PubMed Central

Azadmanjir, Zahra; Safdari, Reza; Ghazisaeedi, Marjan; Mokhtaran, Mehrshad; Kameli, Mohammad Esmail

2017-01-01

Introduction: Accurate coded data in the healthcare are critical. Computer-Assisted Coding (CAC) is an effective tool to improve clinical coding in particular when a new classification will be developed and implemented. But determine the appropriate method for development need to consider the specifications of existing CAC systems, requirements for each type, our infrastructure and also, the classification scheme. Aim: The aim of the study was the development of a decision model for determining accurate code of each medical intervention in Iranian Classification of Health Interventions (IRCHI) that can be implemented as a suitable CAC system. Methods: first, a sample of existing CAC systems was reviewed. Then feasibility of each one of CAC types was examined with regard to their prerequisites for their implementation. The next step, proper model was proposed according to the structure of the classification scheme and was implemented as an interactive system. Results: There is a significant relationship between the level of assistance of a CAC system and integration of it with electronic medical documents. Implementation of fully automated CAC systems is impossible due to immature development of electronic medical record and problems in using language for medical documenting. So, a model was proposed to develop semi-automated CAC system based on hierarchical relationships between entities in the classification scheme and also the logic of decision making to specify the characters of code step by step through a web-based interactive user interface for CAC. It was composed of three phases to select Target, Action and Means respectively for an intervention. Conclusion: The proposed model was suitable the current status of clinical documentation and coding in Iran and also, the structure of new classification scheme. Our results show it was practical. However, the model needs to be evaluated in the next stage of the research. PMID:28883671
Safety equipment list for the 241-SY-101 RAPID mitigation project

DOE Office of Scientific and Technical Information (OSTI.GOV)

MORRIS, K.L.

1999-06-29

This document provides the safety classification for the safety (safety class and safety RAPID Mitigation Project. This document is being issued as the project SEL until the supporting authorization basis documentation, this document will be superseded by the TWRS SEL (LMHC 1999), documentation istlralized. Upon implementation of the authorization basis significant) structures, systems, and components (SSCS) associated with the 241-SY-1O1 which will be updated to include the information contained herein.
Text Extraction from Scene Images by Character Appearance and Structure Modeling

PubMed Central

Yi, Chucai; Tian, Yingli

2012-01-01

In this paper, we propose a novel algorithm to detect text information from natural scene images. Scene text classification and detection are still open research topics. Our proposed algorithm is able to model both character appearance and structure to generate representative and discriminative text descriptors. The contributions of this paper include three aspects: 1) a new character appearance model by a structure correlation algorithm which extracts discriminative appearance features from detected interest points of character samples; 2) a new text descriptor based on structons and correlatons, which model character structure by structure differences among character samples and structure component co-occurrence; and 3) a new text region localization method by combining color decomposition, character contour refinement, and string line alignment to localize character candidates and refine detected text regions. We perform three groups of experiments to evaluate the effectiveness of our proposed algorithm, including text classification, text detection, and character identification. The evaluation results on benchmark datasets demonstrate that our algorithm achieves the state-of-the-art performance on scene text classification and detection, and significantly outperforms the existing algorithms for character identification. PMID:23316111
Statistical text classifier to detect specific type of medical incidents.

PubMed

Wong, Zoie Shui-Yee; Akiyama, Masanori

2013-01-01

WHO Patient Safety has put focus to increase the coherence and expressiveness of patient safety classification with the foundation of International Classification for Patient Safety (ICPS). Text classification and statistical approaches has showed to be successful to identifysafety problems in the Aviation industryusing incident text information. It has been challenging to comprehend the taxonomy of medical incidents in a structured manner. Independent reporting mechanisms for patient safety incidents have been established in the UK, Canada, Australia, Japan, Hong Kong etc. This research demonstrates the potential to construct statistical text classifiers to detect specific type of medical incidents using incident text data. An illustrative example for classifying look-alike sound-alike (LASA) medication incidents using structured text from 227 advisories related to medication errors from Global Patient Safety Alerts (GPSA) is shown in this poster presentation. The classifier was built using logistic regression model. ROC curve and the AUC value indicated that this is a satisfactory good model.
32 CFR 1802.26 - Notification of decision and prohibition on adverse action.

Code of Federal Regulations, 2011 CFR

2011-07-01

... NATIONAL COUNTERINTELLIGENCE CENTER CHALLENGES TO CLASSIFICATION OF DOCUMENTS BY AUTHORIZED HOLDERS... regard to the challenge and that an appeal of the decision may be made to the Interagency Security Classification Appeals Panel (ISCAP) established pursuant to § 5.4 of this Order. ...

32 CFR 1802.26 - Notification of decision and prohibition on adverse action.

Code of Federal Regulations, 2010 CFR

2010-07-01

... NATIONAL COUNTERINTELLIGENCE CENTER CHALLENGES TO CLASSIFICATION OF DOCUMENTS BY AUTHORIZED HOLDERS... regard to the challenge and that an appeal of the decision may be made to the Interagency Security Classification Appeals Panel (ISCAP) established pursuant to § 5.4 of this Order. ...
A System for Automated Extraction of Metadata from Scanned Documents using Layout Recognition and String Pattern Search Models.

PubMed

Misra, Dharitri; Chen, Siyuan; Thoma, George R

2009-01-01

One of the most expensive aspects of archiving digital documents is the manual acquisition of context-sensitive metadata useful for the subsequent discovery of, and access to, the archived items. For certain types of textual documents, such as journal articles, pamphlets, official government records, etc., where the metadata is contained within the body of the documents, a cost effective method is to identify and extract the metadata in an automated way, applying machine learning and string pattern search techniques.At the U. S. National Library of Medicine (NLM) we have developed an automated metadata extraction (AME) system that employs layout classification and recognition models with a metadata pattern search model for a text corpus with structured or semi-structured information. A combination of Support Vector Machine and Hidden Markov Model is used to create the layout recognition models from a training set of the corpus, following which a rule-based metadata search model is used to extract the embedded metadata by analyzing the string patterns within and surrounding each field in the recognized layouts.In this paper, we describe the design of our AME system, with focus on the metadata search model. We present the extraction results for a historic collection from the Food and Drug Administration, and outline how the system may be adapted for similar collections. Finally, we discuss some ongoing enhancements to our AME system.
Topic detection using paragraph vectors to support active learning in systematic reviews.

PubMed

Hashimoto, Kazuma; Kontonatsios, Georgios; Miwa, Makoto; Ananiadou, Sophia

2016-08-01

Systematic reviews require expert reviewers to manually screen thousands of citations in order to identify all relevant articles to the review. Active learning text classification is a supervised machine learning approach that has been shown to significantly reduce the manual annotation workload by semi-automating the citation screening process of systematic reviews. In this paper, we present a new topic detection method that induces an informative representation of studies, to improve the performance of the underlying active learner. Our proposed topic detection method uses a neural network-based vector space model to capture semantic similarities between documents. We firstly represent documents within the vector space, and cluster the documents into a predefined number of clusters. The centroids of the clusters are treated as latent topics. We then represent each document as a mixture of latent topics. For evaluation purposes, we employ the active learning strategy using both our novel topic detection method and a baseline topic model (i.e., Latent Dirichlet Allocation). Results obtained demonstrate that our method is able to achieve a high sensitivity of eligible studies and a significantly reduced manual annotation cost when compared to the baseline method. This observation is consistent across two clinical and three public health reviews. The tool introduced in this work is available from https://nactem.ac.uk/pvtopic/. Copyright © 2016 The Authors. Published by Elsevier Inc. All rights reserved.
Theoretical Synthesis of Mixed Materials for CO2 Capture Applications

DOE Office of Scientific and Technical Information (OSTI.GOV)

Duan, Yuhua

These pages provide an example of the layout and style required for the preparation of four-page papers for the TechConnect World 2015 technical proceedings.Documents must be submitted in electronic (Adobe PDFfile) format. Please study the enclosed materials beforebeginning the final preparation of your paper. Proofread your paper carefully before submitting (it will appear in the published volume in exactly the same form). Your PDF manuscript must be uploaded online by April 11th, 2015.You will receive no proofs. Begin your paper with an abstract of no more than 18 lines. Thoroughly summarize your article in this section since this text willmore » be used for on-line listing and classification of the publication.« less
U.S. speeding-related motor vehicle fatalities by road function classification, 1995-1999

DOT National Transportation Integrated Search

2013-10-01

This document serves as an Operational Concept for the Applications for the Environment: Real-Time Information Synthesis (AERIS) Eco-Signal Operations Transformative Concept. It was developed along with two other Operational Concept documents that de...
Putzmeister Inc. and Implication of Changes in Nonattainment Classification

EPA Pesticide Factsheets

This document may be of assistance in applying the New Source Review (NSR) air permitting regulations including the Prevention of Significant Deterioration (PSD) requirements. This document is part of the NSR Policy and Guidance Database. Some documents in the database are a scanned or retyped version of a paper photocopy of the original. Although we have taken considerable effort to quality assure the documents, some may contain typographical errors. Contact the office that issued the document if you need a copy of the original.
PSD Source Classification for Safety Kleen's Lubricating Oil Recovery Facility

EPA Pesticide Factsheets

This document may be of assistance in applying the New Source Review (NSR) air permitting regulations including the Prevention of Significant Deterioration (PSD) requirements. This document is part of the NSR Policy and Guidance Database. Some documents in the database are a scanned or retyped version of a paper photocopy of the original. Although we have taken considerable effort to quality assure the documents, some may contain typographical errors. Contact the office that issued the document if you need a copy of the original.
Classification of the Bardstown Fuel Alcohol Company under PSD

EPA Pesticide Factsheets

This document may be of assistance in applying the New Source Review (NSR) air permitting regulations including the Prevention of Significant Deterioration (PSD) requirements. This document is part of the NSR Policy and Guidance Database. Some documents in the database are a scanned or retyped version of a paper photocopy of the original. Although we have taken considerable effort to quality assure the documents, some may contain typographical errors. Contact the office that issued the document if you need a copy of the original.
Classification of Ethanol Fuel Plants under PSD

EPA Pesticide Factsheets

This document may be of assistance in applying the New Source Review (NSR) air permitting regulations including the Prevention of Significant Deterioration (PSD) requirements. This document is part of the NSR Policy and Guidance Database. Some documents in the database are a scanned or retyped version of a paper photocopy of the original. Although we have taken considerable effort to quality assure the documents, some may contain typographical errors. Contact the office that issued the document if you need a copy of the original.
Classification of Emissions from Landfills for NSR Applicability Purposes

EPA Pesticide Factsheets

This document may be of assistance in applying the New Source Review (NSR) air permitting regulations including the Prevention of Significant Deterioration (PSD) requirements. This document is part of the NSR Policy and Guidance Database. Some documents in the database are a scanned or retyped version of a paper photocopy of the original. Although we have taken considerable effort to quality assure the documents, some may contain typographical errors. Contact the office that issued the document if you need a copy of the original.
Centroid-Based Document Classification Algorithms: Analysis & Experimental Results

DTIC Science & Technology

2000-03-06

stories such as baseball, football , basketball, and Olympics. In the first category, most of the documents contain words Clinton and Lewinsky and hence...document. On the other hand, any of sports related words like baseball, football , and basketball appearing in a document will put the document in the...0.15 diseas 0.14 women 0.13 heart 0.12 drug 4 0.41 newspap 0.22 editor 0.19 advertis 0.14 media 0.13 peruvian 0.13 coverag 0.12 percent 0.12 journalist
Report on Information Retrieval and Library Automation Studies.

ERIC Educational Resources Information Center

Alberta Univ., Edmonton. Dept. of Computing Science.

Short abstracts of works in progress or completed in the Department of Computing Science at the University of Alberta are presented under five major headings. The five categories are: Storage and search techniques for document data bases, Automatic classification, Study of indexing and classification languages through computer manipulation of data…
American Indian Languages: Classifications and List.

ERIC Educational Resources Information Center

Zisa, Charles A.

This document lists the indigenous languages of North and South America, with the exception of the Eskimo-Aleut languages, the European-based Creoles, and languages which represent Post-Columbian intrusions. Section I consists of genetic classification of the languages included (948 language-level entries). Section II is an alphabetic list of all…
78 FR 19637 - National Organic Program: Notice of Draft Guidance on Classification of Materials and Materials...

Federal Register 2010, 2011, 2012, 2013, 2014

2013-04-02

... Classification of Materials and Materials for Organic Crop Production AGENCY: Agricultural Marketing Service... organic crop production, livestock production, and handling. The second set of draft guidance documents, NOP 5034, provides clarification regarding materials for use in organic crop production. These...
Hierarchical Clustering: A Bibliography. Technical Report No. 1.

ERIC Educational Resources Information Center

Farrell, William T.

"Classification: Purposes, Principles, Progress, Prospects" by Robert R. Sokal is reprinted in this document. It summarizes the principles of classification and cluster analysis in a manner which is of specific value to the Marine Corps Office of Manpower Utilization. Following the article is a 184 item bibliography on cluster analysis…
Error Detection in Mechanized Classification Systems

ERIC Educational Resources Information Center

Hoyle, W. G.

1976-01-01

When documentary material is indexed by a mechanized classification system, and the results judged by trained professionals, the number of documents in disagreement, after suitable adjustment, defines the error rate of the system. In a test case disagreement was 22 percent and, of this 22 percent, the computer correctly identified two-thirds of…
10 CFR 1016.33 - External transmission of documents and material.

Code of Federal Regulations, 2014 CFR

2014-01-01

... ordinary manner and sealed with tape, the appropriate classification shall be placed on both sides of the... address. (3) The outer envelope or wrapper shall be addressed in the ordinary manner. No classification... clearance or access authorization who have been given written authority by their employers. (2) Confidential...
10 CFR 1016.33 - External transmission of documents and material.

Code of Federal Regulations, 2012 CFR

2012-01-01

... ordinary manner and sealed with tape, the appropriate classification shall be placed on both sides of the... address. (3) The outer envelope or wrapper shall be addressed in the ordinary manner. No classification... clearance or access authorization who have been given written authority by their employers. (2) Confidential...
Examining the Effect of Transverse Motion on Retinal Biometric Identifiers Relating to Shipboard Security Mechanisms

DTIC Science & Technology

1986-03-01

CLASSIFICATION OF THIS PAGEZ-~-ft! -q 1 REPORT DOCUMENTATION PAGE a REPORT SECURITY CLASSIFICATION lb. RESTRICTIVE MARKINGS . UNCLASSIFIED """ a. SECURITY...CLASSIFICATION AUTHORITY 3 DISTRiBUTION(/AVAILABILITY OF REPORT Approved for public release; 2b. DECLASSIFICATION/ DOWNGRADING SCHEDULE distribution...is un 1 im i ted 4 PERFORMING ORGANIZATION REPORT NUMBER(S) S. MONITORING ORGANIZATION REPOR•r NUMBER(S) j a.NAME OF PERFORMING ORGANIZATION 16b
Catalog of War Games

DTIC Science & Technology

1992-10-09

PROGRAM CATALOG OF WAR GAMES 92-30805 U *3fl91\\k~o 9 2 SECURITY CLASSIFICATION OF THIS PAGE I ,i REPORT DOCUMENTATION PAGE la. REPORT SECURITY...4401 Ford Ave ELEMENT NO. NO. NO. ACCESSION NO. AIyxndriA. VA 229i2-14O1 --. 11. TITLE (Include Security Classification) Catalog of War Games 12...SCURITY CLASSIFICATiON OF THIS PAGE DECLARATION OF ACCORD 1. PURPOSE This catalog provides information on the primary war games , combat simulations

Evaluation of the Efficiency of Liquid Cooling Garments using a Thermal Manikin

DTIC Science & Technology

2005-05-01

temperatures. The software also calculates thermal resistances and evaporative resistances. TM tests were run dry (i.e. no sweating ) and wet (i.e...REPORT DOCUMENTATION PAGE Form ApprovedOMB No . 0704-0188 SECURITY CLASSIFICATION OF REPORT SECURITY CLASSIFICATION OF THIS PAGE SECURITY CLASSIFICATION...OF ABSTRACT 8. M05-17 1. AGENCY USE ONLY (Leave blank) 4. TITLE AND SUBTITLE EVALUATION OF THE EFFICIENCY OF LIQUID COOLING GARMENTS USING A THERMAL
SFINX-a drug-drug interaction database designed for clinical decision support systems.

PubMed

Böttiger, Ylva; Laine, Kari; Andersson, Marine L; Korhonen, Tuomas; Molin, Björn; Ovesjö, Marie-Louise; Tirkkonen, Tuire; Rane, Anders; Gustafsson, Lars L; Eiermann, Birgit

2009-06-01

The aim was to develop a drug-drug interaction database (SFINX) to be integrated into decision support systems or to be used in website solutions for clinical evaluation of interactions. Key elements such as substance properties and names, drug formulations, text structures and references were defined before development of the database. Standard operating procedures for literature searches, text writing rules and a classification system for clinical relevance and documentation level were determined. ATC codes, CAS numbers and country-specific codes for substances were identified and quality assured to ensure safe integration of SFINX into other data systems. Much effort was put into giving short and practical advice regarding clinically relevant drug-drug interactions. SFINX includes over 8,000 interaction pairs and is integrated into Swedish and Finnish computerised decision support systems. Over 31,000 physicians and pharmacists are receiving interaction alerts through SFINX. User feedback is collected for continuous improvement of the content. SFINX is a potentially valuable tool delivering instant information on drug interactions during prescribing and dispensing.
Probing the Topological Properties of Complex Networks Modeling Short Written Texts

PubMed Central

Amancio, Diego R.

2015-01-01

In recent years, graph theory has been widely employed to probe several language properties. More specifically, the so-called word adjacency model has been proven useful for tackling several practical problems, especially those relying on textual stylistic analysis. The most common approach to treat texts as networks has simply considered either large pieces of texts or entire books. This approach has certainly worked well—many informative discoveries have been made this way—but it raises an uncomfortable question: could there be important topological patterns in small pieces of texts? To address this problem, the topological properties of subtexts sampled from entire books was probed. Statistical analyses performed on a dataset comprising 50 novels revealed that most of the traditional topological measurements are stable for short subtexts. When the performance of the authorship recognition task was analyzed, it was found that a proper sampling yields a discriminability similar to the one found with full texts. Surprisingly, the support vector machine classification based on the characterization of short texts outperformed the one performed with entire books. These findings suggest that a local topological analysis of large documents might improve its global characterization. Most importantly, it was verified, as a proof of principle, that short texts can be analyzed with the methods and concepts of complex networks. As a consequence, the techniques described here can be extended in a straightforward fashion to analyze texts as time-varying complex networks. PMID:25719799
A stylistic classification of Russian-language texts based on the random walk model

NASA Astrophysics Data System (ADS)

Kramarenko, A. A.; Nekrasov, K. A.; Filimonov, V. V.; Zhivoderov, A. A.; Amieva, A. A.

2017-09-01

A formal approach to text analysis is suggested that is based on the random walk model. The frequencies and reciprocal positions of the vowel letters are matched up by a process of quasi-particle migration. Statistically significant difference in the migration parameters for the texts of different functional styles is found. Thus, a possibility of classification of texts using the suggested method is demonstrated. Five groups of the texts are singled out that can be distinguished from one another by the parameters of the quasi-particle migration process.
Emotion models for textual emotion classification

NASA Astrophysics Data System (ADS)

Bruna, O.; Avetisyan, H.; Holub, J.

2016-11-01

This paper deals with textual emotion classification which gained attention in recent years. Emotion classification is used in user experience, product evaluation, national security, and tutoring applications. It attempts to detect the emotional content in the input text and based on different approaches establish what kind of emotional content is present, if any. Textual emotion classification is the most difficult to handle, since it relies mainly on linguistic resources and it introduces many challenges to assignment of text to emotion represented by a proper model. A crucial part of each emotion detector is emotion model. Focus of this paper is to introduce emotion models used for classification. Categorical and dimensional models of emotion are explained and some more advanced approaches are mentioned.
10 CFR 1045.30 - Purpose and scope.

Code of Federal Regulations, 2010 CFR

2010-01-01

... and declassification, provides for periodic or systematic review of RD and FRD documents, and describes procedures for the mandatory review of RD and FRD documents. This subpart applies to all RD and... DEPARTMENT OF ENERGY (GENERAL PROVISIONS) NUCLEAR CLASSIFICATION AND DECLASSIFICATION Generation and Review...
28 CFR 17.29 - Documents of permanent historical value.

Code of Federal Regulations, 2010 CFR

2010-07-01

... 28 Judicial Administration 1 2010-07-01 2010-07-01 false Documents of permanent historical value. 17.29 Section 17.29 Judicial Administration DEPARTMENT OF JUSTICE CLASSIFIED NATIONAL SECURITY... historical value. The original classification authority, to the greatest extent possible, shall declassify...
10 CFR 1045.30 - Purpose and scope.

Code of Federal Regulations, 2011 CFR

2011-01-01

... and declassification, provides for periodic or systematic review of RD and FRD documents, and describes procedures for the mandatory review of RD and FRD documents. This subpart applies to all RD and... DEPARTMENT OF ENERGY (GENERAL PROVISIONS) NUCLEAR CLASSIFICATION AND DECLASSIFICATION Generation and Review...
Negation handling in sentiment classification using rule-based adapted from Indonesian language syntactic for Indonesian text in Twitter

NASA Astrophysics Data System (ADS)

Amalia, Rizkiana; Arif Bijaksana, Moch; Darmantoro, Dhinta

2018-03-01

The presence of the word negation is able to change the polarity of the text if it is not handled properly it will affect the performance of the sentiment classification. Negation words in Indonesian are ‘tidak’, ‘bukan’, ‘belum’ and ‘jangan’. Also, there is a conjunction word that able to reverse the actual values, as the word ‘tetapi’, or ‘tapi’. Unigram has shortcomings in dealing with the existence of negation because it treats negation word and the negated words as separate words. A general approach for negation handling in English text gives the tag ‘NEG_’ for following words after negation until the first punctuation. But this may gives the tag to un-negated, and this approach does not handle negation and conjunction in one sentences. The rule-based method to determine what words negated by adapting the rules of Indonesian language syntactic of negation to determine the scope of negation was proposed in this study. With adapting syntactic rules and tagging “NEG_” using SVM classifier with RBF kernel has better performance results than the other experiments. Considering the average F1-score value, the performance of this proposed method can be improved against baseline equal to 1.79% (baseline without negation handling) and 5% (baseline with existing negation handling) for a dataset that all tweets contain negation words. And also for the second dataset that has the various number of negation words in document tweet. It can be improved against baseline at 2.69% (without negation handling) and 3.17% (with existing negation handling).
Multi-label literature classification based on the Gene Ontology graph.

PubMed

Jin, Bo; Muller, Brian; Zhai, Chengxiang; Lu, Xinghua

2008-12-08

The Gene Ontology is a controlled vocabulary for representing knowledge related to genes and proteins in a computable form. The current effort of manually annotating proteins with the Gene Ontology is outpaced by the rate of accumulation of biomedical knowledge in literature, which urges the development of text mining approaches to facilitate the process by automatically extracting the Gene Ontology annotation from literature. The task is usually cast as a text classification problem, and contemporary methods are confronted with unbalanced training data and the difficulties associated with multi-label classification. In this research, we investigated the methods of enhancing automatic multi-label classification of biomedical literature by utilizing the structure of the Gene Ontology graph. We have studied three graph-based multi-label classification algorithms, including a novel stochastic algorithm and two top-down hierarchical classification methods for multi-label literature classification. We systematically evaluated and compared these graph-based classification algorithms to a conventional flat multi-label algorithm. The results indicate that, through utilizing the information from the structure of the Gene Ontology graph, the graph-based multi-label classification methods can significantly improve predictions of the Gene Ontology terms implied by the analyzed text. Furthermore, the graph-based multi-label classifiers are capable of suggesting Gene Ontology annotations (to curators) that are closely related to the true annotations even if they fail to predict the true ones directly. A software package implementing the studied algorithms is available for the research community. Through utilizing the information from the structure of the Gene Ontology graph, the graph-based multi-label classification methods have better potential than the conventional flat multi-label classification approach to facilitate protein annotation based on the literature.
Readability Formulas and User Perceptions of Electronic Health Records Difficulty: A Corpus Study

PubMed Central

Yu, Hong

2017-01-01

Background Electronic health records (EHRs) are a rich resource for developing applications to engage patients and foster patient activation, thus holding a strong potential to enhance patient-centered care. Studies have shown that providing patients with access to their own EHR notes may improve the understanding of their own clinical conditions and treatments, leading to improved health care outcomes. However, the highly technical language in EHR notes impedes patients’ comprehension. Numerous studies have evaluated the difficulty of health-related text using readability formulas such as Flesch-Kincaid Grade Level (FKGL), Simple Measure of Gobbledygook (SMOG), and Gunning-Fog Index (GFI). They conclude that the materials are often written at a grade level higher than common recommendations. Objective The objective of our study was to explore the relationship between the aforementioned readability formulas and the laypeople’s perceived difficulty on 2 genres of text: general health information and EHR notes. We also validated the formulas’ appropriateness and generalizability on predicting difficulty levels of highly complex technical documents. Methods We collected 140 Wikipedia articles on diabetes and 242 EHR notes with diabetes International Classification of Diseases, Ninth Revision code. We recruited 15 Amazon Mechanical Turk (AMT) users to rate difficulty levels of the documents. Correlations between laypeople’s perceived difficulty levels and readability formula scores were measured, and their difference was tested. We also compared word usage and the impact of medical concepts of the 2 genres of text. Results The distributions of both readability formulas’ scores (P<.001) and laypeople’s perceptions (P=.002) on the 2 genres were different. Correlations of readability predictions and laypeople’s perceptions were weak. Furthermore, despite being graded at similar levels, documents of different genres were still perceived with different difficulty (P<.001). Word usage in the 2 related genres still differed significantly (P<.001). Conclusions Our findings suggested that the readability formulas’ predictions did not align with perceived difficulty in either text genre. The widely used readability formulas were highly correlated with each other but did not show adequate correlation with readers’ perceived difficulty. Therefore, they were not appropriate to assess the readability of EHR notes. PMID:28254738
Assessing Unmet Information Needs of Breast Cancer Survivors: Exploratory Study of Online Health Forums Using Text Classification and Retrieval.

PubMed

McRoy, Susan; Rastegar-Mojarad, Majid; Wang, Yanshan; Ruddy, Kathryn J; Haddad, Tufia C; Liu, Hongfang

2018-05-15

Patient education materials given to breast cancer survivors may not be a good fit for their information needs. Needs may change over time, be forgotten, or be misreported, for a variety of reasons. An automated content analysis of survivors' postings to online health forums can identify expressed information needs over a span of time and be repeated regularly at low cost. Identifying these unmet needs can guide improvements to existing education materials and the creation of new resources. The primary goals of this project are to assess the unmet information needs of breast cancer survivors from their own perspectives and to identify gaps between information needs and current education materials. This approach employs computational methods for content modeling and supervised text classification to data from online health forums to identify explicit and implicit requests for health-related information. Potential gaps between needs and education materials are identified using techniques from information retrieval. We provide a new taxonomy for the classification of sentences in online health forum data. 260 postings from two online health forums were selected, yielding 4179 sentences for coding. After annotation of data and training alternative one-versus-others classifiers, a random forest-based approach achieved F1 scores from 66% (Other, dataset2) to 90% (Medical, dataset1) on the primary information types. 136 expressions of need were used to generate queries to indexed education materials. Upon examination of the best two pages retrieved for each query, 12% (17/136) of queries were found to have relevant content by all coders, and 33% (45/136) were judged to have relevant content by at least one. Text from online health forums can be analyzed effectively using automated methods. Our analysis confirms that breast cancer survivors have many information needs that are not covered by the written documents they typically receive, as our results suggest that at most a third of breast cancer survivors' questions would be addressed by the materials currently provided to them. ©Susan McRoy, Majid Rastegar-Mojarad, Yanshan Wang, Kathryn J. Ruddy, Tufia C. Haddad, Hongfang Liu. Originally published in JMIR Cancer (http://cancer.jmir.org), 15.05.2018.
Assessing Unmet Information Needs of Breast Cancer Survivors: Exploratory Study of Online Health Forums Using Text Classification and Retrieval

PubMed Central

Rastegar-Mojarad, Majid; Wang, Yanshan; Ruddy, Kathryn J; Haddad, Tufia C; Liu, Hongfang

2018-01-01

Background Patient education materials given to breast cancer survivors may not be a good fit for their information needs. Needs may change over time, be forgotten, or be misreported, for a variety of reasons. An automated content analysis of survivors' postings to online health forums can identify expressed information needs over a span of time and be repeated regularly at low cost. Identifying these unmet needs can guide improvements to existing education materials and the creation of new resources. Objective The primary goals of this project are to assess the unmet information needs of breast cancer survivors from their own perspectives and to identify gaps between information needs and current education materials. Methods This approach employs computational methods for content modeling and supervised text classification to data from online health forums to identify explicit and implicit requests for health-related information. Potential gaps between needs and education materials are identified using techniques from information retrieval. Results We provide a new taxonomy for the classification of sentences in online health forum data. 260 postings from two online health forums were selected, yielding 4179 sentences for coding. After annotation of data and training alternative one-versus-others classifiers, a random forest-based approach achieved F1 scores from 66% (Other, dataset2) to 90% (Medical, dataset1) on the primary information types. 136 expressions of need were used to generate queries to indexed education materials. Upon examination of the best two pages retrieved for each query, 12% (17/136) of queries were found to have relevant content by all coders, and 33% (45/136) were judged to have relevant content by at least one. Conclusions Text from online health forums can be analyzed effectively using automated methods. Our analysis confirms that breast cancer survivors have many information needs that are not covered by the written documents they typically receive, as our results suggest that at most a third of breast cancer survivors’ questions would be addressed by the materials currently provided to them. PMID:29764801
Identification of Long Bone Fractures in Radiology Reports Using Natural Language Processing to Support Healthcare Quality Improvement

PubMed Central

Masino, Aaron J.; Casper, T. Charles; Dean, Jonathan M.; Bell, Jamie; Enriquez, Rene; Deakyne, Sara; Chamberlain, James M.; Alpern, Elizabeth R.

2016-01-01

Summary Background Important information to support healthcare quality improvement is often recorded in free text documents such as radiology reports. Natural language processing (NLP) methods may help extract this information, but these methods have rarely been applied outside the research laboratories where they were developed. Objective To implement and validate NLP tools to identify long bone fractures for pediatric emergency medicine quality improvement. Methods Using freely available statistical software packages, we implemented NLP methods to identify long bone fractures from radiology reports. A sample of 1,000 radiology reports was used to construct three candidate classification models. A test set of 500 reports was used to validate the model performance. Blinded manual review of radiology reports by two independent physicians provided the reference standard. Each radiology report was segmented and word stem and bigram features were constructed. Common English “stop words” and rare features were excluded. We used 10-fold cross-validation to select optimal configuration parameters for each model. Accuracy, recall, precision and the F1 score were calculated. The final model was compared to the use of diagnosis codes for the identification of patients with long bone fractures. Results There were 329 unique word stems and 344 bigrams in the training documents. A support vector machine classifier with Gaussian kernel performed best on the test set with accuracy=0.958, recall=0.969, precision=0.940, and F1 score=0.954. Optimal parameters for this model were cost=4 and gamma=0.005. The three classification models that we tested all performed better than diagnosis codes in terms of accuracy, precision, and F1 score (diagnosis code accuracy=0.932, recall=0.960, precision=0.896, and F1 score=0.927). Conclusions NLP methods using a corpus of 1,000 training documents accurately identified acute long bone fractures from radiology reports. Strategic use of straightforward NLP methods, implemented with freely available software, offers quality improvement teams new opportunities to extract information from narrative documents. PMID:27826610
Database Design Methodology and Database Management System for Computer-Aided Structural Design Optimization.

DTIC Science & Technology

1984-12-01

52242 Prepared for the AIR FORCE OFFICE OF SCIENTIFIC RESEARCH Under Grant No. AFOSR 82-0322 December 1984 ~ " ’w Unclassified SECURITY CLASSIFICATION4...OF THIS PAGE REPORT DOCUMENTATION PAGE is REPORT SECURITY CLASSIFICATION lb. RESTRICTIVE MARKINGS Unclassified None 20 SECURITY CLASSIFICATION...designer .and computer- are 20 DIiRIBUTION/AVAILABI LIT Y 0P ABSTR4ACT 21 ABSTRACT SECURITY CLASSIFICA1ONr UNCLASSIFIED/UNLIMITED SAME AS APT OTIC USERS
21 CFR 870.3450 - Vascular graft prosthesis.

Code of Federal Regulations, 2014 CFR

2014-04-01

... terephthalate and polytetrafluoroethylene, and it may be coated with a biological coating, such as albumin or... animal origin, including human umbilical cords. (b) Classification. Class II (special controls). The special control for this device is the FDA guidance document entitled “Guidance Document for Vascular...
21 CFR 870.3450 - Vascular graft prosthesis.

Code of Federal Regulations, 2012 CFR

2012-04-01

... terephthalate and polytetrafluoroethylene, and it may be coated with a biological coating, such as albumin or... animal origin, including human umbilical cords. (b) Classification. Class II (special controls). The special control for this device is the FDA guidance document entitled “Guidance Document for Vascular...
21 CFR 870.3450 - Vascular graft prosthesis.

Code of Federal Regulations, 2013 CFR

2013-04-01

... terephthalate and polytetrafluoroethylene, and it may be coated with a biological coating, such as albumin or... animal origin, including human umbilical cords. (b) Classification. Class II (special controls). The special control for this device is the FDA guidance document entitled “Guidance Document for Vascular...
32 CFR 732.26 - Standard document numbers.

Code of Federal Regulations, 2011 CFR

2011-07-01

... 32 National Defense 5 2011-07-01 2011-07-01 false Standard document numbers. 732.26 Section 732.26 National Defense Department of Defense (Continued) DEPARTMENT OF THE NAVY PERSONNEL NONNAVAL MEDICAL AND DENTAL CARE Accounting Classifications for Nonnaval Medical and Dental Care Expenses and Standard...
32 CFR 732.26 - Standard document numbers.

Code of Federal Regulations, 2013 CFR

2013-07-01

... 32 National Defense 5 2013-07-01 2013-07-01 false Standard document numbers. 732.26 Section 732.26 National Defense Department of Defense (Continued) DEPARTMENT OF THE NAVY PERSONNEL NONNAVAL MEDICAL AND DENTAL CARE Accounting Classifications for Nonnaval Medical and Dental Care Expenses and Standard...

32 CFR 732.26 - Standard document numbers.

Code of Federal Regulations, 2010 CFR

2010-07-01

... 32 National Defense 5 2010-07-01 2010-07-01 false Standard document numbers. 732.26 Section 732.26 National Defense Department of Defense (Continued) DEPARTMENT OF THE NAVY PERSONNEL NONNAVAL MEDICAL AND DENTAL CARE Accounting Classifications for Nonnaval Medical and Dental Care Expenses and Standard...
32 CFR 732.26 - Standard document numbers.

Code of Federal Regulations, 2014 CFR

2014-07-01

... 32 National Defense 5 2014-07-01 2014-07-01 false Standard document numbers. 732.26 Section 732.26 National Defense Department of Defense (Continued) DEPARTMENT OF THE NAVY PERSONNEL NONNAVAL MEDICAL AND DENTAL CARE Accounting Classifications for Nonnaval Medical and Dental Care Expenses and Standard...
32 CFR 732.26 - Standard document numbers.

Code of Federal Regulations, 2012 CFR

2012-07-01

... 32 National Defense 5 2012-07-01 2012-07-01 false Standard document numbers. 732.26 Section 732.26 National Defense Department of Defense (Continued) DEPARTMENT OF THE NAVY PERSONNEL NONNAVAL MEDICAL AND DENTAL CARE Accounting Classifications for Nonnaval Medical and Dental Care Expenses and Standard...
21 CFR 870.3450 - Vascular graft prosthesis.

Code of Federal Regulations, 2011 CFR

2011-04-01

... terephthalate and polytetrafluoroethylene, and it may be coated with a biological coating, such as albumin or... animal origin, including human umbilical cords. (b) Classification. Class II (special controls). The special control for this device is the FDA guidance document entitled “Guidance Document for Vascular...
The Implementation of Cosine Similarity to Calculate Text Relevance between Two Documents

NASA Astrophysics Data System (ADS)

Gunawan, D.; Sembiring, C. A.; Budiman, M. A.

2018-03-01

Rapidly increasing number of web pages or documents leads to topic specific filtering in order to find web pages or documents efficiently. This is a preliminary research that uses cosine similarity to implement text relevance in order to find topic specific document. This research is divided into three parts. The first part is text-preprocessing. In this part, the punctuation in a document will be removed, then convert the document to lower case, implement stop word removal and then extracting the root word by using Porter Stemming algorithm. The second part is keywords weighting. Keyword weighting will be used by the next part, the text relevance calculation. Text relevance calculation will result the value between 0 and 1. The closer value to 1, then both documents are more related, vice versa.
Classification of the visual landscape for transmission planning

Treesearch

Curtis Miller; Nargis Jetha; Rod MacDonald

1979-01-01

The Visual Landscape Type Classification method of the Route and Site Selection Division, Ontario Hydro, defines and delineates the landscape into discrete visual units using parametric and judgmental data. This qualitative and quantitative information is documented in a prescribed format to give each of the approximately 1100 Landscape Types a unique description....
Applications of Location Similarity Measures and Conceptual Spaces to Event Coreference and Classification

ERIC Educational Resources Information Center

McConky, Katie Theresa

2013-01-01

This work covers topics in event coreference and event classification from spoken conversation. Event coreference is the process of identifying descriptions of the same event across sentences, documents, or structured databases. Existing event coreference work focuses on sentence similarity models or feature based similarity models requiring slot…
T Cell Responses to Arenavirus Infections.

DTIC Science & Technology

1991-11-01

CLASSIFICATION OF THIS PAGE Form AAmWved REPORT DOCUMENTATION PAGE OMB No.O 70O1U ts. REPORT SECURITY CLASSIFICATION lb. RESTRICTIVE MARKINGS Unclassified Za...1991. J. Virol. 65.3001-3006. 23. Ahmed, R., L. D., Butler, and L. Bhatti. 1988. J. Virol. 62: 21 02-21 06. 24. Moskophidis 0, S. P. Cobbold , H. Waldmann
Fuel Characteristic Classification System version 3.0: technical documentation

Treesearch

Susan J. Prichard; David V. Sandberg; Roger D. Ottmar; Ellen Eberhardt; Anne Andreu; Paige Eagle; Kjell Swedin

2013-01-01

The Fuel Characteristic Classification System (FCCS) is a software module that records wildland fuel characteristics and calculates potential fire behavior and hazard potentials based on input environmental variables. The FCCS 3.0 is housed within the Integrated Fuels Treatment Decision Support System (Joint Fire Science Program 2012). It can also be run from command...
32 CFR 1639.2 - The claim for Class 2-D.

Code of Federal Regulations, 2011 CFR

2011-07-01

... National Defense Other Regulations Relating to National Defense SELECTIVE SERVICE SYSTEM CLASSIFICATION OF REGISTRANTS PREPARING FOR THE MINISTRY § 1639.2 The claim for Class 2-D. A claim to classification in Class 2-D must be made by the registrant in writing, such document being placed in his file folder. ...
32 CFR 1639.2 - The claim for Class 2-D.

Code of Federal Regulations, 2010 CFR

2010-07-01

... National Defense Other Regulations Relating to National Defense SELECTIVE SERVICE SYSTEM CLASSIFICATION OF REGISTRANTS PREPARING FOR THE MINISTRY § 1639.2 The claim for Class 2-D. A claim to classification in Class 2-D must be made by the registrant in writing, such document being placed in his file folder. ...
AAAS: Automated Affirmative Action System. General Description, Phase 1.

ERIC Educational Resources Information Center

Institute for Services to Education, Inc., Washington, DC. TACTICS Management Information Systems Directorate.

This document describes phase 1 of the Automated Affirmative Action System (AAAS) of the Tuskegee Institute, which was designed to organize an inventory of any patterns of job classification and assignment identifiable by sex or minority group; any job classification or organizational unit where women and minorities are not employed or are…
Short text sentiment classification based on feature extension and ensemble classifier

NASA Astrophysics Data System (ADS)

Liu, Yang; Zhu, Xie

2018-05-01

With the rapid development of Internet social media, excavating the emotional tendencies of the short text information from the Internet, the acquisition of useful information has attracted the attention of researchers. At present, the commonly used can be attributed to the rule-based classification and statistical machine learning classification methods. Although micro-blog sentiment analysis has made good progress, there still exist some shortcomings such as not highly accurate enough and strong dependence from sentiment classification effect. Aiming at the characteristics of Chinese short texts, such as less information, sparse features, and diverse expressions, this paper considers expanding the original text by mining related semantic information from the reviews, forwarding and other related information. First, this paper uses Word2vec to compute word similarity to extend the feature words. And then uses an ensemble classifier composed of SVM, KNN and HMM to analyze the emotion of the short text of micro-blog. The experimental results show that the proposed method can make good use of the comment forwarding information to extend the original features. Compared with the traditional method, the accuracy, recall and F1 value obtained by this method have been improved.
An XML-based system for the flexible classification and retrieval of clinical practice guidelines.

PubMed Central

Ganslandt, T.; Mueller, M. L.; Krieglstein, C. F.; Senninger, N.; Prokosch, H. U.

2002-01-01

Beneficial effects of clinical practice guidelines (CPGs) have not yet reached expectations due to limited routine adoption. Electronic distribution and reminder systems have the potential to overcome implementation barriers. Existing electronic CPG repositories like the National Guideline Clearinghouse (NGC) provide individual access but lack standardized computer-readable interfaces necessary for automated guideline retrieval. The aim of this paper was to facilitate automated context-based selection and presentation of CPGs. Using attributes from the NGC classification scheme, an XML-based metadata repository was successfully implemented, providing document storage, classification and retrieval functionality. Semi-automated extraction of attributes was implemented for the import of XML guideline documents using XPath. A hospital information system interface was exemplarily implemented for diagnosis-based guideline invocation. Limitations of the implemented system are discussed and possible future work is outlined. Integration of standardized computer-readable search interfaces into existing CPG repositories is proposed. PMID:12463831
Available Tools and Challenges Classifying Cutting-Edge and Historical Astronomical Documents

NASA Astrophysics Data System (ADS)

Lagerstrom, Jill

2015-08-01

The STScI Library assists the Science Policies Division in evaluating and choosing scientific keywords and categories for proposals for the Hubble Space Telescope mission and the upcoming James Webb Space Telescope mission. In addition we are often faced with the question “what is the shape of the astronomical literature?” However, subject classification in astronomy in recent times has not been cultivated. This talk will address the available tools and challenges of classifying cutting-edge as well as historical astronomical documents. In at the process, we will give an overview of current and upcoming practices of subject classification in astronomy.
A Novel Feature Selection Technique for Text Classification Using Naïve Bayes.

PubMed

Dey Sarkar, Subhajit; Goswami, Saptarsi; Agarwal, Aman; Aktar, Javed

2014-01-01

With the proliferation of unstructured data, text classification or text categorization has found many applications in topic classification, sentiment analysis, authorship identification, spam detection, and so on. There are many classification algorithms available. Naïve Bayes remains one of the oldest and most popular classifiers. On one hand, implementation of naïve Bayes is simple and, on the other hand, this also requires fewer amounts of training data. From the literature review, it is found that naïve Bayes performs poorly compared to other classifiers in text classification. As a result, this makes the naïve Bayes classifier unusable in spite of the simplicity and intuitiveness of the model. In this paper, we propose a two-step feature selection method based on firstly a univariate feature selection and then feature clustering, where we use the univariate feature selection method to reduce the search space and then apply clustering to select relatively independent feature sets. We demonstrate the effectiveness of our method by a thorough evaluation and comparison over 13 datasets. The performance improvement thus achieved makes naïve Bayes comparable or superior to other classifiers. The proposed algorithm is shown to outperform other traditional methods like greedy search based wrapper or CFS.
78 FR 68849 - Draft Current Intelligence Bulletin “Update of NIOSH Carcinogen Classification and Target Risk...

Federal Register 2010, 2011, 2012, 2013, 2014

2013-11-15

...; Docket Number NIOSH 240-A] Draft Current Intelligence Bulletin ``Update of NIOSH Carcinogen... document for public comment entitled ``Current Intelligence Bulletin: Update of NIOSH Carcinogen... obtain comments on the draft document, ``Current Intelligence Bulletin: Update of NIOSH Carcinogen...
17 CFR 140.24 - Control and accountability procedures.

Code of Federal Regulations, 2011 CFR

2011-04-01

..., recording discs, spools and tapes shall be given the same classification and secure handling as the... Confidential documents may be reproduced to the extent required by operational needs. (4) Reproduced copies of... originating agency, and of all Secret and Confidential documents which are marked with special dissemination...
17 CFR 140.24 - Control and accountability procedures.

Code of Federal Regulations, 2014 CFR

2014-04-01

..., recording discs, spools and tapes shall be given the same classification and secure handling as the... Confidential documents may be reproduced to the extent required by operational needs. (4) Reproduced copies of... originating agency, and of all Secret and Confidential documents which are marked with special dissemination...
17 CFR 140.24 - Control and accountability procedures.

Code of Federal Regulations, 2012 CFR

2012-04-01

..., recording discs, spools and tapes shall be given the same classification and secure handling as the... Confidential documents may be reproduced to the extent required by operational needs. (4) Reproduced copies of... originating agency, and of all Secret and Confidential documents which are marked with special dissemination...

17 CFR 140.24 - Control and accountability procedures.

Code of Federal Regulations, 2013 CFR

2013-04-01

..., recording discs, spools and tapes shall be given the same classification and secure handling as the... Confidential documents may be reproduced to the extent required by operational needs. (4) Reproduced copies of... originating agency, and of all Secret and Confidential documents which are marked with special dissemination...
Structuring Legacy Pathology Reports by openEHR Archetypes to Enable Semantic Querying.

PubMed

Kropf, Stefan; Krücken, Peter; Mueller, Wolf; Denecke, Kerstin

2017-05-18

Clinical information is often stored as free text, e.g. in discharge summaries or pathology reports. These documents are semi-structured using section headers, numbered lists, items and classification strings. However, it is still challenging to retrieve relevant documents since keyword searches applied on complete unstructured documents result in many false positive retrieval results. We are concentrating on the processing of pathology reports as an example for unstructured clinical documents. The objective is to transform reports semi-automatically into an information structure that enables an improved access and retrieval of relevant data. The data is expected to be stored in a standardized, structured way to make it accessible for queries that are applied to specific sections of a document (section-sensitive queries) and for information reuse. Our processing pipeline comprises information modelling, section boundary detection and section-sensitive queries. For enabling a focused search in unstructured data, documents are automatically structured and transformed into a patient information model specified through openEHR archetypes. The resulting XML-based pathology electronic health records (PEHRs) are queried by XQuery and visualized by XSLT in HTML. Pathology reports (PRs) can be reliably structured into sections by a keyword-based approach. The information modelling using openEHR allows saving time in the modelling process since many archetypes can be reused. The resulting standardized, structured PEHRs allow accessing relevant data by retrieving data matching user queries. Mapping unstructured reports into a standardized information model is a practical solution for a better access to data. Archetype-based XML enables section-sensitive retrieval and visualisation by well-established XML techniques. Focussing the retrieval to particular sections has the potential of saving retrieval time and improving the accuracy of the retrieval.
Classification of Emissions from Landfills for NSR Applicability Purposes

EPA Pesticide Factsheets

This document may be of assistance in applying the Title V air operating permit regulations. This document is part of the Title V Policy and Guidance Database available at www2.epa.gov/title-v-operating-permits/title-v-operating-permit-policy-and-guidance-document-index. Some documents in the database are a scanned or retyped version of a paper photocopy of the original. Although we have taken considerable effort to quality assure the documents, some may contain typographical errors. Contact the office that issued the document if you need a copy of the original.
Localizing text in scene images by boundary clustering, stroke segmentation, and string fragment classification.

PubMed

Yi, Chucai; Tian, Yingli

2012-09-01

In this paper, we propose a novel framework to extract text regions from scene images with complex backgrounds and multiple text appearances. This framework consists of three main steps: boundary clustering (BC), stroke segmentation, and string fragment classification. In BC, we propose a new bigram-color-uniformity-based method to model both text and attachment surface, and cluster edge pixels based on color pairs and spatial positions into boundary layers. Then, stroke segmentation is performed at each boundary layer by color assignment to extract character candidates. We propose two algorithms to combine the structural analysis of text stroke with color assignment and filter out background interferences. Further, we design a robust string fragment classification based on Gabor-based text features. The features are obtained from feature maps of gradient, stroke distribution, and stroke width. The proposed framework of text localization is evaluated on scene images, born-digital images, broadcast video images, and images of handheld objects captured by blind persons. Experimental results on respective datasets demonstrate that the framework outperforms state-of-the-art localization algorithms.
A system for classifying disease comorbidity status from medical discharge summaries using automated hotspot and negated concept detection.

PubMed

Ambert, Kyle H; Cohen, Aaron M

2009-01-01

OBJECTIVE Free-text clinical reports serve as an important part of patient care management and clinical documentation of patient disease and treatment status. Free-text notes are commonplace in medical practice, but remain an under-used source of information for clinical and epidemiological research, as well as personalized medicine. The authors explore the challenges associated with automatically extracting information from clinical reports using their submission to the Integrating Informatics with Biology and the Bedside (i2b2) 2008 Natural Language Processing Obesity Challenge Task. DESIGN A text mining system for classifying patient comorbidity status, based on the information contained in clinical reports. The approach of the authors incorporates a variety of automated techniques, including hot-spot filtering, negated concept identification, zero-vector filtering, weighting by inverse class-frequency, and error-correcting of output codes with linear support vector machines. MEASUREMENTS Performance was evaluated in terms of the macroaveraged F1 measure. RESULTS The automated system performed well against manual expert rule-based systems, finishing fifth in the Challenge's intuitive task, and 13(th) in the textual task. CONCLUSIONS The system demonstrates that effective comorbidity status classification by an automated system is possible.
Using Computational Text Classification for Qualitative Research and Evaluation in Extension

ERIC Educational Resources Information Center

Smith, Justin G.; Tissing, Reid

2018-01-01

This article introduces a process for computational text classification that can be used in a variety of qualitative research and evaluation settings. The process leverages supervised machine learning based on an implementation of a multinomial Bayesian classifier. Applied to a community of inquiry framework, the algorithm was used to identify…
A Study of United States Air Force Medical Central Processing and Distribution Systems.

DTIC Science & Technology

1981-06-01

5 M t2-8 13. IILL .i 2 5 I C. N SECURITY CLASSIFICATION OF THIS PAGE N,. LC, t,7EPORT DOCUMENTATION P AD-A 195 485 o Is. REPORT SECURITY...CLASSIFICATION lb. RlI𔃺KILIIV MAKKINib Unc lassif led 2a. SECURITY CLASSIFICATION AUTHORITY 3 DISTRIBUTION /AVAILABILITY OF REPORT Approved for public release...8217b, DECLASSIFICATION I DOWNGRADING SCHEDULE Distribution unlimited 4. PERFORMING ORGANIZATION REPORT NUMBER(S) 5. MONITORING ORGANIZATION REPORT NUMBER
Inversion of High Frequency Acoustic Data for Sediment Properties Needed for the Detection and Classification of UXOs

DTIC Science & Technology

2015-05-26

FINAL REPORT Inversion of High Frequency Acoustic Data for Sediment Properties Needed for the Detection and Classification of UXOs SERDP...DOCUMENTATION PAGE Prescribed by ANSI Std. Z39.18 Form Approved OMB No. 0704-0188 The public reporting burden for this collection of information is...2015 Inversion of High Frequency Acoustic Data for Sediment Properties Needed for the Detection and Classification of UXO’s W912HQ-12-C-0049 MR
Vitamin D3 Analogues with Low Vitamin D Receptor Binding Affinity Regulate Chondrocyte Proliferation, Proteoglycan Synthesis, and Protein Kinase C Activity

DTIC Science & Technology

1997-07-11

REPORT DOCUMENTATION PAGE Form ApprovedOMB No. 0704-0188 Public reporting burden for this collection of information is estimated to average 1 hour...DISTRIBUTION CODE 13. ABSTRACT (Maximum 200 words) 14. SUBJECT TERMS 15. NUMBER OF PAGES 50 16. PRICE CODE 17. SECURITY CLASSIFICATION 18. SECURITY...CLASSIFICATION 19. SECURITY CLASSIFICATION 20. LIMITATION OF ABSTRACT OF REPORT OF THIS PAGE OF ABSTRACT Standard Form 298(Rev. 2-89) (EG) Prescribed byANSI
National Dam Safety Program. Lake Monocan Dam (ID VA-12502), James River Basin, Allen Creek. Nelson County, Virginia. Phase I Inspection Report.

DTIC Science & Technology

1980-04-01

P 114 June 1978 Mr. W. Douglas Wright *Wiley and Wilson, Inc. 2310 Langhorne Road ’ :5 1973 Lvnchburg, Virginia 24505 & LSON, INC. LYNCHF.U1G. VA. Dv...CLASSIFICATION OF THIS P Dat Entered)’ " ’ READ INSTRUCTIO)NS REPORT DOCUMENTATION PAGE BEFORE COMPLETING FORM . REPORT NUMBER - 12. GOVT ACCESSION...OBSOLETE Unclassified -SECUnclassified OI P nd SECURITY CLASSIFICATION OF THIS PAGE (Wfen Dis Entered) SECURITY CLASSIFICATION OF INIS PAOE(Whe Data
HARD PAN I Test Series Test and Instrumentation Plans. Volume I. Test Plan

DTIC Science & Technology

1975-12-01

t.jw .y..,,^.,^,. Ä!»,,«-* :,,; .trwev* ’ UNCLASSIFIED SECURITY CLASSIFICATION OF THIS PAGt ’Wh&n Data Entered) J?)REPORT DOCUMENTATION PAGE...to facility-l—> DO ,: FORM A’J 73 1473 EDITION OF 1 NOV 65 15 OBSOLETE UNCLASSIFIED fNW SECURITY CLASSIFICATION OF THIS PAGE (Wfien Data Entered...y^o ... — ppiiw’.^y.-.j-w... v»t \\ UNCLASSIFIED iCURITY CLASSIFICATION CF THIS PAGEfWlon Data Entered) design, modification, and hardness
Text extraction method for historical Tibetan document images based on block projections

NASA Astrophysics Data System (ADS)

Duan, Li-juan; Zhang, Xi-qun; Ma, Long-long; Wu, Jian

2017-11-01

Text extraction is an important initial step in digitizing the historical documents. In this paper, we present a text extraction method for historical Tibetan document images based on block projections. The task of text extraction is considered as text area detection and location problem. The images are divided equally into blocks and the blocks are filtered by the information of the categories of connected components and corner point density. By analyzing the filtered blocks' projections, the approximate text areas can be located, and the text regions are extracted. Experiments on the dataset of historical Tibetan documents demonstrate the effectiveness of the proposed method.
A case study of the introduction of the International Classification for Nursing Practice(®) in Poland.

PubMed

Kilańska, D; Gaworska-Krzemińska, A; Grabowska, H; Gorzkowicz, B

2016-09-01

The development of a nursing practice, improvements in nurses' autonomy, and increased professional and personal responsibility for the medical services provided all require professional documentation with records of health status assessments, decisions undertaken, actions and their outcomes for each patient. The International Classification for Nursing Practice is a tool that meets all of these needs, and although it requires continuous evaluation, it offers professional documentation and communication in the practitioner and researcher community. The aim of this paper is to present a theoretical critique of an issue related to policy and experience of the current situation in Polish nursing - especially of the efforts to standardize nursing practices through the introduction and development of the Classification in Poland. Despite extensive promotion and training by International Council of Nurses members worldwide, there are still many countries where the Classification has not been implemented as a standard tool in healthcare facilities. Recently, a number of initiatives were undertaken in cooperation with the local and state authorities to disseminate the Classification in healthcare facilities. Thanks to intense efforts by the Polish Nurses Association and the International Council of Nurses Accredited Center for ICNP(®) Research & Development at the Medical University of Łódź, the Classification is known in Poland and has been tested at several centres. Nevertheless, an actual implementation that would allow for national and international interoperability requires strategic governmental decisions and close cooperation with information technology companies operating in the country. Discussing the barriers to the implementation of the Classification can improve understanding of it and its use. At a policy level, decision makers need to understand that use Classification in eHealth services and tools it is necessary to achieve interoperability. © 2016 International Council of Nurses.
The First AO Classification System for Fractures of the Craniomaxillofacial Skeleton: Rationale, Methodological Background, Developmental Process, and Objectives

PubMed Central

Audigé, Laurent; Cornelius, Carl-Peter; Ieva, Antonio Di; Prein, Joachim

2014-01-01

Validated trauma classification systems are the sole means to provide the basis for reliable documentation and evaluation of patient care, which will open the gateway to evidence-based procedures and healthcare in the coming years. With the support of AO Investigation and Documentation, a classification group was established to develop and evaluate a comprehensive classification system for craniomaxillofacial (CMF) fractures. Blueprints for fracture classification in the major constituents of the human skull were drafted and then evaluated by a multispecialty group of experienced CMF surgeons and a radiologist in a structured process during iterative agreement sessions. At each session, surgeons independently classified the radiological imaging of up to 150 consecutive cases with CMF fractures. During subsequent review meetings, all discrepancies in the classification outcome were critically appraised for clarification and improvement until consensus was reached. The resulting CMF classification system is structured in a hierarchical fashion with three levels of increasing complexity. The most elementary level 1 simply distinguishes four fracture locations within the skull: mandible (code 91), midface (code 92), skull base (code 93), and cranial vault (code 94). Levels 2 and 3 focus on further defining the fracture locations and for fracture morphology, achieving an almost individual mapping of the fracture pattern. This introductory article describes the rationale for the comprehensive AO CMF classification system, discusses the methodological framework, and provides insight into the experiences and interactions during the evaluation process within the core groups. The details of this system in terms of anatomy and levels are presented in a series of focused tutorials illustrated with case examples in this special issue of the Journal. PMID:25489387
The First AO Classification System for Fractures of the Craniomaxillofacial Skeleton: Rationale, Methodological Background, Developmental Process, and Objectives.

PubMed

Audigé, Laurent; Cornelius, Carl-Peter; Di Ieva, Antonio; Prein, Joachim

2014-12-01

Validated trauma classification systems are the sole means to provide the basis for reliable documentation and evaluation of patient care, which will open the gateway to evidence-based procedures and healthcare in the coming years. With the support of AO Investigation and Documentation, a classification group was established to develop and evaluate a comprehensive classification system for craniomaxillofacial (CMF) fractures. Blueprints for fracture classification in the major constituents of the human skull were drafted and then evaluated by a multispecialty group of experienced CMF surgeons and a radiologist in a structured process during iterative agreement sessions. At each session, surgeons independently classified the radiological imaging of up to 150 consecutive cases with CMF fractures. During subsequent review meetings, all discrepancies in the classification outcome were critically appraised for clarification and improvement until consensus was reached. The resulting CMF classification system is structured in a hierarchical fashion with three levels of increasing complexity. The most elementary level 1 simply distinguishes four fracture locations within the skull: mandible (code 91), midface (code 92), skull base (code 93), and cranial vault (code 94). Levels 2 and 3 focus on further defining the fracture locations and for fracture morphology, achieving an almost individual mapping of the fracture pattern. This introductory article describes the rationale for the comprehensive AO CMF classification system, discusses the methodological framework, and provides insight into the experiences and interactions during the evaluation process within the core groups. The details of this system in terms of anatomy and levels are presented in a series of focused tutorials illustrated with case examples in this special issue of the Journal.
Automatic topic identification of health-related messages in online health community using text classification.

PubMed

Lu, Yingjie

2013-01-01

To facilitate patient involvement in online health community and obtain informative support and emotional support they need, a topic identification approach was proposed in this paper for identifying automatically topics of the health-related messages in online health community, thus assisting patients in reaching the most relevant messages for their queries efficiently. Feature-based classification framework was presented for automatic topic identification in our study. We first collected the messages related to some predefined topics in a online health community. Then we combined three different types of features, n-gram-based features, domain-specific features and sentiment features to build four feature sets for health-related text representation. Finally, three different text classification techniques, C4.5, Naïve Bayes and SVM were adopted to evaluate our topic classification model. By comparing different feature sets and different classification techniques, we found that n-gram-based features, domain-specific features and sentiment features were all considered to be effective in distinguishing different types of health-related topics. In addition, feature reduction technique based on information gain was also effective to improve the topic classification performance. In terms of classification techniques, SVM outperformed C4.5 and Naïve Bayes significantly. The experimental results demonstrated that the proposed approach could identify the topics of online health-related messages efficiently.
Performance evaluation of MLP and RBF feed forward neural network for the recognition of off-line handwritten characters

NASA Astrophysics Data System (ADS)

Rishi, Rahul; Choudhary, Amit; Singh, Ravinder; Dhaka, Vijaypal Singh; Ahlawat, Savita; Rao, Mukta

2010-02-01

In this paper we propose a system for classification problem of handwritten text. The system is composed of preprocessing module, supervised learning module and recognition module on a very broad level. The preprocessing module digitizes the documents and extracts features (tangent values) for each character. The radial basis function network is used in the learning and recognition modules. The objective is to analyze and improve the performance of Multi Layer Perceptron (MLP) using RBF transfer functions over Logarithmic Sigmoid Function. The results of 35 experiments indicate that the Feed Forward MLP performs accurately and exhaustively with RBF. With the change in weight update mechanism and feature-drawn preprocessing module, the proposed system is competent with good recognition show.
10 CFR 1045.40 - Marking requirements.

Code of Federal Regulations, 2013 CFR

2013-01-01

.... (a) RD classifiers shall ensure that each RD and FRD document is clearly marked to convey to the holder that it contains RD or FRD information, the level of classification assigned, and the additional... sanctions. (2) If the document contains FRD but does not contain RD: FORMERLY RESTRICTED DATA Unauthorized...
10 CFR 1045.40 - Marking requirements.

Code of Federal Regulations, 2012 CFR

2012-01-01

.... (a) RD classifiers shall ensure that each RD and FRD document is clearly marked to convey to the holder that it contains RD or FRD information, the level of classification assigned, and the additional... sanctions. (2) If the document contains FRD but does not contain RD: FORMERLY RESTRICTED DATA Unauthorized...
10 CFR 1045.40 - Marking requirements.

Code of Federal Regulations, 2014 CFR

2014-01-01

.... (a) RD classifiers shall ensure that each RD and FRD document is clearly marked to convey to the holder that it contains RD or FRD information, the level of classification assigned, and the additional... sanctions. (2) If the document contains FRD but does not contain RD: FORMERLY RESTRICTED DATA Unauthorized...

Automatic Term Class Construction Using Relevance--A Summary of Work in Automatic Pseudoclassification.

ERIC Educational Resources Information Center

Salton, G.

1980-01-01

Summarizes studies of pseudoclassification, a process of utilizing user relevance assessments of certain documents with respect to certain queries to build term classes designed to retrieve relevant documents. Conclusions are reached concerning the effectiveness and feasibility of constructing term classifications based on human relevance…
78 FR 33699 - Visas: Classification of Immediate Family Members as G Nonimmigrants

Federal Register 2010, 2011, 2012, 2013, 2014

2013-06-05

...? Currently, 22 CFR 41.22(b) requires that an alien entitled to classification as an A-1 or A-2 nonimmigrant...)(4) to clarify that an immediate family member of a principal alien classifiable G-1 or G- 2, G-3 or... CFR Part 41 Aliens, Documentation of nonimmigrants, Foreign officials, Immigration, Passports and...
Forest land cover change (1975-2000) in the Greater Border Lakes region

Treesearch

Peter T. Wolter; Brian R. Sturtevant; Brian R. Miranda; Sue M. Lietz; Phillip A. Townsend; John Pastor

2012-01-01

This document and accompanying maps describe land cover classifications and change detection for a 13.8 million ha landscape straddling the border between Minnesota, and Ontario, Canada (greater Border Lakes Region). Land cover classifications focus on discerning Anderson Level II forest and nonforest cover to track spatiotemporal changes in forest cover. Multi-...
A System for Automated Extraction of Metadata from Scanned Documents using Layout Recognition and String Pattern Search Models

PubMed Central

Misra, Dharitri; Chen, Siyuan; Thoma, George R.

2010-01-01

One of the most expensive aspects of archiving digital documents is the manual acquisition of context-sensitive metadata useful for the subsequent discovery of, and access to, the archived items. For certain types of textual documents, such as journal articles, pamphlets, official government records, etc., where the metadata is contained within the body of the documents, a cost effective method is to identify and extract the metadata in an automated way, applying machine learning and string pattern search techniques. At the U. S. National Library of Medicine (NLM) we have developed an automated metadata extraction (AME) system that employs layout classification and recognition models with a metadata pattern search model for a text corpus with structured or semi-structured information. A combination of Support Vector Machine and Hidden Markov Model is used to create the layout recognition models from a training set of the corpus, following which a rule-based metadata search model is used to extract the embedded metadata by analyzing the string patterns within and surrounding each field in the recognized layouts. In this paper, we describe the design of our AME system, with focus on the metadata search model. We present the extraction results for a historic collection from the Food and Drug Administration, and outline how the system may be adapted for similar collections. Finally, we discuss some ongoing enhancements to our AME system. PMID:21179386
Does the Use of a Classification for Nursing Diagnoses Affect Nursing Students’ Choice of Nursing Interventions?

PubMed Central

Falk, Joakim; Björvell, Catrin

2012-01-01

The Swedish health care system stands before an implementation of standardized language. The first classification of nursing diagnoses translated into Swedish, The NANDA, was released in January 2011. The aim of the present study was to examine whether the usage of the NANDA classification affected nursing students’ choice of nursing interventions. Thirty-three nursing students in a clinical setting were divided into two groups. The intervention group had access to the NANDA classification text book, while the comparison group did not. In total 78 nursing assessments were performed and 218 nursing interventions initiated. The principle findings show that there were no statistical significant differences between the groups regarding the amount, quality or category of nursing interventions when using the NANDA classification compared to free text format nursing diagnoses. PMID:24199065
Learning Supervised Topic Models for Classification and Regression from Crowds.

PubMed

Rodrigues, Filipe; Lourenco, Mariana; Ribeiro, Bernardete; Pereira, Francisco C

2017-12-01

The growing need to analyze large collections of documents has led to great developments in topic modeling. Since documents are frequently associated with other related variables, such as labels or ratings, much interest has been placed on supervised topic models. However, the nature of most annotation tasks, prone to ambiguity and noise, often with high volumes of documents, deem learning under a single-annotator assumption unrealistic or unpractical for most real-world applications. In this article, we propose two supervised topic models, one for classification and another for regression problems, which account for the heterogeneity and biases among different annotators that are encountered in practice when learning from crowds. We develop an efficient stochastic variational inference algorithm that is able to scale to very large datasets, and we empirically demonstrate the advantages of the proposed model over state-of-the-art approaches.
Classification of proteins with shared motifs and internal repeats in the ECOD database

PubMed Central

Kinch, Lisa N.; Liao, Yuxing

2016-01-01

Abstract Proteins and their domains evolve by a set of events commonly including the duplication and divergence of small motifs. The presence of short repetitive regions in domains has generally constituted a difficult case for structural domain classifications and their hierarchies. We developed the Evolutionary Classification Of protein Domains (ECOD) in part to implement a new schema for the classification of these types of proteins. Here we document the ways in which ECOD classifies proteins with small internal repeats, widespread functional motifs, and assemblies of small domain‐like fragments in its evolutionary schema. We illustrate the ways in which the structural genomics project impacted the classification and characterization of new structural domains and sequence families over the decade. PMID:26833690
BOREAS TE-18 Landsat TM Maximum Likelihood Classification Image of the NSA

NASA Technical Reports Server (NTRS)

Hall, Forrest G. (Editor); Knapp, David

2000-01-01

The BOREAS TE-18 team focused its efforts on using remotely sensed data to characterize the successional and disturbance dynamics of the boreal forest for use in carbon modeling. The objective of this classification is to provide the BOREAS investigators with a data product that characterizes the land cover of the NSA. A Landsat-5 TM image from 20-Aug-1988 was used to derive this classification. A standard supervised maximum likelihood classification approach was used to produce this classification. The data are provided in a binary image format file. The data files are available on a CD-ROM (see document number 20010000884), or from the Oak Ridge National Laboratory (ORNL) Distributed Activity Archive Center (DAAC).
Comparison of Document Index Graph Using TextRank and HITS Weighting Method in Automatic Text Summarization

NASA Astrophysics Data System (ADS)

Hadyan, Fadhlil; Shaufiah; Arif Bijaksana, Moch.

2017-01-01

Automatic summarization is a system that can help someone to take the core information of a long text instantly. The system can help by summarizing text automatically. there’s Already many summarization systems that have been developed at this time but there are still many problems in those system. In this final task proposed summarization method using document index graph. This method utilizes the PageRank and HITS formula used to assess the web page, adapted to make an assessment of words in the sentences in a text document. The expected outcome of this final task is a system that can do summarization of a single document, by utilizing document index graph with TextRank and HITS to improve the quality of the summary results automatically.
Classification of Traffic Related Short Texts to Analyse Road Problems in Urban Areas

NASA Astrophysics Data System (ADS)

Saldana-Perez, A. M. M.; Moreno-Ibarra, M.; Tores-Ruiz, M.

2017-09-01

The Volunteer Geographic Information (VGI) can be used to understand the urban dynamics. In the classification of traffic related short texts to analyze road problems in urban areas, a VGI data analysis is done over a social media's publications, in order to classify traffic events at big cities that modify the movement of vehicles and people through the roads, such as car accidents, traffic and closures. The classification of traffic events described in short texts is done by applying a supervised machine learning algorithm. In the approach users are considered as sensors which describe their surroundings and provide their geographic position at the social network. The posts are treated by a text mining process and classified into five groups. Finally, the classified events are grouped in a data corpus and geo-visualized in the study area, to detect the places with more vehicular problems.
E-documentation as a process management tool for nursing care in hospitals.

PubMed

Rajkovic, Uros; Sustersic, Olga; Rajkovic, Vladislav

2009-01-01

Appropriate documentation plays a key role in process management in nursing care. It includes holistic data management based on patient's data along the clinical path with regard to nursing care. We developed an e-documentation model that follows the process method of work in nursing care. It assesses the patient's status on the basis of Henderson's theoretical model of 14 basic living activities and is aligned with internationally recognized nursing classifications. E-documentation development requires reengineering of existing documentation and facilitates process reengineering. A prototype solution of an e-nursing documentation, already being in testing process at University medical centres in Ljubljana and Maribor, will be described.
Handwritten document age classification based on handwriting styles

NASA Astrophysics Data System (ADS)

Ramaiah, Chetan; Kumar, Gaurav; Govindaraju, Venu

2012-01-01

Handwriting styles are constantly changing over time. We approach the novel problem of estimating the approximate age of Historical Handwritten Documents using Handwriting styles. This system will have many applications in handwritten document processing engines where specialized processing techniques can be applied based on the estimated age of the document. We propose to learn a distribution over styles across centuries using Topic Models and to apply a classifier over weights learned in order to estimate the approximate age of the documents. We present a comparison of different distance metrics such as Euclidean Distance and Hellinger Distance within this application.
Drug-related webpages classification based on multi-modal local decision fusion

NASA Astrophysics Data System (ADS)

Hu, Ruiguang; Su, Xiaojing; Liu, Yanxin

2018-03-01

In this paper, multi-modal local decision fusion is used for drug-related webpages classification. First, meaningful text are extracted through HTML parsing, and effective images are chosen by the FOCARSS algorithm. Second, six SVM classifiers are trained for six kinds of drug-taking instruments, which are represented by PHOG. One SVM classifier is trained for the cannabis, which is represented by the mid-feature of BOW model. For each instance in a webpage, seven SVMs give seven labels for its image, and other seven labels are given by searching the names of drug-taking instruments and cannabis in its related text. Concatenating seven labels of image and seven labels of text, the representation of those instances in webpages are generated. Last, Multi-Instance Learning is used to classify those drugrelated webpages. Experimental results demonstrate that the classification accuracy of multi-instance learning with multi-modal local decision fusion is much higher than those of single-modal classification.
78 FR 28258 - mPower\\TM\\ Design-Specific Review Standard

Federal Register 2010, 2011, 2012, 2013, 2014

2013-05-14

... Public Documents'' and then select ``Begin Web- based ADAMS Search.'' For problems with ADAMS, please... Classification ML12272A013 3.2.2 System Quality Group ML12272A015 Classification. 3.3.1 Severe Wind Loading... ML12324A156 3.3.2 Extreme Wind Loads ML12324A166 (Tornado and Hurricane Loads). 3.4.1 Internal Flood...
From Classification to "Knowledge Organization": Dorking Revisited or "Past is Prelude." FID Occasional Paper No. 14.

ERIC Educational Resources Information Center

Gilchrist, Alan, Ed.

This set of papers offers insights into some of the major developments in the field of classification and knowledge organization, and highlights many of the fundamental changes in views and theories which have taken place during the last 40 years. This document begins with a series of reminiscences from former delegates of the first International…
XNDM: An Experimental Network Data Manager.

DTIC Science & Technology

1981-06-01

return this copy. Retain or destroy. t t , UNCLASSIFIED SECURITY CLASSIFICATION OF THIS PAGE fWln Dtet Entred) READ INSTRUCTIONSREPORT DOCUMENTATION...MONITORING AGENCY NAME & AODRESS(IE’ different Irom Controlling Office) 15 SECURITY CLASS, (I this report) UNCLASSIFIED Same Is DECLASSIFICATION...DD JAN% 1473 EDITION OF I NOV 65 IS OBSOLETE UNCLASSIFIED SECURITY CLASSIFICATION OF THIS PAGE ("hen Date Entered) -: o . <. UNCLASSIFIED SECURITY
10 CFR 1045.43 - Systematic review for declassification.

Code of Federal Regulations, 2010 CFR

2010-01-01

... 10 Energy 4 2010-01-01 2010-01-01 false Systematic review for declassification. 1045.43 Section... Systematic review for declassification. (a) The Secretary shall ensure that RD documents, and the DoD shall... Classification (and with the DoD for FRD) to ensure the systematic review of RD and FRD documents. (c) Review of...
10 CFR 1045.43 - Systematic review for declassification.

Code of Federal Regulations, 2011 CFR

2011-01-01

... 10 Energy 4 2011-01-01 2011-01-01 false Systematic review for declassification. 1045.43 Section... Systematic review for declassification. (a) The Secretary shall ensure that RD documents, and the DoD shall... Classification (and with the DoD for FRD) to ensure the systematic review of RD and FRD documents. (c) Review of...
26 CFR 1.861-18 - Classification of transactions involving computer programs.

Code of Federal Regulations, 2014 CFR

2014-04-01

... income. In the case of a transfer of a copyrighted article, this section provides rules for determining... purposes of this paragraph (a)(3), a computer program includes any media, user manuals, documentation, data base or similar item if the media, user manuals, documentation, data base or similar item is incidental...
26 CFR 1.861-18 - Classification of transactions involving computer programs.

Code of Federal Regulations, 2013 CFR

2013-04-01

... income. In the case of a transfer of a copyrighted article, this section provides rules for determining... purposes of this paragraph (a)(3), a computer program includes any media, user manuals, documentation, data base or similar item if the media, user manuals, documentation, data base or similar item is incidental...

26 CFR 1.861-18 - Classification of transactions involving computer programs.

Code of Federal Regulations, 2012 CFR

2012-04-01

... income. In the case of a transfer of a copyrighted article, this section provides rules for determining... purposes of this paragraph (a)(3), a computer program includes any media, user manuals, documentation, data base or similar item if the media, user manuals, documentation, data base or similar item is incidental...
Spatial Paradigm for Information Retrieval and Exploration

DOE Office of Scientific and Technical Information (OSTI.GOV)

The SPIRE system consists of software for visual analysis of primarily text based information sources. This technology enables the content analysis of text documents without reading all the documents. It employs several algorithms for text and word proximity analysis. It identifies the key themes within the text documents. From this analysis, it projects the results onto a visual spatial proximity display (Galaxies or Themescape) where items (documents and/or themes) visually close to each other are known to have content which is close to each other. Innovative interaction techniques then allow for dynamic visual analysis of large text based information spaces.
SPIRE1.03. Spatial Paradigm for Information Retrieval and Exploration

DOE Office of Scientific and Technical Information (OSTI.GOV)

Adams, K.J.; Bohn, S.; Crow, V.

The SPIRE system consists of software for visual analysis of primarily text based information sources. This technology enables the content analysis of text documents without reading all the documents. It employs several algorithms for text and word proximity analysis. It identifies the key themes within the text documents. From this analysis, it projects the results onto a visual spatial proximity display (Galaxies or Themescape) where items (documents and/or themes) visually close to each other are known to have content which is close to each other. Innovative interaction techniques then allow for dynamic visual analysis of large text based information spaces.
Feature extraction for document text using Latent Dirichlet Allocation

NASA Astrophysics Data System (ADS)

Prihatini, P. M.; Suryawan, I. K.; Mandia, IN

2018-01-01

Feature extraction is one of stages in the information retrieval system that used to extract the unique feature values of a text document. The process of feature extraction can be done by several methods, one of which is Latent Dirichlet Allocation. However, researches related to text feature extraction using Latent Dirichlet Allocation method are rarely found for Indonesian text. Therefore, through this research, a text feature extraction will be implemented for Indonesian text. The research method consists of data acquisition, text pre-processing, initialization, topic sampling and evaluation. The evaluation is done by comparing Precision, Recall and F-Measure value between Latent Dirichlet Allocation and Term Frequency Inverse Document Frequency KMeans which commonly used for feature extraction. The evaluation results show that Precision, Recall and F-Measure value of Latent Dirichlet Allocation method is higher than Term Frequency Inverse Document Frequency KMeans method. This shows that Latent Dirichlet Allocation method is able to extract features and cluster Indonesian text better than Term Frequency Inverse Document Frequency KMeans method.
Automated Classification of Pathology Reports.

PubMed

Oleynik, Michel; Finger, Marcelo; Patrão, Diogo F C

2015-01-01

This work develops an automated classifier of pathology reports which infers the topography and the morphology classes of a tumor using codes from the International Classification of Diseases for Oncology (ICD-O). Data from 94,980 patients of the A.C. Camargo Cancer Center was used for training and validation of Naive Bayes classifiers, evaluated by the F1-score. Measures greater than 74% in the topographic group and 61% in the morphologic group are reported. Our work provides a successful baseline for future research for the classification of medical documents written in Portuguese and in other domains.
Data Processing And Machine Learning Methods For Multi-Modal Operator State Classification Systems

NASA Technical Reports Server (NTRS)

Hearn, Tristan A.

2015-01-01

This document is intended as an introduction to a set of common signal processing learning methods that may be used in the software portion of a functional crew state monitoring system. This includes overviews of both the theory of the methods involved, as well as examples of implementation. Practical considerations are discussed for implementing modular, flexible, and scalable processing and classification software for a multi-modal, multi-channel monitoring system. Example source code is also given for all of the discussed processing and classification methods.
Script-independent text line segmentation in freestyle handwritten documents.

PubMed

Li, Yi; Zheng, Yefeng; Doermann, David; Jaeger, Stefan; Li, Yi

2008-08-01

Text line segmentation in freestyle handwritten documents remains an open document analysis problem. Curvilinear text lines and small gaps between neighboring text lines present a challenge to algorithms developed for machine printed or hand-printed documents. In this paper, we propose a novel approach based on density estimation and a state-of-the-art image segmentation technique, the level set method. From an input document image, we estimate a probability map, where each element represents the probability that the underlying pixel belongs to a text line. The level set method is then exploited to determine the boundary of neighboring text lines by evolving an initial estimate. Unlike connected component based methods ( [1], [2] for example), the proposed algorithm does not use any script-specific knowledge. Extensive quantitative experiments on freestyle handwritten documents with diverse scripts, such as Arabic, Chinese, Korean, and Hindi, demonstrate that our algorithm consistently outperforms previous methods [1]-[3]. Further experiments show the proposed algorithm is robust to scale change, rotation, and noise.
Document segmentation for high-quality printing

NASA Astrophysics Data System (ADS)

Ancin, Hakan

1997-04-01

A technique to segment dark texts on light background of mixed mode color documents is presented. This process does not perceptually change graphics and photo regions. Color documents are scanned and printed from various media which usually do not have clean background. This is especially the case for the printouts generated from thin magazine samples, these printouts usually include text and figures form the back of the page, which is called bleeding. Removal of bleeding artifacts improves the perceptual quality of the printed document and reduces the color ink usage. By detecting the light background of the document, these artifacts are removed from background regions. Also detection of dark text regions enables the halftoning algorithms to use true black ink for the black text pixels instead of composite black. The processed document contains sharp black text on white background, resulting improved perceptual quality and better ink utilization. The described method is memory efficient and requires a small number of scan lines of high resolution color documents during processing.
Computation of term dominance in text documents

DOEpatents

Bauer, Travis L [Albuquerque, NM; Benz, Zachary O [Albuquerque, NM; Verzi, Stephen J [Albuquerque, NM

2012-04-24

An improved entropy-based term dominance metric useful for characterizing a corpus of text documents, and is useful for comparing the term dominance metrics of a first corpus of documents to a second corpus having a different number of documents.
Nonlinear filtering for character recognition in low quality document images

NASA Astrophysics Data System (ADS)

Diaz-Escobar, Julia; Kober, Vitaly

2014-09-01

Optical character recognition in scanned printed documents is a well-studied task, where the captured conditions like sheet position, illumination, contrast and resolution are controlled. Nowadays, it is more practical to use mobile devices for document capture than a scanner. So as a consequence, the quality of document images is often poor owing to presence of geometric distortions, nonhomogeneous illumination, low resolution, etc. In this work we propose to use multiple adaptive nonlinear composite filters for detection and classification of characters. Computer simulation results obtained with the proposed system are presented and discussed.
Software Design Document SAF Workstation. Volume 1, Sections 1.0 - 2.4. 3.4.86

DTIC Science & Technology

1991-06-01

SLECT TERMS IS. NUMER OF PAGES SIMNET Software Design Document for the SAF Workstation CSCI (CSCI 6). 14. PRICE CODE SECUWItY CLASSIFICATION Is. SECUJRITY...AD-A244 972 SOFTWARE DESIGN DOCUMENT SAF Workstation CSCI (6) Volume 1 of 2 Sections 1.0 - 2.4.3.4.86 DTIC June, 1991 Flt. FCTE S JAN 09 1992...00247 APPROVED FOR PUBLIC RELEASE DISTRBUTION UNLIMITED -Mono SOFTWARE DESIGN DOCUMENT SAF Workstation CSCI (6) Volume 1 of 2 Sections 1.0 - 2.4.3.4.86
Standard classification of software documentation

NASA Technical Reports Server (NTRS)

Tausworthe, R. C.

1976-01-01

General conceptual requirements for standard levels of documentation and for application of these requirements to intended usages. These standards encourage the policy to produce only those forms of documentation that are needed and adequate for the purpose. Documentation standards are defined with respect to detail and format quality. Classes A through D range, in order, from the most definitive down to the least definitive, and categories 1 through 4 range, in order, from high-quality typeset down to handwritten material. Criteria for each of the classes and categories, as well as suggested selection guidelines for each are given.
Improving Naive Bayes with Online Feature Selection for Quick Adaptation to Evolving Feature Usefulness

DOE Office of Scientific and Technical Information (OSTI.GOV)

Pon, R K; Cardenas, A F; Buttler, D J

The definition of what makes an article interesting varies from user to user and continually evolves even for a single user. As a result, for news recommendation systems, useless document features can not be determined a priori and all features are usually considered for interestingness classification. Consequently, the presence of currently useless features degrades classification performance [1], particularly over the initial set of news articles being classified. The initial set of document is critical for a user when considering which particular news recommendation system to adopt. To address these problems, we introduce an improved version of the naive Bayes classifiermore » with online feature selection. We use correlation to determine the utility of each feature and take advantage of the conditional independence assumption used by naive Bayes for online feature selection and classification. The augmented naive Bayes classifier performs 28% better than the traditional naive Bayes classifier in recommending news articles from the Yahoo! RSS feeds.« less
Online EEG Classification of Covert Speech for Brain-Computer Interfacing.

PubMed

Sereshkeh, Alborz Rezazadeh; Trott, Robert; Bricout, Aurélien; Chau, Tom

2017-12-01

Brain-computer interfaces (BCIs) for communication can be nonintuitive, often requiring the performance of hand motor imagery or some other conversation-irrelevant task. In this paper, electroencephalography (EEG) was used to develop two intuitive online BCIs based solely on covert speech. The goal of the first BCI was to differentiate between 10[Formula: see text]s of mental repetitions of the word "no" and an equivalent duration of unconstrained rest. The second BCI was designed to discern between 10[Formula: see text]s each of covert repetition of the words "yes" and "no". Twelve participants used these two BCIs to answer yes or no questions. Each participant completed four sessions, comprising two offline training sessions and two online sessions, one for testing each of the BCIs. With a support vector machine and a combination of spectral and time-frequency features, an average accuracy of [Formula: see text] was reached across participants in the online classification of no versus rest, with 10 out of 12 participants surpassing the chance level (60.0% for [Formula: see text]). The online classification of yes versus no yielded an average accuracy of [Formula: see text], with eight participants exceeding the chance level. Task-specific changes in EEG beta and gamma power in language-related brain areas tended to provide discriminatory information. To our knowledge, this is the first report of online EEG classification of covert speech. Our findings support further study of covert speech as a BCI activation task, potentially leading to the development of more intuitive BCIs for communication.
Symbolic rule-based classification of lung cancer stages from free-text pathology reports.

PubMed

Nguyen, Anthony N; Lawley, Michael J; Hansen, David P; Bowman, Rayleen V; Clarke, Belinda E; Duhig, Edwina E; Colquist, Shoni

2010-01-01

To classify automatically lung tumor-node-metastases (TNM) cancer stages from free-text pathology reports using symbolic rule-based classification. By exploiting report substructure and the symbolic manipulation of systematized nomenclature of medicine-clinical terms (SNOMED CT) concepts in reports, statements in free text can be evaluated for relevance against factors relating to the staging guidelines. Post-coordinated SNOMED CT expressions based on templates were defined and populated by concepts in reports, and tested for subsumption by staging factors. The subsumption results were used to build logic according to the staging guidelines to calculate the TNM stage. The accuracy measure and confusion matrices were used to evaluate the TNM stages classified by the symbolic rule-based system. The system was evaluated against a database of multidisciplinary team staging decisions and a machine learning-based text classification system using support vector machines. Overall accuracy on a corpus of pathology reports for 718 lung cancer patients against a database of pathological TNM staging decisions were 72%, 78%, and 94% for T, N, and M staging, respectively. The system's performance was also comparable to support vector machine classification approaches. A system to classify lung TNM stages from free-text pathology reports was developed, and it was verified that the symbolic rule-based approach using SNOMED CT can be used for the extraction of key lung cancer characteristics from free-text reports. Future work will investigate the applicability of using the proposed methodology for extracting other cancer characteristics and types.
The Labelling Approach to Deviance.

ERIC Educational Resources Information Center

Rains, Prudence M.; Kitsuse, John L.; Duster, Troy; Freidson, Eliot

2003-01-01

This reprint of one chapter from the 1975 text, "Issues in the Classification of Children" by Nicholas Hobbs and others, addresses the theoretical, methodological, and empirical issues involved in the "labeling" approach to the sociology of deviance. It examines the social process of classification, the use of classification in social agencies,…
Health Instruction Packages: Drug Dosage, Classification, and Mixing.

ERIC Educational Resources Information Center

Bracchi, Dorothy P.; And Others

Text, illustrations, and exercises are utilized in a set of seven learning modules to instruct nursing students in the fundamentals of drug classification, dosage, and mixing. The first module, by Dorothy Bracchi, teaches the student to identify six classifications of medication often administered to orthopedic patients: anti-neurospasmolytic…
Duplicate document detection in DocBrowse

NASA Astrophysics Data System (ADS)

Chalana, Vikram; Bruce, Andrew G.; Nguyen, Thien

1998-04-01

Duplicate documents are frequently found in large databases of digital documents, such as those found in digital libraries or in the government declassification effort. Efficient duplicate document detection is important not only to allow querying for similar documents, but also to filter out redundant information in large document databases. We have designed three different algorithm to identify duplicate documents. The first algorithm is based on features extracted from the textual content of a document, the second algorithm is based on wavelet features extracted from the document image itself, and the third algorithm is a combination of the first two. These algorithms are integrated within the DocBrowse system for information retrieval from document images which is currently under development at MathSoft. DocBrowse supports duplicate document detection by allowing (1) automatic filtering to hide duplicate documents, and (2) ad hoc querying for similar or duplicate documents. We have tested the duplicate document detection algorithms on 171 documents and found that text-based method has an average 11-point precision of 97.7 percent while the image-based method has an average 11- point precision of 98.9 percent. However, in general, the text-based method performs better when the document contains enough high-quality machine printed text while the image- based method performs better when the document contains little or no quality machine readable text.
MememxGATE: Unearthing Latent Content Features for Improved Search and Relevancy Ranking Across Scientific Literature

NASA Astrophysics Data System (ADS)

Wilson, B. D.; McGibbney, L. J.; Mattmann, C. A.; Ramirez, P.; Joyce, M.; Whitehall, K. D.

2015-12-01

Quantifying scientific relevancy is of increasing importance to NASA and the research community. Scientific relevancy may be defined by mapping the impacts of a particular NASA mission, instrument, and/or retrieved variables to disciplines such as climate predictions, natural hazards detection and mitigation processes, education, and scientific discoveries. Related to relevancy, is the ability to expose data with similar attributes. This in turn depends upon the ability for us to extract latent, implicit document features from scientific data and resources and make them explicit, accessible and useable for search activities amongst others. This paper presents MemexGATE; a server side application, command line interface and computing environment for running large scale metadata extraction, general architecture text engineering, document classification and indexing tasks over document resources such as social media streams, scientific literature archives, legal documentation, etc. This work builds on existing experiences using MemexGATE (funded, developed and validated through the DARPA Memex Progrjam PI Mattmann) for extracting and leveraging latent content features from document resources within the Materials Research domain. We extend the software functionality capability to the domain of scientific literature with emphasis on the expansion of gazetteer lists, named entity rules, natural language construct labeling (e.g. synonym, antonym, hyponym, etc.) efforts to enable extraction of latent content features from data hosted by wide variety of scientific literature vendors (AGU Meeting Abstract Database, Springer, Wiley Online, Elsevier, etc.) hosting earth science literature. Such literature makes both implicit and explicit references to NASA datasets and relationships between such concepts stored across EOSDIS DAAC's hence we envisage that a significant part of this effort will also include development and understanding of relevancy signals which can ultimately be utilized for improved search and relevancy ranking across scientific literature.
Benign paroxysmal positional vertigo: Diagnostic criteria Consensus document of the Committee for the Classification of Vestibular Disorders of the Bárány Society.

PubMed

von Brevern, Michael; Bertholon, Pierre; Brandt, Thomas; Fife, Terry; Imai, Takao; Nuti, Daniele; Newman-Toker, David

This article presents operational diagnostic criteria for benign paroxysmal positional vertigo (BPPV), formulated by the Committee for Classification of Vestibular Disorders of the Bárány Society. The classification reflects current knowledge of clinical aspects and pathomechanisms of BPPV and includes both established and emerging syndromes of BPPV. It is anticipated that growing understanding of the disease will lead to further development of this classification. Copyright © 2017 Elsevier España, S.L.U. and Sociedad Española de Otorrinolaringología y Cirugía de Cabeza y Cuello. All rights reserved.

Constitutions and Democratic Consolidation: Brazil in Comparative Perspective

DTIC Science & Technology

1989-03-31

FIED i(CURIty CLASSIFICATION 0$ T64 S PAGE REPORT DOCUMENTATION PAGE la REPORT SECURITY CLASSIFICATION lb RESTRICTIVE MARKINGS UNCLASSI FIED la SECURITY...majority. The members sought to guarantee the revolutionary changes and continue them into the future by inclusion in the Constitution. Instead...Bresser Pereira, "Economic Ideologies and Democracy in Brazil," paper presented to seminar on L’Internationalisation de la Democratie Politique
Voice/Natural Language Interfacing for Robotic Control.

DTIC Science & Technology

1987-11-01

THIS PAGE REPORT DOCUMENTATION PAGE Is. REPORT SECURITY CLASSIFICATION lb . RESTRICTIVE MARKINGS UNCLASSIFIED 2a. SECURITY CLASSIFICATION AUTHORITY 3...until major computing power can be profitably allocated to the speech recognition process, off-the- shelf units will never have sufficient intelligence to...coordinate transformation for a location, and opening or closing the gripper’s toggles. External to world operations, each joint may be rotated
Evaluation of the Retrieval of Nuclear Science Document References Using the Universal Decimal Classification as the Indexing Language for a Computer-Based System

ERIC Educational Resources Information Center

Atherton, Pauline; And Others

A single issue of Nuclear Science Abstracts, containing about 2,300 abstracts, was indexed by Universal Decimal Classification (UDC) using the Special Subject Edition of UDC for Nuclear Science and Technology. The descriptive cataloging and UDC-indexing records formed a computer-stored data base. A systematic random sample of 500 additional…
Retention in the Navy Nurse Corps

DTIC Science & Technology

1990-12-01

SECURITY CLASSIFICATION OF THIS PAGE REPORT DOCUMENTATION PAGE la . REPORT SECURITY CLASSIFICATION l b. RESTRICTIVE MARKINGS Unclassified 2a. SECURITY...nurses from a la -ge urban hospital. A factor analysis of a questionnaire 36 containing measures of job satisfaction and performance on day and night...those observacions , and the responses to task significance questions, indicated the three respondents were physicians -r dentists. The data were then
The Connections between Students Self-Motivation, Their Classification (Typical Learners, Academic Intervention Services Learners, and Gifted), and Gender in a Standardized Social Studies Test

ERIC Educational Resources Information Center

Dupree, Jeffrey J.; Morote, Elsa Sofia

2011-01-01

This study examines differences, if any, between gender, level of motivation, and students' classification (typical learners, academic intervention services learners, and gifted) in scores upon DBQ (document-based questions) among the sixth grade students. 64 grade students were given a DBQ as part of their final examination. Students' scores were…
Dewey Decimal Classification Online Project: Interim Reports to the Council on Library Resources, April 1984, September 1984, and February 1985.

ERIC Educational Resources Information Center

Markey, Karen; Demeyer, Anh N.

This research project focuses on the implementation and testing of the Dewey Decimal Classification (DDC) system as an online searcher's tool for subject access, browsing, and display in an online catalog. The research project comprises 12 activities. The three interim reports in this document cover the first seven of these activities: (1) obtain…
OTH Radar Surveillance at WARF During the LRAPP Church Opal Exercise

DTIC Science & Technology

1976-11-01

UNCLASSIFIED AD NUMBER ADC010483 CLASSIFICATION CHANGES TO: unclassified FROM: secret LIMITATION CHANGES TO: Approved for public release... DDC SRI STANFORD RESEARCH INSTITUTE Menlo Park, California 94025 U.S.A.D SECRET UNCLASSIFIED The views and conclusions contained in this document are...3 SRI 6-4696 DECLASSIFIED ON 31 December 2005 SECRET (This page is UNCLASSIFIED) . ... ..... .. SECRET SECURITY CLASSIFICATION OF THIS PAGE (When
Engineering Design Handbook. Development Guide for Reliability. Part 6. Mathematical Appendix and Glossary

DTIC Science & Technology

1976-01-08

Corps, nonmilitary Government agencies, contractors, private industry, individuals, universities , and others must purchase these Handbooks from...verified by an official Department of Army representative and processed from Defense Documentation Center ( DDC ), ATTN: DDC -TSR, Cameron Station...tell, by looking at a failed item, what classification of failure is involved. Some of the classifications are for mathematical conven- ience only
Using statistical text classification to identify health information technology incidents

PubMed Central

Chai, Kevin E K; Anthony, Stephen; Coiera, Enrico; Magrabi, Farah

2013-01-01

Objective To examine the feasibility of using statistical text classification to automatically identify health information technology (HIT) incidents in the USA Food and Drug Administration (FDA) Manufacturer and User Facility Device Experience (MAUDE) database. Design We used a subset of 570 272 incidents including 1534 HIT incidents reported to MAUDE between 1 January 2008 and 1 July 2010. Text classifiers using regularized logistic regression were evaluated with both ‘balanced’ (50% HIT) and ‘stratified’ (0.297% HIT) datasets for training, validation, and testing. Dataset preparation, feature extraction, feature selection, cross-validation, classification, performance evaluation, and error analysis were performed iteratively to further improve the classifiers. Feature-selection techniques such as removing short words and stop words, stemming, lemmatization, and principal component analysis were examined. Measurements κ statistic, F1 score, precision and recall. Results Classification performance was similar on both the stratified (0.954 F1 score) and balanced (0.995 F1 score) datasets. Stemming was the most effective technique, reducing the feature set size to 79% while maintaining comparable performance. Training with balanced datasets improved recall (0.989) but reduced precision (0.165). Conclusions Statistical text classification appears to be a feasible method for identifying HIT reports within large databases of incidents. Automated identification should enable more HIT problems to be detected, analyzed, and addressed in a timely manner. Semi-supervised learning may be necessary when applying machine learning to big data analysis of patient safety incidents and requires further investigation. PMID:23666777
The impact of OCR accuracy on automated cancer classification of pathology reports.

PubMed

Zuccon, Guido; Nguyen, Anthony N; Bergheim, Anton; Wickman, Sandra; Grayson, Narelle

2012-01-01

To evaluate the effects of Optical Character Recognition (OCR) on the automatic cancer classification of pathology reports. Scanned images of pathology reports were converted to electronic free-text using a commercial OCR system. A state-of-the-art cancer classification system, the Medical Text Extraction (MEDTEX) system, was used to automatically classify the OCR reports. Classifications produced by MEDTEX on the OCR versions of the reports were compared with the classification from a human amended version of the OCR reports. The employed OCR system was found to recognise scanned pathology reports with up to 99.12% character accuracy and up to 98.95% word accuracy. Errors in the OCR processing were found to minimally impact on the automatic classification of scanned pathology reports into notifiable groups. However, the impact of OCR errors is not negligible when considering the extraction of cancer notification items, such as primary site, histological type, etc. The automatic cancer classification system used in this work, MEDTEX, has proven to be robust to errors produced by the acquisition of freetext pathology reports from scanned images through OCR software. However, issues emerge when considering the extraction of cancer notification items.
Performance Evaluation of Frequency Transform Based Block Classification of Compound Image Segmentation Techniques

NASA Astrophysics Data System (ADS)

Selwyn, Ebenezer Juliet; Florinabel, D. Jemi

2018-04-01

Compound image segmentation plays a vital role in the compression of computer screen images. Computer screen images are images which are mixed with textual, graphical, or pictorial contents. In this paper, we present a comparison of two transform based block classification of compound images based on metrics like speed of classification, precision and recall rate. Block based classification approaches normally divide the compound images into fixed size blocks of non-overlapping in nature. Then frequency transform like Discrete Cosine Transform (DCT) and Discrete Wavelet Transform (DWT) are applied over each block. Mean and standard deviation are computed for each 8 × 8 block and are used as features set to classify the compound images into text/graphics and picture/background block. The classification accuracy of block classification based segmentation techniques are measured by evaluation metrics like precision and recall rate. Compound images of smooth background and complex background images containing text of varying size, colour and orientation are considered for testing. Experimental evidence shows that the DWT based segmentation provides significant improvement in recall rate and precision rate approximately 2.3% than DCT based segmentation with an increase in block classification time for both smooth and complex background images.
Readability Formulas and User Perceptions of Electronic Health Records Difficulty: A Corpus Study.

PubMed

Zheng, Jiaping; Yu, Hong

2017-03-02

Electronic health records (EHRs) are a rich resource for developing applications to engage patients and foster patient activation, thus holding a strong potential to enhance patient-centered care. Studies have shown that providing patients with access to their own EHR notes may improve the understanding of their own clinical conditions and treatments, leading to improved health care outcomes. However, the highly technical language in EHR notes impedes patients' comprehension. Numerous studies have evaluated the difficulty of health-related text using readability formulas such as Flesch-Kincaid Grade Level (FKGL), Simple Measure of Gobbledygook (SMOG), and Gunning-Fog Index (GFI). They conclude that the materials are often written at a grade level higher than common recommendations. The objective of our study was to explore the relationship between the aforementioned readability formulas and the laypeople's perceived difficulty on 2 genres of text: general health information and EHR notes. We also validated the formulas' appropriateness and generalizability on predicting difficulty levels of highly complex technical documents. We collected 140 Wikipedia articles on diabetes and 242 EHR notes with diabetes International Classification of Diseases, Ninth Revision code. We recruited 15 Amazon Mechanical Turk (AMT) users to rate difficulty levels of the documents. Correlations between laypeople's perceived difficulty levels and readability formula scores were measured, and their difference was tested. We also compared word usage and the impact of medical concepts of the 2 genres of text. The distributions of both readability formulas' scores (P<.001) and laypeople's perceptions (P=.002) on the 2 genres were different. Correlations of readability predictions and laypeople's perceptions were weak. Furthermore, despite being graded at similar levels, documents of different genres were still perceived with different difficulty (P<.001). Word usage in the 2 related genres still differed significantly (P<.001). Our findings suggested that the readability formulas' predictions did not align with perceived difficulty in either text genre. The widely used readability formulas were highly correlated with each other but did not show adequate correlation with readers' perceived difficulty. Therefore, they were not appropriate to assess the readability of EHR notes. ©Jiaping Zheng, Hong Yu. Originally published in the Journal of Medical Internet Research (http://www.jmir.org), 02.03.2017.
Sur la classification des adverbes en -ment (On the Classification of -ment Adverbs)

ERIC Educational Resources Information Center

Mordrup, Ole

1976-01-01

Presents a classification of French "-ment" adverbs based on syntactical criteria. The major divisions, consisting of "sentence adverbs" and "adverbs of manner," are further sub-divided into functional sub-groups. (Text is in French.) Available from: Akademisk Forlag, St. Kannikestraede 6-8, DK-1169 Copenhague K Danemark. (AM)
48 CFR 19.303 - Determining North American Industry Classification System (NAICS) codes and size standards.

Code of Federal Regulations, 2012 CFR

2012-10-01

... Industry Classification System (NAICS) codes and size standards. 19.303 Section 19.303 Federal Acquisition... of Small Business Status for Small Business Programs 19.303 Determining North American Industry... user, the added text is set forth as follows: 19.303 Determining North American Industry Classification...
Techniques of Document Management: A Review of Text Retrieval and Related Technologies.

ERIC Educational Resources Information Center

Veal, D. C.

2001-01-01

Reviews present and possible future developments in the techniques of electronic document management, the major ones being text retrieval and scanning and OCR (optical character recognition). Also addresses document acquisition, indexing and thesauri, publishing and dissemination standards, impact of the Internet, and the document management…
Specifying Skill-Based Training Strategies and Devices: A Model Description

DTIC Science & Technology

1990-06-01

Technical Report 897 Specifying Skill-Based Training N Strategies and Devices: A Model Description I Paui J. Sticha and Mark Schlager Human Resources...unlimied 90 ’ Technical Report 897 Specifying Skill-Based Training Strategies and Devices: A Model Description Paul J. Sticha and Mark Schlager Human...SECURITY CLASSIFICATION OF THIS PAGE Form Approved REPORT DOCUMENTATION PAGE FMNo o7 ote la. REPORT SECURITY CLASSIFICATION lb. RESTRICTWE MARKINGS
DNA EMP Awareness Course Notes. Supplement to Third Edition.

DTIC Science & Technology

1978-07-31

UNCLASSIFIED SECURITY CLASSIFICATION OF THIS PAGE (When Data Entered) REPORT DOCUMENTATION PAGE READ INSTRUCTIONS BEFORE COMPLETING FORM I REPORT...the environment through system design and testing. FORM 143 E,,N F’NOV 65 IS OBSOLETE DD JAN73 1473 EDITION OF UNCLASSIFIED SECURITY CLASSIFICATION OF...fields generated tems mission and deployment factors by the prompt gammas. Other forms of EMP, where these environments should be con- such as
Computer Center Reference Manual. Volume 1

DTIC Science & Technology

1990-09-30

Unlimited o- 0 0 91o1 UNCLASSI FI ED SECURITY CLASSIFICATION OF THIS PAGE REPORT DOCUMENTATION PAGE la . REPORT SECURITY CLASSIFICATION lb. RESTRICTIVE...with connection to INTERNET ) (host tables allow transfer to some other networks) OASYS - the DTRC Office Automation System The following can be reached...and buffers, two windows, and some word processing commands. Advanced editing commands are entered through the use of a command line. EVE las its own
Toward a Persistent Object Base.

DTIC Science & Technology

1986-07-01

would eliminate the user burden of explicitly invoking a decompressing program before each use of the compresed file. Another kind of flexible...joined to Source .version. It Is not the case, however, that I two relations have attributes with the same types that It always makes sense to join them...25 V V V~ ~ . - .. ~ " - IPE-~w Fam .rf rw vqrf wwp IECURITY CLASSIFICATION OF THIS PAGE REPORT DOCUMENTATION PAGE Is. REPORT SECURITY CLASSIFICATION
Evidence for the Existing American Nurses Association-Recognized Standardized Nursing Terminologies: A Systematic Review

PubMed Central

Tastan, Sevinc; Linch, Graciele C. F.; Keenan, Gail M.; Stifter, Janet; McKinney, Dawn; Fahey, Linda; Dunn Lopez, Karen; Yao, Yingwei; Wilkie, Diana J.

2014-01-01

Objective To determine the state of the science for the five standardized nursing terminology sets in terms of level of evidence and study focus. Design Systematic Review. Data sources Keyword search of PubMed, CINAHL, and EMBASE databases from 1960s to March 19, 2012 revealed 1,257 publications. Review Methods From abstract review we removed duplicate articles, those not in English or with no identifiable standardized nursing terminology, and those with a low-level of evidence. From full text review of the remaining 312 articles, eight trained raters used a coding system to record standardized nursing terminology names, publication year, country, and study focus. Inter-rater reliability confirmed the level of evidence. We analyzed coded results. Results On average there were 4 studies per year between 1985 and 1995. The yearly number increased to 14 for the decade between 1996–2005, 21 between 2006–2010, and 25 in 2011. Investigators conducted the research in 27 countries. By evidence level for the 312 studies 72.4% were descriptive, 18.9% were observational, and 8.7% were intervention studies. Of the 312 reports, 72.1% focused on North American Nursing Diagnosis-International, Nursing Interventions Classification, Nursing Outcome Classification, or some combination of those three standardized nursing terminologies; 9.6% on Omaha System; 7.1% on International Classification for Nursing Practice; 1.6% on Clinical Care Classification/Home Health Care Classification; 1.6% on Perioperative Nursing Data Set; and 8.0% on two or more standardized nursing terminology sets. There were studies in all 10 foci categories including those focused on concept analysis/classification infrastructure (n = 43), the identification of the standardized nursing terminology concepts applicable to a health setting from registered nurses’ documentation (n = 54), mapping one terminology to another (n = 58), implementation of standardized nursing terminologies into electronic health records (n = 12), and secondary use of electronic health record data (n = 19). Conclusions Findings reveal that the number of standardized nursing terminology publications increased primarily since 2000 with most focusing on North American Nursing Diagnosis-International, Nursing Interventions Classification, and Nursing Outcome Classification. The majority of the studies were descriptive, qualitative, or correlational designs that provide a strong base for understanding the validity and reliability of the concepts underlying the standardized nursing terminologies. There is evidence supporting the successful integration and use in electronic health records for two standardized nursing terminology sets: (1) the North American Nursing Diagnosis-International, Nursing Interventions Classification, and Nursing Outcome Classification set; and (2) the Omaha System set. Researchers, however, should continue to strengthen standardized nursing terminology study designs to promote continuous improvement of the standardized nursing terminologies and use in clinical practice. PMID:24412062

Tumor Slice Culture: A New Avatar in Personalized Oncology

DTIC Science & Technology

2017-08-01

official Department of the Army position, policy or decision unless so designated by other documentation. REPORT DOCUMENTATION PAGE Form Approved OMB No...sensitivity and to correlate the results with clinical and molecular data. 15. SUBJECT TERMS 16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF ABSTRACT...differences in pre-operative treatments. Indeed, the viability scores significantly correlated with pathologic assessment of tumor viability/necrosis
36 CFR § 1238.14 - What are the microfilming requirements for permanent and unscheduled records?

Code of Federal Regulations, 2013 CFR

2013-07-01

... processing procedures in ANSI/AIIM MS1 and ANSI/AIIM MS23 (both incorporated by reference, see § 1238.5). (d... reference, see § 1238.5). (2) Background density of images. Agencies must use the background ISO standard... densities for images of documents are as follows: Classification Description of document Background density...
UV Detector Materials Development Program

DTIC Science & Technology

1981-12-01

document. UNCLASSIFIED SECURITY CLASSIFICATION OF THIr odkE ’Whe Date Entered) READ INSTRUCTIONS REPORT DOCUMENTATION PAGE BEFORE COMPLETING FORM 1... collection efficiency within the detector (internal quantum efficiency). As mentioned previously, it was found that reverse biasing the Schottky diodes...the ratio of the number of carriers collected in the detector versus the number of photons entering into the absorbing region. It is, therefore
Evaluation of the Retrieval of Metallurgical Document References using the Universal Decimal Classification in a Computer-Based System.

ERIC Educational Resources Information Center

Freeman, Robert R.

A set of twenty five questions was processed against a computer-stored file of 9159 document references in the field of ferrous metallurgy, representing the 1965 coverage of the Iron and Steel Institute (London) information service. A basis for evaluation of system performance characteristics and analysis of system failures was provided by using…
Nonparametric projections of forest and rangeland condition indicators: A technical document supporting the 2005 USDA Forest Service RPA Assessment Update

Treesearch

John Hof; Curtis Flather; Tony Baltic; Rudy King

2006-01-01

The 2005 Forest and Rangeland Condition Indicator Model is a set of classification trees for forest and rangeland condition indicators at the national scale. This report documents the development of the database and the nonparametric statistical estimation for this analytical structure, with emphasis on three special characteristics of condition indicator production...
A Study of the Role of Categories in a Thesaurus for Educational Documentation.

ERIC Educational Resources Information Center

Foskett, D. J.

The field of education serves as the basis for this discussion on the use of categories in a thesaurus for information processing and documentation purposes. The author briefly shows how a number of writers concerned with the structure of the field of education, as well as makers of classification schemes, have commented on the value of setting up…
Supervised Gamma Process Poisson Factorization

DOE Office of Scientific and Technical Information (OSTI.GOV)

Anderson, Dylan Zachary

This thesis develops the supervised gamma process Poisson factorization (S- GPPF) framework, a novel supervised topic model for joint modeling of count matrices and document labels. S-GPPF is fully generative and nonparametric: document labels and count matrices are modeled under a uni ed probabilistic framework and the number of latent topics is controlled automatically via a gamma process prior. The framework provides for multi-class classification of documents using a generative max-margin classifier. Several recent data augmentation techniques are leveraged to provide for exact inference using a Gibbs sampling scheme. The first portion of this thesis reviews supervised topic modeling andmore » several key mathematical devices used in the formulation of S-GPPF. The thesis then introduces the S-GPPF generative model and derives the conditional posterior distributions of the latent variables for posterior inference via Gibbs sampling. The S-GPPF is shown to exhibit state-of-the-art performance for joint topic modeling and document classification on a dataset of conference abstracts, beating out competing supervised topic models. The unique properties of S-GPPF along with its competitive performance make it a novel contribution to supervised topic modeling.« less
A Novel Multi-Class Ensemble Model for Classifying Imbalanced Biomedical Datasets

NASA Astrophysics Data System (ADS)

Bikku, Thulasi; Sambasiva Rao, N., Dr; Rao, Akepogu Ananda, Dr

2017-08-01

This paper mainly focuseson developing aHadoop based framework for feature selection and classification models to classify high dimensionality data in heterogeneous biomedical databases. Wide research has been performing in the fields of Machine learning, Big data and Data mining for identifying patterns. The main challenge is extracting useful features generated from diverse biological systems. The proposed model can be used for predicting diseases in various applications and identifying the features relevant to particular diseases. There is an exponential growth of biomedical repositories such as PubMed and Medline, an accurate predictive model is essential for knowledge discovery in Hadoop environment. Extracting key features from unstructured documents often lead to uncertain results due to outliers and missing values. In this paper, we proposed a two phase map-reduce framework with text preprocessor and classification model. In the first phase, mapper based preprocessing method was designed to eliminate irrelevant features, missing values and outliers from the biomedical data. In the second phase, a Map-Reduce based multi-class ensemble decision tree model was designed and implemented in the preprocessed mapper data to improve the true positive rate and computational time. The experimental results on the complex biomedical datasets show that the performance of our proposed Hadoop based multi-class ensemble model significantly outperforms state-of-the-art baselines.
Classification of clinically useful sentences in clinical evidence resources.

PubMed

Morid, Mohammad Amin; Fiszman, Marcelo; Raja, Kalpana; Jonnalagadda, Siddhartha R; Del Fiol, Guilherme

2016-04-01

Most patient care questions raised by clinicians can be answered by online clinical knowledge resources. However, important barriers still challenge the use of these resources at the point of care. To design and assess a method for extracting clinically useful sentences from synthesized online clinical resources that represent the most clinically useful information for directly answering clinicians' information needs. We developed a Kernel-based Bayesian Network classification model based on different domain-specific feature types extracted from sentences in a gold standard composed of 18 UpToDate documents. These features included UMLS concepts and their semantic groups, semantic predications extracted by SemRep, patient population identified by a pattern-based natural language processing (NLP) algorithm, and cue words extracted by a feature selection technique. Algorithm performance was measured in terms of precision, recall, and F-measure. The feature-rich approach yielded an F-measure of 74% versus 37% for a feature co-occurrence method (p<0.001). Excluding predication, population, semantic concept or text-based features reduced the F-measure to 62%, 66%, 58% and 69% respectively (p<0.01). The classifier applied to Medline sentences reached an F-measure of 73%, which is equivalent to the performance of the classifier on UpToDate sentences (p=0.62). The feature-rich approach significantly outperformed general baseline methods. This approach significantly outperformed classifiers based on a single type of feature. Different types of semantic features provided a unique contribution to overall classification performance. The classifier's model and features used for UpToDate generalized well to Medline abstracts. Copyright © 2016 Elsevier Inc. All rights reserved.
Detection of text strings from mixed text/graphics images

NASA Astrophysics Data System (ADS)

Tsai, Chien-Hua; Papachristou, Christos A.

2000-12-01

A robust system for text strings separation from mixed text/graphics images is presented. Based on a union-find (region growing) strategy the algorithm is thus able to classify the text from graphics and adapts to changes in document type, language category (e.g., English, Chinese and Japanese), text font style and size, and text string orientation within digital images. In addition, it allows for a document skew that usually occurs in documents, without skew correction prior to discrimination while these proposed methods such a projection profile or run length coding are not always suitable for the condition. The method has been tested with a variety of printed documents from different origins with one common set of parameters, and the experimental results of the performance of the algorithm in terms of computational efficiency are demonstrated by using several tested images from the evaluation.
41 CFR Appendix B to Part 60 - 300-Sample Invitation to Self-Identify

Code of Federal Regulations, 2014 CFR

2014-07-01

... classifications are defined as follows: • A “disabled veteran” is one of the following: • a veteran of the U.S...). THE DEFINITIONS OF THE SEPARATE CLASSIFICATIONS OF PROTECTED VETERANS SET FORTH IN PARAGRAPH 1 MUST... CLASSIFICATIONS OFPROTECTED VETERAN LISTED ABOVE [ ]I AM NOT A PROTECTED VETERAN [THE FOLLOWING TEXT SHOULD BE...
Optimizing research in symptomatic uterine fibroids with development of a computable phenotype for use with electronic health records.

PubMed

Hoffman, Sarah R; Vines, Anissa I; Halladay, Jacqueline R; Pfaff, Emily; Schiff, Lauren; Westreich, Daniel; Sundaresan, Aditi; Johnson, La-Shell; Nicholson, Wanda K

2018-06-01

Women with symptomatic uterine fibroids can report a myriad of symptoms, including pain, bleeding, infertility, and psychosocial sequelae. Optimizing fibroid research requires the ability to enroll populations of women with image-confirmed symptomatic uterine fibroids. Our objective was to develop an electronic health record-based algorithm to identify women with symptomatic uterine fibroids for a comparative effectiveness study of medical or surgical treatments on quality-of-life measures. Using an iterative process and text-mining techniques, an effective computable phenotype algorithm, composed of demographics, and clinical and laboratory characteristics, was developed with reasonable performance. Such algorithms provide a feasible, efficient way to identify populations of women with symptomatic uterine fibroids for the conduct of large traditional or pragmatic trials and observational comparative effectiveness studies. Symptomatic uterine fibroids, due to menorrhagia, pelvic pain, bulk symptoms, or infertility, are a source of substantial morbidity for reproductive-age women. Comparing Treatment Options for Uterine Fibroids is a multisite registry study to compare the effectiveness of hormonal or surgical fibroid treatments on women's perceptions of their quality of life. Electronic health record-based algorithms are able to identify large numbers of women with fibroids, but additional work is needed to develop electronic health record algorithms that can identify women with symptomatic fibroids to optimize fibroid research. We sought to develop an efficient electronic health record-based algorithm that can identify women with symptomatic uterine fibroids in a large health care system for recruitment into large-scale observational and interventional research in fibroid management. We developed and assessed the accuracy of 3 algorithms to identify patients with symptomatic fibroids using an iterative approach. The data source was the Carolina Data Warehouse for Health, a repository for the health system's electronic health record data. In addition to International Classification of Diseases, Ninth Revision diagnosis and procedure codes and clinical characteristics, text data-mining software was used to derive information from imaging reports to confirm the presence of uterine fibroids. Results of each algorithm were compared with expert manual review to calculate the positive predictive values for each algorithm. Algorithm 1 was composed of the following criteria: (1) age 18-54 years; (2) either ≥1 International Classification of Diseases, Ninth Revision diagnosis codes for uterine fibroids or mention of fibroids using text-mined key words in imaging records or documents; and (3) no International Classification of Diseases, Ninth Revision or Current Procedural Terminology codes for hysterectomy and no reported history of hysterectomy. The positive predictive value was 47% (95% confidence interval 39-56%). Algorithm 2 required ≥2 International Classification of Diseases, Ninth Revision diagnosis codes for fibroids and positive text-mined key words and had a positive predictive value of 65% (95% confidence interval 50-79%). In algorithm 3, further refinements included ≥2 International Classification of Diseases, Ninth Revision diagnosis codes for fibroids on separate outpatient visit dates, the exclusion of women who had a positive pregnancy test within 3 months of their fibroid-related visit, and exclusion of incidentally detected fibroids during prenatal or emergency department visits. Algorithm 3 achieved a positive predictive value of 76% (95% confidence interval 71-81%). An electronic health record-based algorithm is capable of identifying cases of symptomatic uterine fibroids with moderate positive predictive value and may be an efficient approach for large-scale study recruitment. Copyright © 2018 Elsevier Inc. All rights reserved.
Graph-based layout analysis for PDF documents

NASA Astrophysics Data System (ADS)

Xu, Canhui; Tang, Zhi; Tao, Xin; Li, Yun; Shi, Cao

2013-03-01

To increase the flexibility and enrich the reading experience of e-book on small portable screens, a graph based method is proposed to perform layout analysis on Portable Document Format (PDF) documents. Digital born document has its inherent advantages like representing texts and fractional images in explicit form, which can be straightforwardly exploited. To integrate traditional image-based document analysis and the inherent meta-data provided by PDF parser, the page primitives including text, image and path elements are processed to produce text and non text layer for respective analysis. Graph-based method is developed in superpixel representation level, and page text elements corresponding to vertices are used to construct an undirected graph. Euclidean distance between adjacent vertices is applied in a top-down manner to cut the graph tree formed by Kruskal's algorithm. And edge orientation is then used in a bottom-up manner to extract text lines from each sub tree. On the other hand, non-textual objects are segmented by connected component analysis. For each segmented text and non-text composite, a 13-dimensional feature vector is extracted for labelling purpose. The experimental results on selected pages from PDF books are presented.
Telemetry Standards, RCC Standard 106-17, Annex A.1, Pulse Amplitude Modulation Standards

DTIC Science & Technology

2017-07-01

conform to either Figure Error! No text of specified style in document.-1 or Figure Error! No text of specified style in document.-2. Figure Error...No text of specified style in document.-1. 50 percent duty cycle PAM with amplitude synchronization A 20-25 percent deviation reserved for pulse...synchronization is recommended. Telemetry Standards, RCC Standard 106-17 Annex A.1, July 2017 A.1.2 Figure Error! No text of specified style
Reading and Writing in the 21st Century.

ERIC Educational Resources Information Center

Soloway, Elliot; And Others

1993-01-01

Describes MediaText, a multimedia document processor developed at the University of Michigan that allows the incorporation of video, music, sound, animations, still images, and text into one document. Interactive documents are discussed, and the need for users to be able to write documents as well as read them is emphasized. (four references) (LRW)
Electronic Documentation Support Tools and Text Duplication in the Electronic Medical Record

ERIC Educational Resources Information Center

Wrenn, Jesse

2010-01-01

In order to ease the burden of electronic note entry on physicians, electronic documentation support tools have been developed to assist in note authoring. There is little evidence of the effects of these tools on attributes of clinical documentation, including document quality. Furthermore, the resultant abundance of duplicated text and…
Integration of a knowledge-based system and a clinical documentation system via a data dictionary.

PubMed

Eich, H P; Ohmann, C; Keim, E; Lang, K

1997-01-01

This paper describes the design and realisation of a knowledge-based system and a clinical documentation system linked via a data dictionary. The software was developed as a shell with object oriented methods and C++ for IBM-compatible PC's and WINDOWS 3.1/95. The data dictionary covers terminology and document objects with relations to external classifications. It controls the terminology in the documentation program with form-based entry of clinical documents and in the knowledge-based system with scores and rules. The software was applied to the clinical field of acute abdominal pain by implementing a data dictionary with 580 terminology objects, 501 document objects, and 2136 links; a documentation module with 8 clinical documents and a knowledge-based system with 10 scores and 7 sets of rules.
Semi-Automated Methods for Refining a Domain-Specific Terminology Base

DTIC Science & Technology

2011-02-01

only as a resource for written and oral translation, but also for Natural Language Processing ( NLP ) applications, text retrieval, document indexing...Natural Language Processing ( NLP ) applications, text retrieval, document indexing, and other knowledge management tasks. The objective of this...also for Natural Language Processing ( NLP ) applications, text retrieval (1), document indexing, and other knowledge management tasks. The National
Thematic clustering of text documents using an EM-based approach

PubMed Central

2012-01-01

Clustering textual contents is an important step in mining useful information on the web or other text-based resources. The common task in text clustering is to handle text in a multi-dimensional space, and to partition documents into groups, where each group contains documents that are similar to each other. However, this strategy lacks a comprehensive view for humans in general since it cannot explain the main subject of each cluster. Utilizing semantic information can solve this problem, but it needs a well-defined ontology or pre-labeled gold standard set. In this paper, we present a thematic clustering algorithm for text documents. Given text, subject terms are extracted and used for clustering documents in a probabilistic framework. An EM approach is used to ensure documents are assigned to correct subjects, hence it converges to a locally optimal solution. The proposed method is distinctive because its results are sufficiently explanatory for human understanding as well as efficient for clustering performance. The experimental results show that the proposed method provides a competitive performance compared to other state-of-the-art approaches. We also show that the extracted themes from the MEDLINE® dataset represent the subjects of clusters reasonably well. PMID:23046528
Document image improvement for OCR as a classification problem

NASA Astrophysics Data System (ADS)

Summers, Kristen M.

2003-01-01

In support of the goal of automatically selecting methods of enhancing an image to improve the accuracy of OCR on that image, we consider the problem of determining whether to apply each of a set of methods as a supervised classification problem for machine learning. We characterize each image according to a combination of two sets of measures: a set that are intended to reflect the degree of particular types of noise present in documents in a single font of Roman or similar script and a more general set based on connected component statistics. We consider several potential methods of image improvement, each of which constitutes its own 2-class classification problem, according to whether transforming the image with this method improves the accuracy of OCR. In our experiments, the results varied for the different image transformation methods, but the system made the correct choice in 77% of the cases in which the decision affected the OCR score (in the range [0,1]) by at least .01, and it made the correct choice 64% of the time overall.

Fast words boundaries localization in text fields for low quality document images

NASA Astrophysics Data System (ADS)

Ilin, Dmitry; Novikov, Dmitriy; Polevoy, Dmitry; Nikolaev, Dmitry

2018-04-01

The paper examines the problem of word boundaries precise localization in document text zones. Document processing on a mobile device consists of document localization, perspective correction, localization of individual fields, finding words in separate zones, segmentation and recognition. While capturing an image with a mobile digital camera under uncontrolled capturing conditions, digital noise, perspective distortions or glares may occur. Further document processing gets complicated because of its specifics: layout elements, complex background, static text, document security elements, variety of text fonts. However, the problem of word boundaries localization has to be solved at runtime on mobile CPU with limited computing capabilities under specified restrictions. At the moment, there are several groups of methods optimized for different conditions. Methods for the scanned printed text are quick but limited only for images of high quality. Methods for text in the wild have an excessively high computational complexity, thus, are hardly suitable for running on mobile devices as part of the mobile document recognition system. The method presented in this paper solves a more specialized problem than the task of finding text on natural images. It uses local features, a sliding window and a lightweight neural network in order to achieve an optimal algorithm speed-precision ratio. The duration of the algorithm is 12 ms per field running on an ARM processor of a mobile device. The error rate for boundaries localization on a test sample of 8000 fields is 0.3
Documentation of pharmaceutical care: Validation of an intervention oriented classification system.

PubMed

Maes, Karen A; Studer, Helene; Berger, Jérôme; Hersberger, Kurt E; Lampert, Markus L

2017-12-01

During the dispensing process, pharmacists may come across technical and clinical issues requiring a pharmaceutical intervention (PI). An intervention-oriented classification system is a helpful tool to document these PIs in a structured manner. Therefore, we developed the PharmDISC classification system (Pharmacists' Documentation of Interventions in Seamless Care). The aim of this study was to evaluate the PharmDISC system in the daily practice environment (in terms of interrater reliability, appropriateness, interpretability, acceptability, feasibility, and validity); to assess its user satisfaction, the descriptive manual, and the online training; and to explore first implementation aspects. Twenty-one pharmacists from different community pharmacies each classified 30 prescriptions requiring a PI with the PharmDISC system on 5 selected days within 5 weeks. Interrater reliability was determined using model PIs and Fleiss's kappa coefficients (κ) were calculated. User satisfaction was assessed by questionnaire with a 4-point Likert scale. The main outcome measures were interrater reliability (κ); appropriateness, interpretability, validity (ratio of completely classified PIs/all PIs); feasibility, and acceptability (user satisfaction and suggestions). The PharmDISC system reached an average substantial agreement (κ = 0.66). Of documented 519 PIs, 430 (82.9%) were completely classified. Most users found the system comprehensive (median user agreement 3 [2/3.25 quartiles]) and practical (3[2.75/3]). The PharmDISC system raised the awareness regarding drug-related problems for most users (n = 16). To facilitate its implementation, an electronic version that automatically connects to the prescription together with a task manager for PIs needing follow-up was suggested. Barriers could be time expenditure and lack of understanding the benefits. Substantial interrater reliability and acceptable user satisfaction indicate that the PharmDISC system is a valid system to document PIs in daily community pharmacy practice. © 2017 John Wiley & Sons, Ltd.
Automated Authorship Attribution Using Advanced Signal Classification Techniques

PubMed Central

Ebrahimpour, Maryam; Putniņš, Tālis J.; Berryman, Matthew J.; Allison, Andrew; Ng, Brian W.-H.; Abbott, Derek

2013-01-01

In this paper, we develop two automated authorship attribution schemes, one based on Multiple Discriminant Analysis (MDA) and the other based on a Support Vector Machine (SVM). The classification features we exploit are based on word frequencies in the text. We adopt an approach of preprocessing each text by stripping it of all characters except a-z and space. This is in order to increase the portability of the software to different types of texts. We test the methodology on a corpus of undisputed English texts, and use leave-one-out cross validation to demonstrate classification accuracies in excess of 90%. We further test our methods on the Federalist Papers, which have a partly disputed authorship and a fair degree of scholarly consensus. And finally, we apply our methodology to the question of the authorship of the Letter to the Hebrews by comparing it against a number of original Greek texts of known authorship. These tests identify where some of the limitations lie, motivating a number of open questions for future work. An open source implementation of our methodology is freely available for use at https://github.com/matthewberryman/author-detection. PMID:23437047
Thermal Runaway Due to Strain-Heading Feedback,

DTIC Science & Technology

1985-05-28

ApprovedREPORT DOCUMENTATION PAGE OMBNo. 0704-0188 la. REPORT SECURITY CLASSIFICATION lb. RESTRICTIVE MARKINGS 2a. SECURITY CLASSIFICATION AUTHORITY 3...instability. The thermal runaway phenomenon has been discussed in the geophysics literature (e.g. Brun and Cobbold 1980, Cary et al. 1979 and Wan et al...pp. 325-342. Boley, B.A. and Weiner, J.H., 1960, Theory of Thermal Stresses, John Wiley and Sons, Inc. Brun, Y.P. and Cobbold , P.R., 1980, Strain
Guidance for Maintenance Task Identification and Analysis: Organizational and Intermediate Maintenance.

DTIC Science & Technology

1980-09-01

CLASSIFICATION OF THIS PAGE (Uffi Pat* jfntered) READ INSTRUCTIONSREPORT DOCUMENTATION PAGE BEFORE COMPLETING FORM AH -8- -21 12 . GOVT ACCESSION NO. 3. RECIPIENT’S...appliration of that specification. - DDO ,JA11473- K Unclassified t ,9 SECURITY CLASSIFICATION OF THIS PAGG Rnh DM- Entered) U nclassified SECURITY...codes .............................. 52 12 Sample data sheet for use in user analysis ............... 54 13 Sample data sheet G for use in user analysis
Authorship Discovery in Blogs Using Bayesian Classification with Corrective Scaling

DTIC Science & Technology

2008-06-01

4 2.3 W. Fucks ’ Diagram of n-Syllable Word Frequencies . . . . . . . . . . . . . . 5 3.1 Confusion Matrix for All Test Documents of 500...of the books which scholars believed he had. • Wilhelm Fucks discriminated between authors using the average number of syllables per word and average...distance between equal-syllabled words [8]. Fucks , too, concluded that a study such as his reveals a “possibility of a quantitative classification
A Comparison of Tactical Leader Decision Making Between Automated and Live Counterparts in a Virtual Environment

DTIC Science & Technology

2014-06-01

Scott A. Patton June 2014 Thesis Advisor: Quinn Kennedy Second Reader: Jonathan Alt THIS PAGE INTENTIONALLY LEFT BLANK i REPORT...DOCUMENTATION PAGE Form Approved OMB No. 0704–0188 Public reporting burden for this collection of information is estimated to average 1 hour per response...Robotic Integration 15. NUMBER OF PAGES 119 16. PRICE CODE 17. SECURITY CLASSIFICATION OF REPORT Unclassified 18. SECURITY CLASSIFICATION OF
How Space - The Fourth Operational Medium - Supports Operational Maneuver.

DTIC Science & Technology

1987-05-17

51981 87-3044 87 9 18 014 UNCLASSIFIED SECURITY CLASSIFICATION OF THIS PAGE Form Approved REPORT DOCUMENTATION PAGE OMB No. 0704-0188 la. REPORT...P. J anecek (14213 AT L_ aCd) ZL-wI DO Form 1473, JUN 86 Previous editions are obsolete. SECURITY CLASSIFICATION OF THIS PAGE UNCLASSIFIED . ~18...Space technology, superior and enhanced weapons, and space systems combine to form spacepower that can be exploited to enhance ground force mission
Management Overview of System Technical Support Plan for the FIREFINDER System Support Center.

DTIC Science & Technology

1980-08-06

34.. /b , UNCLASSIFIED SECU ITY CLASSIFICATION OF THIS PAGE (When Data Entered) READ INSTRUCTIONSREPORT DOCUMENTATION PAGE BEFORE COMPLETING FORM I. REPORT...Evaluation Agency USASA U.S. Army Security Agency UTM Universal Transverse Mercator V Volts V&V Verification and Validation VCSA Vice-Chief of Staff, Army VDD... classification , configuration audits, and so forth. INSTR 5010.27 Management of Automatic Data System 9 November 1971 Development Establishes uniform
Hardened Reentry Vehicle Development Program. Erosion-Resistant Nosetip Technology

DTIC Science & Technology

1978-01-01

Best Available Copy .- / L A- UNCLASSIFIED SECURITY CLASSIFICATION OF THIS PAGE (When Data Enttered) _____________________ REPORT DOCUMENTATION PAGE...OF PAGES t Washington, D. C. 20305 3" TK~ 14 MONITORING AGENCY NAME a ADDRESfrif different troin Controllmng~ Office) IS. SECURITY CLA5--M~e7ry...tests indicate( 1 low probability of survival for DD IJAN 73 1473 EDITION OF I NOV 65 IS OBSOLETE UCASFE 41 -n0 SECURITY CLASSIFICATION OF THIS PAGE
Clinical, aetiological, anatomical and pathological classification (CEAP): gold standard and limits.

PubMed

Rabe, E; Pannier, F

2012-03-01

The first CEAP (clinical, aetiological, anatomical and pathological elements) consensus document was published after a consensus conference of the American Venous Forum, held at the sixth annual meeting of the AVF in February 1994 in Maui, Hawaii. In the following years the CEAP classification was published in many international journals and books which has led to widespread international use of the CEAP classification since 1995. The aim of this paper is to review the benefits and limits of CEAP from the available literature. In an actual Medline analysis with the keywords 'CEAP' and 'venous insufficiency', 266 publications using the CEAP classification in venous diseases are available. The CEAP classification was accepted in the venous community and used in scientific publications, but in most of the cases only the clinical classification was used. Limitations of the first version including a lack of clear definition of clinical signs led to a revised version. The CEAP classification is the gold standard of classification of chronic venous disorders today. Nevertheless for proper use some facts have to be taken into account: the CEAP classification is not a severity classification, C2 summarizes all kinds of varicose veins, in C3 it may be difficult to separate venous and other reasons for oedema, and corona phlebectatica is not included in the classification. Further revisions of the CEAP classification may help to overcome the still-existing deficits.
Documents Similarity Measurement Using Field Association Terms.

ERIC Educational Resources Information Center

Atlam, El-Sayed; Fuketa, M.; Morita, K.; Aoe, Jun-ichi

2003-01-01

Discussion of text analysis and information retrieval and measurement of document similarity focuses on a new text manipulation system called FA (field association)-Sim that is useful for retrieving information in large heterogeneous texts and for recognizing content similarity in text excerpts. Discusses recall and precision, automatic indexing…
Project development process.

DOT National Transportation Integrated Search

2007-08-01

The following chapter explains the purpose of this document, outlines the essential elements involved in : the Project Development Process, describes the differences in the three main project classifications, and : provides the necessary background i...
The first step in the development of text mining technology for cancer risk assessment: identifying and organizing scientific evidence in risk assessment literature

PubMed Central

Korhonen, Anna; Silins, Ilona; Sun, Lin; Stenius, Ulla

2009-01-01

Background One of the most neglected areas of biomedical Text Mining (TM) is the development of systems based on carefully assessed user needs. We have recently investigated the user needs of an important task yet to be tackled by TM -- Cancer Risk Assessment (CRA). Here we take the first step towards the development of TM technology for the task: identifying and organizing the scientific evidence required for CRA in a taxonomy which is capable of supporting extensive data gathering from biomedical literature. Results The taxonomy is based on expert annotation of 1297 abstracts downloaded from relevant PubMed journals. It classifies 1742 unique keywords found in the corpus to 48 classes which specify core evidence required for CRA. We report promising results with inter-annotator agreement tests and automatic classification of PubMed abstracts to taxonomy classes. A simple user test is also reported in a near real-world CRA scenario which demonstrates along with other evaluation that the resources we have built are well-defined, accurate, and applicable in practice. Conclusion We present our annotation guidelines and a tool which we have designed for expert annotation of PubMed abstracts. A corpus annotated for keywords and document relevance is also presented, along with the taxonomy which organizes the keywords into classes defining core evidence for CRA. As demonstrated by the evaluation, the materials we have constructed provide a good basis for classification of CRA literature along multiple dimensions. They can support current manual CRA as well as facilitate the development of an approach based on TM. We discuss extending the taxonomy further via manual and machine learning approaches and the subsequent steps required to develop TM technology for the needs of CRA. PMID:19772619
v3NLP Framework: Tools to Build Applications for Extracting Concepts from Clinical Text

PubMed Central

Divita, Guy; Carter, Marjorie E.; Tran, Le-Thuy; Redd, Doug; Zeng, Qing T; Duvall, Scott; Samore, Matthew H.; Gundlapalli, Adi V.

2016-01-01

Introduction: Substantial amounts of clinically significant information are contained only within the narrative of the clinical notes in electronic medical records. The v3NLP Framework is a set of “best-of-breed” functionalities developed to transform this information into structured data for use in quality improvement, research, population health surveillance, and decision support. Background: MetaMap, cTAKES and similar well-known natural language processing (NLP) tools do not have sufficient scalability out of the box. The v3NLP Framework evolved out of the necessity to scale-up these tools up and provide a framework to customize and tune techniques that fit a variety of tasks, including document classification, tuned concept extraction for specific conditions, patient classification, and information retrieval. Innovation: Beyond scalability, several v3NLP Framework-developed projects have been efficacy tested and benchmarked. While v3NLP Framework includes annotators, pipelines and applications, its functionalities enable developers to create novel annotators and to place annotators into pipelines and scaled applications. Discussion: The v3NLP Framework has been successfully utilized in many projects including general concept extraction, risk factors for homelessness among veterans, and identification of mentions of the presence of an indwelling urinary catheter. Projects as diverse as predicting colonization with methicillin-resistant Staphylococcus aureus and extracting references to military sexual trauma are being built using v3NLP Framework components. Conclusion: The v3NLP Framework is a set of functionalities and components that provide Java developers with the ability to create novel annotators and to place those annotators into pipelines and applications to extract concepts from clinical text. There are scale-up and scale-out functionalities to process large numbers of records. PMID:27683667
Means of storage and automated monitoring of versions of text technical documentation

NASA Astrophysics Data System (ADS)

Leonovets, S. A.; Shukalov, A. V.; Zharinov, I. O.

2018-03-01

The paper presents automation of the process of preparation, storage and monitoring of version control of a text designer, and program documentation by means of the specialized software is considered. Automation of preparation of documentation is based on processing of the engineering data which are contained in the specifications and technical documentation or in the specification. Data handling assumes existence of strictly structured electronic documents prepared in widespread formats according to templates on the basis of industry standards and generation by an automated method of the program or designer text document. Further life cycle of the document and engineering data entering it are controlled. At each stage of life cycle, archive data storage is carried out. Studies of high-speed performance of use of different widespread document formats in case of automated monitoring and storage are given. The new developed software and the work benches available to the developer of the instrumental equipment are described.
Fusion and Sense Making of Heterogeneous Sensor Network and Other Sources

DTIC Science & Technology

2017-03-16

multimodal fusion framework that uses both training data and web resources for scene classification, the experimental results on the benchmark datasets...show that the proposed text-aided scene classification framework could significantly improve classification performance. Experimental results also show...human whose adaptability is achieved by reliability- dependent weighting of different sensory modalities. Experimental results show that the proposed
Sampling surface and subsurface particle-size distributions in wadable gravel-and cobble-bed streams for analyses in sediment transport, hydraulics, and streambed monitoring

Treesearch

Kristin Bunte; Steven R. Abt

2001-01-01

This document provides guidance for sampling surface and subsurface sediment from wadable gravel-and cobble-bed streams. After a short introduction to streams types and classifications in gravel-bed rivers, the document explains the field and laboratory measurement of particle sizes and the statistical analysis of particle-size distributions. Analysis of particle...
Developing a maximum energy efficiency improvement target for SIC 28: chemicals and allied products. Volume 3. Draft target and support document. Appendices. Part 2. [Soaps, cosmetics, detergents, and perfumes

DOE Office of Scientific and Technical Information (OSTI.GOV)

Not Available

1976-07-01

Part 2 of this appendix contains the detailed supporting documentation and rationale for the energy efficiency improvement goals for each of the component industries in Standard Industrial Classification (SIC) 284 which includes soap, detergents and cleaning preparations, and cosmetics, perfumes and other toilet preparations.
Structural Characterisation of Proteins from the Peroxiredoxin Family

DTIC Science & Technology

2014-01-01

SECURITY CLASSIFICATION OF: The oligomerisation of protein subunits is an area of much research interest, in particular the relationship to protein...or decision, unless so designated by other documentation. 9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS (ES) U.S. Army Research Office P.O...Box 12211 Research Triangle Park, NC 27709-2211 peroxiredoxin, tecton, supramolecular assembly, TEM REPORT DOCUMENTATION PAGE 11. SPONSOR/MONITOR’S

Some links on this page may take you to non-federal websites. Their policies may differ from this site.