Science.gov

Sample records for automatic text classification

  1. Information fusion for automatic text classification

    SciTech Connect

    Dasigi, V.; Mann, R.C.; Protopopescu, V.A.

    1996-08-01

    Analysis and classification of free text documents encompass decision-making processes that rely on several clues derived from text and other contextual information. When using multiple clues, it is generally not known a priori how these should be integrated into a decision. An algorithmic sensor based on Latent Semantic Indexing (LSI) (a recent successful method for text retrieval rather than classification) is the primary sensor used in our work, but its utility is limited by the {ital reference}{ital library} of documents. Thus, there is an important need to complement or at least supplement this sensor. We have developed a system that uses a neural network to integrate the LSI-based sensor with other clues derived from the text. This approach allows for systematic fusion of several information sources in order to determine a combined best decision about the category to which a document belongs.

  2. Portable Automatic Text Classification for Adverse Drug Reaction Detection via Multi-corpus Training

    PubMed Central

    Gonzalez, Graciela

    2014-01-01

    Objective Automatic detection of Adverse Drug Reaction (ADR) mentions from text has recently received significant interest in pharmacovigilance research. Current research focuses on various sources of text-based information, including social media — where enormous amounts of user posted data is available, which have the potential for use in pharmacovigilance if collected and filtered accurately. The aims of this study are: (i) to explore natural language processing approaches for generating useful features from text, and utilizing them in optimized machine learning algorithms for automatic classification of ADR assertive text segments; (ii) to present two data sets that we prepared for the task of ADR detection from user posted internet data; and (iii) to investigate if combining training data from distinct corpora can improve automatic classification accuracies. Methods One of our three data sets contains annotated sentences from clinical reports, and the two other data sets, built in-house, consist of annotated posts from social media. Our text classification approach relies on generating a large set of features, representing semantic properties (e.g., sentiment, polarity, and topic), from short text nuggets. Importantly, using our expanded feature sets, we combine training data from different corpora in attempts to boost classification accuracies. Results Our feature-rich classification approach performs significantly better than previously published approaches with ADR class F-scores of 0.812 (previously reported best: 0.770), 0.538 and 0.678 for the three data sets. Combining training data from multiple compatible corpora further improves the ADR F-scores for the in-house data sets to 0.597 (improvement of 5.9 units) and 0.704 (improvement of 2.6 units) respectively. Conclusions Our research results indicate that using advanced NLP techniques for generating information rich features from text can significantly improve classification accuracies over existing

  3. Toward a multi-sensor-based approach to automatic text classification

    SciTech Connect

    Dasigi, V.R.; Mann, R.C.

    1995-10-01

    Many automatic text indexing and retrieval methods use a term-document matrix that is automatically derived from the text in question. Latent Semantic Indexing is a method, recently proposed in the Information Retrieval (IR) literature, for approximating a large and sparse term-document matrix with a relatively small number of factors, and is based on a solid mathematical foundation. LSI appears to be quite useful in the problem of text information retrieval, rather than text classification. In this report, we outline a method that attempts to combine the strength of the LSI method with that of neural networks, in addressing the problem of text classification. In doing so, we also indicate ways to improve performance by adding additional {open_quotes}logical sensors{close_quotes} to the neural network, something that is hard to do with the LSI method when employed by itself. The various programs that can be used in testing the system with TIPSTER data set are described. Preliminary results are summarized, but much work remains to be done.

  4. Text Classification for Automatic Detection of E-Cigarette Use and Use for Smoking Cessation from Twitter: A Feasibility Pilot

    PubMed Central

    Aphinyanaphongs, Yin; Lulejian, Armine; Brown, Duncan Penfold; Bonneau, Richard; Krebs, Paul

    2015-01-01

    Rapid increases in e-cigarette use and potential exposure to harmful byproducts have shifted public health focus to e-cigarettes as a possible drug of abuse. Effective surveillance of use and prevalence would allow appropriate regulatory responses. An ideal surveillance system would collect usage data in real time, focus on populations of interest, include populations unable to take the survey, allow a breadth of questions to answer, and enable geo-location analysis. Social media streams may provide this ideal system. To realize this use case, a foundational question is whether we can detect ecigarette use at all. This work reports two pilot tasks using text classification to identify automatically Tweets that indicate e-cigarette use and/or e-cigarette use for smoking cessation. We build and define both datasets and compare performance of 4 state of the art classifiers and a keyword search for each task. Our results demonstrate excellent classifier performance of up to 0.90 and 0.94 area under the curve in each category. These promising initial results form the foundation for further studies to realize the ideal surveillance solution. PMID:26776211

  5. Automatic segmentation of clinical texts.

    PubMed

    Apostolova, Emilia; Channin, David S; Demner-Fushman, Dina; Furst, Jacob; Lytinen, Steven; Raicu, Daniela

    2009-01-01

    Clinical narratives, such as radiology and pathology reports, are commonly available in electronic form. However, they are also commonly entered and stored as free text. Knowledge of the structure of clinical narratives is necessary for enhancing the productivity of healthcare departments and facilitating research. This study attempts to automatically segment medical reports into semantic sections. Our goal is to develop a robust and scalable medical report segmentation system requiring minimum user input for efficient retrieval and extraction of information from free-text clinical narratives. Hand-crafted rules were used to automatically identify a high-confidence training set. This automatically created training dataset was later used to develop metrics and an algorithm that determines the semantic structure of the medical reports. A word-vector cosine similarity metric combined with several heuristics was used to classify each report sentence into one of several pre-defined semantic sections. This baseline algorithm achieved 79% accuracy. A Support Vector Machine (SVM) classifier trained on additional formatting and contextual features was able to achieve 90% accuracy. Plans for future work include developing a configurable system that could accommodate various medical report formatting and content standards. PMID:19965054

  6. Computing symmetrical strength of N-grams: a two pass filtering approach in automatic classification of text documents.

    PubMed

    Agnihotri, Deepak; Verma, Kesari; Tripathi, Priyanka

    2016-01-01

    The contiguous sequences of the terms (N-grams) in the documents are symmetrically distributed among different classes. The symmetrical distribution of the N-Grams raises uncertainty in the belongings of the N-Grams towards the class. In this paper, we focused on the selection of most discriminating N-Grams by reducing the effects of symmetrical distribution. In this context, a new text feature selection method named as the symmetrical strength of the N-Grams (SSNG) is proposed using a two pass filtering based feature selection (TPF) approach. Initially, in the first pass of the TPF, the SSNG method chooses various informative N-Grams from the entire extracted N-Grams of the corpus. Subsequently, in the second pass the well-known Chi Square (χ(2)) method is being used to select few most informative N-Grams. Further, to classify the documents the two standard classifiers Multinomial Naive Bayes and Linear Support Vector Machine have been applied on the ten standard text data sets. In most of the datasets, the experimental results state the performance and success rate of SSNG method using TPF approach is superior to the state-of-the-art methods viz. Mutual Information, Information Gain, Odds Ratio, Discriminating Feature Selection and χ(2). PMID:27386386

  7. Injury narrative text classification using factorization model

    PubMed Central

    2015-01-01

    Narrative text is a useful way of identifying injury circumstances from the routine emergency department data collections. Automatically classifying narratives based on machine learning techniques is a promising technique, which can consequently reduce the tedious manual classification process. Existing works focus on using Naive Bayes which does not always offer the best performance. This paper proposes the Matrix Factorization approaches along with a learning enhancement process for this task. The results are compared with the performance of various other classification approaches. The impact on the classification results from the parameters setting during the classification of a medical text dataset is discussed. With the selection of right dimension k, Non Negative Matrix Factorization-model method achieves 10 CV accuracy of 0.93. PMID:26043671

  8. Automatic lexical classification: bridging research and practice.

    PubMed

    Korhonen, Anna

    2010-08-13

    Natural language processing (NLP)--the automatic analysis, understanding and generation of human language by computers--is vitally dependent on accurate knowledge about words. Because words change their behaviour between text types, domains and sub-languages, a fully accurate static lexical resource (e.g. a dictionary, word classification) is unattainable. Researchers are now developing techniques that could be used to automatically acquire or update lexical resources from textual data. If successful, the automatic approach could considerably enhance the accuracy and portability of language technologies, such as machine translation, text mining and summarization. This paper reviews the recent and on-going research in automatic lexical acquisition. Focusing on lexical classification, it discusses the many challenges that still need to be met before the approach can benefit NLP on a large scale. PMID:20603372

  9. Autoclass: An automatic classification system

    NASA Technical Reports Server (NTRS)

    Stutz, John; Cheeseman, Peter; Hanson, Robin

    1991-01-01

    The task of inferring a set of classes and class descriptions most likely to explain a given data set can be placed on a firm theoretical foundation using Bayesian statistics. Within this framework, and using various mathematical and algorithmic approximations, the AutoClass System searches for the most probable classifications, automatically choosing the number of classes and complexity of class descriptions. A simpler version of AutoClass has been applied to many large real data sets, has discovered new independently-verified phenomena, and has been released as a robust software package. Recent extensions allow attributes to be selectively correlated within particular classes, and allow classes to inherit, or share, model parameters through a class hierarchy. The mathematical foundations of AutoClass are summarized.

  10. Automatic Syntactic Analysis of Free Text.

    ERIC Educational Resources Information Center

    Schwarz, Christoph

    1990-01-01

    Discusses problems encountered with the syntactic analysis of free text documents in indexing. Postcoordination and precoordination of terms is discussed, an automatic indexing system call COPSY (context operator syntax) that uses natural language processing techniques is described, and future developments are explained. (60 references) (LRW)

  11. Automatic Classification of Marine Mammals with Speaker Classification Methods.

    PubMed

    Kreimeyer, Roman; Ludwig, Stefan

    2016-01-01

    We present an automatic acoustic classifier for marine mammals based on human speaker classification methods as an element of a passive acoustic monitoring (PAM) tool. This work is part of the Protection of Marine Mammals (PoMM) project under the framework of the European Defense Agency (EDA) and joined by the Research Department for Underwater Acoustics and Geophysics (FWG), Bundeswehr Technical Centre (WTD 71) and Kiel University. The automatic classification should support sonar operators in the risk mitigation process before and during sonar exercises with a reliable automatic classification result. PMID:26611006

  12. Spatial Text Visualization Using Automatic Typographic Maps.

    PubMed

    Afzal, S; Maciejewski, R; Jang, Yun; Elmqvist, N; Ebert, D S

    2012-12-01

    We present a method for automatically building typographic maps that merge text and spatial data into a visual representation where text alone forms the graphical features. We further show how to use this approach to visualize spatial data such as traffic density, crime rate, or demographic data. The technique accepts a vector representation of a geographic map and spatializes the textual labels in the space onto polylines and polygons based on user-defined visual attributes and constraints. Our sample implementation runs as a Web service, spatializing shape files from the OpenStreetMap project into typographic maps for any region. PMID:26357164

  13. An Optimized NBC Approach in Text Classification

    NASA Astrophysics Data System (ADS)

    Yao, Zhao; Zhi-Min, Chen

    state-of-the-art text classification algorithms are good at categorizing the Web documents into a few categories. But such a classification method does not give very detailed topic-related class information for the user because the first two levels are often too coarse in Large-scale Text Hierarchies. In this paper, we propose a method named DNB which can improve the performance of classification effectively in experimental results.

  14. Automatic figure classification in bioscience literature.

    PubMed

    Kim, Daehyun; Ramesh, Balaji Polepalli; Yu, Hong

    2011-10-01

    Millions of figures appear in biomedical articles, and it is important to develop an intelligent figure search engine to return relevant figures based on user entries. In this study we report a figure classifier that automatically classifies biomedical figures into five predefined figure types: Gel-image, Image-of-thing, Graph, Model, and Mix. The classifier explored rich image features and integrated them with text features. We performed feature selection and explored different classification models, including a rule-based figure classifier, a supervised machine-learning classifier, and a multi-model classifier, the latter of which integrated the first two classifiers. Our results show that feature selection improved figure classification and the novel image features we explored were the best among image features that we have examined. Our results also show that integrating text and image features achieved better performance than using either of them individually. The best system is a multi-model classifier which combines the rule-based hierarchical classifier and a support vector machine (SVM) based classifier, achieving a 76.7% F1-score for five-type classification. We demonstrated our system at http://figureclassification.askhermes.org/. PMID:21645638

  15. Automatic discourse connective detection in biomedical text

    PubMed Central

    Polepalli Ramesh, Balaji; Prasad, Rashmi; Miller, Tim; Harrington, Brian

    2012-01-01

    Objective Relation extraction in biomedical text mining systems has largely focused on identifying clause-level relations, but increasing sophistication demands the recognition of relations at discourse level. A first step in identifying discourse relations involves the detection of discourse connectives: words or phrases used in text to express discourse relations. In this study supervised machine-learning approaches were developed and evaluated for automatically identifying discourse connectives in biomedical text. Materials and Methods Two supervised machine-learning models (support vector machines and conditional random fields) were explored for identifying discourse connectives in biomedical literature. In-domain supervised machine-learning classifiers were trained on the Biomedical Discourse Relation Bank, an annotated corpus of discourse relations over 24 full-text biomedical articles (∼112 000 word tokens), a subset of the GENIA corpus. Novel domain adaptation techniques were also explored to leverage the larger open-domain Penn Discourse Treebank (∼1 million word tokens). The models were evaluated using the standard evaluation metrics of precision, recall and F1 scores. Results and Conclusion Supervised machine-learning approaches can automatically identify discourse connectives in biomedical text, and the novel domain adaptation techniques yielded the best performance: 0.761 F1 score. A demonstration version of the fully implemented classifier BioConn is available at: http://bioconn.askhermes.org. PMID:22744958

  16. Experiments in Automatic Library of Congress Classification.

    ERIC Educational Resources Information Center

    Larson, Ray R.

    1992-01-01

    Presents the results of research into the automatic selection of Library of Congress Classification numbers based on the titles and subject headings in MARC records from a test database at the University of California at Berkeley Library School library. Classification clustering and matching techniques are described. (44 references) (LRW)

  17. Automatic Behavior Pattern Classification for Social Robots

    NASA Astrophysics Data System (ADS)

    Prieto, Abraham; Bellas, Francisco; Caamaño, Pilar; Duro, Richard J.

    In this paper, we focus our attention on providing robots with a system that allows them to automatically detect behavior patterns in other robots, as a first step to introducing social responsive robots. The system is called ANPAC (Automatic Neural-based Pattern Classification). Its main feature is that ANPAC automatically adjusts the optimal processing window size and obtains the appropriate features through a dimensional transformation process that allow for the classification of behavioral patterns of large groups of entities from perception datasets. Here we present the basic elements and operation of ANPAC, and illustrate its applicability through the detection of behavior patterns in the motion of flocks.

  18. Semiconductor yield improvements through automatic defect classification

    SciTech Connect

    Gleason, S.; Kulkarni, A.

    1995-09-30

    Automatic detection of defects during the fabrication of semiconductor wafers is largely automated, but the classification of those defects is still performed manually by technicians. Projections by semiconductor manufacturers predict that with larger wafer sizes and smaller line width technology the number of defects to be manually classified will increase exponentially. This cooperative research and development agreement (CRADA) between Martin Marietta Energy Systems (MMES) and KLA Instruments developed concepts, algorithms and systems to automate the classification of wafer defects to decrease inspection time, improve the reliability of defect classification, and hence increase process throughput and yield. Image analysis, feature extraction, pattern recognition and classification schemes were developed that are now being used as research tools for future products and are being integrated into the KLA line of wafer inspection hardware. An automatic defect classification software research tool was developed and delivered to the CRADA partner to facilitate continuation of this research beyond the end of the partnership.

  19. Towards Automatic Classification of Neurons

    PubMed Central

    Armañanzas, Rubén; Ascoli, Giorgio A.

    2015-01-01

    The classification of neurons into types has been much debated since the inception of modern neuroscience. Recent experimental advances are accelerating the pace of data collection. The resulting information growth of morphological, physiological, and molecular properties encourages efforts to automate neuronal classification by powerful machine learning techniques. We review state-of-the-art analysis approaches and availability of suitable data and resources, highlighting prominent challenges and opportunities. The effective solution of the neuronal classification problem will require continuous development of computational methods, high-throughput data production, and systematic metadata organization to enable cross-lab integration. PMID:25765323

  20. Multimodal Excitatory Interfaces with Automatic Content Classification

    NASA Astrophysics Data System (ADS)

    Williamson, John; Murray-Smith, Roderick

    We describe a non-visual interface for displaying data on mobile devices, based around active exploration: devices are shaken, revealing the contents rattling around inside. This combines sample-based contact sonification with event playback vibrotactile feedback for a rich and compelling display which produces an illusion much like balls rattling inside a box. Motion is sensed from accelerometers, directly linking the motions of the user to the feedback they receive in a tightly closed loop. The resulting interface requires no visual attention and can be operated blindly with a single hand: it is reactive rather than disruptive. This interaction style is applied to the display of an SMS inbox. We use language models to extract salient features from text messages automatically. The output of this classification process controls the timbre and physical dynamics of the simulated objects. The interface gives a rapid semantic overview of the contents of an inbox, without compromising privacy or interrupting the user.

  1. Feature selection for ordinal text classification.

    PubMed

    Baccianella, Stefano; Esuli, Andrea; Sebastiani, Fabrizio

    2014-03-01

    Ordinal classification (also known as ordinal regression) is a supervised learning task that consists of estimating the rating of a data item on a fixed, discrete rating scale. This problem is receiving increased attention from the sentiment analysis and opinion mining community due to the importance of automatically rating large amounts of product review data in digital form. As in other supervised learning tasks such as binary or multiclass classification, feature selection is often needed in order to improve efficiency and avoid overfitting. However, although feature selection has been extensively studied for other classification tasks, it has not for ordinal classification. In this letter, we present six novel feature selection methods that we have specifically devised for ordinal classification and test them on two data sets of product review data against three methods previously known from the literature, using two learning algorithms from the support vector regression tradition. The experimental results show that all six proposed metrics largely outperform all three baseline techniques (and are more stable than these others by an order of magnitude), on both data sets and for both learning algorithms. PMID:24320850

  2. Automatic classification of animal vocalizations

    NASA Astrophysics Data System (ADS)

    Clemins, Patrick J.

    2005-11-01

    Bioacoustics, the study of animal vocalizations, has begun to use increasingly sophisticated analysis techniques in recent years. Some common tasks in bioacoustics are repertoire determination, call detection, individual identification, stress detection, and behavior correlation. Each research study, however, uses a wide variety of different measured variables, called features, and classification systems to accomplish these tasks. The well-established field of human speech processing has developed a number of different techniques to perform many of the aforementioned bioacoustics tasks. Melfrequency cepstral coefficients (MFCCs) and perceptual linear prediction (PLP) coefficients are two popular feature sets. The hidden Markov model (HMM), a statistical model similar to a finite autonoma machine, is the most commonly used supervised classification model and is capable of modeling both temporal and spectral variations. This research designs a framework that applies models from human speech processing for bioacoustic analysis tasks. The development of the generalized perceptual linear prediction (gPLP) feature extraction model is one of the more important novel contributions of the framework. Perceptual information from the species under study can be incorporated into the gPLP feature extraction model to represent the vocalizations as the animals might perceive them. By including this perceptual information and modifying parameters of the HMM classification system, this framework can be applied to a wide range of species. The effectiveness of the framework is shown by analyzing African elephant and beluga whale vocalizations. The features extracted from the African elephant data are used as input to a supervised classification system and compared to results from traditional statistical tests. The gPLP features extracted from the beluga whale data are used in an unsupervised classification system and the results are compared to labels assigned by experts. The

  3. Presentation video retrieval using automatically recovered slide and spoken text

    NASA Astrophysics Data System (ADS)

    Cooper, Matthew

    2013-03-01

    Video is becoming a prevalent medium for e-learning. Lecture videos contain text information in both the presentation slides and lecturer's speech. This paper examines the relative utility of automatically recovered text from these sources for lecture video retrieval. To extract the visual information, we automatically detect slides within the videos and apply optical character recognition to obtain their text. Automatic speech recognition is used similarly to extract spoken text from the recorded audio. We perform controlled experiments with manually created ground truth for both the slide and spoken text from more than 60 hours of lecture video. We compare the automatically extracted slide and spoken text in terms of accuracy relative to ground truth, overlap with one another, and utility for video retrieval. Results reveal that automatically recovered slide text and spoken text contain different content with varying error profiles. Experiments demonstrate that automatically extracted slide text enables higher precision video retrieval than automatically recovered spoken text.

  4. Performance Measurement Framework for Hierarchical Text Classification.

    ERIC Educational Resources Information Center

    Sun, Aixin; Lim, Ee-Peng; Ng, Wee-Keong

    2003-01-01

    Discusses hierarchical text classification for electronic information retrieval and the measures used to evaluate performance. Proposes new performance measures that consist of category similarity measures and distance-based measures that consider the contributions of misclassified documents, and explains a blocking measure that identifies…

  5. Bayesian Automatic Classification Of HMI Images

    NASA Astrophysics Data System (ADS)

    Ulrich, R. K.; Beck, John G.

    2011-05-01

    The Bayesian automatic classification system known as "AutoClass" finds a set of class definitions based on a set of observed data and assigns data to classes without human supervision. It has been applied to Mt Wilson data to improve modeling of total solar irradiance variations (Ulrich, et al, 2010). We apply AutoClass to HMI observables to automatically identify regions of the solar surface. To prevent small instrument artifacts from interfering with class identification, we apply a flat-field correction and a rotationally shifted temporal average to the HMI images prior to processing with AutoClass. Additionally, the sensitivity of AutoClass to instrumental artifacts is investigated.

  6. Automatic classification of blank substrate defects

    NASA Astrophysics Data System (ADS)

    Boettiger, Tom; Buck, Peter; Paninjath, Sankaranarayanan; Pereira, Mark; Ronald, Rob; Rost, Dan; Samir, Bhamidipati

    2014-10-01

    Mask preparation stages are crucial in mask manufacturing, since this mask is to later act as a template for considerable number of dies on wafer. Defects on the initial blank substrate, and subsequent cleaned and coated substrates, can have a profound impact on the usability of the finished mask. This emphasizes the need for early and accurate identification of blank substrate defects and the risk they pose to the patterned reticle. While Automatic Defect Classification (ADC) is a well-developed technology for inspection and analysis of defects on patterned wafers and masks in the semiconductors industry, ADC for mask blanks is still in the early stages of adoption and development. Calibre ADC is a powerful analysis tool for fast, accurate, consistent and automatic classification of defects on mask blanks. Accurate, automated classification of mask blanks leads to better usability of blanks by enabling defect avoidance technologies during mask writing. Detailed information on blank defects can help to select appropriate job-decks to be written on the mask by defect avoidance tools [1][4][5]. Smart algorithms separate critical defects from the potentially large number of non-critical defects or false defects detected at various stages during mask blank preparation. Mechanisms used by Calibre ADC to identify and characterize defects include defect location and size, signal polarity (dark, bright) in both transmitted and reflected review images, distinguishing defect signals from background noise in defect images. The Calibre ADC engine then uses a decision tree to translate this information into a defect classification code. Using this automated process improves classification accuracy, repeatability and speed, while avoiding the subjectivity of human judgment compared to the alternative of manual defect classification by trained personnel [2]. This paper focuses on the results from the evaluation of Automatic Defect Classification (ADC) product at MP Mask

  7. Automatic morphological classification of galaxy images

    PubMed Central

    Shamir, Lior

    2009-01-01

    We describe an image analysis supervised learning algorithm that can automatically classify galaxy images. The algorithm is first trained using a manually classified images of elliptical, spiral, and edge-on galaxies. A large set of image features is extracted from each image, and the most informative features are selected using Fisher scores. Test images can then be classified using a simple Weighted Nearest Neighbor rule such that the Fisher scores are used as the feature weights. Experimental results show that galaxy images from Galaxy Zoo can be classified automatically to spiral, elliptical and edge-on galaxies with accuracy of ~90% compared to classifications carried out by the author. Full compilable source code of the algorithm is available for free download, and its general-purpose nature makes it suitable for other uses that involve automatic image analysis of celestial objects. PMID:20161594

  8. Semi-automatic approach for music classification

    NASA Astrophysics Data System (ADS)

    Zhang, Tong

    2003-11-01

    Audio categorization is essential when managing a music database, either a professional library or a personal collection. However, a complete automation in categorizing music into proper classes for browsing and searching is not yet supported by today"s technology. Also, the issue of music classification is subjective to some extent as each user may have his own criteria for categorizing music. In this paper, we propose the idea of semi-automatic music classification. With this approach, a music browsing system is set up which contains a set of tools for separating music into a number of broad types (e.g. male solo, female solo, string instruments performance, etc.) using existing music analysis methods. With results of the automatic process, the user may further cluster music pieces in the database into finer classes and/or adjust misclassifications manually according to his own preferences and definitions. Such a system may greatly improve the efficiency of music browsing and retrieval, while at the same time guarantee accuracy and user"s satisfaction of the results. Since this semi-automatic system has two parts, i.e. the automatic part and the manual part, they are described separately in the paper, with detailed descriptions and examples of each step of the two parts included.

  9. Improved VSM for Incremental Text Classification

    NASA Astrophysics Data System (ADS)

    Yang, Zhen; Lei, Jianjun; Wang, Jian; Zhang, Xing; Guo, Jim

    2008-11-01

    As a simple classification method VSM has been widely applied in text information processing field. There are some problems for traditional VSM to select a refined vector model representation, which can make a good tradeoff between complexity and performance, especially for incremental text mining. To solve these problems, in this paper, several improvements, such as VSM based on improved TF, TFIDF and BM25, are discussed. And then maximum mutual information feature selection is introduced to achieve a low dimension VSM with less complexity, and at the same time keep an acceptable precision. The experimental results of spam filtering and short messages classification shows that the algorithm can achieve higher precision than existing algorithms under same conditions.

  10. Stemming Malay Text and Its Application in Automatic Text Categorization

    NASA Astrophysics Data System (ADS)

    Yasukawa, Michiko; Lim, Hui Tian; Yokoo, Hidetoshi

    In Malay language, there are no conjugations and declensions and affixes have important grammatical functions. In Malay, the same word may function as a noun, an adjective, an adverb, or, a verb, depending on its position in the sentence. Although extensively simple root words are used in informal conversations, it is essential to use the precise words in formal speech or written texts. In Malay, to make sentences clear, derivative words are used. Derivation is achieved mainly by the use of affixes. There are approximately a hundred possible derivative forms of a root word in written language of the educated Malay. Therefore, the composition of Malay words may be complicated. Although there are several types of stemming algorithms available for text processing in English and some other languages, they cannot be used to overcome the difficulties in Malay word stemming. Stemming is the process of reducing various words to their root forms in order to improve the effectiveness of text processing in information systems. It is essential to avoid both over-stemming and under-stemming errors. We have developed a new Malay stemmer (stemming algorithm) for removing inflectional and derivational affixes. Our stemmer uses a set of affix rules and two types of dictionaries: a root-word dictionary and a derivative-word dictionary. The use of set of rules is aimed at reducing the occurrence of under-stemming errors, while that of the dictionaries is believed to reduce the occurrence of over-stemming errors. We performed an experiment to evaluate the application of our stemmer in text mining software. For the experiment, text data used were actual web pages collected from the World Wide Web to demonstrate the effectiveness of our Malay stemming algorithm. The experimental results showed that our stemmer can effectively increase the precision of the extracted Boolean expressions for text categorization.

  11. Learning regular expressions for clinical text classification

    PubMed Central

    Bui, Duy Duc An; Zeng-Treitler, Qing

    2014-01-01

    Objectives Natural language processing (NLP) applications typically use regular expressions that have been developed manually by human experts. Our goal is to automate both the creation and utilization of regular expressions in text classification. Methods We designed a novel regular expression discovery (RED) algorithm and implemented two text classifiers based on RED. The RED+ALIGN classifier combines RED with an alignment algorithm, and RED+SVM combines RED with a support vector machine (SVM) classifier. Two clinical datasets were used for testing and evaluation: the SMOKE dataset, containing 1091 text snippets describing smoking status; and the PAIN dataset, containing 702 snippets describing pain status. We performed 10-fold cross-validation to calculate accuracy, precision, recall, and F-measure metrics. In the evaluation, an SVM classifier was trained as the control. Results The two RED classifiers achieved 80.9–83.0% in overall accuracy on the two datasets, which is 1.3–3% higher than SVM's accuracy (p<0.001). Similarly, small but consistent improvements have been observed in precision, recall, and F-measure when RED classifiers are compared with SVM alone. More significantly, RED+ALIGN correctly classified many instances that were misclassified by the SVM classifier (8.1–10.3% of the total instances and 43.8–53.0% of SVM's misclassifications). Conclusions Machine-generated regular expressions can be effectively used in clinical text classification. The regular expression-based classifier can be combined with other classifiers, like SVM, to improve classification performance. PMID:24578357

  12. Towards automatic classification of all WISE sources

    NASA Astrophysics Data System (ADS)

    Kurcz, A.; Bilicki, M.; Solarz, A.; Krupa, M.; Pollo, A.; Małek, K.

    2016-07-01

    Context. The Wide-field Infrared Survey Explorer (WISE) has detected hundreds of millions of sources over the entire sky. Classifying them reliably is, however, a challenging task owing to degeneracies in WISE multicolour space and low levels of detection in its two longest-wavelength bandpasses. Simple colour cuts are often not sufficient; for satisfactory levels of completeness and purity, more sophisticated classification methods are needed. Aims: Here we aim to obtain comprehensive and reliable star, galaxy, and quasar catalogues based on automatic source classification in full-sky WISE data. This means that the final classification will employ only parameters available from WISE itself, in particular those which are reliably measured for the majority of sources. Methods: For the automatic classification we applied a supervised machine learning algorithm, support vector machines (SVM). It requires a training sample with relevant classes already identified, and we chose to use the SDSS spectroscopic dataset (DR10) for that purpose. We tested the performance of two kernels used by the classifier, and determined the minimum number of sources in the training set required to achieve stable classification, as well as the minimum dimension of the parameter space. We also tested SVM classification accuracy as a function of extinction and apparent magnitude. Thus, the calibrated classifier was finally applied to all-sky WISE data, flux-limited to 16 mag (Vega) in the 3.4 μm channel. Results: By calibrating on the test data drawn from SDSS, we first established that a polynomial kernel is preferred over a radial one for this particular dataset. Next, using three classification parameters (W1 magnitude, W1-W2 colour, and a differential aperture magnitude) we obtained very good classification efficiency in all the tests. At the bright end, the completeness for stars and galaxies reaches ~95%, deteriorating to ~80% at W1 = 16 mag, while for quasars it stays at a level of

  13. Designing a Knowledge Base for Automatic Book Classification.

    ERIC Educational Resources Information Center

    Kim, Jeong-Hyen; Lee, Kyung-Ho

    2002-01-01

    Reports on the design of a knowledge base for an automatic classification in the library science field by using the facet classification principles of colon classification. Discusses inputting titles or key words into the computer to create class numbers through automatic subject recognition and processing title key words. (Author/LRW)

  14. Automatic analysis and classification of surface electromyography.

    PubMed

    Abou-Chadi, F E; Nashar, A; Saad, M

    2001-01-01

    In this paper, parametric modeling of surface electromyography (EMG) algorithms that facilitates automatic SEMG feature extraction and artificial neural networks (ANN) are combined for providing an integrated system for the automatic analysis and diagnosis of myopathic disorders. Three paradigms of ANN were investigated: the multilayer backpropagation algorithm, the self-organizing feature map algorithm and a probabilistic neural network model. The performance of the three classifiers was compared with that of the old Fisher linear discriminant (FLD) classifiers. The results have shown that the three ANN models give higher performance. The percentage of correct classification reaches 90%. Poorer diagnostic performance was obtained from the FLD classifier. The system presented here indicates that surface EMG, when properly processed, can be used to provide the physician with a diagnostic assist device. PMID:11556501

  15. Automatically generating extraction patterns from untagged text

    SciTech Connect

    Riloff, E.

    1996-12-31

    Many corpus-based natural language processing systems rely on text corpora that have been manually annotated with syntactic or semantic tags. In particular, all previous dictionary construction systems for information extraction have used an annotated training corpus or some form of annotated input. We have developed a system called AutoSlog-TS that creates dictionaries of extraction patterns using only untagged text. AutoSlog-TS is based on the AutoSlog system, which generated extraction patterns using annotated text and a set of heuristic rules. By adapting AutoSlog and combining it with statistical techniques, we eliminated its dependency on tagged text. In experiments with the MUC-4 terrorism domain, AutoSlog-TS created a dictionary of extraction patterns that performed comparably to a dictionary created by AutoSlog, using only preclassified texts as input.

  16. Automatic extraction of angiogenesis bioprocess from text

    PubMed Central

    Wang, Xinglong; McKendrick, Iain; Barrett, Ian; Dix, Ian; French, Tim; Tsujii, Jun'ichi; Ananiadou, Sophia

    2011-01-01

    Motivation: Understanding key biological processes (bioprocesses) and their relationships with constituent biological entities and pharmaceutical agents is crucial for drug design and discovery. One way to harvest such information is searching the literature. However, bioprocesses are difficult to capture because they may occur in text in a variety of textual expressions. Moreover, a bioprocess is often composed of a series of bioevents, where a bioevent denotes changes to one or a group of cells involved in the bioprocess. Such bioevents are often used to refer to bioprocesses in text, which current techniques, relying solely on specialized lexicons, struggle to find. Results: This article presents a range of methods for finding bioprocess terms and events. To facilitate the study, we built a gold standard corpus in which terms and events related to angiogenesis, a key biological process of the growth of new blood vessels, were annotated. Statistics of the annotated corpus revealed that over 36% of the text expressions that referred to angiogenesis appeared as events. The proposed methods respectively employed domain-specific vocabularies, a manually annotated corpus and unstructured domain-specific documents. Evaluation results showed that, while a supervised machine-learning model yielded the best precision, recall and F1 scores, the other methods achieved reasonable performance and less cost to develop. Availability: The angiogenesis vocabularies, gold standard corpus, annotation guidelines and software described in this article are available at http://text0.mib.man.ac.uk/~mbassxw2/angiogenesis/ Contact: xinglong.wang@gmail.com PMID:21821664

  17. Automatic image classification for the urinoculture screening.

    PubMed

    Andreini, Paolo; Bonechi, Simone; Bianchini, Monica; Garzelli, Andrea; Mecocci, Alessandro

    2016-03-01

    Urinary tract infections (UTIs) are considered to be the most common bacterial infection and, actually, it is estimated that about 150 million UTIs occur world wide yearly, giving rise to roughly $6 billion in healthcare expenditures and resulting in 100,000 hospitalizations. Nevertheless, it is difficult to carefully assess the incidence of UTIs, since an accurate diagnosis depends both on the presence of symptoms and on a positive urinoculture, whereas in most outpatient settings this diagnosis is made without an ad hoc analysis protocol. On the other hand, in the traditional urinoculture test, a sample of midstream urine is put onto a Petri dish, where a growth medium favors the proliferation of germ colonies. Then, the infection severity is evaluated by a visual inspection of a human expert, an error prone and lengthy process. In this paper, we propose a fully automated system for the urinoculture screening that can provide quick and easily traceable results for UTIs. Based on advanced image processing and machine learning tools, the infection type recognition, together with the estimation of the bacterial load, can be automatically carried out, yielding accurate diagnoses. The proposed AID (Automatic Infection Detector) system provides support during the whole analysis process: first, digital color images of Petri dishes are automatically captured, then specific preprocessing and spatial clustering algorithms are applied to isolate the colonies from the culture ground and, finally, an accurate classification of the infections and their severity evaluation are performed. The AID system speeds up the analysis, contributes to the standardization of the process, allows result repeatability, and reduces the costs. Moreover, the continuous transition between sterile and external environments (typical of the standard analysis procedure) is completely avoided. PMID:26780249

  18. Automatic Approach to Vhr Satellite Image Classification

    NASA Astrophysics Data System (ADS)

    Kupidura, P.; Osińska-Skotak, K.; Pluto-Kossakowska, J.

    2016-06-01

    In this paper, we present a proposition of a fully automatic classification of VHR satellite images. Unlike the most widespread approaches: supervised classification, which requires prior defining of class signatures, or unsupervised classification, which must be followed by an interpretation of its results, the proposed method requires no human intervention except for the setting of the initial parameters. The presented approach bases on both spectral and textural analysis of the image and consists of 3 steps. The first step, the analysis of spectral data, relies on NDVI values. Its purpose is to distinguish between basic classes, such as water, vegetation and non-vegetation, which all differ significantly spectrally, thus they can be easily extracted basing on spectral analysis. The second step relies on granulometric maps. These are the product of local granulometric analysis of an image and present information on the texture of each pixel neighbourhood, depending on the texture grain. The purpose of texture analysis is to distinguish between different classes, spectrally similar, but yet of different texture, e.g. bare soil from a built-up area, or low vegetation from a wooded area. Due to the use of granulometric analysis, based on mathematical morphology opening and closing, the results are resistant to the border effect (qualifying borders of objects in an image as spaces of high texture), which affect other methods of texture analysis like GLCM statistics or fractal analysis. Therefore, the effectiveness of the analysis is relatively high. Several indices based on values of different granulometric maps have been developed to simplify the extraction of classes of different texture. The third and final step of the process relies on a vegetation index, based on near infrared and blue bands. Its purpose is to correct partially misclassified pixels. All the indices used in the classification model developed relate to reflectance values, so the preliminary step

  19. Profiling School Shooters: Automatic Text-Based Analysis

    PubMed Central

    Neuman, Yair; Assaf, Dan; Cohen, Yochai; Knoll, James L.

    2015-01-01

    School shooters present a challenge to both forensic psychiatry and law enforcement agencies. The relatively small number of school shooters, their various characteristics, and the lack of in-depth analysis of all of the shooters prior to the shooting add complexity to our understanding of this problem. In this short paper, we introduce a new methodology for automatically profiling school shooters. The methodology involves automatic analysis of texts and the production of several measures relevant for the identification of the shooters. Comparing texts written by 6 school shooters to 6056 texts written by a comparison group of male subjects, we found that the shooters’ texts scored significantly higher on the Narcissistic Personality dimension as well as on the Humilated and Revengeful dimensions. Using a ranking/prioritization procedure, similar to the one used for the automatic identification of sexual predators, we provide support for the validity and relevance of the proposed methodology. PMID:26089804

  20. Profiling School Shooters: Automatic Text-Based Analysis.

    PubMed

    Neuman, Yair; Assaf, Dan; Cohen, Yochai; Knoll, James L

    2015-01-01

    School shooters present a challenge to both forensic psychiatry and law enforcement agencies. The relatively small number of school shooters, their various characteristics, and the lack of in-depth analysis of all of the shooters prior to the shooting add complexity to our understanding of this problem. In this short paper, we introduce a new methodology for automatically profiling school shooters. The methodology involves automatic analysis of texts and the production of several measures relevant for the identification of the shooters. Comparing texts written by 6 school shooters to 6056 texts written by a comparison group of male subjects, we found that the shooters' texts scored significantly higher on the Narcissistic Personality dimension as well as on the Humilated and Revengeful dimensions. Using a ranking/prioritization procedure, similar to the one used for the automatic identification of sexual predators, we provide support for the validity and relevance of the proposed methodology. PMID:26089804

  1. Measures of voiced frication for automatic classification

    NASA Astrophysics Data System (ADS)

    Jackson, Philip J. B.; Jesus, Luis M. T.; Shadle, Christine H.; Pincas, Jonathan

    2001-05-01

    As an approach to understanding the characteristics of the acoustic sources in voiced fricatives, it seems apt to draw on knowledge of vowels and voiceless fricatives, which have been relatively well studied. However, the presence of both phonation and frication in these mixed-source sounds offers the possibility of mutual interaction effects, with variations across place of articulation. This paper examines the acoustic and articulatory consequences of these interactions and explores automatic techniques for finding parametric and statistical descriptions of these phenomena. A reliable and consistent set of such acoustic cues could be used for phonetic classification or speech recognition. Following work on devoicing of European Portuguese voiced fricatives [Jesus and Shadle, in Mamede et al. (eds.) (Springer-Verlag, Berlin, 2003), pp. 1-8]. and the modulating effect of voicing on frication [Jackson and Shadle, J. Acoust. Soc. Am. 108, 1421-1434 (2000)], the present study focuses on three types of information: (i) sequences and durations of acoustic events in VC transitions, (ii) temporal, spectral and modulation measures from the periodic and aperiodic components of the acoustic signal, and (iii) voicing activity derived from simultaneous EGG data. Analysis of interactions observed in British/American English and European Portuguese speech corpora will be compared, and the principal findings discussed.

  2. Multi-sensor text classification experiments -- a comparison

    SciTech Connect

    Dasigi, V.R.; Mann, R.C.; Protopopescu, V.

    1997-01-01

    In this paper, the authors report recent results on automatic classification of free text documents into a given number of categories. The method uses multiple sensors to derive informative clues about patterns of interest in the input text, and fuses this information using a neural network. Encouraging preliminary results were obtained by applying this approach to a set of free text documents from the Associated Press (AP) news wire. New free text documents have been made available by the Reuters news agency. The advantages of this collection compared to the AP data are that the Reuters stories were already manually classified, and included sufficiently high numbers of stories per category. The results indicate the usefulness of the new method: after the network is fully trained, if data belonging to only one category are used for testing, correctness is about 90%, representing nearly 15% over the best results for the AP data. Based on the performance of the method with the AP and the Reuters collections they now have conclusive evidence that the approach is viable and practical. More work remains to be done for handling data belonging to the multiple categories.

  3. Automatic inpainting scheme for video text detection and removal.

    PubMed

    Mosleh, Ali; Bouguila, Nizar; Ben Hamza, Abdessamad

    2013-11-01

    We present a two stage framework for automatic video text removal to detect and remove embedded video texts and fill-in their remaining regions by appropriate data. In the video text detection stage, text locations in each frame are found via an unsupervised clustering performed on the connected components produced by the stroke width transform (SWT). Since SWT needs an accurate edge map, we develop a novel edge detector which benefits from the geometric features revealed by the bandlet transform. Next, the motion patterns of the text objects of each frame are analyzed to localize video texts. The detected video text regions are removed, then the video is restored by an inpainting scheme. The proposed video inpainting approach applies spatio-temporal geometric flows extracted by bandlets to reconstruct the missing data. A 3D volume regularization algorithm, which takes advantage of bandlet bases in exploiting the anisotropic regularities, is introduced to carry out the inpainting task. The method does not need extra processes to satisfy visual consistency. The experimental results demonstrate the effectiveness of both our proposed video text detection approach and the video completion technique, and consequently the entire automatic video text removal and restoration process. PMID:24057006

  4. Super pixel density based clustering automatic image classification method

    NASA Astrophysics Data System (ADS)

    Xu, Mingxing; Zhang, Chuan; Zhang, Tianxu

    2015-12-01

    The image classification is an important means of image segmentation and data mining, how to achieve rapid automated image classification has been the focus of research. In this paper, based on the super pixel density of cluster centers algorithm for automatic image classification and identify outlier. The use of the image pixel location coordinates and gray value computing density and distance, to achieve automatic image classification and outlier extraction. Due to the increased pixel dramatically increase the computational complexity, consider the method of ultra-pixel image preprocessing, divided into a small number of super-pixel sub-blocks after the density and distance calculations, while the design of a normalized density and distance discrimination law, to achieve automatic classification and clustering center selection, whereby the image automatically classify and identify outlier. After a lot of experiments, our method does not require human intervention, can automatically categorize images computing speed than the density clustering algorithm, the image can be effectively automated classification and outlier extraction.

  5. Text Classification Using ESC-Based Stochastic Decision Lists.

    ERIC Educational Resources Information Center

    Li, Hang; Yamanishi, Kenji

    2002-01-01

    Proposes a new method of text classification using stochastic decision lists, ordered sequences of IF-THEN-ELSE rules. The method can be viewed as a rule-based method for text classification having advantages of readability and refinability of acquired knowledge. Advantages of rule-based methods over non-rule-based ones are empirically verified.…

  6. Document Exploration and Automatic Knowledge Extraction for Unstructured Biomedical Text

    NASA Astrophysics Data System (ADS)

    Chu, S.; Totaro, G.; Doshi, N.; Thapar, S.; Mattmann, C. A.; Ramirez, P.

    2015-12-01

    We describe our work on building a web-browser based document reader with built-in exploration tool and automatic concept extraction of medical entities for biomedical text. Vast amounts of biomedical information are offered in unstructured text form through scientific publications and R&D reports. Utilizing text mining can help us to mine information and extract relevant knowledge from a plethora of biomedical text. The ability to employ such technologies to aid researchers in coping with information overload is greatly desirable. In recent years, there has been an increased interest in automatic biomedical concept extraction [1, 2] and intelligent PDF reader tools with the ability to search on content and find related articles [3]. Such reader tools are typically desktop applications and are limited to specific platforms. Our goal is to provide researchers with a simple tool to aid them in finding, reading, and exploring documents. Thus, we propose a web-based document explorer, which we called Shangri-Docs, which combines a document reader with automatic concept extraction and highlighting of relevant terms. Shangri-Docsalso provides the ability to evaluate a wide variety of document formats (e.g. PDF, Words, PPT, text, etc.) and to exploit the linked nature of the Web and personal content by performing searches on content from public sites (e.g. Wikipedia, PubMed) and private cataloged databases simultaneously. Shangri-Docsutilizes Apache cTAKES (clinical Text Analysis and Knowledge Extraction System) [4] and Unified Medical Language System (UMLS) to automatically identify and highlight terms and concepts, such as specific symptoms, diseases, drugs, and anatomical sites, mentioned in the text. cTAKES was originally designed specially to extract information from clinical medical records. Our investigation leads us to extend the automatic knowledge extraction process of cTAKES for biomedical research domain by improving the ontology guided information extraction

  7. A scheme for automatic text rectification in real scene images

    NASA Astrophysics Data System (ADS)

    Wang, Baokang; Liu, Changsong; Ding, Xiaoqing

    2015-03-01

    Digital camera is gradually replacing traditional flat-bed scanner as the main access to obtain text information for its usability, cheapness and high-resolution, there has been a large amount of research done on camera-based text understanding. Unfortunately, arbitrary position of camera lens related to text area can frequently cause perspective distortion which most OCR systems at present cannot manage, thus creating demand for automatic text rectification. Current rectification-related research mainly focused on document images, distortion of natural scene text is seldom considered. In this paper, a scheme for automatic text rectification in natural scene images is proposed. It relies on geometric information extracted from characters themselves as well as their surroundings. For the first step, linear segments are extracted from interested region, and a J-Linkage based clustering is performed followed by some customized refinement to estimate primary vanishing point(VP)s. To achieve a more comprehensive VP estimation, second stage would be performed by inspecting the internal structure of characters which involves analysis on pixels and connected components of text lines. Finally VPs are verified and used to implement perspective rectification. Experiments demonstrate increase of recognition rate and improvement compared with some related algorithms.

  8. Towards the automatic classification of neurons.

    PubMed

    Armañanzas, Rubén; Ascoli, Giorgio A

    2015-05-01

    The classification of neurons into types has been much debated since the inception of modern neuroscience. Recent experimental advances are accelerating the pace of data collection. The resulting growth of information about morphological, physiological, and molecular properties encourages efforts to automate neuronal classification by powerful machine learning techniques. We review state-of-the-art analysis approaches and the availability of suitable data and resources, highlighting prominent challenges and opportunities. The effective solution of the neuronal classification problem will require continuous development of computational methods, high-throughput data production, and systematic metadata organization to enable cross-laboratory integration. PMID:25765323

  9. Automatic extraction of linguistic knowledge from an international classification.

    PubMed

    Baud, R; Lovis, C; Rassinoux, A M; Michel, P A; Scherrer, J R

    1998-01-01

    Automatic extraction of knowledge from large corpus of texts is an essential step toward linguistic knowledge acquisition in the medical domain. The current situation shows a lack of computer-readable large medical lexicons, with a partial exception for the English language. Moreover, multilingual lexicons with versatility for multiple languages applications are far from reach as long as only manual extraction is considered. Computer-assisted linguistic knowledge acquisition is a must. A multilingual lexicon differs from a monolingual one by the necessity to bridge the words in different languages. A kind of interlingua has to be built under the form of concepts to which the specific entries are attached. In the present approach, the authors have developed an intelligent rule-based tool in order to focus on a multilingual source of medical knowledge, like the International Classification of Disease (ICD) which contains a vocabulary of some 20,000 words, translated in numerous languages. PMID:10384521

  10. Automatic stellar spectral classification via sparse representations and dictionary learning

    NASA Astrophysics Data System (ADS)

    Díaz-Hernández, R.; Peregrina-Barreto, H.; Altamirano-Robles, L.; González-Bernal, J. A.; Ortiz-Esquivel, A. E.

    2014-11-01

    Stellar classification is an important topic in astronomical tasks such as the study of stellar populations. However, stellar classification of a region of the sky is a time-consuming process due to the large amount of objects present in an image. Therefore, automatic techniques to speed up the process are required. In this work, we study the application of a sparse representation and a dictionary learning for automatic spectral stellar classification. Our dataset consist of 529 calibrated stellar spectra of classes B to K, belonging to the Pulkovo Spectrophotometric catalog, in the 3400-5500Å range. These stellar spectra are used for both training and testing of the proposed methodology. The sparse technique is applied by using the greedy algorithm OMP (Orthogonal Matching Pursuit) for finding an approximated solution, and the K-SVD (K-Singular Value Decomposition) for the dictionary learning step. Thus, sparse classification is based on the recognition of the common characteristics of a particular stellar type through the construction of a trained basis. In this work, we propose a classification criterion that evaluates the results of the sparse representation techniques and determines the final classification of the spectra. This methodology demonstrates its ability to achieve levels of classification comparable with automatic methodologies previously reported such as the Maximum Correlation Coefficient (MCC) and Artificial Neural Networks (ANN).

  11. Text Passage Retrieval Based on Colon Classification: Retrieval Performance.

    ERIC Educational Resources Information Center

    Shepherd, Michael A.

    1981-01-01

    Reports the results of experiments using colon classification for the analysis, representation, and retrieval of primary information from the full text of documents. Recall, precision, and search length measures indicate colon classification did not perform significantly better than Boolean or simple word occurrence systems. Thirteen references…

  12. Automatic UXO classification for fully polarimetric GPR data

    NASA Astrophysics Data System (ADS)

    Youn, Hyoung-Sun; Chen, Chi-Chih

    2003-09-01

    This paper presents an automatic UXO classification system using neural network and fuzzy inference based on the classification rules developed by the OSU. These rules incorporate scattering pattern, polarization and resonance features extracted from an ultra-wide bandwidth, fully polarimetric radar system. These features allow one to discriminate an elongated object. The algorithm consists of two stages. The first-stage classifies objects into clutter (group-A and D), a horizontal linear object (group-B) and a vertical linear object (group-C) according to the spatial distribution of the Estimated Linear Factor (ELF) values. Then second-stage discriminates UXO-LIKE targets from clutters under groups B and C. The rule in the first-stage was implemented by neural network and rules in the second-stage were realized by fuzzy inference with quantitative variables, i.e. ELF level, flatness of Estimated Target Orientation (ETO), the consistency of the target orientation, and the magnitude of the target response. It was found that the classification performance of this automatic algorithm is comparable with or superior to that obtained from a trained expert. However, the automatic classification procedure does not require the involvement of the operator and assigns a unbiased quantitative confidence level (or quality factor) associated with each classification. Classification error and inconsistency associated with fatigue, memory fading or complex features should be greatly reduced.

  13. Automatic breast density classification using neural network

    NASA Astrophysics Data System (ADS)

    Arefan, D.; Talebpour, A.; Ahmadinejhad, N.; Kamali Asl, A.

    2015-12-01

    According to studies, the risk of breast cancer directly associated with breast density. Many researches are done on automatic diagnosis of breast density using mammography. In the current study, artifacts of mammograms are removed by using image processing techniques and by using the method presented in this study, including the diagnosis of points of the pectoral muscle edges and estimating them using regression techniques, pectoral muscle is detected with high accuracy in mammography and breast tissue is fully automatically extracted. In order to classify mammography images into three categories: Fatty, Glandular, Dense, a feature based on difference of gray-levels of hard tissue and soft tissue in mammograms has been used addition to the statistical features and a neural network classifier with a hidden layer. Image database used in this research is the mini-MIAS database and the maximum accuracy of system in classifying images has been reported 97.66% with 8 hidden layers in neural network.

  14. Classification and automatic transcription of primate calls.

    PubMed

    Versteegh, Maarten; Kuhn, Jeremy; Synnaeve, Gabriel; Ravaux, Lucie; Chemla, Emmanuel; Cäsar, Cristiane; Fuller, James; Murphy, Derek; Schel, Anne; Dunbar, Ewan

    2016-07-01

    This paper reports on an automated and openly available tool for automatic acoustic analysis and transcription of primate calls, which takes raw field recordings and outputs call labels time-aligned with the audio. The system's output predicts a majority of the start times of calls accurately within 200 milliseconds. The tools do not require any manual acoustic analysis or selection of spectral features by the researcher. PMID:27475207

  15. Automatic Classification of Kepler Planetary Transit Candidates

    NASA Astrophysics Data System (ADS)

    McCauliff, Sean D.; Jenkins, Jon M.; Catanzarite, Joseph; Burke, Christopher J.; Coughlin, Jeffrey L.; Twicken, Joseph D.; Tenenbaum, Peter; Seader, Shawn; Li, Jie; Cote, Miles

    2015-06-01

    In the first three years of operation, the Kepler mission found 3697 planet candidates (PCs) from a set of 18,406 transit-like features detected on more than 200,000 distinct stars. Vetting candidate signals manually by inspecting light curves and other diagnostic information is a labor intensive effort. Additionally, this classification methodology does not yield any information about the quality of PCs; all candidates are as credible as any other. The torrent of exoplanet discoveries will continue after Kepler, because a number of exoplanet surveys will have an even broader search area. This paper presents the application of machine-learning techniques to the classification of the exoplanet transit-like signals present in the Kepler light curve data. Transit-like detections are transformed into a uniform set of real-numbered attributes, the most important of which are described in this paper. Each of the known transit-like detections is assigned a class of PC; astrophysical false positive; or systematic, instrumental noise. We use a random forest algorithm to learn the mapping from attributes to classes on this training set. The random forest algorithm has been used previously to classify variable stars; this is the first time it has been used for exoplanet classification. We are able to achieve an overall error rate of 5.85% and an error rate for classifying exoplanets candidates of 2.81%.

  16. Research on Automatic Classification, Indexing and Extracting. Annual Progress Report.

    ERIC Educational Resources Information Center

    Baker, F.T.; And Others

    In order to contribute to the success of several studies for automatic classification, indexing and extracting currently in progress, as well as to further the theoretical and practical understanding of textual item distributions, the development of a frequency program capable of supplying these types of information was undertaken. The program…

  17. Automatic classification of stellar spectra using the SOFM method

    NASA Astrophysics Data System (ADS)

    Xui, Jian-qiao; Li, Qi-bin; Zhao, Yong-heng

    2001-04-01

    In this paper, an automatic classification method of stellar spectra using the Self-Organization Feature Mapping (SOFM) method is given. The SOFM is an unsupervised learning algorithm of Artificial Neural Network (ANN). It allows the data organized onto a feature graph while conserving most of the topological characters of the original data space. We used this method to classify stellar spectra automatically. The result is very similar to the Harvard sequence with an accuracy comparable to that obtained by human experts. The SOFM should be a useful method for on-line classification of stellar spectrum samples of very large size. The SOFM can be executed automatically, so it can be applied to process large numbers of target spectra.

  18. Automatically classifying sentences in full-text biomedical articles into Introduction, Methods, Results and Discussion

    PubMed Central

    Agarwal, Shashank; Yu, Hong

    2009-01-01

    Biomedical texts can be typically represented by four rhetorical categories: Introduction, Methods, Results and Discussion (IMRAD). Classifying sentences into these categories can benefit many other text-mining tasks. Although many studies have applied different approaches for automatically classifying sentences in MEDLINE abstracts into the IMRAD categories, few have explored the classification of sentences that appear in full-text biomedical articles. We first evaluated whether sentences in full-text biomedical articles could be reliably annotated into the IMRAD format and then explored different approaches for automatically classifying these sentences into the IMRAD categories. Our results show an overall annotation agreement of 82.14% with a Kappa score of 0.756. The best classification system is a multinomial naïve Bayes classifier trained on manually annotated data that achieved 91.95% accuracy and an average F-score of 91.55%, which is significantly higher than baseline systems. A web version of this system is available online at—http://wood.ims.uwm.edu/full_text_classifier/. Contact: hongyu@uwm.edu PMID:19783830

  19. Research on Automatic Classification, Indexing and Extracting: A General-Purpose Frequency Program.

    ERIC Educational Resources Information Center

    Baker, F. T.; Williams, John H., Jr.

    To support studies in automatic indexing, classification and extracting, a general purpose frequency program was developed to further theoretical and practical understanding of text word distributions. While the program is primarily designed for counting strings of character-oriented data, it can be used without change for counting any items which…

  20. Semi-automatic classification of textures in thoracic CT scans

    NASA Astrophysics Data System (ADS)

    Kockelkorn, Thessa T. J. P.; de Jong, Pim A.; Schaefer-Prokop, Cornelia M.; Wittenberg, Rianne; Tiehuis, Audrey M.; Gietema, Hester A.; Grutters, Jan C.; Viergever, Max A.; van Ginneken, Bram

    2016-08-01

    The textural patterns in the lung parenchyma, as visible on computed tomography (CT) scans, are essential to make a correct diagnosis in interstitial lung disease. We developed one automatic and two interactive protocols for classification of normal and seven types of abnormal lung textures. Lungs were segmented and subdivided into volumes of interest (VOIs) with homogeneous texture using a clustering approach. In the automatic protocol, VOIs were classified automatically by an extra-trees classifier that was trained using annotations of VOIs from other CT scans. In the interactive protocols, an observer iteratively trained an extra-trees classifier to distinguish the different textures, by correcting mistakes the classifier makes in a slice-by-slice manner. The difference between the two interactive methods was whether or not training data from previously annotated scans was used in classification of the first slice. The protocols were compared in terms of the percentages of VOIs that observers needed to relabel. Validation experiments were carried out using software that simulated observer behavior. In the automatic classification protocol, observers needed to relabel on average 58% of the VOIs. During interactive annotation without the use of previous training data, the average percentage of relabeled VOIs decreased from 64% for the first slice to 13% for the second half of the scan. Overall, 21% of the VOIs were relabeled. When previous training data was available, the average overall percentage of VOIs requiring relabeling was 20%, decreasing from 56% in the first slice to 13% in the second half of the scan.

  1. Automatic lung nodule classification with radiomics approach

    NASA Astrophysics Data System (ADS)

    Ma, Jingchen; Wang, Qian; Ren, Yacheng; Hu, Haibo; Zhao, Jun

    2016-03-01

    Lung cancer is the first killer among the cancer deaths. Malignant lung nodules have extremely high mortality while some of the benign nodules don't need any treatment .Thus, the accuracy of diagnosis between benign or malignant nodules diagnosis is necessary. Notably, although currently additional invasive biopsy or second CT scan in 3 months later may help radiologists to make judgments, easier diagnosis approaches are imminently needed. In this paper, we propose a novel CAD method to distinguish the benign and malignant lung cancer from CT images directly, which can not only improve the efficiency of rumor diagnosis but also greatly decrease the pain and risk of patients in biopsy collecting process. Briefly, according to the state-of-the-art radiomics approach, 583 features were used at the first step for measurement of nodules' intensity, shape, heterogeneity and information in multi-frequencies. Further, with Random Forest method, we distinguish the benign nodules from malignant nodules by analyzing all these features. Notably, our proposed scheme was tested on all 79 CT scans with diagnosis data available in The Cancer Imaging Archive (TCIA) which contain 127 nodules and each nodule is annotated by at least one of four radiologists participating in the project. Satisfactorily, this method achieved 82.7% accuracy in classification of malignant primary lung nodules and benign nodules. We believe it would bring much value for routine lung cancer diagnosis in CT imaging and provide improvement in decision-support with much lower cost.

  2. Automatic Classification of Kepler Threshold Crossing Events

    NASA Astrophysics Data System (ADS)

    McCauliff, Sean; Catanzarite, Joseph; Jenkins, Jon Michael

    2014-06-01

    Over the course of its 4-year primary mission the Kepler mission has discovered numerous planets. Part of the process of planet discovery has involved generating threshold crossing events (TCEs); a light curve with a repeating exoplanet transit-like feature. The large number of diagnostics 100) makes it difficult to examine all the information available for each TCE. The effort required for vetting all threshold-crossing events (TCEs) takes several months by many individuals associated with the Kepler Threshold Crossing Event Review Team (TCERT). The total number of objects with transit-like features identified in the light curves has increased to as many as 18,000, just examining the first three years of data. In order to accelerate the process by which new planet candidates are classified, we propose a machine learning approach to establish a preliminary list of planetary candidates ranked from most credible to least credible. The classifier must distinguish between three classes of detections: non-transiting phenomena, astrophysical false positives, and planet candidates. We use random forests, a supervised classification algorithm to this end. We report on the performance of the classifier and identify diagnostics that are important for discriminating between these classes of TCEs.Funding for this mission is provided by NASA’s Science Mission Directorate.

  3. Automatic classification of time-variable X-ray sources

    SciTech Connect

    Lo, Kitty K.; Farrell, Sean; Murphy, Tara; Gaensler, B. M.

    2014-05-01

    To maximize the discovery potential of future synoptic surveys, especially in the field of transient science, it will be necessary to use automatic classification to identify some of the astronomical sources. The data mining technique of supervised classification is suitable for this problem. Here, we present a supervised learning method to automatically classify variable X-ray sources in the Second XMM-Newton Serendipitous Source Catalog (2XMMi-DR2). Random Forest is our classifier of choice since it is one of the most accurate learning algorithms available. Our training set consists of 873 variable sources and their features are derived from time series, spectra, and other multi-wavelength contextual information. The 10 fold cross validation accuracy of the training data is ∼97% on a 7 class data set. We applied the trained classification model to 411 unknown variable 2XMM sources to produce a probabilistically classified catalog. Using the classification margin and the Random Forest derived outlier measure, we identified 12 anomalous sources, of which 2XMM J180658.7–500250 appears to be the most unusual source in the sample. Its X-ray spectra is suggestive of a ultraluminous X-ray source but its variability makes it highly unusual. Machine-learned classification and anomaly detection will facilitate scientific discoveries in the era of all-sky surveys.

  4. Lexical Inference Mechanisms for Text Understanding and Classification.

    ERIC Educational Resources Information Center

    Figa, Elizabeth; Tarau, Paul

    2003-01-01

    Describes a framework for building "story traces" (compact global views of a narrative) and "story projects" (selections of key elements of a narrative) and their applications in text understanding and classification. The resulting "abstract story traces" provide a compact view of the underlying narrative's key content elements and a means for…

  5. PADMA: PArallel Data Mining Agents for scalable text classification

    SciTech Connect

    Kargupta, H.; Hamzaoglu, I.; Stafford, B.

    1997-03-01

    This paper introduces PADMA (PArallel Data Mining Agents), a parallel agent based system for scalable text classification. PADMA contains modules for (1) parallel data accessing operations, (2) parallel hierarchical clustering, and (3) web-based data visualization. This paper introduces the general architecture of PADMA and presents a detailed description of its different modules.

  6. Automatic classification of seismic events within a regional seismograph network

    NASA Astrophysics Data System (ADS)

    Tiira, Timo; Kortström, Jari; Uski, Marja

    2015-04-01

    A fully automatic method for seismic event classification within a sparse regional seismograph network is presented. The tool is based on a supervised pattern recognition technique, Support Vector Machine (SVM), trained here to distinguish weak local earthquakes from a bulk of human-made or spurious seismic events. The classification rules rely on differences in signal energy distribution between natural and artificial seismic sources. Seismic records are divided into four windows, P, P coda, S, and S coda. For each signal window STA is computed in 20 narrow frequency bands between 1 and 41 Hz. The 80 discrimination parameters are used as a training data for the SVM. The SVM models are calculated for 19 on-line seismic stations in Finland. The event data are compiled mainly from fully automatic event solutions that are manually classified after automatic location process. The station-specific SVM training events include 11-302 positive (earthquake) and 227-1048 negative (non-earthquake) examples. The best voting rules for combining results from different stations are determined during an independent testing period. Finally, the network processing rules are applied to an independent evaluation period comprising 4681 fully automatic event determinations, of which 98 % have been manually identified as explosions or noise and 2 % as earthquakes. The SVM method correctly identifies 94 % of the non-earthquakes and all the earthquakes. The results imply that the SVM tool can identify and filter out blasts and spurious events from fully automatic event solutions with a high level of confidence. The tool helps to reduce work-load in manual seismic analysis by leaving only ~5 % of the automatic event determinations, i.e. the probable earthquakes for more detailed seismological analysis. The approach presented is easy to adjust to requirements of a denser or wider high-frequency network, once enough training examples for building a station-specific data set are available.

  7. Automatic Classification of Variable Stars in Catalogs with Missing Data

    NASA Astrophysics Data System (ADS)

    Pichara, Karim; Protopapas, Pavlos

    2013-11-01

    We present an automatic classification method for astronomical catalogs with missing data. We use Bayesian networks and a probabilistic graphical model that allows us to perform inference to predict missing values given observed data and dependency relationships between variables. To learn a Bayesian network from incomplete data, we use an iterative algorithm that utilizes sampling methods and expectation maximization to estimate the distributions and probabilistic dependencies of variables from data with missing values. To test our model, we use three catalogs with missing data (SAGE, Two Micron All Sky Survey, and UBVI) and one complete catalog (MACHO). We examine how classification accuracy changes when information from missing data catalogs is included, how our method compares to traditional missing data approaches, and at what computational cost. Integrating these catalogs with missing data, we find that classification of variable objects improves by a few percent and by 15% for quasar detection while keeping the computational cost the same.

  8. AUTOMATIC CLASSIFICATION OF VARIABLE STARS IN CATALOGS WITH MISSING DATA

    SciTech Connect

    Pichara, Karim; Protopapas, Pavlos

    2013-11-10

    We present an automatic classification method for astronomical catalogs with missing data. We use Bayesian networks and a probabilistic graphical model that allows us to perform inference to predict missing values given observed data and dependency relationships between variables. To learn a Bayesian network from incomplete data, we use an iterative algorithm that utilizes sampling methods and expectation maximization to estimate the distributions and probabilistic dependencies of variables from data with missing values. To test our model, we use three catalogs with missing data (SAGE, Two Micron All Sky Survey, and UBVI) and one complete catalog (MACHO). We examine how classification accuracy changes when information from missing data catalogs is included, how our method compares to traditional missing data approaches, and at what computational cost. Integrating these catalogs with missing data, we find that classification of variable objects improves by a few percent and by 15% for quasar detection while keeping the computational cost the same.

  9. Simple-random-sampling-based multiclass text classification algorithm.

    PubMed

    Liu, Wuying; Wang, Lin; Yi, Mianzhu

    2014-01-01

    Multiclass text classification (MTC) is a challenging issue and the corresponding MTC algorithms can be used in many applications. The space-time overhead of the algorithms must be concerned about the era of big data. Through the investigation of the token frequency distribution in a Chinese web document collection, this paper reexamines the power law and proposes a simple-random-sampling-based MTC (SRSMTC) algorithm. Supported by a token level memory to store labeled documents, the SRSMTC algorithm uses a text retrieval approach to solve text classification problems. The experimental results on the TanCorp data set show that SRSMTC algorithm can achieve the state-of-the-art performance at greatly reduced space-time requirements. PMID:24778587

  10. Simple-Random-Sampling-Based Multiclass Text Classification Algorithm

    PubMed Central

    Liu, Wuying; Wang, Lin; Yi, Mianzhu

    2014-01-01

    Multiclass text classification (MTC) is a challenging issue and the corresponding MTC algorithms can be used in many applications. The space-time overhead of the algorithms must be concerned about the era of big data. Through the investigation of the token frequency distribution in a Chinese web document collection, this paper reexamines the power law and proposes a simple-random-sampling-based MTC (SRSMTC) algorithm. Supported by a token level memory to store labeled documents, the SRSMTC algorithm uses a text retrieval approach to solve text classification problems. The experimental results on the TanCorp data set show that SRSMTC algorithm can achieve the state-of-the-art performance at greatly reduced space-time requirements. PMID:24778587

  11. Semi-automatic classification of textures in thoracic CT scans.

    PubMed

    Kockelkorn, Thessa T J P; de Jong, Pim A; Schaefer-Prokop, Cornelia M; Wittenberg, Rianne; Tiehuis, Audrey M; Gietema, Hester A; Grutters, Jan C; Viergever, Max A; van Ginneken, Bram

    2016-08-21

    The textural patterns in the lung parenchyma, as visible on computed tomography (CT) scans, are essential to make a correct diagnosis in interstitial lung disease. We developed one automatic and two interactive protocols for classification of normal and seven types of abnormal lung textures. Lungs were segmented and subdivided into volumes of interest (VOIs) with homogeneous texture using a clustering approach. In the automatic protocol, VOIs were classified automatically by an extra-trees classifier that was trained using annotations of VOIs from other CT scans. In the interactive protocols, an observer iteratively trained an extra-trees classifier to distinguish the different textures, by correcting mistakes the classifier makes in a slice-by-slice manner. The difference between the two interactive methods was whether or not training data from previously annotated scans was used in classification of the first slice. The protocols were compared in terms of the percentages of VOIs that observers needed to relabel. Validation experiments were carried out using software that simulated observer behavior. In the automatic classification protocol, observers needed to relabel on average 58% of the VOIs. During interactive annotation without the use of previous training data, the average percentage of relabeled VOIs decreased from 64% for the first slice to 13% for the second half of the scan. Overall, 21% of the VOIs were relabeled. When previous training data was available, the average overall percentage of VOIs requiring relabeling was 20%, decreasing from 56% in the first slice to 13% in the second half of the scan. PMID:27436568

  12. Neural net learning issues in classification of free text documents

    SciTech Connect

    Dasigi, V.R.; Mann, R.C.

    1996-03-01

    In intelligent analysis of large amounts of text, not any single clue indicates reliably that a pattern of interest has been found. When using multiple clues, it is not known how these should be integrated into a decision. In the context of this investigation, we have been using neural nets as parameterized mappings that allow for fusion of higher level clues extracted from free text. By using higher level clues and features, we avoid very large networks. By using the dominant singular values computed by Latent Semantic Indexing (LSI) and applying neural network algorithms for integrating these values and the outputs from other ``sensors,`` we have obtained preliminary encouraging results with text classification.

  13. Automatic Cataract Hardness Classification Ex Vivo by Ultrasound Techniques.

    PubMed

    Caixinha, Miguel; Santos, Mário; Santos, Jaime

    2016-04-01

    To demonstrate the feasibility of a new methodology for cataract hardness characterization and automatic classification using ultrasound techniques, different cataract degrees were induced in 210 porcine lenses. A 25-MHz ultrasound transducer was used to obtain acoustical parameters (velocity and attenuation) and backscattering signals. B-Scan and parametric Nakagami images were constructed. Ninety-seven parameters were extracted and subjected to a Principal Component Analysis. Bayes, K-Nearest-Neighbours, Fisher Linear Discriminant and Support Vector Machine (SVM) classifiers were used to automatically classify the different cataract severities. Statistically significant increases with cataract formation were found for velocity, attenuation, mean brightness intensity of the B-Scan images and mean Nakagami m parameter (p < 0.01). The four classifiers showed a good performance for healthy versus cataractous lenses (F-measure ≥ 92.68%), while for initial versus severe cataracts the SVM classifier showed the higher performance (90.62%). The results showed that ultrasound techniques can be used for non-invasive cataract hardness characterization and automatic classification. PMID:26742891

  14. Exploring Automaticity in Text Processing: Syntactic Ambiguity as a Test Case

    ERIC Educational Resources Information Center

    Rawson, Katherine A.

    2004-01-01

    A prevalent assumption in text comprehension research is that many aspects of text processing are automatic, with automaticity typically defined in terms of properties (e.g., speed and effort). The present research advocates conceptualization of automaticity in terms of underlying mechanisms and evaluates two such accounts, a…

  15. Automatic template-guided classification of remnant trees

    NASA Astrophysics Data System (ADS)

    Kennedy, Peter

    Spectral features within satellite images change so frequently and unpredictably that spectral definitions of land cover are often only accurate for a single image. Consequently, land-cover maps are expensive, because the superior pattern recognition skills of human analysts are required to manually tune spectral definitions of land cover to individual images. To reduce mapping costs, this study developed the Template-Guided Classification (TGC) algorithm, which classifies land cover automatically by reusing class information embedded in freely available large-area land-cover maps. TGC was applied to map remnant forest within six 10-m resolution SPOT images of the Vermilion River watershed in Alberta, Canada. Although the accuracy of the resulting forest maps was low (58 % forest user's accuracy and 67 % forest producer's accuracy), there were 25 % and 8 % fewer errors of omission and commission than the original maps, respectively. This improvement would be very useful if it could be obtained automatically over large-areas.

  16. An integrated spatial signature analysis and automatic defect classification system

    SciTech Connect

    Gleason, S.S.; Tobin, K.W.; Karnowski, T.P.

    1997-08-01

    An integrated Spatial Signature Analysis (SSA) and automatic defect classification (ADC) system for improved automatic semiconductor wafer manufacturing characterization is presented. Both concepts of SSA and ADC methodologies are reviewed and then the benefits of an integrated system are described, namely, focused ADC and signature-level sampling. Focused ADC involves the use of SSA information on a defect signature to reduce the number of possible classes that an ADC system must consider, thus improving the ADC system performance. Signature-level sampling improved the ADC system throughput and accuracy by intelligently sampling defects within a given spatial signature for subsequent off-line, high-resolution ADC. A complete example of wafermap characterization via an integrated SSA/ADC system is presented where a wafer with 3274 defects is completely characterized by revisiting only 25 defects on an off-line ADC review station. 13 refs., 7 figs.

  17. Classification of protein-protein interaction full-text documents using text and citation network features.

    PubMed

    Kolchinsky, Artemy; Abi-Haidar, Alaa; Kaur, Jasleen; Hamed, Ahmed Abdeen; Rocha, Luis M

    2010-01-01

    We participated (as Team 9) in the Article Classification Task of the Biocreative II.5 Challenge: binary classification of full-text documents relevant for protein-protein interaction. We used two distinct classifiers for the online and offline challenges: 1) the lightweight Variable Trigonometric Threshold (VTT) linear classifier we successfully introduced in BioCreative 2 for binary classification of abstracts and 2) a novel Naive Bayes classifier using features from the citation network of the relevant literature. We supplemented the supplied training data with full-text documents from the MIPS database. The lightweight VTT classifier was very competitive in this new full-text scenario: it was a top-performing submission in this task, taking into account the rank product of the Area Under the interpolated precision and recall Curve, Accuracy, Balanced F-Score, and Matthew's Correlation Coefficient performance measures. The novel citation network classifier for the biomedical text mining domain, while not a top performing classifier in the challenge, performed above the central tendency of all submissions, and therefore indicates a promising new avenue to investigate further in bibliome informatics. PMID:20671313

  18. Evolutionary synthesis of automatic classification on astroinformatic big data

    NASA Astrophysics Data System (ADS)

    Kojecky, Lumir; Zelinka, Ivan; Saloun, Petr

    2016-06-01

    This article describes the initial experiments using a new approach to automatic identification of Be and B[e] stars spectra in large archives. With enormous amount of these data it is no longer feasible to analyze it using classical approaches. We introduce an evolutionary synthesis of the classification by means of analytic programming, one of methods of symbolic regression. By this method, we synthesize the most suitable mathematical formulas that approximate chosen samples of the stellar spectra. As a result is then selected the category whose formula has the lowest difference compared to the particular spectrum. The results show us that classification of stellar spectra by means of analytic programming is able to identify different shapes of the spectra.

  19. Improve mask inspection capacity with Automatic Defect Classification (ADC)

    NASA Astrophysics Data System (ADS)

    Wang, Crystal; Ho, Steven; Guo, Eric; Wang, Kechang; Lakkapragada, Suresh; Yu, Jiao; Hu, Peter; Tolani, Vikram; Pang, Linyong

    2013-09-01

    As optical lithography continues to extend into low-k1 regime, resolution of mask patterns continues to diminish. The adoption of RET techniques like aggressive OPC, sub-resolution assist features combined with the requirements to detect even smaller defects on masks due to increasing MEEF, poses considerable challenges for mask inspection operators and engineers. Therefore a comprehensive approach is required in handling defects post-inspections by correctly identifying and classifying the real killer defects impacting the printability on wafer, and ignoring nuisance defect and false defects caused by inspection systems. This paper focuses on the results from the evaluation of Automatic Defect Classification (ADC) product at the SMIC mask shop for the 40nm technology node. Traditionally, each defect is manually examined and classified by the inspection operator based on a set of predefined rules and human judgment. At SMIC mask shop due to the significant total number of detected defects, manual classification is not cost-effective due to increased inspection cycle time, resulting in constrained mask inspection capacity, since the review has to be performed while the mask stays on the inspection system. Luminescent Technologies Automated Defect Classification (ADC) product offers a complete and systematic approach for defect disposition and classification offline, resulting in improved utilization of the current mask inspection capability. Based on results from implementation of ADC in SMIC mask production flow, there was around 20% improvement in the inspection capacity compared to the traditional flow. This approach of computationally reviewing defects post mask-inspection ensures no yield loss by qualifying reticles without the errors associated with operator mis-classification or human error. The ADC engine retrieves the high resolution inspection images and uses a decision-tree flow to classify a given defect. Some identification mechanisms adopted by ADC to

  20. Automatic coding of reasons for hospital referral from general medicine free-text reports.

    PubMed Central

    Letrilliart, L.; Viboud, C.; Boëlle, P. Y.; Flahault, A.

    2000-01-01

    Although the coding of medical data is expected to benefit both patients and the health care system, its implementation as a manual process often represents a poorly attractive workload for the physician. For epidemiological purpose, we developed a simple automatic coding system based on string matching, which was designed to process free-text sentences stating reasons for hospital referral, as collected from general practitioners (GPs). This system relied on a look-up table, built up from 2590 reports giving a single reason for referral, which were coded manually according to the International Classification of Primary Care (ICPC). We tested the system by entering 797 new reasons for referral. The match rate was estimated at 77%, and the accuracy rate, at 80% at code level and 92% at chapter level. This simple system is now routinely used by a national epidemiological network of sentinel physicians. PMID:11079931

  1. Clinically-inspired automatic classification of ovarian carcinoma subtypes

    PubMed Central

    BenTaieb, Aïcha; Nosrati, Masoud S; Li-Chang, Hector; Huntsman, David; Hamarneh, Ghassan

    2016-01-01

    Context: It has been shown that ovarian carcinoma subtypes are distinct pathologic entities with differing prognostic and therapeutic implications. Histotyping by pathologists has good reproducibility, but occasional cases are challenging and require immunohistochemistry and subspecialty consultation. Motivated by the need for more accurate and reproducible diagnoses and to facilitate pathologists’ workflow, we propose an automatic framework for ovarian carcinoma classification. Materials and Methods: Our method is inspired by pathologists’ workflow. We analyse imaged tissues at two magnification levels and extract clinically-inspired color, texture, and segmentation-based shape descriptors using image-processing methods. We propose a carefully designed machine learning technique composed of four modules: A dissimilarity matrix, dimensionality reduction, feature selection and a support vector machine classifier to separate the five ovarian carcinoma subtypes using the extracted features. Results: This paper presents the details of our implementation and its validation on a clinically derived dataset of eighty high-resolution histopathology images. The proposed system achieved a multiclass classification accuracy of 95.0% when classifying unseen tissues. Assessment of the classifier's confusion (confusion matrix) between the five different ovarian carcinoma subtypes agrees with clinician's confusion and reflects the difficulty in diagnosing endometrioid and serous carcinomas. Conclusions: Our results from this first study highlight the difficulty of ovarian carcinoma diagnosis which originate from the intrinsic class-imbalance observed among subtypes and suggest that the automatic analysis of ovarian carcinoma subtypes could be valuable to clinician's diagnostic procedure by providing a second opinion. PMID:27563487

  2. Image classification approach for automatic identification of grassland weeds

    NASA Astrophysics Data System (ADS)

    Gebhardt, Steffen; Kühbauch, Walter

    2006-08-01

    The potential of digital image processing for weed mapping in arable crops has widely been investigated in the last decades. In grassland farming these techniques are rarely applied so far. The project presented here focuses on the automatic identification of one of the most invasive and persistent grassland weed species, the broad-leaved dock (Rumex obtusifolius L.) in complex mixtures of grass and herbs. A total of 108 RGB-images were acquired in near range from a field experiment under constant illumination conditions using a commercial digital camera. The objects of interest were separated from the background by transforming the 24 bit RGB-images into 8 bit intensities and then calculating the local homogeneity images. These images were binarised by applying a dynamic grey value threshold. Finally, morphological opening was applied to the binary images. The remaining contiguous regions were considered to be objects. In order to classify these objects into 3 different weed species, a soil and a residue class, a total of 17 object-features related to shape, color and texture of the weeds were extracted. Using MANOVA, 12 of them were identified which contribute to classification. Maximum-likelihood classification was conducted to discriminate the weed species. The total classification rate across all classes ranged from 76 % to 83 %. The classification of Rumex obtusifolius achieved detection rates between 85 % and 93 % by misclassifications below 10 %. Further, Rumex obtusifolius distribution and the density maps were generated based on classification results and transformation of image coordinates into Gauss-Krueger system. These promising results show the high potential of image analysis for weed mapping in grassland and the implementation of site-specific herbicide spraying.

  3. Automatic classification of spectra from the Infrared Astronomical Satellite (IRAS)

    NASA Technical Reports Server (NTRS)

    Cheeseman, Peter; Stutz, John; Self, Matthew; Taylor, William; Goebel, John; Volk, Kevin; Walker, Helen

    1989-01-01

    A new classification of Infrared spectra collected by the Infrared Astronomical Satellite (IRAS) is presented. The spectral classes were discovered automatically by a program called Auto Class 2. This program is a method for discovering (inducing) classes from a data base, utilizing a Bayesian probability approach. These classes can be used to give insight into the patterns that occur in the particular domain, in this case, infrared astronomical spectroscopy. The classified spectra are the entire Low Resolution Spectra (LRS) Atlas of 5,425 sources. There are seventy-seven classes in this classification and these in turn were meta-classified to produce nine meta-classes. The classification is presented as spectral plots, IRAS color-color plots, galactic distribution plots and class commentaries. Cross-reference tables, listing the sources by IRAS name and by Auto Class class, are also given. These classes show some of the well known classes, such as the black-body class, and silicate emission classes, but many other classes were unsuspected, while others show important subtle differences within the well known classes.

  4. On Multilabel Classification Methods of Incompletely Labeled Biomedical Text Data

    PubMed Central

    Kamyshenkov, Dmitry; Smekalova, Elena; Golovizin, Alexey; Zhavoronkov, Alex

    2014-01-01

    Multilabel classification is often hindered by incompletely labeled training datasets; for some items of such dataset (or even for all of them) some labels may be omitted. In this case, we cannot know if any item is labeled fully and correctly. When we train a classifier directly on incompletely labeled dataset, it performs ineffectively. To overcome the problem, we added an extra step, training set modification, before training a classifier. In this paper, we try two algorithms for training set modification: weighted k-nearest neighbor (WkNN) and soft supervised learning (SoftSL). Both of these approaches are based on similarity measurements between data vectors. We performed the experiments on AgingPortfolio (text dataset) and then rechecked on the Yeast (nontext genetic data). We tried SVM and RF classifiers for the original datasets and then for the modified ones. For each dataset, our experiments demonstrated that both classification algorithms performed considerably better when preceded by the training set modification step. PMID:24587817

  5. An Approach for Automatic Classification of Radiology Reports in Spanish.

    PubMed

    Cotik, Viviana; Filippo, Darío; Castaño, José

    2015-01-01

    Automatic detection of relevant terms in medical reports is useful for educational purposes and for clinical research. Natural language processing (NLP) techniques can be applied in order to identify them. In this work we present an approach to classify radiology reports written in Spanish into two sets: the ones that indicate pathological findings and the ones that do not. In addition, the entities corresponding to pathological findings are identified in the reports. We use RadLex, a lexicon of English radiology terms, and NLP techniques to identify the occurrence of pathological findings. Reports are classified using a simple algorithm based on the presence of pathological findings, negation and hedge terms. The implemented algorithms were tested with a test set of 248 reports annotated by an expert, obtaining a best result of 0.72 F1 measure. The output of the classification task can be used to look for specific occurrences of pathological findings. PMID:26262128

  6. Automatic music genres classification as a pattern recognition problem

    NASA Astrophysics Data System (ADS)

    Ul Haq, Ihtisham; Khan, Fauzia; Sharif, Sana; Shaukat, Arsalan

    2013-12-01

    Music genres are the simplest and effect descriptors for searching music libraries stores or catalogues. The paper compares the results of two automatic music genres classification systems implemented by using two different yet simple classifiers (K-Nearest Neighbor and Naïve Bayes). First a 10-12 second sample is selected and features are extracted from it, and then based on those features results of both classifiers are represented in the form of accuracy table and confusion matrix. An experiment carried out on test 60 taken from middle of a song represents the true essence of its genre as compared to the samples taken from beginning and ending of a song. The novel techniques have achieved an accuracy of 91% and 78% by using Naïve Bayes and KNN classifiers respectively.

  7. Automatic classification of penicillin-induced epileptic EEG spikes.

    PubMed

    Kortelainen, Jukka; Silfverhuth, Minna; Suominen, Kalervo; Sonkajarvi, Eila; Alahuhta, Seppo; Jantti, Ville; Seppanen, Tapio

    2010-01-01

    Penicillin-induced focal epilepsy is a well-known model in epilepsy research. In this model, epileptic activity is generated by delivering penicillin focally to the cortex. The drug induces interictal electroencephalographic (EEG) spikes which evolve in time and may later change to ictal discharges. This paper proposes a method for automatic classification of these interictal epileptic spikes using iterative K-means clustering. The method is shown to be able to detect different spike waveforms and describe their characteristic occurrence in time during penicillin-induced focal epilepsy. The study offers potential for future research by providing a method to objectively and quantitatively analyze the time sequence of interictal epileptic activity. PMID:21096740

  8. Automatic Coding of Short Text Responses via Clustering in Educational Assessment

    ERIC Educational Resources Information Center

    Zehner, Fabian; Sälzer, Christine; Goldhammer, Frank

    2016-01-01

    Automatic coding of short text responses opens new doors in assessment. We implemented and integrated baseline methods of natural language processing and statistical modelling by means of software components that are available under open licenses. The accuracy of automatic text coding is demonstrated by using data collected in the "Programme…

  9. Text Classification Using the Sum of Frequency Ratios of Word andN-gram Over Categories

    NASA Astrophysics Data System (ADS)

    Suzuki, Makoto; Hirasawa, Shigeichi

    In this paper, we consider the automatic text classification as a series of information processing, and propose a new classification technique, namely, “Frequency Ratio Accumulation Method (FRAM)”. This is a simple technique that calculates the sum of ratios of term frequency in each category. However, it has a desirable property that feature terms can be used without their extraction procedure. Then, we use “character N-gram” and “word N-gram” as feature terms by using this property of our classification technique. Next, we evaluate our technique by some experiments. In our experiments, we classify the newspaper articles of Japanese “CD-Mainichi 2002” and English “Reuters-21578” using the Naive Bayes (baseline method) and the proposed method. As the result, we show that the classification accuracy of the proposed method improves greatly compared with the baseline. That is, it is 89.6% for Mainichi, 87.8% for Reuters. Thus, the proposed method has a very high performance. Though the proposed method is a simple technique, it has a new viewpoint, a high potential and is language-independent, so it can be expected the development in the future.

  10. Automatic classification of spatial signatures on semiconductor wafermaps

    SciTech Connect

    Tobin, K.W.; Gleason, S.S.; Karnowski, T.P.; Cohen, S.L.; Lakhani, F.

    1997-03-01

    This paper describes Spatial Signature Analysis (SSA), a cooperative research project between SEMATECH and Oak Ridge National Laboratory for automatically analyzing and reducing semiconductor wafermap defect data to useful information. Trends toward larger wafer formats and smaller critical dimensions have caused an exponential increase in the volume of visual and parametric defect data which must be analyzed and stored, therefore necessitating the development of automated tools for wafer defect analysis. Contamination particles that did not create problems with 1 micron design rules can now be categorized as killer defects. SSA is an automated wafermap analysis procedure which performs a sophisticated defect clustering and signature classification of electronic wafermaps. This procedure has been realized in a software system that contains a signature classifier that is user-trainable. Known examples of historically problematic process signatures are added to a training database for the classifier. Once a suitable training set has been established, the software can automatically segment and classify multiple signatures form a standard electronic wafermap file into user-defined categories. It is anticipated that successful integration of this technology with other wafer monitoring strategies will result in reduced time-to-discovery and ultimately improved product yield.

  11. Automatic text extraction in news images using morphology

    NASA Astrophysics Data System (ADS)

    Jang, InYoung; Ko, ByoungChul; Byun, HyeRan; Choi, Yeongwoo

    2002-01-01

    In this paper we present a new method to extract both superimposed and embedded graphical texts in a freeze-frame of news video. The algorithm is summarized in the following three steps. For the first step, we convert a color image into a gray-level image and apply contrast stretching to enhance the contrast of the input image. Then, a modified local adaptive thresholding is applied to the contrast-stretched image. The second step is divided into three processes: eliminating text-like components by applying erosion, dilation, and (OpenClose + CloseOpen)/2 morphological operations, maintaining text components using (OpenClose + CloseOpen)/2 operation with a new Geo-correction method, and subtracting two result images for eliminating false-positive components further. In the third filtering step, the characteristics of each component such as the ratio of the number of pixels in each candidate component to the number of its boundary pixels and the ratio of the minor to the major axis of each bounding box are used. Acceptable results have been obtained using the proposed method on 300 news images with a recognition rate of 93.6%. Also, our method indicates a good performance on all the various kinds of images by adjusting the size of the structuring element.

  12. Automatic Refinement of Training Data for Classification of Satellite Imagery

    NASA Astrophysics Data System (ADS)

    Büschenfeld, T.; Ostermann, J.

    2012-07-01

    In this paper, we present a method for automatic refinement of training data. Many classifiers from machine learning used in applications in the remote sensing domain, rely on previously labelled training data. This labelling is often done by human operators and is bound to time constraints. Hence, selection of training data must be kept practical which implies a certain inaccuracy. This results in erroneously tagged regions enclosed within competing classes. For that purpose, we propose a method that removes outliers from training data by using an iterative training-classification scheme. Outliers are detected by their newly determined class membership as well as through analysis of uncertainty of classified samples. The sample selection method which incorporates quality of neighbouring samples is presented and compared to alternative strategies. Additionally, iterative approaches tend to propagate errors which might lead to degenerating classes. Therefore, a robust stopping criterion based on training data characteristics is described. Our experiments using a support vector machine (SVM) show, that outliers are reliably removed, allowing a more convenient sample selection. The classification result for unknown scenes of the accordant validation set improves from 70.36% to 79.12% on average. Additionally, the average complexity of the SVM model is decreased by 82.75% resulting in similar reduction of processing time.

  13. Automatic defect classification using topography map from SEM photometric stereo

    NASA Astrophysics Data System (ADS)

    Serulnik, Sergio D.; Cohen, Jacob; Sherman, Boris; Ben-Porath, Ariel

    2004-04-01

    As the industry moves to smaller design rules, shrinking process windows and shorter product lifecycles, the need for enhanced yield management methodology is increasing. Defect classification is required for identification and isolation of yield loss sources. Practice demonstrates that an operator relies on 3D information heavily while classifying defects. Therefore, Defect Topographic Map (DTM) information can enhance Automatic Defect Classification (ADC) capabilities dramatically. In the present article, we describe the manner in which reliable and rapid SEM measurements of defect topography characteristics increase the classifier ability to achieve fast identification of the exact process step at which a given defect was introduced. Special multiple perspective SEM imaging allows efficient application of the photometric stereo methods. Physical properties of a defect can be derived from the 3D by using straightforward computer vision algorithms. We will show several examples, from both production fabs and R&D lines, of instances where the depth map is essential in correctly partitioning the defects, thus reducing time to source and overall fab expenses due to defect excursions.

  14. Automatic Classification of Specific Melanocytic Lesions Using Artificial Intelligence

    PubMed Central

    Jaworek-Korjakowska, Joanna; Kłeczek, Paweł

    2016-01-01

    Background. Given its propensity to metastasize, and lack of effective therapies for most patients with advanced disease, early detection of melanoma is a clinical imperative. Different computer-aided diagnosis (CAD) systems have been proposed to increase the specificity and sensitivity of melanoma detection. Although such computer programs are developed for different diagnostic algorithms, to the best of our knowledge, a system to classify different melanocytic lesions has not been proposed yet. Method. In this research we present a new approach to the classification of melanocytic lesions. This work is focused not only on categorization of skin lesions as benign or malignant but also on specifying the exact type of a skin lesion including melanoma, Clark nevus, Spitz/Reed nevus, and blue nevus. The proposed automatic algorithm contains the following steps: image enhancement, lesion segmentation, feature extraction, and selection as well as classification. Results. The algorithm has been tested on 300 dermoscopic images and achieved accuracy of 92% indicating that the proposed approach classified most of the melanocytic lesions correctly. Conclusions. A proposed system can not only help to precisely diagnose the type of the skin mole but also decrease the amount of biopsies and reduce the morbidity related to skin lesion excision. PMID:26885520

  15. Semantic Similarity for Automatic Classification of Chemical Compounds

    PubMed Central

    Ferreira, João D.; Couto, Francisco M.

    2010-01-01

    With the increasing amount of data made available in the chemical field, there is a strong need for systems capable of comparing and classifying chemical compounds in an efficient and effective way. The best approaches existing today are based on the structure-activity relationship premise, which states that biological activity of a molecule is strongly related to its structural or physicochemical properties. This work presents a novel approach to the automatic classification of chemical compounds by integrating semantic similarity with existing structural comparison methods. Our approach was assessed based on the Matthews Correlation Coefficient for the prediction, and achieved values of 0.810 when used as a prediction of blood-brain barrier permeability, 0.694 for P-glycoprotein substrate, and 0.673 for estrogen receptor binding activity. These results expose a significant improvement over the currently existing methods, whose best performances were 0.628, 0.591, and 0.647 respectively. It was demonstrated that the integration of semantic similarity is a feasible and effective way to improve existing chemical compound classification systems. Among other possible uses, this tool helps the study of the evolution of metabolic pathways, the study of the correlation of metabolic networks with properties of those networks, or the improvement of ontologies that represent chemical information. PMID:20885779

  16. Text mining for the Vaccine Adverse Event Reporting System: medical text classification using informative feature selection

    PubMed Central

    Nguyen, Michael D; Woo, Emily Jane; Markatou, Marianthi; Ball, Robert

    2011-01-01

    Objective The US Vaccine Adverse Event Reporting System (VAERS) collects spontaneous reports of adverse events following vaccination. Medical officers review the reports and often apply standardized case definitions, such as those developed by the Brighton Collaboration. Our objective was to demonstrate a multi-level text mining approach for automated text classification of VAERS reports that could potentially reduce human workload. Design We selected 6034 VAERS reports for H1N1 vaccine that were classified by medical officers as potentially positive (Npos=237) or negative for anaphylaxis. We created a categorized corpus of text files that included the class label and the symptom text field of each report. A validation set of 1100 labeled text files was also used. Text mining techniques were applied to extract three feature sets for important keywords, low- and high-level patterns. A rule-based classifier processed the high-level feature representation, while several machine learning classifiers were trained for the remaining two feature representations. Measurements Classifiers' performance was evaluated by macro-averaging recall, precision, and F-measure, and Friedman's test; misclassification error rate analysis was also performed. Results Rule-based classifier, boosted trees, and weighted support vector machines performed well in terms of macro-recall, however at the expense of a higher mean misclassification error rate. The rule-based classifier performed very well in terms of average sensitivity and specificity (79.05% and 94.80%, respectively). Conclusion Our validated results showed the possibility of developing effective medical text classifiers for VAERS reports by combining text mining with informative feature selection; this strategy has the potential to reduce reviewer workload considerably. PMID:21709163

  17. Text Classification for Assisting Moderators in Online Health Communities

    PubMed Central

    Huh, Jina; Yetisgen-Yildiz, Meliha; Pratt, Wanda

    2013-01-01

    Objectives Patients increasingly visit online health communities to get help on managing health. The large scale of these online communities makes it impossible for the moderators to engage in all conversations; yet, some conversations need their expertise. Our work explores low-cost text classification methods to this new domain of determining whether a thread in an online health forum needs moderators’ help. Methods We employed a binary classifier on WebMD’s online diabetes community data. To train the classifier, we considered three feature types: (1) word unigram, (2) sentiment analysis features, and (3) thread length. We applied feature selection methods based on χ2 statistics and under sampling to account for unbalanced data. We then performed a qualitative error analysis to investigate the appropriateness of the gold standard. Results Using sentiment analysis features, feature selection methods, and balanced training data increased the AUC value up to 0.75 and the F1-score up to 0.54 compared to the baseline of using word unigrams with no feature selection methods on unbalanced data (0.65 AUC and 0.40 F1-score). The error analysis uncovered additional reasons for why moderators respond to patients’ posts. Discussion We showed how feature selection methods and balanced training data can improve the overall classification performance. We present implications of weighing precision versus recall for assisting moderators of online health communities. Our error analysis uncovered social, legal, and ethical issues around addressing community members’ needs. We also note challenges in producing a gold standard, and discuss potential solutions for addressing these challenges. Conclusion Social media environments provide popular venues in which patients gain health-related information. Our work contributes to understanding scalable solutions for providing moderators’ expertise in these large-scale, social media environments. PMID:24025513

  18. Automatic theory generation from analyst text files using coherence networks

    NASA Astrophysics Data System (ADS)

    Shaffer, Steven C.

    2014-05-01

    This paper describes a three-phase process of extracting knowledge from analyst textual reports. Phase 1 involves performing natural language processing on the source text to extract subject-predicate-object triples. In phase 2, these triples are then fed into a coherence network analysis process, using a genetic algorithm optimization. Finally, the highest-value sub networks are processed into a semantic network graph for display. Initial work on a well- known data set (a Wikipedia article on Abraham Lincoln) has shown excellent results without any specific tuning. Next, we ran the process on the SYNthetic Counter-INsurgency (SYNCOIN) data set, developed at Penn State, yielding interesting and potentially useful results.

  19. Automatic removal of crossed-out handwritten text and the effect on writer verification and identification

    NASA Astrophysics Data System (ADS)

    Brink, Axel; van der Klauw, Harro; Schomaker, Lambert

    2008-01-01

    A method is presented for automatically identifying and removing crossed-out text in off-line handwriting. It classifies connected components by simply comparing two scalar features with thresholds. The performance is quantified based on manually labeled connected components of 250 pages of a forensic dataset. 47% of connected components consisting of crossed-out text can be removed automatically while 99% of the normal text components are preserved. The influence of automatically removing crossed-out text on writer verification and identification is also quantified. This influence is not significant.

  20. 77 FR 60475 - Draft of SWGDOC Standard Classification of Typewritten Text

    Federal Register 2010, 2011, 2012, 2013, 2014

    2012-10-03

    ... From the Federal Register Online via the Government Publishing Office DEPARTMENT OF JUSTICE Office of Justice Programs Draft of SWGDOC Standard Classification of Typewritten Text AGENCY: National... general public a draft document entitled, ``SWGDOC Standard Classification of Typewritten Text''....

  1. Deep transfer learning for automatic target classification: MWIR to LWIR

    NASA Astrophysics Data System (ADS)

    Ding, Zhengming; Nasrabadi, Nasser; Fu, Yun

    2016-05-01

    Publisher's Note: This paper, originally published on 5/12/2016, was replaced with a corrected/revised version on 5/18/2016. If you downloaded the original PDF but are unable to access the revision, please contact SPIE Digital Library Customer Service for assistance. When dealing with sparse or no labeled data in the target domain, transfer learning shows its appealing performance by borrowing the supervised knowledge from external domains. Recently deep structure learning has been exploited in transfer learning due to its attractive power in extracting effective knowledge through multi-layer strategy, so that deep transfer learning is promising to address the cross-domain mismatch. In general, cross-domain disparity can be resulted from the difference between source and target distributions or different modalities, e.g., Midwave IR (MWIR) and Longwave IR (LWIR). In this paper, we propose a Weighted Deep Transfer Learning framework for automatic target classification through a task-driven fashion. Specifically, deep features and classifier parameters are obtained simultaneously for optimal classification performance. In this way, the proposed deep structures can extract more effective features with the guidance of the classifier performance; on the other hand, the classifier performance is further improved since it is optimized on more discriminative features. Furthermore, we build a weighted scheme to couple source and target output by assigning pseudo labels to target data, therefore we can transfer knowledge from source (i.e., MWIR) to target (i.e., LWIR). Experimental results on real databases demonstrate the superiority of the proposed algorithm by comparing with others.

  2. Automatic segmentation and classification of multiple sclerosis in multichannel MRI.

    PubMed

    Akselrod-Ballin, Ayelet; Galun, Meirav; Gomori, John Moshe; Filippi, Massimo; Valsasina, Paola; Basri, Ronen; Brandt, Achi

    2009-10-01

    We introduce a multiscale approach that combines segmentation with classification to detect abnormal brain structures in medical imagery, and demonstrate its utility in automatically detecting multiple sclerosis (MS) lesions in 3-D multichannel magnetic resonance (MR) images. Our method uses segmentation to obtain a hierarchical decomposition of a multichannel, anisotropic MR scans. It then produces a rich set of features describing the segments in terms of intensity, shape, location, neighborhood relations, and anatomical context. These features are then fed into a decision forest classifier, trained with data labeled by experts, enabling the detection of lesions at all scales. Unlike common approaches that use voxel-by-voxel analysis, our system can utilize regional properties that are often important for characterizing abnormal brain structures. We provide experiments on two types of real MR images: a multichannel proton-density-, T2-, and T1-weighted dataset of 25 MS patients and a single-channel fluid attenuated inversion recovery (FLAIR) dataset of 16 MS patients. Comparing our results with lesion delineation by a human expert and with previously extensively validated results shows the promise of the approach. PMID:19758850

  3. Automatic classification for pathological prostate images based on fractal analysis.

    PubMed

    Huang, Po-Whei; Lee, Cheng-Hsiung

    2009-07-01

    Accurate grading for prostatic carcinoma in pathological images is important to prognosis and treatment planning. Since human grading is always time-consuming and subjective, this paper presents a computer-aided system to automatically grade pathological images according to Gleason grading system which is the most widespread method for histological grading of prostate tissues. We proposed two feature extraction methods based on fractal dimension to analyze variations of intensity and texture complexity in regions of interest. Each image can be classified into an appropriate grade by using Bayesian, k-NN, and support vector machine (SVM) classifiers, respectively. Leave-one-out and k-fold cross-validation procedures were used to estimate the correct classification rates (CCR). Experimental results show that 91.2%, 93.7%, and 93.7% CCR can be achieved by Bayesian, k-NN, and SVM classifiers, respectively, for a set of 205 pathological prostate images. If our fractal-based feature set is optimized by the sequential floating forward selection method, the CCR can be promoted up to 94.6%, 94.2%, and 94.6%, respectively, using each of the above three classifiers. Experimental results also show that our feature set is better than the feature sets extracted from multiwavelets, Gabor filters, and gray-level co-occurrence matrix methods because it has a much smaller size and still keeps the most powerful discriminating capability in grading prostate images. PMID:19164082

  4. An automatic agricultural zone classification procedure for crop inventory satellite images

    NASA Technical Reports Server (NTRS)

    Parada, N. D. J. (Principal Investigator); Kux, H. J.; Velasco, F. R. D.; Deoliveira, M. O. B.

    1982-01-01

    A classification procedure for assessing crop areal proportion in multispectral scanner image is discussed. The procedure is into four parts: labeling; classification; proportion estimation; and evaluation. The procedure also has the following characteristics: multitemporal classification; the need for a minimum field information; and verification capability between automatic classification and analyst labeling. The processing steps and the main algorithms involved are discussed. An outlook on the future of this technology is also presented.

  5. Automatic sleep classification according to Rechtschaffen and Kales.

    PubMed

    Anderer, Peter; Gruber, Georg; Parapatics, Silvia; Dorffner, Georg

    2007-01-01

    Conventionally, polysomnographic recordings are classified according to the rules published in 1968 by Rechtschaffen and Kales (R&K). The present paper describes an automatic classification system embedded in an e-health solution that has been developed and validated in a large database of healthy controls and sleep disturbed patients. The Somnolyzer 24x7 adheres to the decision rules for visual scoring as closely as possible and includes a structured quality control procedure by a human expert. The final system consists of a raw data quality check, a feature extraction algorithm (density and intensity of sleep/wake-related patterns such as sleep spindles, delta waves, slow and rapid eye movements), a feature matrix plausibility check, a classifier designed as an expert system and a rule-based smoothing procedure for the start and the end of stages REM and 2. The validation based on 286 recordings in both normal healthy subjects aged 20 to 95 years and patients suffering from organic or nonorganic sleep disorders demonstrated an overall epoch-by-epoch agreement of 80% (Cohen's Kappa: 0.72) between the Somnolyzer 24x7 and the human expert scoring, as compared with an inter-rater reliability of 77% (Cohen's Kappa: 0.68) between two human experts. Two Somnolyzer 24x7 analyses (including a structured quality control by two human experts) revealed an inter-rater reliability close to 1 (Cohen's Kappa: 0.991). Moreover, correlation analysis in R&K derived target variables revealed similar -- in 36 out of 38 variables even higher -- relationships between Somnolyzer 24x7 and expert evaluations as compared to the concordance between two human experts. Thus, the validation study proved the high reliability and validity of the Somnolyzer 24x7 both, on the epoch-by-epoch and on the target variable level. These results demonstrate the applicability of the Somnolyzer 24x7 evaluation in clinical routine and sleep studies. PMID:18002875

  6. Automatic Classification of Trees from Laser Scanning Point Clouds

    NASA Astrophysics Data System (ADS)

    Sirmacek, B.; Lindenbergh, R.

    2015-08-01

    Development of laser scanning technologies has promoted tree monitoring studies to a new level, as the laser scanning point clouds enable accurate 3D measurements in a fast and environmental friendly manner. In this paper, we introduce a probability matrix computation based algorithm for automatically classifying laser scanning point clouds into 'tree' and 'non-tree' classes. Our method uses the 3D coordinates of the laser scanning points as input and generates a new point cloud which holds a label for each point indicating if it belongs to the 'tree' or 'non-tree' class. To do so, a grid surface is assigned to the lowest height level of the point cloud. The grids are filled with probability values which are calculated by checking the point density above the grid. Since the tree trunk locations appear with very high values in the probability matrix, selecting the local maxima of the grid surface help to detect the tree trunks. Further points are assigned to tree trunks if they appear in the close proximity of trunks. Since heavy mathematical computations (such as point cloud organization, detailed shape 3D detection methods, graph network generation) are not required, the proposed algorithm works very fast compared to the existing methods. The tree classification results are found reliable even on point clouds of cities containing many different objects. As the most significant weakness, false detection of light poles, traffic signs and other objects close to trees cannot be prevented. Nevertheless, the experimental results on mobile and airborne laser scanning point clouds indicate the possible usage of the algorithm as an important step for tree growth observation, tree counting and similar applications. While the laser scanning point cloud is giving opportunity to classify even very small trees, accuracy of the results is reduced in the low point density areas further away than the scanning location. These advantages and disadvantages of two laser scanning point

  7. Using back error propagation networks for automatic document image classification

    NASA Astrophysics Data System (ADS)

    Hauser, Susan E.; Cookson, Timothy J.; Thoma, George R.

    1993-09-01

    The Lister Hill National Center for Biomedical Communications is a Research and Development Division of the National Library of Medicine. One of the Center's current research projects involves the conversion of entire journals to bitmapped binary page images. In an effort to reduce operator errors that sometimes occur during document capture, three back error propagation networks were designed to automatically identify journal title based on features in the binary image of the journal's front cover page. For all three network designs, twenty five journal titles were randomly selected from the stored database of image files. Seven cover page images from each title were selected as the training set. For each title, three other cover page images were selected as the test set. Each bitmapped image was initially processed by counting the total number of black pixels in 32-pixel wide rows and columns of the page image. For the first network, these counts were scaled to create 122-element count vectors as the input vectors to a back error propagation network. The network had one output node for each journal classification. Although the network was successful in correctly classifying the 25 journals, the large input vector resulted in a large network and, consequently, a long training period. In an alternative approach, the first thirty-five coefficients of the Fast Fourier Transform of the count vector were used as the input vector to a second network. A third approach was to train a separate network for each journal using the original count vectors as input and with only one output node. The output of the network could be 'yes' (it is this journal) or 'no' (it is not this journal). This final design promises to be most efficient for a system in which journal titles are added or removed as it does not require retraining a large network for each change.

  8. Automated ancillary cancer history classification for mesothelioma patients from free-text clinical reports

    PubMed Central

    Wilson, Richard A.; Chapman, Wendy W.; DeFries, Shawn J.; Becich, Michael J.; Chapman, Brian E.

    2010-01-01

    Background: Clinical records are often unstructured, free-text documents that create information extraction challenges and costs. Healthcare delivery and research organizations, such as the National Mesothelioma Virtual Bank, require the aggregation of both structured and unstructured data types. Natural language processing offers techniques for automatically extracting information from unstructured, free-text documents. Methods: Five hundred and eight history and physical reports from mesothelioma patients were split into development (208) and test sets (300). A reference standard was developed and each report was annotated by experts with regard to the patient’s personal history of ancillary cancer and family history of any cancer. The Hx application was developed to process reports, extract relevant features, perform reference resolution and classify them with regard to cancer history. Two methods, Dynamic-Window and ConText, for extracting information were evaluated. Hx’s classification responses using each of the two methods were measured against the reference standard. The average Cohen’s weighted kappa served as the human benchmark in evaluating the system. Results: Hx had a high overall accuracy, with each method, scoring 96.2%. F-measures using the Dynamic-Window and ConText methods were 91.8% and 91.6%, which were comparable to the human benchmark of 92.8%. For the personal history classification, Dynamic-Window scored highest with 89.2% and for the family history classification, ConText scored highest with 97.6%, in which both methods were comparable to the human benchmark of 88.3% and 97.2%, respectively. Conclusion: We evaluated an automated application’s performance in classifying a mesothelioma patient’s personal and family history of cancer from clinical reports. To do so, the Hx application must process reports, identify cancer concepts, distinguish the known mesothelioma from ancillary cancers, recognize negation, perform reference

  9. Combining MEDLINE and publisher data to create parallel corpora for the automatic translation of biomedical text

    PubMed Central

    2013-01-01

    Background Most of the institutional and research information in the biomedical domain is available in the form of English text. Even in countries where English is an official language, such as the United States, language can be a barrier for accessing biomedical information for non-native speakers. Recent progress in machine translation suggests that this technique could help make English texts accessible to speakers of other languages. However, the lack of adequate specialized corpora needed to train statistical models currently limits the quality of automatic translations in the biomedical domain. Results We show how a large-sized parallel corpus can automatically be obtained for the biomedical domain, using the MEDLINE database. The corpus generated in this work comprises article titles obtained from MEDLINE and abstract text automatically retrieved from journal websites, which substantially extends the corpora used in previous work. After assessing the quality of the corpus for two language pairs (English/French and English/Spanish) we use the Moses package to train a statistical machine translation model that outperforms previous models for automatic translation of biomedical text. Conclusions We have built translation data sets in the biomedical domain that can easily be extended to other languages available in MEDLINE. These sets can successfully be applied to train statistical machine translation models. While further progress should be made by incorporating out-of-domain corpora and domain-specific lexicons, we believe that this work improves the automatic translation of biomedical texts. PMID:23631733

  10. Automatic Classification of Book Material Represented by Back-of-the-Book Index.

    ERIC Educational Resources Information Center

    Enser, P. G. B.

    1985-01-01

    Investigates techniques for automatic classification of book material focusing on: computer-based surrogation of monographic material, book surrogate clustering on basis of content association, evaluation of resultant classifications. Test collection (250 books) is described with surrogation by means of back-of-the-book index, table of contents,…

  11. Gaussian process classification using automatic relevance determination for SAR target recognition

    NASA Astrophysics Data System (ADS)

    Zhang, Xiangrong; Gou, Limin; Hou, Biao; Jiao, Licheng

    2010-10-01

    In this paper, a Synthetic Aperture Radar Automatic Target Recognition approach based on Gaussian process (GP) classification is proposed. It adopts kernel principal component analysis to extract sample features and implements target recognition by using GP classification with automatic relevance determination (ARD) function. Compared with k-Nearest Neighbor, Naïve Bayes classifier and Support Vector Machine, GP with ARD has the advantage of automatic model selection and hyper-parameter optimization. The experiments on UCI datasets and MSTAR database show that our algorithm is self-tuning and has better recognition accuracy as well.

  12. Automatic Method of Supernovae Classification by Modeling Human Procedure of Spectrum Analysis

    NASA Astrophysics Data System (ADS)

    Módolo, Marcelo; Rosa, Reinaldo; Guimaraes, Lamartine N. F.

    2016-07-01

    The classification of a recently discovered supernova must be done as quickly as possible in order to define what information will be captured and analyzed in the following days. This classification is not trivial and only a few experts astronomers are able to perform it. This paper proposes an automatic method that models the human procedure of classification. It uses Multilayer Perceptron Neural Networks to analyze the supernovae spectra. Experiments were performed using different pre-processing and multiple neural network configurations to identify the classic types of supernovae. Significant results were obtained indicating the viability of using this method in places that have no specialist or that require an automatic analysis.

  13. Hierarchic Agglomerative Clustering Methods for Automatic Document Classification.

    ERIC Educational Resources Information Center

    Griffiths, Alan; And Others

    1984-01-01

    Considers classifications produced by application of single linkage, complete linkage, group average, and word clustering methods to Keen and Cranfield document test collections, and studies structure of hierarchies produced, extent to which methods distort input similarity matrices during classification generation, and retrieval effectiveness…

  14. Automatic Cataloguing and Searching for Retrospective Data by Use of OCR Text.

    ERIC Educational Resources Information Center

    Tseng, Yuen-Hsien

    2001-01-01

    Describes efforts in supporting information retrieval from OCR (optical character recognition) degraded text. Reports on approaches used in an automatic cataloging and searching contest for books in multiple languages, including a vector space retrieval model, an n-gram indexing method, and a weighting scheme; and discusses problems of Asian…

  15. A Feature Selection Method Based on Fisher's Discriminant Ratio for Text Sentiment Classification

    NASA Astrophysics Data System (ADS)

    Wang, Suge; Li, Deyu; Wei, Yingjie; Li, Hongxia

    With the rapid growth of e-commerce, product reviews on the Web have become an important information source for customers' decision making when they intend to buy some product. As the reviews are often too many for customers to go through, how to automatically classify them into different sentiment orientation categories (i.e. positive/negative) has become a research problem. In this paper, based on Fisher's discriminant ratio, an effective feature selection method is proposed for product review text sentiment classification. In order to validate the validity of the proposed method, we compared it with other methods respectively based on information gain and mutual information while support vector machine is adopted as the classifier. In this paper, 6 subexperiments are conducted by combining different feature selection methods with 2 kinds of candidate feature sets. Under 1006 review documents of cars, the experimental results indicate that the Fisher's discriminant ratio based on word frequency estimation has the best performance with F value 83.3% while the candidate features are the words which appear in both positive and negative texts.

  16. A case-comparison study of automatic document classification utilizing both serial and parallel approaches

    NASA Astrophysics Data System (ADS)

    Wilges, B.; Bastos, R. C.; Mateus, G. P.; Dantas, M. A. R.

    2014-10-01

    A well-known problem faced by any organization nowadays is the high volume of data that is available and the required process to transform this volume into differential information. In this study, a case-comparison study of automatic document classification (ADC) approach is presented, utilizing both serial and parallel paradigms. The serial approach was implemented by adopting the RapidMiner software tool, which is recognized as the worldleading open-source system for data mining. On the other hand, considering the MapReduce programming model, the Hadoop software environment has been used. The main goal of this case-comparison study is to exploit differences between these two paradigms, especially when large volumes of data such as Web text documents are utilized to build a category database. In the literature, many studies point out that distributed processing in unstructured documents have been yielding efficient results in utilizing Hadoop. Results from our research indicate a threshold to such efficiency.

  17. Drug related webpages classification using images and text information based on multi-kernel learning

    NASA Astrophysics Data System (ADS)

    Hu, Ruiguang; Xiao, Liping; Zheng, Wenjuan

    2015-12-01

    In this paper, multi-kernel learning(MKL) is used for drug-related webpages classification. First, body text and image-label text are extracted through HTML parsing, and valid images are chosen by the FOCARSS algorithm. Second, text based BOW model is used to generate text representation, and image-based BOW model is used to generate images representation. Last, text and images representation are fused with a few methods. Experimental results demonstrate that the classification accuracy of MKL is higher than those of all other fusion methods in decision level and feature level, and much higher than the accuracy of single-modal classification.

  18. Automatic counting and classification of bacterial colonies using hyperspectral imaging

    Technology Transfer Automated Retrieval System (TEKTRAN)

    Detection and counting of bacterial colonies on agar plates is a routine microbiology practice to get a rough estimate of the number of viable cells in a sample. There have been a variety of different automatic colony counting systems and software algorithms mainly based on color or gray-scale pictu...

  19. Automatic Classification of E-Mail Messages by Message Type.

    ERIC Educational Resources Information Center

    May, Andrew D.

    1997-01-01

    Describes a system that automatically classifies e-mail messages in the HUMANIST electronic discussion group into one of four classes: question, response, announcement, or administrative. The system's ability to accurately classify a message was compared against manually assigned codes. Results suggested that there was a statistical agreement in…

  20. A Simple Semi-Automatic Approach for Land Cover Classification from Multispectral Remote Sensing Imagery

    PubMed Central

    Jiang, Dong; Huang, Yaohuan; Zhuang, Dafang; Zhu, Yunqiang; Xu, Xinliang; Ren, Hongyan

    2012-01-01

    Land cover data represent a fundamental data source for various types of scientific research. The classification of land cover based on satellite data is a challenging task, and an efficient classification method is needed. In this study, an automatic scheme is proposed for the classification of land use using multispectral remote sensing images based on change detection and a semi-supervised classifier. The satellite image can be automatically classified using only the prior land cover map and existing images; therefore human involvement is reduced to a minimum, ensuring the operability of the method. The method was tested in the Qingpu District of Shanghai, China. Using Environment Satellite 1(HJ-1) images of 2009 with 30 m spatial resolution, the areas were classified into five main types of land cover based on previous land cover data and spectral features. The results agreed on validation of land cover maps well with a Kappa value of 0.79 and statistical area biases in proportion less than 6%. This study proposed a simple semi-automatic approach for land cover classification by using prior maps with satisfied accuracy, which integrated the accuracy of visual interpretation and performance of automatic classification methods. The method can be used for land cover mapping in areas lacking ground reference information or identifying rapid variation of land cover regions (such as rapid urbanization) with convenience. PMID:23049886

  1. A simple semi-automatic approach for land cover classification from multispectral remote sensing imagery.

    PubMed

    Jiang, Dong; Huang, Yaohuan; Zhuang, Dafang; Zhu, Yunqiang; Xu, Xinliang; Ren, Hongyan

    2012-01-01

    Land cover data represent a fundamental data source for various types of scientific research. The classification of land cover based on satellite data is a challenging task, and an efficient classification method is needed. In this study, an automatic scheme is proposed for the classification of land use using multispectral remote sensing images based on change detection and a semi-supervised classifier. The satellite image can be automatically classified using only the prior land cover map and existing images; therefore human involvement is reduced to a minimum, ensuring the operability of the method. The method was tested in the Qingpu District of Shanghai, China. Using Environment Satellite 1(HJ-1) images of 2009 with 30 m spatial resolution, the areas were classified into five main types of land cover based on previous land cover data and spectral features. The results agreed on validation of land cover maps well with a Kappa value of 0.79 and statistical area biases in proportion less than 6%. This study proposed a simple semi-automatic approach for land cover classification by using prior maps with satisfied accuracy, which integrated the accuracy of visual interpretation and performance of automatic classification methods. The method can be used for land cover mapping in areas lacking ground reference information or identifying rapid variation of land cover regions (such as rapid urbanization) with convenience. PMID:23049886

  2. Automatic parquet block sorting using real-time spectral classification

    NASA Astrophysics Data System (ADS)

    Astrom, Anders; Astrand, Erik; Johansson, Magnus

    1999-03-01

    This paper presents a real-time spectral classification system based on the PGP spectrograph and a smart image sensor. The PGP is a spectrograph which extracts the spectral information from a scene and projects the information on an image sensor, which is a method often referred to as Imaging Spectroscopy. The classification is based on linear models and categorizes a number of pixels along a line. Previous systems adopting this method have used standard sensors, which often resulted in poor performance. The new system, however, is based on a patented near-sensor classification method, which exploits analogue features on the smart image sensor. The method reduces the enormous amount of data to be processed at an early stage, thus making true real-time spectral classification possible. The system has been evaluated on hardwood parquet boards showing very good results. The color defects considered in the experiments were blue stain, white sapwood, yellow decay and red decay. In addition to these four defect classes, a reference class was used to indicate correct surface color. The system calculates a statistical measure for each parquet block, giving the pixel defect percentage. The patented method makes it possible to run at very high speeds with a high spectral discrimination ability. Using a powerful illuminator, the system can run with a line frequency exceeding 2000 line/s. This opens up the possibility to maintain high production speed and still measure with good resolution.

  3. The Fractal Patterns of Words in a Text: A Method for Automatic Keyword Extraction

    PubMed Central

    Najafi, Elham; Darooneh, Amir H.

    2015-01-01

    A text can be considered as a one dimensional array of words. The locations of each word type in this array form a fractal pattern with certain fractal dimension. We observe that important words responsible for conveying the meaning of a text have dimensions considerably different from one, while the fractal dimensions of unimportant words are close to one. We introduce an index quantifying the importance of the words in a given text using their fractal dimensions and then ranking them according to their importance. This index measures the difference between the fractal pattern of a word in the original text relative to a shuffled version. Because the shuffled text is meaningless (i.e., words have no importance), the difference between the original and shuffled text can be used to ascertain degree of fractality. The degree of fractality may be used for automatic keyword detection. Words with the degree of fractality higher than a threshold value are assumed to be the retrieved keywords of the text. We measure the efficiency of our method for keywords extraction, making a comparison between our proposed method and two other well-known methods of automatic keyword extraction. PMID:26091207

  4. Text Categorization Based on K-Nearest Neighbor Approach for Web Site Classification.

    ERIC Educational Resources Information Center

    Kwon, Oh-Woog; Lee, Jong-Hyeok

    2003-01-01

    Discusses text categorization and Web site classification and proposes a three-step classification system that includes the use of Web pages linked with the home page. Highlights include the k-nearest neighbor (k-NN) approach; improving performance with a feature selection method and a term weighting scheme using HTML tags; and similarity…

  5. [Automatic classification method of star spectrum data based on classification pattern tree].

    PubMed

    Zhao, Xu-Jun; Cai, Jiang-Hui; Zhang, Ji-Fu; Yang, Hai-Feng; Ma, Yang

    2013-10-01

    Frequent pattern, frequently appearing in the data set, plays an important role in data mining. For the stellar spectrum classification tasks, a classification rule mining method based on classification pattern tree is presented on the basis of frequent pattern. The procedures can be shown as follows. Firstly, a new tree structure, i. e., classification pattern tree, is introduced based on the different frequencies of stellar spectral attributes in data base and its different importance used for classification. The related concepts and the construction method of classification pattern tree are also described in this paper. Then, the characteristics of the stellar spectrum are mapped to the classification pattern tree. Two modes of top-to-down and bottom-to-up are used to traverse the classification pattern tree and extract the classification rules. Meanwhile, the concept of pattern capability is introduced to adjust the number of classification rules and improve the construction efficiency of the classification pattern tree. Finally, the SDSS (the Sloan Digital Sky Survey) stellar spectral data provided by the National Astronomical Observatory are used to verify the accuracy of the method. The results show that a higher classification accuracy has been got. PMID:24409754

  6. Realizing parameterless automatic classification of remote sensing imagery using ontology engineering and cyberinfrastructure techniques

    NASA Astrophysics Data System (ADS)

    Sun, Ziheng; Fang, Hui; Di, Liping; Yue, Peng

    2016-09-01

    It was an untouchable dream for remote sensing experts to realize total automatic image classification without inputting any parameter values. Experts usually spend hours and hours on tuning the input parameters of classification algorithms in order to obtain the best results. With the rapid development of knowledge engineering and cyberinfrastructure, a lot of data processing and knowledge reasoning capabilities become online accessible, shareable and interoperable. Based on these recent improvements, this paper presents an idea of parameterless automatic classification which only requires an image and automatically outputs a labeled vector. No parameters and operations are needed from endpoint consumers. An approach is proposed to realize the idea. It adopts an ontology database to store the experiences of tuning values for classifiers. A sample database is used to record training samples of image segments. Geoprocessing Web services are used as functionality blocks to finish basic classification steps. Workflow technology is involved to turn the overall image classification into a total automatic process. A Web-based prototypical system named PACS (Parameterless Automatic Classification System) is implemented. A number of images are fed into the system for evaluation purposes. The results show that the approach could automatically classify remote sensing images and have a fairly good average accuracy. It is indicated that the classified results will be more accurate if the two databases have higher quality. Once the experiences and samples in the databases are accumulated as many as an expert has, the approach should be able to get the results with similar quality to that a human expert can get. Since the approach is total automatic and parameterless, it can not only relieve remote sensing workers from the heavy and time-consuming parameter tuning work, but also significantly shorten the waiting time for consumers and facilitate them to engage in image

  7. An automatic system to detect and extract texts in medical images for de-identification

    NASA Astrophysics Data System (ADS)

    Zhu, Yingxuan; Singh, P. D.; Siddiqui, Khan; Gillam, Michael

    2010-03-01

    Recently, there is an increasing need to share medical images for research purpose. In order to respect and preserve patient privacy, most of the medical images are de-identified with protected health information (PHI) before research sharing. Since manual de-identification is time-consuming and tedious, so an automatic de-identification system is necessary and helpful for the doctors to remove text from medical images. A lot of papers have been written about algorithms of text detection and extraction, however, little has been applied to de-identification of medical images. Since the de-identification system is designed for end-users, it should be effective, accurate and fast. This paper proposes an automatic system to detect and extract text from medical images for de-identification purposes, while keeping the anatomic structures intact. First, considering the text have a remarkable contrast with the background, a region variance based algorithm is used to detect the text regions. In post processing, geometric constraints are applied to the detected text regions to eliminate over-segmentation, e.g., lines and anatomic structures. After that, a region based level set method is used to extract text from the detected text regions. A GUI for the prototype application of the text detection and extraction system is implemented, which shows that our method can detect most of the text in the images. Experimental results validate that our method can detect and extract text in medical images with a 99% recall rate. Future research of this system includes algorithm improvement, performance evaluation, and computation optimization.

  8. Automatic body flexibility classification using laser doppler flowmeter

    NASA Astrophysics Data System (ADS)

    Lien, I.-Chan; Li, Yung-Hui; Bau, Jian-Guo

    2015-10-01

    Body flexibility is an important indicator that can measure whether an individual is healthy or not. Traditionally, we need to prepare a protractor and the subject need to perform a pre-defined set of actions. The measurement takes place at the same time when the subject performs required action, which is clumsy and inconvenient. In this paper, we propose a statistical learning model using the technique of random forest. The proposed system can classify body flexibility based on LDF signals analyzed in the frequency domain. The reasons of using random forest are because of their efficiency (fast in classification), interpretable structures and their ability to filter out irrelevant features. In addition, using random forest can prevent the problem of over-fitting, and the output model will become more robust to noises. In our experiment, we use chirp Z-transform (CZT), to transform a LDF signal into its energy values in five frequency bands. Combining the power of the random forest algorithm and frequency band analysis methods, a maximum recognition rate of 66% is achieved. Compared to traditional flexibility measuring process, the proposed system shortens the long and tedious stages of measurement to a simple, fast and pre-defined activity set. The major contributions of our work include (1) a novel body flexibility classification scheme using non-invasive biomedical sensor; (2) a set of designed protocol which is easy to conduct and practice; (3) a high precision classification scheme which combines the power of spectrum analysis and machine learning algorithms.

  9. Iterative Strategies for Aftershock Classification in Automatic Seismic Processing Pipelines

    NASA Astrophysics Data System (ADS)

    Gibbons, Steven J.; Kværna, Tormod; Harris, David B.; Dodge, Douglas A.

    2016-04-01

    Aftershock sequences following very large earthquakes present enormous challenges to near-realtime generation of seismic bulletins. The increase in analyst resources needed to relocate an inflated number of events is compounded by failures of phase association algorithms and a significant deterioration in the quality of underlying fully automatic event bulletins. Current processing pipelines were designed a generation ago and, due to computational limitations of the time, are usually limited to single passes over the raw data. With current processing capability, multiple passes over the data are feasible. Processing the raw data at each station currently generates parametric data streams which are then scanned by a phase association algorithm to form event hypotheses. We consider the scenario where a large earthquake has occurred and propose to define a region of likely aftershock activity in which events are detected and accurately located using a separate specially targeted semi-automatic process. This effort may focus on so-called pattern detectors, but here we demonstrate a more general grid search algorithm which may cover wider source regions without requiring waveform similarity. Given many well-located aftershocks within our source region, we may remove all associated phases from the original detection lists prior to a new iteration of the phase association algorithm. We provide a proof-of-concept example for the 2015 Gorkha sequence, Nepal, recorded on seismic arrays of the International Monitoring System. Even with very conservative conditions for defining event hypotheses within the aftershock source region, we can automatically remove over half of the original detections which could have been generated by Nepal earthquakes and reduce the likelihood of false associations and spurious event hypotheses. Further reductions in the number of detections in the parametric data streams are likely using correlation and subspace detectors and/or empirical matched

  10. An examination of the potential applications of automatic classification techniques to Georgia management problems

    NASA Technical Reports Server (NTRS)

    Rado, B. Q.

    1975-01-01

    Automatic classification techniques are described in relation to future information and natural resource planning systems with emphasis on application to Georgia resource management problems. The concept, design, and purpose of Georgia's statewide Resource AS Assessment Program is reviewed along with participation in a workshop at the Earth Resources Laboratory. Potential areas of application discussed include: agriculture, forestry, water resources, environmental planning, and geology.

  11. Using a MaxEnt Classifier for the Automatic Content Scoring of Free-Text Responses

    NASA Astrophysics Data System (ADS)

    Sukkarieh, Jana Z.

    2011-03-01

    Criticisms against multiple-choice item assessments in the USA have prompted researchers and organizations to move towards constructed-response (free-text) items. Constructed-response (CR) items pose many challenges to the education community—one of which is that they are expensive to score by humans. At the same time, there has been widespread movement towards computer-based assessment and hence, assessment organizations are competing to develop automatic content scoring engines for such items types—which we view as a textual entailment task. This paper describes how MaxEnt Modeling is used to help solve the task. MaxEnt has been used in many natural language tasks but this is the first application of the MaxEnt approach to textual entailment and automatic content scoring.

  12. Using a MaxEnt Classifier for the Automatic Content Scoring of Free-Text Responses

    SciTech Connect

    Sukkarieh, Jana Z.

    2011-03-14

    Criticisms against multiple-choice item assessments in the USA have prompted researchers and organizations to move towards constructed-response (free-text) items. Constructed-response (CR) items pose many challenges to the education community - one of which is that they are expensive to score by humans. At the same time, there has been widespread movement towards computer-based assessment and hence, assessment organizations are competing to develop automatic content scoring engines for such items types - which we view as a textual entailment task. This paper describes how MaxEnt Modeling is used to help solve the task. MaxEnt has been used in many natural language tasks but this is the first application of the MaxEnt approach to textual entailment and automatic content scoring.

  13. Interpretable Probabilistic Latent Variable Models for Automatic Annotation of Clinical Text

    PubMed Central

    Kotov, Alexander; Hasan, Mehedi; Carcone, April; Dong, Ming; Naar-King, Sylvie; BroganHartlieb, Kathryn

    2015-01-01

    We propose Latent Class Allocation (LCA) and Discriminative Labeled Latent Dirichlet Allocation (DL-LDA), two novel interpretable probabilistic latent variable models for automatic annotation of clinical text. Both models separate the terms that are highly characteristic of textual fragments annotated with a given set of labels from other non-discriminative terms, but rely on generative processes with different structure of latent variables. LCA directly learns class-specific multinomials, while DL-LDA breaks them down into topics (clusters of semantically related words). Extensive experimental evaluation indicates that the proposed models outperform Naïve Bayes, a standard probabilistic classifier, and Labeled LDA, a state-of-the-art topic model for labeled corpora, on the task of automatic annotation of transcripts of motivational interviews, while the output of the proposed models can be easily interpreted by clinical practitioners. PMID:26958214

  14. Automatic classification and speaker identification of African elephant (Loxodonta africana) vocalizations

    NASA Astrophysics Data System (ADS)

    Clemins, Patrick J.; Johnson, Michael T.; Leong, Kirsten M.; Savage, Anne

    2005-02-01

    A hidden Markov model (HMM) system is presented for automatically classifying African elephant vocalizations. The development of the system is motivated by successful models from human speech analysis and recognition. Classification features include frequency-shifted Mel-frequency cepstral coefficients (MFCCs) and log energy, spectrally motivated features which are commonly used in human speech processing. Experiments, including vocalization type classification and speaker identification, are performed on vocalizations collected from captive elephants in a naturalistic environment. The system classified vocalizations with accuracies of 94.3% and 82.5% for type classification and speaker identification classification experiments, respectively. Classification accuracy, statistical significance tests on the model parameters, and qualitative analysis support the effectiveness and robustness of this approach for vocalization analysis in nonhuman species. .

  15. Automatic classification of Polish fricatives on the basis of optimization of the parameter space

    NASA Astrophysics Data System (ADS)

    Domagala, Piotr; Richter, Lutoslawa

    1994-08-01

    The subject of this article is a study of the possibility of automatic classification and recognition of fricatives on the basis of a certain linear combination of values of an autocorrelation function. Automatic classification of fricative consonants constituted one phase of a study which involves the classification of all Polish phones for the purpose of enabling automatic speech recognition regardless of the phonetic context or the speaker. The study was conducted using phonetic material consisting of 166 nonsense syllables which included fricatives and had CVCV and VCCV structures. There were a total of about 20 contexts for each consonant. Separate classifications were performed for 5 female voices and 5 male voices and then both groups of voices were classified together. The three series of classifications had success rates of 60%, 69%, and 60% respectively. These results were about 10% better than the results obtained using classical discrimination analysis (CSS:Statistica 3.1 software). The use of cluster analysis and multidimensional scaling yielded information on the relative probabilities of the acoustic patterns of these phones in reference to perception tests.

  16. Automatic Classification of Subdwarf Spectra using a Neural Network

    NASA Astrophysics Data System (ADS)

    Winter, C.; Jeffery, C. S.; Drilling, J. S.

    2004-06-01

    We apply a multilayer feed-forward back propagation artificial neural network to a sample of 380 subdwarf spectra classified by Drilling et al. (Drilling, J.S., Moehler, S., Jeffery, C.S., Heber, U., and Napiwotzki, R.: in press in: R. Gray (ed.), Probing the Personalities of Stars and Galaxies), showing that it is possible to use this technique on large sets of spectra and obtain classifications in good agreement with the standard. We briefly investigate the impact of training set size, showing that large training sets do not necessarily perform significantly better than small sets.

  17. A Classification Method of Inquiry E-mails for Describing FAQ with Automatic Setting Mechanism of Judgment Thresholds

    NASA Astrophysics Data System (ADS)

    Tsuda, Yuki; Akiyoshi, Masanori; Samejima, Masaki; Oka, Hironori

    In this paper the authors propose a classification method of inquiry e-mails for describing FAQ (Frequently Asked Questions) and automatic setting mechanism of judgment thresholds. In this method, a dictionary used for classification of inquiries is generated and updated automatically by statistical information of characteristic words in clusters, and inquiries are classified correctly to each proper cluster by using the dictionary. Threshold values are automatically set by using statistical information.

  18. Automatic classification of DMSA scans using an artificial neural network

    NASA Astrophysics Data System (ADS)

    Wright, J. W.; Duguid, R.; Mckiddie, F.; Staff, R. T.

    2014-04-01

    DMSA imaging is carried out in nuclear medicine to assess the level of functional renal tissue in patients. This study investigated the use of an artificial neural network to perform diagnostic classification of these scans. Using the radiological report as the gold standard, the network was trained to classify DMSA scans as positive or negative for defects using a representative sample of 257 previously reported images. The trained network was then independently tested using a further 193 scans and achieved a binary classification accuracy of 95.9%. The performance of the network was compared with three qualified expert observers who were asked to grade each scan in the 193 image testing set on a six point defect scale, from ‘definitely normal’ to ‘definitely abnormal’. A receiver operating characteristic analysis comparison between a consensus operator, generated from the scores of the three expert observers, and the network revealed a statistically significant increase (α < 0.05) in performance between the network and operators. A further result from this work was that when suitably optimized, a negative predictive value of 100% for renal defects was achieved by the network, while still managing to identify 93% of the negative cases in the dataset. These results are encouraging for application of such a network as a screening tool or quality assurance assistant in clinical practice.

  19. Automatic Fault Characterization via Abnormality-Enhanced Classification

    SciTech Connect

    Bronevetsky, G; Laguna, I; de Supinski, B R

    2010-12-20

    Enterprise and high-performance computing systems are growing extremely large and complex, employing hundreds to hundreds of thousands of processors and software/hardware stacks built by many people across many organizations. As the growing scale of these machines increases the frequency of faults, system complexity makes these faults difficult to detect and to diagnose. Current system management techniques, which focus primarily on efficient data access and query mechanisms, require system administrators to examine the behavior of various system services manually. Growing system complexity is making this manual process unmanageable: administrators require more effective management tools that can detect faults and help to identify their root causes. System administrators need timely notification when a fault is manifested that includes the type of fault, the time period in which it occurred and the processor on which it originated. Statistical modeling approaches can accurately characterize system behavior. However, the complex effects of system faults make these tools difficult to apply effectively. This paper investigates the application of classification and clustering algorithms to fault detection and characterization. We show experimentally that naively applying these methods achieves poor accuracy. Further, we design novel techniques that combine classification algorithms with information on the abnormality of application behavior to improve detection and characterization accuracy. Our experiments demonstrate that these techniques can detect and characterize faults with 65% accuracy, compared to just 5% accuracy for naive approaches.

  20. Text Mining and Natural Language Processing Approaches for Automatic Categorization of Lay Requests to Web-Based Expert Forums

    PubMed Central

    Reincke, Ulrich; Michelmann, Hans Wilhelm

    2009-01-01

    Background Both healthy and sick people increasingly use electronic media to obtain medical information and advice. For example, Internet users may send requests to Web-based expert forums, or so-called “ask the doctor” services. Objective To automatically classify lay requests to an Internet medical expert forum using a combination of different text-mining strategies. Methods We first manually classified a sample of 988 requests directed to a involuntary childlessness forum on the German website “Rund ums Baby” (“Everything about Babies”) into one or more of 38 categories belonging to two dimensions (“subject matter” and “expectations”). After creating start and synonym lists, we calculated the average Cramer’s V statistic for the association of each word with each category. We also used principle component analysis and singular value decomposition as further text-mining strategies. With these measures we trained regression models and determined, on the basis of best regression models, for any request the probability of belonging to each of the 38 different categories, with a cutoff of 50%. Recall and precision of a test sample were calculated as a measure of quality for the automatic classification. Results According to the manual classification of 988 documents, 102 (10%) documents fell into the category “in vitro fertilization (IVF),” 81 (8%) into the category “ovulation,” 79 (8%) into “cycle,” and 57 (6%) into “semen analysis.” These were the four most frequent categories in the subject matter dimension (consisting of 32 categories). The expectation dimension comprised six categories; we classified 533 documents (54%) as “general information” and 351 (36%) as a wish for “treatment recommendations.” The generation of indicator variables based on the chi-square analysis and Cramer’s V proved to be the best approach for automatic classification in about half of the categories. In combination with the two other

  1. Assessing the impact of graphical quality on automatic text recognition in digital maps

    NASA Astrophysics Data System (ADS)

    Chiang, Yao-Yi; Leyk, Stefan; Honarvar Nazari, Narges; Moghaddam, Sima; Tan, Tian Xiang

    2016-08-01

    Converting geographic features (e.g., place names) in map images into a vector format is the first step for incorporating cartographic information into a geographic information system (GIS). With the advancement in computational power and algorithm design, map processing systems have been considerably improved over the last decade. However, the fundamental map processing techniques such as color image segmentation, (map) layer separation, and object recognition are sensitive to minor variations in graphical properties of the input image (e.g., scanning resolution). As a result, most map processing results would not meet user expectations if the user does not "properly" scan the map of interest, pre-process the map image (e.g., using compression or not), and train the processing system, accordingly. These issues could slow down the further advancement of map processing techniques as such unsuccessful attempts create a discouraged user community, and less sophisticated tools would be perceived as more viable solutions. Thus, it is important to understand what kinds of maps are suitable for automatic map processing and what types of results and process-related errors can be expected. In this paper, we shed light on these questions by using a typical map processing task, text recognition, to discuss a number of map instances that vary in suitability for automatic processing. We also present an extensive experiment on a diverse set of scanned historical maps to provide measures of baseline performance of a standard text recognition tool under varying map conditions (graphical quality) and text representations (that can vary even within the same map sheet). Our experimental results help the user understand what to expect when a fully or semi-automatic map processing system is used to process a scanned map with certain (varying) graphical properties and complexities in map content.

  2. Ipsilateral coordination features for automatic classification of Parkinson's disease

    NASA Astrophysics Data System (ADS)

    Sarmiento, Fernanda; Atehortúa, Angélica; Martínez, Fabio; Romero, Eduardo

    2015-12-01

    A reliable diagnosis of the Parkinson Disease lies on the objective evaluation of different motor sub-systems. Discovering specific motor patterns associated to the disease is fundamental for the development of unbiased assessments that facilitate the disease characterization, independently of the particular examiner. This paper proposes a new objective screening of patients with Parkinson, an approach that optimally combines ipsilateral global descriptors. These ipsilateral gait features are simple upper-lower limb relationships in frequency and relative phase spaces. These low level characteristics feed a simple SVM classifier with a polynomial kernel function. The strategy was assessed in a binary classification task, normal against Parkinson, under a leave-one-out scheme in a population of 16 Parkinson patients and 7 healthy control subjects. Results showed an accuracy of 94;6% using relative phase spaces and 82;1% with simple frequency relations.

  3. Automatic Entity Recognition and Typing from Massive Text Corpora: A Phrase and Network Mining Approach

    PubMed Central

    Ren, Xiang; El-Kishky, Ahmed; Wang, Chi; Han, Jiawei

    2015-01-01

    In today’s computerized and information-based society, we are soaked with vast amounts of text data, ranging from news articles, scientific publications, product reviews, to a wide range of textual information from social media. To unlock the value of these unstructured text data from various domains, it is of great importance to gain an understanding of entities and their relationships. In this tutorial, we introduce data-driven methods to recognize typed entities of interest in massive, domain-specific text corpora. These methods can automatically identify token spans as entity mentions in documents and label their types (e.g., people, product, food) in a scalable way. We demonstrate on real datasets including news articles and tweets how these typed entities aid in knowledge discovery and management. PMID:26705508

  4. Automatic Classification of Cellular Expression by Nonlinear Stochastic Embedding (ACCENSE).

    PubMed

    Shekhar, Karthik; Brodin, Petter; Davis, Mark M; Chakraborty, Arup K

    2014-01-01

    Mass cytometry enables an unprecedented number of parameters to be measured in individual cells at a high throughput, but the large dimensionality of the resulting data severely limits approaches relying on manual "gating." Clustering cells based on phenotypic similarity comes at a loss of single-cell resolution and often the number of subpopulations is unknown a priori. Here we describe ACCENSE, a tool that combines nonlinear dimensionality reduction with density-based partitioning, and displays multivariate cellular phenotypes on a 2D plot. We apply ACCENSE to 35-parameter mass cytometry data from CD8(+) T cells derived from specific pathogen-free and germ-free mice, and stratify cells into phenotypic subpopulations. Our results show significant heterogeneity within the known CD8(+) T-cell subpopulations, and of particular note is that we find a large novel subpopulation in both specific pathogen-free and germ-free mice that has not been described previously. This subpopulation possesses a phenotypic signature that is distinct from conventional naive and memory subpopulations when analyzed by ACCENSE, but is not distinguishable on a biaxial plot of standard markers. We are able to automatically identify cellular subpopulations based on all proteins analyzed, thus aiding the full utilization of powerful new single-cell technologies such as mass cytometry. PMID:24344260

  5. Automatic Classification of Cellular Expression by Nonlinear Stochastic Embedding (ACCENSE)

    PubMed Central

    Shekhar, Karthik; Brodin, Petter; Davis, Mark M.; Chakraborty, Arup K.

    2014-01-01

    Mass cytometry enables an unprecedented number of parameters to be measured in individual cells at a high throughput, but the large dimensionality of the resulting data severely limits approaches relying on manual “gating.” Clustering cells based on phenotypic similarity comes at a loss of single-cell resolution and often the number of subpopulations is unknown a priori. Here we describe ACCENSE, a tool that combines nonlinear dimensionality reduction with density-based partitioning, and displays multivariate cellular phenotypes on a 2D plot. We apply ACCENSE to 35-parameter mass cytometry data from CD8+ T cells derived from specific pathogen-free and germ-free mice, and stratify cells into phenotypic subpopulations. Our results show significant heterogeneity within the known CD8+ T-cell subpopulations, and of particular note is that we find a large novel subpopulation in both specific pathogen-free and germ-free mice that has not been described previously. This subpopulation possesses a phenotypic signature that is distinct from conventional naive and memory subpopulations when analyzed by ACCENSE, but is not distinguishable on a biaxial plot of standard markers. We are able to automatically identify cellular subpopulations based on all proteins analyzed, thus aiding the full utilization of powerful new single-cell technologies such as mass cytometry. PMID:24344260

  6. Automatically Detecting Failures in Natural Language Processing Tools for Online Community Text

    PubMed Central

    Hartzler, Andrea L; Huh, Jina; McDonald, David W; Pratt, Wanda

    2015-01-01

    Background The prevalence and value of patient-generated health text are increasing, but processing such text remains problematic. Although existing biomedical natural language processing (NLP) tools are appealing, most were developed to process clinician- or researcher-generated text, such as clinical notes or journal articles. In addition to being constructed for different types of text, other challenges of using existing NLP include constantly changing technologies, source vocabularies, and characteristics of text. These continuously evolving challenges warrant the need for applying low-cost systematic assessment. However, the primarily accepted evaluation method in NLP, manual annotation, requires tremendous effort and time. Objective The primary objective of this study is to explore an alternative approach—using low-cost, automated methods to detect failures (eg, incorrect boundaries, missed terms, mismapped concepts) when processing patient-generated text with existing biomedical NLP tools. We first characterize common failures that NLP tools can make in processing online community text. We then demonstrate the feasibility of our automated approach in detecting these common failures using one of the most popular biomedical NLP tools, MetaMap. Methods Using 9657 posts from an online cancer community, we explored our automated failure detection approach in two steps: (1) to characterize the failure types, we first manually reviewed MetaMap’s commonly occurring failures, grouped the inaccurate mappings into failure types, and then identified causes of the failures through iterative rounds of manual review using open coding, and (2) to automatically detect these failure types, we then explored combinations of existing NLP techniques and dictionary-based matching for each failure cause. Finally, we manually evaluated the automatically detected failures. Results From our manual review, we characterized three types of failure: (1) boundary failures, (2) missed

  7. Automatic Evaluation of Voice Quality Using Text-Based Laryngograph Measurements and Prosodic Analysis

    PubMed Central

    Haderlein, Tino; Schwemmle, Cornelia; Döllinger, Michael; Matoušek, Václav; Ptok, Martin; Nöth, Elmar

    2015-01-01

    Due to low intra- and interrater reliability, perceptual voice evaluation should be supported by objective, automatic methods. In this study, text-based, computer-aided prosodic analysis and measurements of connected speech were combined in order to model perceptual evaluation of the German Roughness-Breathiness-Hoarseness (RBH) scheme. 58 connected speech samples (43 women and 15 men; 48.7 ± 17.8 years) containing the German version of the text “The North Wind and the Sun” were evaluated perceptually by 19 speech and voice therapy students according to the RBH scale. For the human-machine correlation, Support Vector Regression with measurements of the vocal fold cycle irregularities (CFx) and the closed phases of vocal fold vibration (CQx) of the Laryngograph and 33 features from a prosodic analysis module were used to model the listeners' ratings. The best human-machine results for roughness were obtained from a combination of six prosodic features and CFx (r = 0.71, ρ = 0.57). These correlations were approximately the same as the interrater agreement among human raters (r = 0.65, ρ = 0.61). CQx was one of the substantial features of the hoarseness model. For hoarseness and breathiness, the human-machine agreement was substantially lower. Nevertheless, the automatic analysis method can serve as the basis for a meaningful objective support for perceptual analysis. PMID:26136813

  8. Validating Automatically Generated Students' Conceptual Models from Free-text Answers at the Level of Concepts

    NASA Astrophysics Data System (ADS)

    Pérez-Marín, Diana; Pascual-Nieto, Ismael; Rodríguez, Pilar; Anguiano, Eloy; Alfonseca, Enrique

    2008-11-01

    Students' conceptual models can be defined as networks of interconnected concepts, in which a confidence-value (CV) is estimated per each concept. This CV indicates how confident the system is that each student knows the concept according to how the student has used it in the free-text answers provided to an automatic free-text scoring system. In a previous work, a preliminary validation was done based on the global comparison between the score achieved by each student in the final exam and the score associated to his or her model (calculated as the average of the CVs of the concepts). 50% Pearson correlation statistically significant (p = 0.01) was reached. In order to complete those results, in this paper, the level of granularity has been lowered down to each particular concept. In fact, the correspondence between the human estimation of how well each concept of the conceptual model is known versus the computer estimation is calculated. 0.08 mean quadratic error between both values has been attained, which validates the automatically generated students' conceptual models at the concept level of granularity.

  9. Automatic Evaluation of Voice Quality Using Text-Based Laryngograph Measurements and Prosodic Analysis.

    PubMed

    Haderlein, Tino; Schwemmle, Cornelia; Döllinger, Michael; Matoušek, Václav; Ptok, Martin; Nöth, Elmar

    2015-01-01

    Due to low intra- and interrater reliability, perceptual voice evaluation should be supported by objective, automatic methods. In this study, text-based, computer-aided prosodic analysis and measurements of connected speech were combined in order to model perceptual evaluation of the German Roughness-Breathiness-Hoarseness (RBH) scheme. 58 connected speech samples (43 women and 15 men; 48.7 ± 17.8 years) containing the German version of the text "The North Wind and the Sun" were evaluated perceptually by 19 speech and voice therapy students according to the RBH scale. For the human-machine correlation, Support Vector Regression with measurements of the vocal fold cycle irregularities (CFx) and the closed phases of vocal fold vibration (CQx) of the Laryngograph and 33 features from a prosodic analysis module were used to model the listeners' ratings. The best human-machine results for roughness were obtained from a combination of six prosodic features and CFx (r = 0.71, ρ = 0.57). These correlations were approximately the same as the interrater agreement among human raters (r = 0.65, ρ = 0.61). CQx was one of the substantial features of the hoarseness model. For hoarseness and breathiness, the human-machine agreement was substantially lower. Nevertheless, the automatic analysis method can serve as the basis for a meaningful objective support for perceptual analysis. PMID:26136813

  10. Automatic identification of ROI in figure images toward improving hybrid (text and image) biomedical document retrieval

    NASA Astrophysics Data System (ADS)

    You, Daekeun; Antani, Sameer; Demner-Fushman, Dina; Rahman, Md Mahmudur; Govindaraju, Venu; Thoma, George R.

    2011-01-01

    Biomedical images are often referenced for clinical decision support (CDS), educational purposes, and research. They appear in specialized databases or in biomedical publications and are not meaningfully retrievable using primarily textbased retrieval systems. The task of automatically finding the images in an article that are most useful for the purpose of determining relevance to a clinical situation is quite challenging. An approach is to automatically annotate images extracted from scientific publications with respect to their usefulness for CDS. As an important step toward achieving the goal, we proposed figure image analysis for localizing pointers (arrows, symbols) to extract regions of interest (ROI) that can then be used to obtain meaningful local image content. Content-based image retrieval (CBIR) techniques can then associate local image ROIs with identified biomedical concepts in figure captions for improved hybrid (text and image) retrieval of biomedical articles. In this work we present methods that make robust our previous Markov random field (MRF)-based approach for pointer recognition and ROI extraction. These include use of Active Shape Models (ASM) to overcome problems in recognizing distorted pointer shapes and a region segmentation method for ROI extraction. We measure the performance of our methods on two criteria: (i) effectiveness in recognizing pointers in images, and (ii) improved document retrieval through use of extracted ROIs. Evaluation on three test sets shows 87% accuracy in the first criterion. Further, the quality of document retrieval using local visual features and text is shown to be better than using visual features alone.

  11. Automatic classification of retinal vessels into arteries and veins

    NASA Astrophysics Data System (ADS)

    Niemeijer, Meindert; van Ginneken, Bram; Abràmoff, Michael D.

    2009-02-01

    Separating the retinal vascular tree into arteries and veins is important for quantifying vessel changes that preferentially affect either the veins or the arteries. For example the ratio of arterial to venous diameter, the retinal a/v ratio, is well established to be predictive of stroke and other cardiovascular events in adults, as well as the staging of retinopathy of prematurity in premature infants. This work presents a supervised, automatic method that can determine whether a vessel is an artery or a vein based on intensity and derivative information. After thinning of the vessel segmentation, vessel crossing and bifurcation points are removed leaving a set of vessel segments containing centerline pixels. A set of features is extracted from each centerline pixel and using these each is assigned a soft label indicating the likelihood that it is part of a vein. As all centerline pixels in a connected segment should be the same type we average the soft labels and assign this average label to each centerline pixel in the segment. We train and test the algorithm using the data (40 color fundus photographs) from the DRIVE database1 with an enhanced reference standard. In the enhanced reference standard a fellowship trained retinal specialist (MDA) labeled all vessels for which it was possible to visually determine whether it was a vein or an artery. After applying the proposed method to the 20 images of the DRIVE test set we obtained an area under the receiver operator characteristic (ROC) curve of 0.88 for correctly assigning centerline pixels to either the vein or artery classes.

  12. An Automatic Multidocument Text Summarization Approach Based on Naïve Bayesian Classifier Using Timestamp Strategy

    PubMed Central

    Ramanujam, Nedunchelian; Kaliappan, Manivannan

    2016-01-01

    Nowadays, automatic multidocument text summarization systems can successfully retrieve the summary sentences from the input documents. But, it has many limitations such as inaccurate extraction to essential sentences, low coverage, poor coherence among the sentences, and redundancy. This paper introduces a new concept of timestamp approach with Naïve Bayesian Classification approach for multidocument text summarization. The timestamp provides the summary an ordered look, which achieves the coherent looking summary. It extracts the more relevant information from the multiple documents. Here, scoring strategy is also used to calculate the score for the words to obtain the word frequency. The higher linguistic quality is estimated in terms of readability and comprehensibility. In order to show the efficiency of the proposed method, this paper presents the comparison between the proposed methods with the existing MEAD algorithm. The timestamp procedure is also applied on the MEAD algorithm and the results are examined with the proposed method. The results show that the proposed method results in lesser time than the existing MEAD algorithm to execute the summarization process. Moreover, the proposed method results in better precision, recall, and F-score than the existing clustering with lexical chaining approach. PMID:27034971

  13. An Automatic Multidocument Text Summarization Approach Based on Naïve Bayesian Classifier Using Timestamp Strategy.

    PubMed

    Ramanujam, Nedunchelian; Kaliappan, Manivannan

    2016-01-01

    Nowadays, automatic multidocument text summarization systems can successfully retrieve the summary sentences from the input documents. But, it has many limitations such as inaccurate extraction to essential sentences, low coverage, poor coherence among the sentences, and redundancy. This paper introduces a new concept of timestamp approach with Naïve Bayesian Classification approach for multidocument text summarization. The timestamp provides the summary an ordered look, which achieves the coherent looking summary. It extracts the more relevant information from the multiple documents. Here, scoring strategy is also used to calculate the score for the words to obtain the word frequency. The higher linguistic quality is estimated in terms of readability and comprehensibility. In order to show the efficiency of the proposed method, this paper presents the comparison between the proposed methods with the existing MEAD algorithm. The timestamp procedure is also applied on the MEAD algorithm and the results are examined with the proposed method. The results show that the proposed method results in lesser time than the existing MEAD algorithm to execute the summarization process. Moreover, the proposed method results in better precision, recall, and F-score than the existing clustering with lexical chaining approach. PMID:27034971

  14. Automatic vs. manual curation of a multi-source chemical dictionary: the impact on text mining

    PubMed Central

    2010-01-01

    Background Previously, we developed a combined dictionary dubbed Chemlist for the identification of small molecules and drugs in text based on a number of publicly available databases and tested it on an annotated corpus. To achieve an acceptable recall and precision we used a number of automatic and semi-automatic processing steps together with disambiguation rules. However, it remained to be investigated which impact an extensive manual curation of a multi-source chemical dictionary would have on chemical term identification in text. ChemSpider is a chemical database that has undergone extensive manual curation aimed at establishing valid chemical name-to-structure relationships. Results We acquired the component of ChemSpider containing only manually curated names and synonyms. Rule-based term filtering, semi-automatic manual curation, and disambiguation rules were applied. We tested the dictionary from ChemSpider on an annotated corpus and compared the results with those for the Chemlist dictionary. The ChemSpider dictionary of ca. 80 k names was only a 1/3 to a 1/4 the size of Chemlist at around 300 k. The ChemSpider dictionary had a precision of 0.43 and a recall of 0.19 before the application of filtering and disambiguation and a precision of 0.87 and a recall of 0.19 after filtering and disambiguation. The Chemlist dictionary had a precision of 0.20 and a recall of 0.47 before the application of filtering and disambiguation and a precision of 0.67 and a recall of 0.40 after filtering and disambiguation. Conclusions We conclude the following: (1) The ChemSpider dictionary achieved the best precision but the Chemlist dictionary had a higher recall and the best F-score; (2) Rule-based filtering and disambiguation is necessary to achieve a high precision for both the automatically generated and the manually curated dictionary. ChemSpider is available as a web service at http://www.chemspider.com/ and the Chemlist dictionary is freely available as an XML file in

  15. Material classification and automatic content enrichment of images using supervised learning and knowledge bases

    NASA Astrophysics Data System (ADS)

    Mallepudi, Sri Abhishikth; Calix, Ricardo A.; Knapp, Gerald M.

    2011-02-01

    In recent years there has been a rapid increase in the size of video and image databases. Effective searching and retrieving of images from these databases is a significant current research area. In particular, there is a growing interest in query capabilities based on semantic image features such as objects, locations, and materials, known as content-based image retrieval. This study investigated mechanisms for identifying materials present in an image. These capabilities provide additional information impacting conditional probabilities about images (e.g. objects made of steel are more likely to be buildings). These capabilities are useful in Building Information Modeling (BIM) and in automatic enrichment of images. I2T methodologies are a way to enrich an image by generating text descriptions based on image analysis. In this work, a learning model is trained to detect certain materials in images. To train the model, an image dataset was constructed containing single material images of bricks, cloth, grass, sand, stones, and wood. For generalization purposes, an additional set of 50 images containing multiple materials (some not used in training) was constructed. Two different supervised learning classification models were investigated: a single multi-class SVM classifier, and multiple binary SVM classifiers (one per material). Image features included Gabor filter parameters for texture, and color histogram data for RGB components. All classification accuracy scores using the SVM-based method were above 85%. The second model helped in gathering more information from the images since it assigned multiple classes to the images. A framework for the I2T methodology is presented.

  16. Automatic breast density classification using a convolutional neural network architecture search procedure

    NASA Astrophysics Data System (ADS)

    Fonseca, Pablo; Mendoza, Julio; Wainer, Jacques; Ferrer, Jose; Pinto, Joseph; Guerrero, Jorge; Castaneda, Benjamin

    2015-03-01

    Breast parenchymal density is considered a strong indicator of breast cancer risk and therefore useful for preventive tasks. Measurement of breast density is often qualitative and requires the subjective judgment of radiologists. Here we explore an automatic breast composition classification workflow based on convolutional neural networks for feature extraction in combination with a support vector machines classifier. This is compared to the assessments of seven experienced radiologists. The experiments yielded an average kappa value of 0.58 when using the mode of the radiologists' classifications as ground truth. Individual radiologist performance against this ground truth yielded kappa values between 0.56 and 0.79.

  17. Motif-Based Text Mining of Microbial Metagenome Redundancy Profiling Data for Disease Classification.

    PubMed

    Wang, Yin; Li, Rudong; Zhou, Yuhua; Ling, Zongxin; Guo, Xiaokui; Xie, Lu; Liu, Lei

    2016-01-01

    Background. Text data of 16S rRNA are informative for classifications of microbiota-associated diseases. However, the raw text data need to be systematically processed so that features for classification can be defined/extracted; moreover, the high-dimension feature spaces generated by the text data also pose an additional difficulty. Results. Here we present a Phylogenetic Tree-Based Motif Finding algorithm (PMF) to analyze 16S rRNA text data. By integrating phylogenetic rules and other statistical indexes for classification, we can effectively reduce the dimension of the large feature spaces generated by the text datasets. Using the retrieved motifs in combination with common classification methods, we can discriminate different samples of both pneumonia and dental caries better than other existing methods. Conclusions. We extend the phylogenetic approaches to perform supervised learning on microbiota text data to discriminate the pathological states for pneumonia and dental caries. The results have shown that PMF may enhance the efficiency and reliability in analyzing high-dimension text data. PMID:27057545

  18. Motif-Based Text Mining of Microbial Metagenome Redundancy Profiling Data for Disease Classification

    PubMed Central

    Wang, Yin; Zhou, Yuhua; Ling, Zongxin; Guo, Xiaokui; Xie, Lu; Liu, Lei

    2016-01-01

    Background. Text data of 16S rRNA are informative for classifications of microbiota-associated diseases. However, the raw text data need to be systematically processed so that features for classification can be defined/extracted; moreover, the high-dimension feature spaces generated by the text data also pose an additional difficulty. Results. Here we present a Phylogenetic Tree-Based Motif Finding algorithm (PMF) to analyze 16S rRNA text data. By integrating phylogenetic rules and other statistical indexes for classification, we can effectively reduce the dimension of the large feature spaces generated by the text datasets. Using the retrieved motifs in combination with common classification methods, we can discriminate different samples of both pneumonia and dental caries better than other existing methods. Conclusions. We extend the phylogenetic approaches to perform supervised learning on microbiota text data to discriminate the pathological states for pneumonia and dental caries. The results have shown that PMF may enhance the efficiency and reliability in analyzing high-dimension text data. PMID:27057545

  19. Automatic classification and accurate size measurement of blank mask defects

    NASA Astrophysics Data System (ADS)

    Bhamidipati, Samir; Paninjath, Sankaranarayanan; Pereira, Mark; Buck, Peter

    2015-07-01

    complexity of defects encountered. The variety arises due to factors such as defect nature, size, shape and composition; and the optical phenomena occurring around the defect. This paper focuses on preliminary characterization results, in terms of classification and size estimation, obtained by Calibre MDPAutoClassify tool on a variety of mask blank defects. It primarily highlights the challenges faced in achieving the results with reference to the variety of defects observed on blank mask substrates and the underlying complexities which make accurate defect size measurement an important and challenging task.

  20. Automatic classification of MR scans in Alzheimer's disease.

    PubMed

    Klöppel, Stefan; Stonnington, Cynthia M; Chu, Carlton; Draganski, Bogdan; Scahill, Rachael I; Rohrer, Jonathan D; Fox, Nick C; Jack, Clifford R; Ashburner, John; Frackowiak, Richard S J

    2008-03-01

    To be diagnostically useful, structural MRI must reliably distinguish Alzheimer's disease (AD) from normal aging in individual scans. Recent advances in statistical learning theory have led to the application of support vector machines to MRI for detection of a variety of disease states. The aims of this study were to assess how successfully support vector machines assigned individual diagnoses and to determine whether data-sets combined from multiple scanners and different centres could be used to obtain effective classification of scans. We used linear support vector machines to classify the grey matter segment of T1-weighted MR scans from pathologically proven AD patients and cognitively normal elderly individuals obtained from two centres with different scanning equipment. Because the clinical diagnosis of mild AD is difficult we also tested the ability of support vector machines to differentiate control scans from patients without post-mortem confirmation. Finally we sought to use these methods to differentiate scans between patients suffering from AD from those with frontotemporal lobar degeneration. Up to 96% of pathologically verified AD patients were correctly classified using whole brain images. Data from different centres were successfully combined achieving comparable results from the separate analyses. Importantly, data from one centre could be used to train a support vector machine to accurately differentiate AD and normal ageing scans obtained from another centre with different subjects and different scanner equipment. Patients with mild, clinically probable AD and age/sex matched controls were correctly separated in 89% of cases which is compatible with published diagnosis rates in the best clinical centres. This method correctly assigned 89% of patients with post-mortem confirmed diagnosis of either AD or frontotemporal lobar degeneration to their respective group. Our study leads to three conclusions: Firstly, support vector machines successfully

  1. Challenges for automatically extracting molecular interactions from full-text articles

    PubMed Central

    McIntosh, Tara; Curran, James R

    2009-01-01

    Background The increasing availability of full-text biomedical articles will allow more biomedical knowledge to be extracted automatically with greater reliability. However, most Information Retrieval (IR) and Extraction (IE) tools currently process only abstracts. The lack of corpora has limited the development of tools that are capable of exploiting the knowledge in full-text articles. As a result, there has been little investigation into the advantages of full-text document structure, and the challenges developers will face in processing full-text articles. Results We manually annotated passages from full-text articles that describe interactions summarised in a Molecular Interaction Map (MIM). Our corpus tracks the process of identifying facts to form the MIM summaries and captures any factual dependencies that must be resolved to extract the fact completely. For example, a fact in the results section may require a synonym defined in the introduction. The passages are also annotated with negated and coreference expressions that must be resolved. We describe the guidelines for identifying relevant passages and possible dependencies. The corpus includes 2162 sentences from 78 full-text articles. Our corpus analysis demonstrates the necessity of full-text processing; identifies the article sections where interactions are most commonly stated; and quantifies the proportion of interaction statements requiring coherent dependencies. Further, it allows us to report on the relative importance of identifying synonyms and resolving negated expressions. We also experiment with an oracle sentence retrieval system using the corpus as a gold-standard evaluation set. Conclusion We introduce the MIM corpus, a unique resource that maps interaction facts in a MIM to annotated passages within full-text articles. It is an invaluable case study providing guidance to developers of biomedical IR and IE systems, and can be used as a gold-standard evaluation set for full-text IR tasks

  2. Groupwise conditional random forests for automatic shape classification and contour quality assessment in radiotherapy planning.

    PubMed

    McIntosh, Chris; Svistoun, Igor; Purdie, Thomas G

    2013-06-01

    Radiation therapy is used to treat cancer patients around the world. High quality treatment plans maximally radiate the targets while minimally radiating healthy organs at risk. In order to judge plan quality and safety, segmentations of the targets and organs at risk are created, and the amount of radiation that will be delivered to each structure is estimated prior to treatment. If the targets or organs at risk are mislabelled, or the segmentations are of poor quality, the safety of the radiation doses will be erroneously reviewed and an unsafe plan could proceed. We propose a technique to automatically label groups of segmentations of different structures from a radiation therapy plan for the joint purposes of providing quality assurance and data mining. Given one or more segmentations and an associated image we seek to assign medically meaningful labels to each segmentation and report the confidence of that label. Our method uses random forests to learn joint distributions over the training features, and then exploits a set of learned potential group configurations to build a conditional random field (CRF) that ensures the assignment of labels is consistent across the group of segmentations. The CRF is then solved via a constrained assignment problem. We validate our method on 1574 plans, consisting of 17[Formula: see text] 579 segmentations, demonstrating an overall classification accuracy of 91.58%. Our results also demonstrate the stability of RF with respect to tree depth and the number of splitting variables in large data sets. PMID:23475352

  3. Large-scale automatic extraction of side effects associated with targeted anticancer drugs from full-text oncological articles.

    PubMed

    Xu, Rong; Wang, QuanQiu

    2015-06-01

    Targeted anticancer drugs such as imatinib, trastuzumab and erlotinib dramatically improved treatment outcomes in cancer patients, however, these innovative agents are often associated with unexpected side effects. The pathophysiological mechanisms underlying these side effects are not well understood. The availability of a comprehensive knowledge base of side effects associated with targeted anticancer drugs has the potential to illuminate complex pathways underlying toxicities induced by these innovative drugs. While side effect association knowledge for targeted drugs exists in multiple heterogeneous data sources, published full-text oncological articles represent an important source of pivotal, investigational, and even failed trials in a variety of patient populations. In this study, we present an automatic process to extract targeted anticancer drug-associated side effects (drug-SE pairs) from a large number of high profile full-text oncological articles. We downloaded 13,855 full-text articles from the Journal of Oncology (JCO) published between 1983 and 2013. We developed text classification, relationship extraction, signaling filtering, and signal prioritization algorithms to extract drug-SE pairs from downloaded articles. We extracted a total of 26,264 drug-SE pairs with an average precision of 0.405, a recall of 0.899, and an F1 score of 0.465. We show that side effect knowledge from JCO articles is largely complementary to that from the US Food and Drug Administration (FDA) drug labels. Through integrative correlation analysis, we show that targeted drug-associated side effects positively correlate with their gene targets and disease indications. In conclusion, this unique database that we built from a large number of high-profile oncological articles could facilitate the development of computational models to understand toxic effects associated with targeted anticancer drugs. PMID:25817969

  4. Automatic classification framework for ventricular septal defects: a pilot study on high-throughput mouse embryo cardiac phenotyping.

    PubMed

    Xie, Zhongliu; Liang, Xi; Guo, Liucheng; Kitamoto, Asanobu; Tamura, Masaru; Shiroishi, Toshihiko; Gillies, Duncan

    2015-10-01

    Intensive international efforts are underway toward phenotyping the entire mouse genome by modifying all its [Formula: see text] genes one-by-one for comparative studies. A workload of this scale has triggered numerous studies harnessing image informatics for the identification of morphological defects. However, existing work in this line primarily rests on abnormality detection via structural volumetrics between wild-type and gene-modified mice, which generally fails when the pathology involves no severe volume changes, such as ventricular septal defects (VSDs) in the heart. Furthermore, in embryo cardiac phenotyping, the lack of relevant work in embryonic heart segmentation, the limited availability of public atlases, and the general requirement of manual labor for the actual phenotype classification after abnormality detection, along with other limitations, have collectively restricted existing practices from meeting the high-throughput demands. This study proposes, to the best of our knowledge, the first fully automatic VSD classification framework in mouse embryo imaging. Our approach leverages a combination of atlas-based segmentation and snake evolution techniques to derive the segmentation of heart ventricles, where VSD classification is achieved by checking whether the left and right ventricles border or overlap with each other. A pilot study has validated our approach at a proof-of-concept level and achieved a classification accuracy of 100% through a series of empirical experiments on a database of 15 images. PMID:26835488

  5. Automatic pathology classification using a single feature machine learning support - vector machines

    NASA Astrophysics Data System (ADS)

    Yepes-Calderon, Fernando; Pedregosa, Fabian; Thirion, Bertrand; Wang, Yalin; Lepore, Natasha

    2014-03-01

    Magnetic Resonance Imaging (MRI) has been gaining popularity in the clinic in recent years as a safe in-vivo imaging technique. As a result, large troves of data are being gathered and stored daily that may be used as clinical training sets in hospitals. While numerous machine learning (ML) algorithms have been implemented for Alzheimer's disease classification, their outputs are usually difficult to interpret in the clinical setting. Here, we propose a simple method of rapid diagnostic classification for the clinic using Support Vector Machines (SVM)1 and easy to obtain geometrical measurements that, together with a cortical and sub-cortical brain parcellation, create a robust framework capable of automatic diagnosis with high accuracy. On a significantly large imaging dataset consisting of over 800 subjects taken from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database, classification-success indexes of up to 99.2% are reached with a single measurement.

  6. Key issues in automatic classification of defects in post-inspection review process of photomasks

    NASA Astrophysics Data System (ADS)

    Pereira, Mark; Maji, Manabendra; Pai, Ravi R.; B. V. R., Samir; Seshadri, R.; Patil, Pradeepkumar

    2012-11-01

    The mask inspection and defect classification is a crucial part of mask preparation technology and consumes a significant amount of mask preparation time. As the patterns on a mask become smaller and more complex, the need for a highly precise mask inspection system with high detection sensitivity becomes greater. However, due to the high sensitivity, in addition to the detection of smaller defects on finer geometries, the inspection machine could report large number of false defects. The total number of defects becomes significantly high and the manual classification of these defects, where the operator should review each of the defects and classify them, may take huge amount of time. Apart from false defects, many of the very small real defects may not print on the wafer and user needs to spend time on classifying them as well. Also, sometimes, manual classification done by different operators may not be consistent. So, need for an automatic, consistent and fast classification tool becomes more acute in more advanced nodes. Automatic Defect Classification tool (NxADC) which is in advanced stage of development as part of NxDAT1, can automatically classify defects accurately and consistently in very less amount of time, compared to a human operator. Amongst the prospective defects as detected by the Mask Inspection System, NxADC identifies several types of false defects such as false defects due to registration error, false defects due to problems with CCD, noise, etc. It is also able to automatically classify real defects such as, pin-dot, pin-hole, clear extension, multiple-edges opaque, missing chrome, chrome-over-MoSi, etc. We faced a large set of algorithmic challenges during the course of the development of our NxADC tool. These include selecting the appropriate image alignment algorithm to detect registration errors (especially when there are sub-pixel registration errors or misalignment in repetitive patterns such as line space), differentiating noise from

  7. Emergency Medical Text Classifier: New system improves processing and classification of triage notes

    PubMed Central

    Haas, Stephanie W.; Travers, Debbie; Waller, Anna; Mahalingam, Deepika; Crouch, John; Schwartz, Todd A.; Mostafa, Javed

    2014-01-01

    Objective Automated syndrome classification aims to aid near real-time syndromic surveillance to serve as an early warning system for disease outbreaks, using Emergency Department (ED) data. We present a system that improves the automatic classification of an ED record with triage note into one or more syndrome categories using the vector space model coupled with a ‘learning’ module that employs a pseudo-relevance feedback mechanism. Materials and Methods: Terms from standard syndrome definitions are used to construct an initial reference dictionary for generating the syndrome and triage note vectors. Based on cosine similarity between the vectors, each record is classified into a syndrome category. We then take terms from the top-ranked records that belong to the syndrome of interest as feedback. These terms are added to the reference dictionary and the process is repeated to determine the final classification. The system was tested on two different datasets for each of three syndromes: Gastro-Intestinal (GI), Respiratory (Resp) and Fever-Rash (FR). Performance was measured in terms of sensitivity (Se) and specificity (Sp). Results: The use of relevance feedback produced high values of sensitivity and specificity for all three syndromes in both test sets: GI: 90% and 71%, Resp: 97% and 73%, FR: 100% and 87%, respectively, in test set 1, and GI: 88% and 69%, Resp: 87% and 61%, FR: 97% and 71%, respectively, in test set 2. Conclusions: The new system for pre-processing and syndromic classification of ED records with triage notes achieved improvements in Se and Sp. Our results also demonstrate that the system can be tuned to achieve different levels of performance based on user requirements. PMID:25379126

  8. Towards Automatic Classification of Exoplanet-Transit-Like Signals: A Case Study on Kepler Mission Data

    NASA Astrophysics Data System (ADS)

    Valizadegan, Hamed; Martin, Rodney; McCauliff, Sean D.; Jenkins, Jon Michael; Catanzarite, Joseph; Oza, Nikunj C.

    2015-08-01

    Building new catalogues of planetary candidates, astrophysical false alarms, and non-transiting phenomena is a challenging task that currently requires a reviewing team of astrophysicists and astronomers. These scientists need to examine more than 100 diagnostic metrics and associated graphics for each candidate exoplanet-transit-like signal to classify it into one of the three classes. Considering that the NASA Explorer Program's TESS mission and ESA's PLATO mission survey even a larger area of space, the classification of their transit-like signals is more time-consuming for human agents and a bottleneck to successfully construct the new catalogues in a timely manner. This encourages building automatic classification tools that can quickly and reliably classify the new signal data from these missions. The standard tool for building automatic classification systems is the supervised machine learning that requires a large set of highly accurate labeled examples in order to build an effective classifier. This requirement cannot be easily met for classifying transit-like signals because not only are existing labeled signals very limited, but also the current labels may not be reliable (because the labeling process is a subjective task). Our experiments with using different supervised classifiers to categorize transit-like signals verifies that the labeled signals are not rich enough to provide the classifier with enough power to generalize well beyond the observed cases (e.g. to unseen or test signals). That motivated us to utilize a new category of learning techniques, so-called semi-supervised learning, that combines the label information from the costly labeled signals, and distribution information from the cheaply available unlabeled signals in order to construct more effective classifiers. Our study on the Kepler Mission data shows that semi-supervised learning can significantly improve the result of multiple base classifiers (e.g. Support Vector Machines, Ada

  9. A Novel Feature Selection Technique for Text Classification Using Naïve Bayes

    PubMed Central

    Dey Sarkar, Subhajit; Goswami, Saptarsi; Agarwal, Aman; Aktar, Javed

    2014-01-01

    With the proliferation of unstructured data, text classification or text categorization has found many applications in topic classification, sentiment analysis, authorship identification, spam detection, and so on. There are many classification algorithms available. Naïve Bayes remains one of the oldest and most popular classifiers. On one hand, implementation of naïve Bayes is simple and, on the other hand, this also requires fewer amounts of training data. From the literature review, it is found that naïve Bayes performs poorly compared to other classifiers in text classification. As a result, this makes the naïve Bayes classifier unusable in spite of the simplicity and intuitiveness of the model. In this paper, we propose a two-step feature selection method based on firstly a univariate feature selection and then feature clustering, where we use the univariate feature selection method to reduce the search space and then apply clustering to select relatively independent feature sets. We demonstrate the effectiveness of our method by a thorough evaluation and comparison over 13 datasets. The performance improvement thus achieved makes naïve Bayes comparable or superior to other classifiers. The proposed algorithm is shown to outperform other traditional methods like greedy search based wrapper or CFS.

  10. Automatic extraction of property norm-like data from large text corpora.

    PubMed

    Kelly, Colin; Devereux, Barry; Korhonen, Anna

    2014-01-01

    Traditional methods for deriving property-based representations of concepts from text have focused on either extracting only a subset of possible relation types, such as hyponymy/hypernymy (e.g., car is-a vehicle) or meronymy/metonymy (e.g., car has wheels), or unspecified relations (e.g., car--petrol). We propose a system for the challenging task of automatic, large-scale acquisition of unconstrained, human-like property norms from large text corpora, and discuss the theoretical implications of such a system. We employ syntactic, semantic, and encyclopedic information to guide our extraction, yielding concept-relation-feature triples (e.g., car be fast, car require petrol, car cause pollution), which approximate property-based conceptual representations. Our novel method extracts candidate triples from parsed corpora (Wikipedia and the British National Corpus) using syntactically and grammatically motivated rules, then reweights triples with a linear combination of their frequency and four statistical metrics. We assess our system output in three ways: lexical comparison with norms derived from human-generated property norm data, direct evaluation by four human judges, and a semantic distance comparison with both WordNet similarity data and human-judged concept similarity ratings. Our system offers a viable and performant method of plausible triple extraction: Our lexical comparison shows comparable performance to the current state-of-the-art, while subsequent evaluations exhibit the human-like character of our generated properties. PMID:25019134

  11. Applying active learning to assertion classification of concepts in clinical text.

    PubMed

    Chen, Yukun; Mani, Subramani; Xu, Hua

    2012-04-01

    Supervised machine learning methods for clinical natural language processing (NLP) research require a large number of annotated samples, which are very expensive to build because of the involvement of physicians. Active learning, an approach that actively samples from a large pool, provides an alternative solution. Its major goal in classification is to reduce the annotation effort while maintaining the quality of the predictive model. However, few studies have investigated its uses in clinical NLP. This paper reports an application of active learning to a clinical text classification task: to determine the assertion status of clinical concepts. The annotated corpus for the assertion classification task in the 2010 i2b2/VA Clinical NLP Challenge was used in this study. We implemented several existing and newly developed active learning algorithms and assessed their uses. The outcome is reported in the global ALC score, based on the Area under the average Learning Curve of the AUC (Area Under the Curve) score. Results showed that when the same number of annotated samples was used, active learning strategies could generate better classification models (best ALC-0.7715) than the passive learning method (random sampling) (ALC-0.7411). Moreover, to achieve the same classification performance, active learning strategies required fewer samples than the random sampling method. For example, to achieve an AUC of 0.79, the random sampling method used 32 samples, while our best active learning algorithm required only 12 samples, a reduction of 62.5% in manual annotation effort. PMID:22127105

  12. Automatic Generation Of Training Data For Hyperspectral Image Classification Using Support Vector Machine

    NASA Astrophysics Data System (ADS)

    Abbasi, B.; Arefi, H.; Bigdeli, B.; Roessner, S.

    2015-04-01

    An image classification method based on Support Vector Machine (SVM) is proposed on hyperspectral and 3K DSM data. To obtain training data we applied an automatic method relating to four classes namely; building, grass, tree, and ground pixels. First, some initial segments regarding to building, tree, grass, and ground pixels are produced using different feature descriptors. The feature descriptors are generated using optical (hyperspectral) as well as range (3K DSM) images. The initial building regions are created using DSM segmentation. Fusion of NDVI and elevation information assist us to provide initial segments regarding to the grass and tree areas. Also, we created initial segment regarding to ground pixel after geodesic based filtering of DSM and elimination of the non-ground pixels. To improve classification accuracy, the hyperspectral image and 3K DSM were utilized simultaneously to perform image classification. For obtaining testing data, labelled pixels was divide into two parts: test and training. Experimental result shows a final classification accuracy of about 90% using Support Vector Machine. In the process of satellite image classification; provided by 3K camera. Both datasets correspond to Munich area in Germany.

  13. [Automatic classification method of star spectra data based on manifold fuzzy twin support vector machine].

    PubMed

    Liu, Zhong-bao; Gao, Yan-yun; Wang, Jian-zhen

    2015-01-01

    Support vector machine (SVM) with good leaning ability and generalization is widely used in the star spectra data classification. But when the scale of data becomes larger, the shortages of SVM appear: the calculation amount is quite large and the classification speed is too slow. In order to solve the above problems, twin support vector machine (TWSVM) was proposed by Jayadeva. The advantage of TSVM is that the time cost is reduced to 1/4 of that of SVM. While all the methods mentioned above only focus on the global characteristics and neglect the local characteristics. In view of this, an automatic classification method of star spectra data based on manifold fuzzy twin support vector machine (MF-TSVM) is proposed in this paper. In MF-TSVM, manifold-based discriminant analysis (MDA) is used to obtain the global and local characteristics of the input data and the fuzzy membership is introduced to reduce the influences of noise and singular data on the classification results. Comparative experiments with current classification methods, such as C-SVM and KNN, on the SDSS star spectra datasets verify the effectiveness of the proposed method. PMID:25993861

  14. Automatic large-scale classification of bird sounds is strongly improved by unsupervised feature learning.

    PubMed

    Stowell, Dan; Plumbley, Mark D

    2014-01-01

    Automatic species classification of birds from their sound is a computational tool of increasing importance in ecology, conservation monitoring and vocal communication studies. To make classification useful in practice, it is crucial to improve its accuracy while ensuring that it can run at big data scales. Many approaches use acoustic measures based on spectrogram-type data, such as the Mel-frequency cepstral coefficient (MFCC) features which represent a manually-designed summary of spectral information. However, recent work in machine learning has demonstrated that features learnt automatically from data can often outperform manually-designed feature transforms. Feature learning can be performed at large scale and "unsupervised", meaning it requires no manual data labelling, yet it can improve performance on "supervised" tasks such as classification. In this work we introduce a technique for feature learning from large volumes of bird sound recordings, inspired by techniques that have proven useful in other domains. We experimentally compare twelve different feature representations derived from the Mel spectrum (of which six use this technique), using four large and diverse databases of bird vocalisations, classified using a random forest classifier. We demonstrate that in our classification tasks, MFCCs can often lead to worse performance than the raw Mel spectral data from which they are derived. Conversely, we demonstrate that unsupervised feature learning provides a substantial boost over MFCCs and Mel spectra without adding computational complexity after the model has been trained. The boost is particularly notable for single-label classification tasks at large scale. The spectro-temporal activations learned through our procedure resemble spectro-temporal receptive fields calculated from avian primary auditory forebrain. However, for one of our datasets, which contains substantial audio data but few annotations, increased performance is not discernible. We

  15. Automatic large-scale classification of bird sounds is strongly improved by unsupervised feature learning

    PubMed Central

    Plumbley, Mark D.

    2014-01-01

    Automatic species classification of birds from their sound is a computational tool of increasing importance in ecology, conservation monitoring and vocal communication studies. To make classification useful in practice, it is crucial to improve its accuracy while ensuring that it can run at big data scales. Many approaches use acoustic measures based on spectrogram-type data, such as the Mel-frequency cepstral coefficient (MFCC) features which represent a manually-designed summary of spectral information. However, recent work in machine learning has demonstrated that features learnt automatically from data can often outperform manually-designed feature transforms. Feature learning can be performed at large scale and “unsupervised”, meaning it requires no manual data labelling, yet it can improve performance on “supervised” tasks such as classification. In this work we introduce a technique for feature learning from large volumes of bird sound recordings, inspired by techniques that have proven useful in other domains. We experimentally compare twelve different feature representations derived from the Mel spectrum (of which six use this technique), using four large and diverse databases of bird vocalisations, classified using a random forest classifier. We demonstrate that in our classification tasks, MFCCs can often lead to worse performance than the raw Mel spectral data from which they are derived. Conversely, we demonstrate that unsupervised feature learning provides a substantial boost over MFCCs and Mel spectra without adding computational complexity after the model has been trained. The boost is particularly notable for single-label classification tasks at large scale. The spectro-temporal activations learned through our procedure resemble spectro-temporal receptive fields calculated from avian primary auditory forebrain. However, for one of our datasets, which contains substantial audio data but few annotations, increased performance is not

  16. Approach for Text Classification Based on the Similarity Measurement between Normal Cloud Models

    PubMed Central

    Dai, Jin; Liu, Xin

    2014-01-01

    The similarity between objects is the core research area of data mining. In order to reduce the interference of the uncertainty of nature language, a similarity measurement between normal cloud models is adopted to text classification research. On this basis, a novel text classifier based on cloud concept jumping up (CCJU-TC) is proposed. It can efficiently accomplish conversion between qualitative concept and quantitative data. Through the conversion from text set to text information table based on VSM model, the text qualitative concept, which is extraction from the same category, is jumping up as a whole category concept. According to the cloud similarity between the test text and each category concept, the test text is assigned to the most similar category. By the comparison among different text classifiers in different feature selection set, it fully proves that not only does CCJU-TC have a strong ability to adapt to the different text features, but also the classification performance is also better than the traditional classifiers. PMID:24711737

  17. Automatic Classification of Volcanic Earthquakes Using Multi-Station Waveforms and Dynamic Neural Networks

    NASA Astrophysics Data System (ADS)

    Bruton, C. P.; West, M. E.

    2013-12-01

    Earthquakes and seismicity have long been used to monitor volcanoes. In addition to time, location, and magnitude of an earthquake, the characteristics of the waveform itself are important. For example, low-frequency or hybrid type events could be generated by magma rising toward the surface. A rockfall event could indicate a growing lava dome. Classification of earthquake waveforms is thus a useful tool in volcano monitoring. A procedure to perform such classification automatically could flag certain event types immediately, instead of waiting for a human analyst's review. Inspired by speech recognition techniques, we have developed a procedure to classify earthquake waveforms using artificial neural networks. A neural network can be "trained" with an existing set of input and desired output data; in this case, we use a set of earthquake waveforms (input) that has been classified by a human analyst (desired output). After training the neural network, new waveforms can be classified automatically as they are presented. Our procedure uses waveforms from multiple stations, making it robust to seismic network changes and outages. The use of a dynamic time-delay neural network allows waveforms to be presented without precise alignment in time, and thus could be applied to continuous data or to seismic events without clear start and end times. We have evaluated several different training algorithms and neural network structures to determine their effects on classification performance. We apply this procedure to earthquakes recorded at Mount Spurr and Katmai in Alaska, and Uturuncu Volcano in Bolivia.

  18. Automatic Crack Detection and Classification Method for Subway Tunnel Safety Monitoring

    PubMed Central

    Zhang, Wenyu; Zhang, Zhenjiang; Qi, Dapeng; Liu, Yun

    2014-01-01

    Cracks are an important indicator reflecting the safety status of infrastructures. This paper presents an automatic crack detection and classification methodology for subway tunnel safety monitoring. With the application of high-speed complementary metal-oxide-semiconductor (CMOS) industrial cameras, the tunnel surface can be captured and stored in digital images. In a next step, the local dark regions with potential crack defects are segmented from the original gray-scale images by utilizing morphological image processing techniques and thresholding operations. In the feature extraction process, we present a distance histogram based shape descriptor that effectively describes the spatial shape difference between cracks and other irrelevant objects. Along with other features, the classification results successfully remove over 90% misidentified objects. Also, compared with the original gray-scale images, over 90% of the crack length is preserved in the last output binary images. The proposed approach was tested on the safety monitoring for Beijing Subway Line 1. The experimental results revealed the rules of parameter settings and also proved that the proposed approach is effective and efficient for automatic crack detection and classification. PMID:25325337

  19. Automatic crack detection and classification method for subway tunnel safety monitoring.

    PubMed

    Zhang, Wenyu; Zhang, Zhenjiang; Qi, Dapeng; Liu, Yun

    2014-01-01

    Cracks are an important indicator reflecting the safety status of infrastructures. This paper presents an automatic crack detection and classification methodology for subway tunnel safety monitoring. With the application of high-speed complementary metal-oxide-semiconductor (CMOS) industrial cameras, the tunnel surface can be captured and stored in digital images. In a next step, the local dark regions with potential crack defects are segmented from the original gray-scale images by utilizing morphological image processing techniques and thresholding operations. In the feature extraction process, we present a distance histogram based shape descriptor that effectively describes the spatial shape difference between cracks and other irrelevant objects. Along with other features, the classification results successfully remove over 90% misidentified objects. Also, compared with the original gray-scale images, over 90% of the crack length is preserved in the last output binary images. The proposed approach was tested on the safety monitoring for Beijing Subway Line 1. The experimental results revealed the rules of parameter settings and also proved that the proposed approach is effective and efficient for automatic crack detection and classification. PMID:25325337

  20. Texting

    ERIC Educational Resources Information Center

    Tilley, Carol L.

    2009-01-01

    With the increasing ranks of cell phone ownership is an increase in text messaging, or texting. During 2008, more than 2.5 trillion text messages were sent worldwide--that's an average of more than 400 messages for every person on the planet. Although many of the messages teenagers text each day are perhaps nothing more than "how r u?" or "c u…

  1. Assessing the Utility of Automatic Cancer Registry Notifications Data Extraction from Free-Text Pathology Reports

    PubMed Central

    Nguyen, Anthony N.; Moore, Julie; O’Dwyer, John; Philpot, Shoni

    2015-01-01

    Cancer Registries record cancer data by reading and interpreting pathology cancer specimen reports. For some Registries this can be a manual process, which is labour and time intensive and subject to errors. A system for automatic extraction of cancer data from HL7 electronic free-text pathology reports has been proposed to improve the workflow efficiency of the Cancer Registry. The system is currently processing an incoming trickle feed of HL7 electronic pathology reports from across the state of Queensland in Australia to produce an electronic cancer notification. Natural language processing and symbolic reasoning using SNOMED CT were adopted in the system; Queensland Cancer Registry business rules were also incorporated. A set of 220 unseen pathology reports selected from patients with a range of cancers was used to evaluate the performance of the system. The system achieved overall recall of 0.78, precision of 0.83 and F-measure of 0.80 over seven categories, namely, basis of diagnosis (3 classes), primary site (66 classes), laterality (5 classes), histological type (94 classes), histological grade (7 classes), metastasis site (19 classes) and metastatic status (2 classes). These results are encouraging given the large cross-section of cancers. The system allows for the provision of clinical coding support as well as indicative statistics on the current state of cancer, which is not otherwise available. PMID:26958232

  2. Automatically Detecting Medications and the Reason for their Prescription in Clinical Narrative Text Documents

    PubMed Central

    Meystre, Stéphane M.; Thibault, Julien; Shen, Shuying; Hurdle, John F.; South, Brett R.

    2011-01-01

    An important proportion of the information about the medications a patient is taking is mentioned only in narrative text in the electronic health record. Automated information extraction can make this information accessible for decision-support, research, or any other automated processing. In the context of the “i2b2 medication extraction challenge,” we have developed a new NLP application called Textractor to automatically extract medications and details about them (e.g., dosage, frequency, reason for their prescription). This application and its evaluation with part of the reference standard for this “challenge” are presented here, along with an analysis of the development of this reference standard. During this evaluation, Textractor reached a system-level overall F1-measure, the reference metric for this challenge, of about 77% for exact matches. The best performance was measured with medication routes (F1-measure 86.4%), and the worst with prescription reasons (F1-measure 29%). These results are consistent with the agreement observed between human annotators when developing the reference standard, and with other published research. PMID:20841823

  3. Assessing the Utility of Automatic Cancer Registry Notifications Data Extraction from Free-Text Pathology Reports.

    PubMed

    Nguyen, Anthony N; Moore, Julie; O'Dwyer, John; Philpot, Shoni

    2015-01-01

    Cancer Registries record cancer data by reading and interpreting pathology cancer specimen reports. For some Registries this can be a manual process, which is labour and time intensive and subject to errors. A system for automatic extraction of cancer data from HL7 electronic free-text pathology reports has been proposed to improve the workflow efficiency of the Cancer Registry. The system is currently processing an incoming trickle feed of HL7 electronic pathology reports from across the state of Queensland in Australia to produce an electronic cancer notification. Natural language processing and symbolic reasoning using SNOMED CT were adopted in the system; Queensland Cancer Registry business rules were also incorporated. A set of 220 unseen pathology reports selected from patients with a range of cancers was used to evaluate the performance of the system. The system achieved overall recall of 0.78, precision of 0.83 and F-measure of 0.80 over seven categories, namely, basis of diagnosis (3 classes), primary site (66 classes), laterality (5 classes), histological type (94 classes), histological grade (7 classes), metastasis site (19 classes) and metastatic status (2 classes). These results are encouraging given the large cross-section of cancers. The system allows for the provision of clinical coding support as well as indicative statistics on the current state of cancer, which is not otherwise available. PMID:26958232

  4. Automatic classification of EEG segments and extraction of representative ones by dynamic clusters method.

    PubMed

    Krajca, V

    1984-06-01

    Use of the dynamic clusters method for automatic extraction of compressed information about recorded EEG signal is presented. The computer first divides the record into quasi-stationary segments by means of adaptive segmentation. Second, the extracted segments are classified by a method of dynamic clusters into homogeneous classes. One part of the used clustering algorithm permits to specify and draw the most typical class members, which may represent the whole studied EEG signal and may be used as input for the further phase of the automatic EEG analysis, i.e. for the classification of the whole EEG records. The above procedure was applied to a 75 sec long EEG record of anaesthetized cat intoxicated by CO. PMID:6475475

  5. Neural Network Classification of Receiver Functions as a Step Towards Automatic Crustal Parameter Determination

    NASA Astrophysics Data System (ADS)

    Jemberie, A.; Dugda, M. T.; Reusch, D.; Nyblade, A.

    2006-12-01

    Neural networks are decision making mathematical/engineering tools, which if trained properly, can do jobs automatically (and objectively) that normally require particular expertise and/or tedious repetition. Here we explore two techniques from the field of artificial neural networks (ANNs) that seek to reduce the time requirements and increase the objectivity of quality control (QC) and Event Identification (EI) on seismic datasets. We explore to apply the multiplayer Feed Forward (FF) Artificial Neural Networks (ANN) and Self- Organizing Maps (SOM) in combination with Hk stacking of receiver functions in an attempt to test the extent of the usefulness of automatic classification of receiver functions for crustal parameter determination. Feed- forward ANNs (FFNNs) are a supervised classification tool while self-organizing maps (SOMs) are able to provide unsupervised classification of large, complex geophysical data sets into a fixed number of distinct generalized patterns or modes. Hk stacking is a methodology that is used to stack receiver functions based on the relative arrival times of P-to-S converted phase and next two reverberations to determine crustal thickness H and Vp-to-Vs ratio (k). We use receiver functions from teleseismic events recorded by the 2000- 2002 Ethiopia Broadband Seismic Experiment. Preliminary results of applying FFNN neural network and Hk stacking of receiver functions for automatic receiver functions classification as a step towards an effort of automatic crustal parameter determination look encouraging. After training a FFNN neural network, the network could classify the best receiver functions from bad ones with a success rate of about 75 to 95%. Applying H? stacking on the receiver functions classified by this FFNN as the best receiver functions, we could obtain crustal thickness and Vp/Vs ratio of 31±4 km and 1.75±0.05, respectively, for the crust beneath station ARBA in the Main Ethiopian Rift. To make comparison, we applied Hk

  6. Automatic detection and classification of obstacles with applications in autonomous mobile robots

    NASA Astrophysics Data System (ADS)

    Ponomaryov, Volodymyr I.; Rosas-Miranda, Dario I.

    2016-04-01

    Hardware implementation of an automatic detection and classification of objects that can represent an obstacle for an autonomous mobile robot using stereo vision algorithms is presented. We propose and evaluate a new method to detect and classify objects for a mobile robot in outdoor conditions. This method is divided in two parts, the first one is the object detection step based on the distance from the objects to the camera and a BLOB analysis. The second part is the classification step that is based on visuals primitives and a SVM classifier. The proposed method is performed in GPU in order to reduce the processing time values. This is performed with help of hardware based on multi-core processors and GPU platform, using a NVIDIA R GeForce R GT640 graphic card and Matlab over a PC with Windows 10.

  7. Analysis of steranes and triterpanes in geolipid extracts by automatic classification of mass spectra

    NASA Technical Reports Server (NTRS)

    Wardroper, A. M. K.; Brooks, P. W.; Humberston, M. J.; Maxwell, J. R.

    1977-01-01

    A computer method is described for the automatic classification of triterpanes and steranes into gross structural type from their mass spectral characteristics. The method has been applied to the spectra obtained by gas-chromatographic/mass-spectroscopic analysis of two mixtures of standards and of hydrocarbon fractions isolated from Green River and Messel oil shales. Almost all of the steranes and triterpanes identified previously in both shales were classified, in addition to a number of new components. The results indicate that classification of such alkanes is possible with a laboratory computer system. The method has application to diagenesis and maturation studies as well as to oil/oil and oil/source rock correlations in which rapid screening of large numbers of samples is required.

  8. The effect of atmospheric water vapor on automatic classification of ERTS data

    NASA Technical Reports Server (NTRS)

    Pitts, D. E.; Mcallum, W. E.; Dillinger, A. E.

    1974-01-01

    Absorption by atmospheric water vapor changes the spectral signatures collected by multispectral scanners if channels are not chosen to avoid the atmospheric water bands. For ERTS (Earth Resources Technology Satellite), the Multispectral Scanner band 7 (MSS 7, .8 to 1.1 micron) is the only band significantly affected. Line-by-line atmospheric absorption calculations showed that this effect can multiply the intensity by factors ranging from .77 to 1.0. If horizontal gradients in atmospheric water exist between training fields and the rest of the scene, errors are introduced in automatic classification of the imagery. The degradation of the classification of corn and soybeans was determined by using actual ERTS data and simulating the absorption effects on the MSS 7 band.

  9. Using complex networks for text classification: Discriminating informative and imaginative documents

    NASA Astrophysics Data System (ADS)

    de Arruda, Henrique F.; Costa, Luciano da F.; Amancio, Diego R.

    2016-01-01

    Statistical methods have been widely employed in recent years to grasp many language properties. The application of such techniques have allowed an improvement of several linguistic applications, such as machine translation and document classification. In the latter, many approaches have emphasised the semantical content of texts, as is the case of bag-of-word language models. These approaches have certainly yielded reasonable performance. However, some potential features such as the structural organization of texts have been used only in a few studies. In this context, we probe how features derived from textual structure analysis can be effectively employed in a classification task. More specifically, we performed a supervised classification aiming at discriminating informative from imaginative documents. Using a networked model that describes the local topological/dynamical properties of function words, we achieved an accuracy rate of up to 95%, which is much higher than similar networked approaches. A systematic analysis of feature relevance revealed that symmetry and accessibility measurements are among the most prominent network measurements. Our results suggest that these measurements could be used in related language applications, as they play a complementary role in characterising texts.

  10. Automatic tissue classification for high-resolution breast CT images based on bilateral filtering

    NASA Astrophysics Data System (ADS)

    Yang, Xiaofeng; Sechopoulos, Ioannis; Fei, Baowei

    2011-03-01

    Breast tissue classification can provide quantitative measurements of breast composition, density and tissue distribution for diagnosis and identification of high-risk patients. In this study, we present an automatic classification method to classify high-resolution dedicated breast CT images. The breast is classified into skin, fat and glandular tissue. First, we use a multiscale bilateral filter to reduce noise and at the same time keep edges on the images. As skin and glandular tissue have similar CT values in breast CT images, we use morphologic operations to get the mask of the skin based on information of its position. Second, we use a modified fuzzy C-mean classification method twice, one for the skin and the other for the fatty and glandular tissue. We compared our classified results with manually segmentation results and used Dice overlap ratios to evaluate our classification method. We also tested our method using added noise in the images. The overlap ratios for glandular tissue were above 94.7% for data from five patients. Evaluation results showed that our method is robust and accurate.

  11. Automatic classification of delphinids based on the representative frequencies of whistles.

    PubMed

    Lin, Tzu-Hao; Chou, Lien-Siang

    2015-08-01

    Classification of odontocete species remains a challenging task for passive acoustic monitoring. Classifiers that have been developed use spectral features extracted from echolocation clicks and whistle contours. Most of these contour-based classifiers require complete contours to reduce measurement errors. Therefore, overlapping contours and partially detected contours in an automatic detection algorithm may increase the bias for contour-based classifiers. In this study, classification was conducted on each recording section without extracting individual contours. The local-max detector was used to extract representative frequencies of delphinid whistles and each section was divided into multiple non-overlapping fragments. Three acoustical parameters were measured from the distribution of representative frequencies in each fragment. By using the statistical features of the acoustical parameters and the percentage of overlapping whistles, correct classification rate of 70.3% was reached for the recordings of seven species (Tursiops truncatus, Delphinus delphis, Delphinus capensis, Peponocephala electra, Grampus griseus, Stenella longirostris longirostris, and Stenella attenuata) archived in MobySound.org. In addition, correct classification rate was not dramatically reduced in various simulated noise conditions. This algorithm can be employed in acoustic observatories to classify different delphinid species and facilitate future studies on the community ecology of odontocetes. PMID:26328716

  12. Automatic classification of protein structures using low-dimensional structure space mappings

    PubMed Central

    2014-01-01

    dimensional embeddings are next computed using an embedding technique called multidimensional scaling (MDS). Next, by using the remotely homologous Superfamily and Fold levels of the hierarchical SCOP database, a distance threshold is determined to relate adjacency in the low dimensional map to functional relationships. In our approach, the optimal threshold is determined as the value that maximizes the total true classification rate vis-a-vis the SCOP classification. We also show that determining such a threshold is often straightforward, once the structural relationships are represented using MPSS. Results and conclusion We demonstrate that MPSS constitute highly accurate representations of protein fold space and enable automatic classification of SCOP Superfamily and Fold-level relationships. The results from our automatic classification approach are remarkably similar to those found in the distantly homologous Superfamily level and the quite remotely homologous Fold levels of SCOP. The significance of our results are underlined by the fact that most automated methods developed thus far have only managed to match the closest-homology Family level of the SCOP hierarchy and tend to differ considerably at the Superfamily and Fold levels. Furthermore, our research demonstrates that projection into a low-dimensional space using MDS constitutes a superior noise-reducing transformation of pairwise distances than do the variety of probability- and alignment-length-based transformations currently used by structure alignment algorithms. PMID:24564500

  13. Enhanced information retrieval from narrative German-language clinical text documents using automated document classification.

    PubMed

    Spat, Stephan; Cadonna, Bruno; Rakovac, Ivo; Gütl, Christian; Leitner, Hubert; Stark, Günther; Beck, Peter

    2008-01-01

    The amount of narrative clinical text documents stored in Electronic Patient Records (EPR) of Hospital Information Systems is increasing. Physicians spend a lot of time finding relevant patient-related information for medical decision making in these clinical text documents. Thus, efficient and topical retrieval of relevant patient-related information is an important task in an EPR system. This paper describes the prototype of a medical information retrieval system (MIRS) for clinical text documents. The open-source information retrieval framework Apache Lucene has been used to implement the prototype of the MIRS. Additionally, a multi-label classification system based on the open-source data mining framework WEKA generates metadata from the clinical text document set. The metadata is used for influencing the rank order of documents retrieved by physicians. Combining information retrieval and automated document classification offers an enhanced approach to let physicians and in the near future patients define their information needs for information stored in an EPR. The system has been designed as a J2EE Web-application. First findings are based on a sample of 18,000 unstructured, clinical text documents written in German. PMID:18487776

  14. Automatic classification of acetowhite temporal patterns to identify precursor lesions of cervical cancer

    NASA Astrophysics Data System (ADS)

    Gutiérrez-Fragoso, K.; Acosta-Mesa, H. G.; Cruz-Ramírez, N.; Hernández-Jiménez, R.

    2013-12-01

    Cervical cancer has remained, until now, as a serious public health problem in developing countries. The most common method of screening is the Pap test or cytology. When abnormalities are reported in the result, the patient is referred to a dysplasia clinic for colposcopy. During this test, a solution of acetic acid is applied, which produces a color change in the tissue and is known as acetowhitening phenomenon. This reaction aims to obtaining a sample of tissue and its histological analysis let to establish a final diagnosis. During the colposcopy test, digital images can be acquired to analyze the behavior of the acetowhitening reaction from a temporal approach. In this way, we try to identify precursor lesions of cervical cancer through a process of automatic classification of acetowhite temporal patterns. In this paper, we present the performance analysis of three classification methods: kNN, Naïve Bayes and C4.5. The results showed that there is similarity between some acetowhite temporal patterns of normal and abnormal tissues. Therefore we conclude that it is not sufficient to only consider the temporal dynamic of the acetowhitening reaction to establish a diagnosis by an automatic method. Information from cytologic, colposcopic and histopathologic disciplines should be integrated as well.

  15. Automatic Training Sample Selection for a Multi-Evidence Based Crop Classification Approach

    NASA Astrophysics Data System (ADS)

    Chellasamy, M.; Ferre, P. A. Ty; Humlekrog Greve, M.

    2014-09-01

    An approach to use the available agricultural parcel information to automatically select training samples for crop classification is investigated. Previous research addressed the multi-evidence crop classification approach using an ensemble classifier. This first produced confidence measures using three Multi-Layer Perceptron (MLP) neural networks trained separately with spectral, texture and vegetation indices; classification labels were then assigned based on Endorsement Theory. The present study proposes an approach to feed this ensemble classifier with automatically selected training samples. The available vector data representing crop boundaries with corresponding crop codes are used as a source for training samples. These vector data are created by farmers to support subsidy claims and are, therefore, prone to errors such as mislabeling of crop codes and boundary digitization errors. The proposed approach is named as ECRA (Ensemble based Cluster Refinement Approach). ECRA first automatically removes mislabeled samples and then selects the refined training samples in an iterative training-reclassification scheme. Mislabel removal is based on the expectation that mislabels in each class will be far from cluster centroid. However, this must be a soft constraint, especially when working with a hypothesis space that does not contain a good approximation of the targets classes. Difficulty in finding a good approximation often exists either due to less informative data or a large hypothesis space. Thus this approach uses the spectral, texture and indices domains in an ensemble framework to iteratively remove the mislabeled pixels from the crop clusters declared by the farmers. Once the clusters are refined, the selected border samples are used for final learning and the unknown samples are classified using the multi-evidence approach. The study is implemented with WorldView-2 multispectral imagery acquired for a study area containing 10 crop classes. The proposed

  16. Automatic generation of 3D motifs for classification of protein binding sites

    PubMed Central

    Nebel, Jean-Christophe; Herzyk, Pawel; Gilbert, David R

    2007-01-01

    Background Since many of the new protein structures delivered by high-throughput processes do not have any known function, there is a need for structure-based prediction of protein function. Protein 3D structures can be clustered according to their fold or secondary structures to produce classes of some functional significance. A recent alternative has been to detect specific 3D motifs which are often associated to active sites. Unfortunately, there are very few known 3D motifs, which are usually the result of a manual process, compared to the number of sequential motifs already known. In this paper, we report a method to automatically generate 3D motifs of protein structure binding sites based on consensus atom positions and evaluate it on a set of adenine based ligands. Results Our new approach was validated by generating automatically 3D patterns for the main adenine based ligands, i.e. AMP, ADP and ATP. Out of the 18 detected patterns, only one, the ADP4 pattern, is not associated with well defined structural patterns. Moreover, most of the patterns could be classified as binding site 3D motifs. Literature research revealed that the ADP4 pattern actually corresponds to structural features which show complex evolutionary links between ligases and transferases. Therefore, all of the generated patterns prove to be meaningful. Each pattern was used to query all PDB proteins which bind either purine based or guanine based ligands, in order to evaluate the classification and annotation properties of the pattern. Overall, our 3D patterns matched 31% of proteins with adenine based ligands and 95.5% of them were classified correctly. Conclusion A new metric has been introduced allowing the classification of proteins according to the similarity of atomic environment of binding sites, and a methodology has been developed to automatically produce 3D patterns from that classification. A study of proteins binding adenine based ligands showed that these 3D patterns are not

  17. AUTOMATIC UNSUPERVISED CLASSIFICATION OF ALL SLOAN DIGITAL SKY SURVEY DATA RELEASE 7 GALAXY SPECTRA

    SciTech Connect

    Almeida, J. Sanchez; Aguerri, J. A. L.; Munoz-Tunon, C.; De Vicente, A. E-mail: jalfonso@iac.e E-mail: angelv@iac.e

    2010-05-01

    Using the k-means cluster analysis algorithm, we carry out an unsupervised classification of all galaxy spectra in the seventh and final Sloan Digital Sky Survey data release (SDSS/DR7). Except for the shift to rest-frame wavelengths and the normalization to the g-band flux, no manipulation is applied to the original spectra. The algorithm guarantees that galaxies with similar spectra belong to the same class. We find that 99% of the galaxies can be assigned to only 17 major classes, with 11 additional minor classes including the remaining 1%. The classification is not unique since many galaxies appear in between classes; however, our rendering of the algorithm overcomes this weakness with a tool to identify borderline galaxies. Each class is characterized by a template spectrum, which is the average of all the spectra of the galaxies in the class. These low-noise template spectra vary smoothly and continuously along a sequence labeled from 0 to 27, from the reddest class to the bluest class. Our Automatic Spectroscopic K-means-based (ASK) classification separates galaxies in colors, with classes characteristic of the red sequence, the blue cloud, as well as the green valley. When red sequence galaxies and green valley galaxies present emission lines, they are characteristic of active galactic nucleus activity. Blue galaxy classes have emission lines corresponding to star formation regions. We find the expected correlation between spectroscopic class and Hubble type, but this relationship exhibits a high intrinsic scatter. Several potential uses of the ASK classification are identified and sketched, including fast determination of physical properties by interpolation, classes as templates in redshift determinations, and target selection in follow-up works (we find classes of Seyfert galaxies, green valley galaxies, as well as a significant number of outliers). The ASK classification is publicly accessible through various Web sites.

  18. Automatic segmentation and classification of gestational sac based on mean sac diameter using medical ultrasound image

    NASA Astrophysics Data System (ADS)

    Khazendar, Shan; Farren, Jessica; Al-Assam, Hisham; Sayasneh, Ahmed; Du, Hongbo; Bourne, Tom; Jassim, Sabah A.

    2014-05-01

    Ultrasound is an effective multipurpose imaging modality that has been widely used for monitoring and diagnosing early pregnancy events. Technology developments coupled with wide public acceptance has made ultrasound an ideal tool for better understanding and diagnosing of early pregnancy. The first measurable signs of an early pregnancy are the geometric characteristics of the Gestational Sac (GS). Currently, the size of the GS is manually estimated from ultrasound images. The manual measurement involves multiple subjective decisions, in which dimensions are taken in three planes to establish what is known as Mean Sac Diameter (MSD). The manual measurement results in inter- and intra-observer variations, which may lead to difficulties in diagnosis. This paper proposes a fully automated diagnosis solution to accurately identify miscarriage cases in the first trimester of pregnancy based on automatic quantification of the MSD. Our study shows a strong positive correlation between the manual and the automatic MSD estimations. Our experimental results based on a dataset of 68 ultrasound images illustrate the effectiveness of the proposed scheme in identifying early miscarriage cases with classification accuracies comparable with those of domain experts using K nearest neighbor classifier on automatically estimated MSDs.

  19. Automatic classification of volcanic earthquakes using multi-station waveforms and dynamic neural networks

    NASA Astrophysics Data System (ADS)

    Bruton, Christopher Patrick

    Earthquakes and seismicity have long been used to monitor volcanoes. In addition to the time, location, and magnitude of an earthquake, the characteristics of the waveform itself are important. For example, low-frequency or hybrid type events could be generated by magma rising toward the surface. A rockfall event could indicate a growing lava dome. Classification of earthquake waveforms is thus a useful tool in volcano monitoring. A procedure to perform such classification automatically could flag certain event types immediately, instead of waiting for a human analyst's review. Inspired by speech recognition techniques, we have developed a procedure to classify earthquake waveforms using artificial neural networks. A neural network can be "trained" with an existing set of input and desired output data; in this case, we use a set of earthquake waveforms (input) that has been classified by a human analyst (desired output). After training the neural network, new sets of waveforms can be classified automatically as they are presented. Our procedure uses waveforms from multiple stations, making it robust to seismic network changes and outages. The use of a dynamic time-delay neural network allows waveforms to be presented without precise alignment in time, and thus could be applied to continuous data or to seismic events without clear start and end times. We have evaluated several different training algorithms and neural network structures to determine their effects on classification performance. We apply this procedure to earthquakes recorded at Mount Spurr and Katmai in Alaska, and Uturuncu Volcano in Bolivia. The procedure can successfully distinguish between slab and volcanic events at Uturuncu, between events from four different volcanoes in the Katmai region, and between volcano-tectonic and long-period events at Spurr. Average recall and overall accuracy were greater than 80% in all three cases.

  20. Concept Recognition in an Automatic Text-Processing System for the Life Sciences.

    ERIC Educational Resources Information Center

    Vleduts-Stokolov, Natasha

    1987-01-01

    Describes a system developed for the automatic recognition of biological concepts in titles of scientific articles; reports results of several pilot experiments which tested the system's performance; analyzes typical ambiguity problems encountered by the system; describes a disambiguation technique that was developed; and discusses future plans…

  1. Automatic Training Site Selection for Agricultural Crop Classification: a Case Study on Karacabey Plain, Turkey

    NASA Astrophysics Data System (ADS)

    Ozdarici Ok, A.; Akyurek, Z.

    2011-09-01

    This study implements a traditional supervised classification method to an optical image composed of agricultural crops by means of a unique way, selecting the training samples automatically. Panchromatic (1m) and multispectral (4m) Kompsat-2 images (July 2008) of Karacabey Plain (~100km2), located in Marmara region, are used to evaluate the proposed approach. Due to the characteristic of rich, loamy soils combined with reasonable weather conditions, the Karacabey Plain is one of the most valuable agricultural regions of Turkey. Analyses start with applying an image fusion algorithm on the panchromatic and multispectral image. As a result of this process, 1m spatial resolution colour image is produced. In the next step, the four-band fused (1m) image and multispectral (4m) image are orthorectified. Next, the fused image (1m) is segmented using a popular segmentation method, Mean- Shift. The Mean-Shift is originally a method based on kernel density estimation and it shifts each pixel to the mode of clusters. In the segmentation procedure, three parameters must be defined: (i) spatial domain (hs), (ii) range domain (hr), and (iii) minimum region (MR). In this study, in total, 176 parameter combinations (hs, hr, and MR) are tested on a small part of the area (~10km2) to find an optimum segmentation result, and a final parameter combination (hs=18, hr=20, and MR=1000) is determined after evaluating multiple goodness measures. The final segmentation output is then utilized to the classification framework. The classification operation is applied on the four-band multispectral image (4m) to minimize the mixed pixel effect. Before the image classification, each segment is overlaid with the bands of the image fused, and several descriptive statistics of each segment are computed for each band. To select the potential homogeneous regions that are eligible for the selection of training samples, a user-defined threshold is applied. After finding those potential regions, the

  2. Scaling up the evaluation of psychotherapy: evaluating motivational interviewing fidelity via statistical text classification

    PubMed Central

    2014-01-01

    Background Behavioral interventions such as psychotherapy are leading, evidence-based practices for a variety of problems (e.g., substance abuse), but the evaluation of provider fidelity to behavioral interventions is limited by the need for human judgment. The current study evaluated the accuracy of statistical text classification in replicating human-based judgments of provider fidelity in one specific psychotherapy—motivational interviewing (MI). Method Participants (n = 148) came from five previously conducted randomized trials and were either primary care patients at a safety-net hospital or university students. To be eligible for the original studies, participants met criteria for either problematic drug or alcohol use. All participants received a type of brief motivational interview, an evidence-based intervention for alcohol and substance use disorders. The Motivational Interviewing Skills Code is a standard measure of MI provider fidelity based on human ratings that was used to evaluate all therapy sessions. A text classification approach called a labeled topic model was used to learn associations between human-based fidelity ratings and MI session transcripts. It was then used to generate codes for new sessions. The primary comparison was the accuracy of model-based codes with human-based codes. Results Receiver operating characteristic (ROC) analyses of model-based codes showed reasonably strong sensitivity and specificity with those from human raters (range of area under ROC curve (AUC) scores: 0.62 – 0.81; average AUC: 0.72). Agreement with human raters was evaluated based on talk turns as well as code tallies for an entire session. Generated codes had higher reliability with human codes for session tallies and also varied strongly by individual code. Conclusion To scale up the evaluation of behavioral interventions, technological solutions will be required. The current study demonstrated preliminary, encouraging findings regarding the utility

  3. Fidelity of Automatic Speech Processing for Adult and Child Talker Classifications

    PubMed Central

    2016-01-01

    Automatic speech processing (ASP) has recently been applied to very large datasets of naturalistically collected, daylong recordings of child speech via an audio recorder worn by young children. The system developed by the LENA Research Foundation analyzes children's speech for research and clinical purposes, with special focus on of identifying and tagging family speech dynamics and the at-home acoustic environment from the auditory perspective of the child. A primary issue for researchers, clinicians, and families using the Language ENvironment Analysis (LENA) system is to what degree the segment labels are valid. This classification study evaluates the performance of the computer ASP output against 23 trained human judges who made about 53,000 judgements of classification of segments tagged by the LENA ASP. Results indicate performance consistent with modern ASP such as those using HMM methods, with acoustic characteristics of fundamental frequency and segment duration most important for both human and machine classifications. Results are likely to be important for interpreting and improving ASP output. PMID:27529813

  4. Towards the Development of a Mobile Phonopneumogram: Automatic Breath-Phase Classification Using Smartphones.

    PubMed

    Reyes, Bersain A; Reljin, Natasa; Kong, Youngsun; Nam, Yunyoung; Ha, Sangho; Chon, Ki H

    2016-09-01

    Correct labeling of breath phases is useful in the automatic analysis of respiratory sounds, where airflow or volume signals are commonly used as temporal reference. However, such signals are not always available. The development of a smartphone-based respiratory sound analysis system has received increased attention. In this study, we propose an optical approach that takes advantage of a smartphone's camera and provides a chest movement signal useful for classification of the breath phases when simultaneously recording tracheal sounds. Spirometer and smartphone-based signals were acquired from N = 13 healthy volunteers breathing at different frequencies, airflow and volume levels. We found that the smartphone-acquired chest movement signal was highly correlated with reference volume (ρ = 0.960 ± 0.025, mean ± SD). A simple linear regression on the chest signal was used to label the breath phases according to the slope between consecutive onsets. 100% accuracy was found for the classification of the analyzed breath phases. We found that the proposed classification scheme can be used to correctly classify breath phases in more challenging breathing patterns, such as those that include non-breath events like swallowing, talking, and coughing, and alternating or irregular breathing. These results show the feasibility of developing a portable and inexpensive phonopneumogram for the analysis of respiratory sounds based on smartphones. PMID:26847825

  5. Performance analysis of distributed applications using automatic classification of communication inefficiencies

    DOEpatents

    Vetter, Jeffrey S.

    2005-02-01

    The method and system described herein presents a technique for performance analysis that helps users understand the communication behavior of their message passing applications. The method and system described herein may automatically classifies individual communication operations and reveal the cause of communication inefficiencies in the application. This classification allows the developer to quickly focus on the culprits of truly inefficient behavior, rather than manually foraging through massive amounts of performance data. Specifically, the method and system described herein trace the message operations of Message Passing Interface (MPI) applications and then classify each individual communication event using a supervised learning technique: decision tree classification. The decision tree may be trained using microbenchmarks that demonstrate both efficient and inefficient communication. Since the method and system described herein adapt to the target system's configuration through these microbenchmarks, they simultaneously automate the performance analysis process and improve classification accuracy. The method and system described herein may improve the accuracy of performance analysis and dramatically reduce the amount of data that users must encounter.

  6. Fidelity of Automatic Speech Processing for Adult and Child Talker Classifications.

    PubMed

    VanDam, Mark; Silbert, Noah H

    2016-01-01

    Automatic speech processing (ASP) has recently been applied to very large datasets of naturalistically collected, daylong recordings of child speech via an audio recorder worn by young children. The system developed by the LENA Research Foundation analyzes children's speech for research and clinical purposes, with special focus on of identifying and tagging family speech dynamics and the at-home acoustic environment from the auditory perspective of the child. A primary issue for researchers, clinicians, and families using the Language ENvironment Analysis (LENA) system is to what degree the segment labels are valid. This classification study evaluates the performance of the computer ASP output against 23 trained human judges who made about 53,000 judgements of classification of segments tagged by the LENA ASP. Results indicate performance consistent with modern ASP such as those using HMM methods, with acoustic characteristics of fundamental frequency and segment duration most important for both human and machine classifications. Results are likely to be important for interpreting and improving ASP output. PMID:27529813

  7. Automatic classification of background EEG activity in healthy and sick neonates

    NASA Astrophysics Data System (ADS)

    Löfhede, Johan; Thordstein, Magnus; Löfgren, Nils; Flisberg, Anders; Rosa-Zurera, Manuel; Kjellmer, Ingemar; Lindecrantz, Kaj

    2010-02-01

    The overall aim of our research is to develop methods for a monitoring system to be used at neonatal intensive care units. When monitoring a baby, a range of different types of background activity needs to be considered. In this work, we have developed a scheme for automatic classification of background EEG activity in newborn babies. EEG from six full-term babies who were displaying a burst suppression pattern while suffering from the after-effects of asphyxia during birth was included along with EEG from 20 full-term healthy newborn babies. The signals from the healthy babies were divided into four behavioural states: active awake, quiet awake, active sleep and quiet sleep. By using a number of features extracted from the EEG together with Fisher's linear discriminant classifier we have managed to achieve 100% correct classification when separating burst suppression EEG from all four healthy EEG types and 93% true positive classification when separating quiet sleep from the other types. The other three sleep stages could not be classified. When the pathological burst suppression pattern was detected, the analysis was taken one step further and the signal was segmented into burst and suppression, allowing clinically relevant parameters such as suppression length and burst suppression ratio to be calculated. The segmentation of the burst suppression EEG works well, with a probability of error around 4%.

  8. Notes for the improvement of the spatial and spectral data classification method. [automatic classification and mapping of earth resources satellite data

    NASA Technical Reports Server (NTRS)

    Dalton, C. C.

    1974-01-01

    This report examines the spatial and spectral clustering technique for the unsupervised automatic classification and mapping of earth resources satellite data, and makes theoretical analysis of the decision rules and tests in order to suggest how the method might best be applied to other flight data such as Skylab and Spacelab.

  9. Automatic classification of prostate cancer Gleason scores from multiparametric magnetic resonance images

    PubMed Central

    Fehr, Duc; Veeraraghavan, Harini; Wibmer, Andreas; Gondo, Tatsuo; Matsumoto, Kazuhiro; Vargas, Herbert Alberto; Sala, Evis; Hricak, Hedvig; Deasy, Joseph O.

    2015-01-01

    Noninvasive, radiological image-based detection and stratification of Gleason patterns can impact clinical outcomes, treatment selection, and the determination of disease status at diagnosis without subjecting patients to surgical biopsies. We present machine learning-based automatic classification of prostate cancer aggressiveness by combining apparent diffusion coefficient (ADC) and T2-weighted (T2-w) MRI-based texture features. Our approach achieved reasonably accurate classification of Gleason scores (GS) 6(3+3) vs. ≥7 and 7(3+4) vs. 7(4+3) despite the presence of highly unbalanced samples by using two different sample augmentation techniques followed by feature selection-based classification. Our method distinguished between GS 6(3+3) and ≥7 cancers with 93% accuracy for cancers occurring in both peripheral (PZ) and transition (TZ) zones and 92% for cancers occurring in the PZ alone. Our approach distinguished the GS 7(3+4) from GS 7(4+3) with 92% accuracy for cancers occurring in both the PZ and TZ and with 93% for cancers occurring in the PZ alone. In comparison, a classifier using only the ADC mean achieved a top accuracy of 58% for distinguishing GS 6(3+3) vs. GS ≥7 for cancers occurring in PZ and TZ and 63% for cancers occurring in PZ alone. The same classifier achieved an accuracy of 59% for distinguishing GS 7(3+4) from GS 7(4+3) occurring in the PZ and TZ and 60% for cancers occurring in PZ alone. Separate analysis of the cancers occurring in TZ alone was not performed owing to the limited number of samples. Our results suggest that texture features derived from ADC and T2-w MRI together with sample augmentation can help to obtain reasonably accurate classification of Gleason patterns. PMID:26578786

  10. Automatic classification of endoscopic images for premalignant conditions of the esophagus

    NASA Astrophysics Data System (ADS)

    Boschetto, Davide; Gambaretto, Gloria; Grisan, Enrico

    2016-03-01

    Barrett's esophagus (BE) is a precancerous complication of gastroesophageal reflux disease in which normal stratified squamous epithelium lining the esophagus is replaced by intestinal metaplastic columnar epithelium. Repeated endoscopies and multiple biopsies are often necessary to establish the presence of intestinal metaplasia. Narrow Band Imaging (NBI) is an imaging technique commonly used with endoscopies that enhances the contrast of vascular pattern on the mucosa. We present a computer-based method for the automatic normal/metaplastic classification of endoscopic NBI images. Superpixel segmentation is used to identify and cluster pixels belonging to uniform regions. From each uniform clustered region of pixels, eight features maximizing differences among normal and metaplastic epithelium are extracted for the classification step. For each superpixel, the three mean intensities of each color channel are firstly selected as features. Three added features are the mean intensities for each superpixel after separately applying to the red-channel image three different morphological filters (top-hat filtering, entropy filtering and range filtering). The last two features require the computation of the Grey-Level Co-Occurrence Matrix (GLCM), and are reflective of the contrast and the homogeneity of each superpixel. The classification step is performed using an ensemble of 50 classification trees, with a 10-fold cross-validation scheme by training the classifier at each step on a random 70% of the images and testing on the remaining 30% of the dataset. Sensitivity and Specificity are respectively of 79.2% and 87.3%, with an overall accuracy of 83.9%.

  11. Effective Key Parameter Determination for an Automatic Approach to Land Cover Classification Based on Multispectral Remote Sensing Imagery

    PubMed Central

    Wang, Yong; Jiang, Dong; Zhuang, Dafang; Huang, Yaohuan; Wang, Wei; Yu, Xinfang

    2013-01-01

    The classification of land cover based on satellite data is important for many areas of scientific research. Unfortunately, some traditional land cover classification methods (e.g. known as supervised classification) are very labor-intensive and subjective because of the required human involvement. Jiang et al. proposed a simple but robust method for land cover classification using a prior classification map and a current multispectral remote sensing image. This new method has proven to be a suitable classification method; however, its drawback is that it is a semi-automatic method because the key parameters cannot be selected automatically. In this study, we propose an approach in which the two key parameters are chosen automatically. The proposed method consists primarily of the following three interdependent parts: the selection procedure for the pure-pixel training-sample dataset, the method to determine the key parameters, and the optimal combination model. In this study, the proposed approach employs both overall accuracy and their Kappa Coefficients (KC), and Time-Consumings (TC, unit: second) in order to select the two key parameters automatically instead of using a test-decision, which avoids subjective bias. A case study of Weichang District of Hebei Province, China, using Landsat-5/TM data of 2010 with 30 m spatial resolution and prior classification map of 2005 recognised as relatively precise data, was conducted to test the performance of this method. The experimental results show that the methodology determining the key parameters uses the portfolio optimisation model and increases the degree of automation of Jiang et al.'s classification method, which may have a wide scope of scientific application. PMID:24204582

  12. Automatic Galaxy Classification via Machine Learning Techniques: Parallelized Rotation/Flipping INvariant Kohonen Maps (PINK)

    NASA Astrophysics Data System (ADS)

    Polsterer, K. L.; Gieseke, F.; Igel, C.

    2015-09-01

    In the last decades more and more all-sky surveys created an enormous amount of data which is publicly available on the Internet. Crowd-sourcing projects such as Galaxy-Zoo and Radio-Galaxy-Zoo used encouraged users from all over the world to manually conduct various classification tasks. The combination of the pattern-recognition capabilities of thousands of volunteers enabled scientists to finish the data analysis within acceptable time. For up-coming surveys with billions of sources, however, this approach is not feasible anymore. In this work, we present an unsupervised method that can automatically process large amounts of galaxy data and which generates a set of prototypes. This resulting model can be used to both visualize the given galaxy data as well as to classify so far unseen images.

  13. Automatic segmentation and classification of mycobacterium tuberculosis with conventional light microscopy

    NASA Astrophysics Data System (ADS)

    Xu, Chao; Zhou, Dongxiang; Zhai, Yongping; Liu, Yunhui

    2015-12-01

    This paper realizes the automatic segmentation and classification of Mycobacterium tuberculosis with conventional light microscopy. First, the candidate bacillus objects are segmented by the marker-based watershed transform. The markers are obtained by an adaptive threshold segmentation based on the adaptive scale Gaussian filter. The scale of the Gaussian filter is determined according to the color model of the bacillus objects. Then the candidate objects are extracted integrally after region merging and contaminations elimination. Second, the shape features of the bacillus objects are characterized by the Hu moments, compactness, eccentricity, and roughness, which are used to classify the single, touching and non-bacillus objects. We evaluated the logistic regression, random forest, and intersection kernel support vector machines classifiers in classifying the bacillus objects respectively. Experimental results demonstrate that the proposed method yields to high robustness and accuracy. The logistic regression classifier performs best with an accuracy of 91.68%.

  14. Automatic segmentation and classification of tendon nuclei from IHC stained images

    NASA Astrophysics Data System (ADS)

    Kuok, Chan-Pang; Wu, Po-Ting; Jou, I.-Ming; Su, Fong-Chin; Sun, Yung-Nien

    2015-12-01

    Immunohistochemical (IHC) staining is commonly used for detecting cells in microscopy. It is used for analyzing many types of diseases, e.g. breast cancer. Dispersion problem often exist at cell staining which will affect the accuracy of automatic counting. In this paper, we introduce a new method to overcome this problem. Otsu's thresholding method is first applied to exclude the background, so that only cells with dispersed staining are left at foreground, and then refinement will be applied by local adaptive thresholding method according to the irregularity index of the segmented shape at foreground. The segmentation results are also compared to the refinement results using Otsu's thresholding method. Cell classification based on the shape and color indices obtained from the segmentation result is applied to determine the cell condition into normal, abnormal and suspected abnormal cases.

  15. Field demonstration of an instrument performing automatic classification of geologic surfaces.

    PubMed

    Bekker, Dmitriy L; Thompson, David R; Abbey, William J; Cabrol, Nathalie A; Francis, Raymond; Manatt, Ken S; Ortega, Kevin F; Wagstaff, Kiri L

    2014-06-01

    This work presents a method with which to automate simple aspects of geologic image analysis during space exploration. Automated image analysis on board the spacecraft can make operations more efficient by generating compressed maps of long traverses for summary downlink. It can also enable immediate automatic responses to science targets of opportunity, improving the quality of targeted measurements collected with each command cycle. In addition, automated analyses on Earth can process large image catalogs, such as the growing database of Mars surface images, permitting more timely and quantitative summaries that inform tactical mission operations. We present TextureCam, a new instrument that incorporates real-time image analysis to produce texture-sensitive classifications of geologic surfaces in mesoscale scenes. A series of tests at the Cima Volcanic Field in the Mojave Desert, California, demonstrated mesoscale surficial mapping at two distinct sites of geologic interest. PMID:24886217

  16. Automatic intraductal breast carcinoma classification using a neural network-based recognition system.

    PubMed

    Reigosa, A; Hernández, L; Torrealba, V; Barrios, V; Montilla, G; Bosnjak, A; Araez, M; Turiaf, M; Leon, A

    1998-07-01

    A contour-based automatic recognition system was applied to classify intraductal breast carcinoma into high nuclear grade and low nuclear grade in a digitized histologic image. The image discriminating characteristics were selected by their invariability condition to rotation and translation. They were acquired from cellular contours information. The totally interconnected multilayer perceptron network architecture was selected, and it was trained with the error back propagation algorithm. Forty cases were analyzed by the system and the diagnoses were compared with that of pathologist consensus, obtaining agreement in 97.5% (p < .00001 of cases). The system may become a very useful tool for the pathologist in the definitive classification of intraductal carcinoma. PMID:21223442

  17. Automatic Detection and Classification of Unsafe Events During Power Wheelchair Use

    PubMed Central

    Moghaddam, Athena K.; Yuen, Hiu Kim; Archambault, Philippe S.; Routhier, François; Michaud, François; Boissy, Patrick

    2014-01-01

    Using a powered wheelchair (PW) is a complex task requiring advanced perceptual and motor control skills. Unfortunately, PW incidents and accidents are not uncommon and their consequences can be serious. The objective of this paper is to develop technological tools that can be used to characterize a wheelchair user’s driving behavior under various settings. In the experiments conducted, PWs are outfitted with a datalogging platform that records, in real-time, the 3-D acceleration of the PW. Data collection was conducted over 35 different activities, designed to capture a spectrum of PW driving events performed at different speeds (collisions with fixed or moving objects, rolling on incline plane, and rolling across multiple types obstacles). The data was processed using time-series analysis and data mining techniques, to automatically detect and identify the different events. We compared the classification accuracy using four different types of time-series features: 1) time-delay embeddings; 2) time-domain characterization; 3) frequency-domain features; and 4) wavelet transforms. In the analysis, we compared the classification accuracy obtained when distinguishing between safe and unsafe events during each of the 35 different activities. For the purposes of this study, unsafe events were defined as activities containing collisions against objects at different speed, and the remainder were defined as safe events. We were able to accurately detect 98% of unsafe events, with a low (12%) false positive rate, using only five examples of each activity. This proof-of-concept study shows that the proposed approach has the potential of capturing, based on limited input from embedded sensors, contextual information on PW use, and of automatically characterizing a user’s PW driving behavior. PMID:27170879

  18. Automatic Detection and Classification of Unsafe Events During Power Wheelchair Use.

    PubMed

    Pineau, Joelle; Moghaddam, Athena K; Yuen, Hiu Kim; Archambault, Philippe S; Routhier, François; Michaud, François; Boissy, Patrick

    2014-01-01

    Using a powered wheelchair (PW) is a complex task requiring advanced perceptual and motor control skills. Unfortunately, PW incidents and accidents are not uncommon and their consequences can be serious. The objective of this paper is to develop technological tools that can be used to characterize a wheelchair user's driving behavior under various settings. In the experiments conducted, PWs are outfitted with a datalogging platform that records, in real-time, the 3-D acceleration of the PW. Data collection was conducted over 35 different activities, designed to capture a spectrum of PW driving events performed at different speeds (collisions with fixed or moving objects, rolling on incline plane, and rolling across multiple types obstacles). The data was processed using time-series analysis and data mining techniques, to automatically detect and identify the different events. We compared the classification accuracy using four different types of time-series features: 1) time-delay embeddings; 2) time-domain characterization; 3) frequency-domain features; and 4) wavelet transforms. In the analysis, we compared the classification accuracy obtained when distinguishing between safe and unsafe events during each of the 35 different activities. For the purposes of this study, unsafe events were defined as activities containing collisions against objects at different speed, and the remainder were defined as safe events. We were able to accurately detect 98% of unsafe events, with a low (12%) false positive rate, using only five examples of each activity. This proof-of-concept study shows that the proposed approach has the potential of capturing, based on limited input from embedded sensors, contextual information on PW use, and of automatically characterizing a user's PW driving behavior. PMID:27170879

  19. Automatic Detection of Cervical Cancer Cells by a Two-Level Cascade Classification System.

    PubMed

    Su, Jie; Xu, Xuan; He, Yongjun; Song, Jinming

    2016-01-01

    We proposed a method for automatic detection of cervical cancer cells in images captured from thin liquid based cytology slides. We selected 20,000 cells in images derived from 120 different thin liquid based cytology slides, which include 5000 epithelial cells (normal 2500, abnormal 2500), lymphoid cells, neutrophils, and junk cells. We first proposed 28 features, including 20 morphologic features and 8 texture features, based on the characteristics of each cell type. We then used a two-level cascade integration system of two classifiers to classify the cervical cells into normal and abnormal epithelial cells. The results showed that the recognition rates for abnormal cervical epithelial cells were 92.7% and 93.2%, respectively, when C4.5 classifier or LR (LR: logical regression) classifier was used individually; while the recognition rate was significantly higher (95.642%) when our two-level cascade integrated classifier system was used. The false negative rate and false positive rate (both 1.44%) of the proposed automatic two-level cascade classification system are also much lower than those of traditional Pap smear review. PMID:27298758

  20. Automatic Detection of Cervical Cancer Cells by a Two-Level Cascade Classification System

    PubMed Central

    Su, Jie; Xu, Xuan; He, Yongjun; Song, Jinming

    2016-01-01

    We proposed a method for automatic detection of cervical cancer cells in images captured from thin liquid based cytology slides. We selected 20,000 cells in images derived from 120 different thin liquid based cytology slides, which include 5000 epithelial cells (normal 2500, abnormal 2500), lymphoid cells, neutrophils, and junk cells. We first proposed 28 features, including 20 morphologic features and 8 texture features, based on the characteristics of each cell type. We then used a two-level cascade integration system of two classifiers to classify the cervical cells into normal and abnormal epithelial cells. The results showed that the recognition rates for abnormal cervical epithelial cells were 92.7% and 93.2%, respectively, when C4.5 classifier or LR (LR: logical regression) classifier was used individually; while the recognition rate was significantly higher (95.642%) when our two-level cascade integrated classifier system was used. The false negative rate and false positive rate (both 1.44%) of the proposed automatic two-level cascade classification system are also much lower than those of traditional Pap smear review. PMID:27298758

  1. Automatic classification of sulcal regions of the human brain cortex using pattern recognition

    NASA Astrophysics Data System (ADS)

    Behnke, Kirsten J.; Rettmann, Maryam E.; Pham, Dzung L.; Shen, Dinggang; Resnick, Susan M.; Davatzikos, Christos; Prince, Jerry L.

    2003-05-01

    Parcellation of the cortex has received a great deal of attention in magnetic resonance (MR) image analysis, but its usefulness has been limited by time-consuming algorithms that require manual labeling. An automatic labeling scheme is necessary to accurately and consistently parcellate a large number of brains. The large variation of cortical folding patterns makes automatic labeling a challenging problem, which cannot be solved by deformable atlas registration alone. In this work, an automated classification scheme that consists of a mix of both atlas driven and data driven methods is proposed to label the sulcal regions, which are defined as the gray matter regions of the cortical surface surrounding each sulcus. The premise for this algorithm is that sulcal regions can be classified according to the pattern of anatomical features (e.g. supramarginal gyrus, cuneus, etc.) associated with each region. Using a nearest-neighbor approach, a sulcal region is classified as being in the same class as the sulcus from a set of training data which has the nearest pattern of anatomical features. Using just one subject as training data, the algorithm correctly labeled 83% of the regions that make up the main sulci of the cortex.

  2. Application of the AutoClass Automatic Bayesian Classification System to HMI Solar Images

    NASA Astrophysics Data System (ADS)

    Parker, D. G.; Beck, J. G.; Ulrich, R. K.

    2011-12-01

    When applied to a sample set of observed data, the Bayesian automatic classification system known as AutoClass finds a set of class definitions based on specified attributes of the data, such as magnetic field and intensity, without human supervision. These class definitions can then be applied to new data sets to identify automatically in them the classes found in the sample set. AutoClass can be applied to solar magnetic and intensity images to identify surface features associated with different values of magnetic and intensity fields in a consistent manner without the need for human judgment. AutoClass has been applied to Mt. Wilson magnetograms and intensity-grams to identify solar surface features associated with variations in total solar irradiance (TSI) and, using those identifications, to improve modeling of TSI variations over time. (Ulrich, et al, 2010) Here, we apply AutoClass to observables derived from the high resolution 4096 x 4096 HMI magnetic, intensity continuum, line width and line depth images to identify solar surface regions which may be associated with variations in TSI and other solar irradiance measurements. To prevent small instrument artifacts from interfering with class identification, we apply a flat-field correction and a rotationally shifted temporal average to the HMI images prior to processing with AutoClass. This pre-processing also allows an investigation of the sensitivity of AutoClass to instrumental artifacts. The ability to categorize automatically surface features in the HMI images holds out the promise of consistent, relatively quick and manageable analysis of the large quantity of data available in these highly resolved images and the use of that analysis to enhance understanding of the physical processes at work in solar surface features and their implications for the solar-terrestrial environment. Reference Ulrich, R.K., Parker, D, Bertello, L. and Boyden, J. 2010, Solar Phys., 261, 11.

  3. Automatic retinal vessel classification using a Least Square-Support Vector Machine in VAMPIRE.

    PubMed

    Relan, D; MacGillivray, T; Ballerini, L; Trucco, E

    2014-01-01

    It is important to classify retinal blood vessels into arterioles and venules for computerised analysis of the vasculature and to aid discovery of disease biomarkers. For instance, zone B is the standardised region of a retinal image utilised for the measurement of the arteriole to venule width ratio (AVR), a parameter indicative of microvascular health and systemic disease. We introduce a Least Square-Support Vector Machine (LS-SVM) classifier for the first time (to the best of our knowledge) to label automatically arterioles and venules. We use only 4 image features and consider vessels inside zone B (802 vessels from 70 fundus camera images) and in an extended zone (1,207 vessels, 70 fundus camera images). We achieve an accuracy of 94.88% and 93.96% in zone B and the extended zone, respectively, with a training set of 10 images and a testing set of 60 images. With a smaller training set of only 5 images and the same testing set we achieve an accuracy of 94.16% and 93.95%, respectively. This experiment was repeated five times by randomly choosing 10 and 5 images for the training set. Mean classification accuracy are close to the above mentioned result. We conclude that the performance of our system is very promising and outperforms most recently reported systems. Our approach requires smaller training data sets compared to others but still results in a similar or higher classification rate. PMID:25569917

  4. Support Vector Machine Model for Automatic Detection and Classification of Seismic Events

    NASA Astrophysics Data System (ADS)

    Barros, Vesna; Barros, Lucas

    2016-04-01

    The automated processing of multiple seismic signals to detect, localize and classify seismic events is a central tool in both natural hazards monitoring and nuclear treaty verification. However, false detections and missed detections caused by station noise and incorrect classification of arrivals are still an issue and the events are often unclassified or poorly classified. Thus, machine learning techniques can be used in automatic processing for classifying the huge database of seismic recordings and provide more confidence in the final output. Applied in the context of the International Monitoring System (IMS) - a global sensor network developed for the Comprehensive Nuclear-Test-Ban Treaty (CTBT) - we propose a fully automatic method for seismic event detection and classification based on a supervised pattern recognition technique called the Support Vector Machine (SVM). According to Kortström et al., 2015, the advantages of using SVM are handleability of large number of features and effectiveness in high dimensional spaces. Our objective is to detect seismic events from one IMS seismic station located in an area of high seismicity and mining activity and classify them as earthquakes or quarry blasts. It is expected to create a flexible and easily adjustable SVM method that can be applied in different regions and datasets. Taken a step further, accurate results for seismic stations could lead to a modification of the model and its parameters to make it applicable to other waveform technologies used to monitor nuclear explosions such as infrasound and hydroacoustic waveforms. As an authorized user, we have direct access to all IMS data and bulletins through a secure signatory account. A set of significant seismic waveforms containing different types of events (e.g. earthquake, quarry blasts) and noise is being analysed to train the model and learn the typical pattern of the signal from these events. Moreover, comparing the performance of the support

  5. Unsupervised method for automatic construction of a disease dictionary from a large free text collection.

    PubMed

    Xu, Rong; Supekar, Kaustubh; Morgan, Alex; Das, Amar; Garber, Alan

    2008-01-01

    Concept specific lexicons (e.g. diseases, drugs, anatomy) are a critical source of background knowledge for many medical language-processing systems. However, the rapid pace of biomedical research and the lack of constraints on usage ensure that such dictionaries are incomplete. Focusing on disease terminology, we have developed an automated, unsupervised, iterative pattern learning approach for constructing a comprehensive medical dictionary of disease terms from randomized clinical trial (RCT) abstracts, and we compared different ranking methods for automatically extracting con-textual patterns and concept terms. When used to identify disease concepts from 100 randomly chosen, manually annotated clinical abstracts, our disease dictionary shows significant performance improvement (F1 increased by 35-88%) over available, manually created disease terminologies. PMID:18999169

  6. Generating Automated Text Complexity Classifications That Are Aligned with Targeted Text Complexity Standards. Research Report. ETS RR-10-28

    ERIC Educational Resources Information Center

    Sheehan, Kathleen M.; Kostin, Irene; Futagi, Yoko; Flor, Michael

    2010-01-01

    The Common Core Standards call for students to be exposed to a much greater level of text complexity than has been the norm in schools for the past 40 years. Textbook publishers, teachers, and assessment developers are being asked to refocus materials and methods to ensure that students are challenged to read texts at steadily increasing…

  7. Experimenting with Automatic Text-to-Diagram Conversion: A Novel Teaching Aid for the Blind People

    ERIC Educational Resources Information Center

    Mukherjee, Anirban; Garain, Utpal; Biswas, Arindam

    2014-01-01

    Diagram describing texts are integral part of science and engineering subjects including geometry, physics, engineering drawing, etc. In order to understand such text, one, at first, tries to draw or perceive the underlying diagram. For perception of the blind students such diagrams need to be drawn in some non-visual accessible form like tactile…

  8. [Situation models in text comprehension: will emotionally relieving information be automatically activated?].

    PubMed

    Wentura, D; Nüsing, J

    1999-01-01

    It was tested whether the "situation model" framework can be applied to research on coping processes. Therefore, subjects (N = 80) were presented with short episodes (formulated in a self-referent manner) about everyday situations which potentially ended in a negative way (e.g., failures in achievement situations; losses etc.). The first half of each episode contained a critical sentence with emotionally relieving information. Given a negative ending, this information should be automatically activated due to its relieving effect. A two-factorial design was used. First, a phrase from the critical sentence was presented for recognition either after a negative ending, a positive ending, or before the ending. Second, with minor changes a control sentence (with an additionally distressing character) was constructed for each potentially relieving sentence. As hypothesized, an interaction emerged: Given a negative ending, the error rate was significantly lower for relieving information than for the control version, whereas there was no difference if the test phrase was presented before the end or after a positive end. PMID:10474322

  9. Improvement in accuracy of defect size measurement by automatic defect classification

    NASA Astrophysics Data System (ADS)

    Samir, Bhamidipati; Pereira, Mark; Paninjath, Sankaranarayanan; Jeon, Chan-Uk; Chung, Dong-Hoon; Yoon, Gi-Sung; Jung, Hong-Yul

    2015-10-01

    The blank mask defect review process involves detailed analysis of defects observed across a substrate's multiple preparation stages, such as cleaning and resist-coating. The detailed knowledge of these defects plays an important role in the eventual yield obtained by using the blank. Defect knowledge predominantly comprises of details such as the number of defects observed, and their accurate sizes. Mask usability assessment at the start of the preparation process, is crudely based on number of defects. Similarly, defect size gives an idea of eventual wafer defect printability. Furthermore, monitoring defect characteristics, specifically size and shape, aids in obtaining process related information such as cleaning or coating process efficiencies. Blank mask defect review process is largely manual in nature. However, the large number of defects, observed for latest technology nodes with reducing half-pitch sizes; and the associated amount of information, together make the process increasingly inefficient in terms of review time, accuracy and consistency. The usage of additional tools such as CDSEM may be required to further aid the review process resulting in increasing costs. Calibre® MDPAutoClassify™ provides an automated software alternative, in the form of a powerful analysis tool for fast, accurate, consistent and automatic classification of blank defects. Elaborate post-processing algorithms are applied on defect images generated by inspection machines, to extract and report significant defect information such as defect size, affecting defect printability and mask usability. The algorithm's capabilities are challenged by the variety and complexity of defects encountered, in terms of defect nature, size, shape and composition; and the optical phenomena occurring around the defect [1]. This paper mainly focuses on the results from the evaluation of Calibre® MDPAutoClassify™ product. The main objective of this evaluation is to assess the capability of

  10. Automatic classification of small bowel mucosa alterations in celiac disease for confocal laser endomicroscopy

    NASA Astrophysics Data System (ADS)

    Boschetto, Davide; Di Claudio, Gianluca; Mirzaei, Hadis; Leong, Rupert; Grisan, Enrico

    2016-03-01

    Celiac disease (CD) is an immune-mediated enteropathy triggered by exposure to gluten and similar proteins, affecting genetically susceptible persons, increasing their risk of different complications. Small bowels mucosa damage due to CD involves various degrees of endoscopically relevant lesions, which are not easily recognized: their overall sensitivity and positive predictive values are poor even when zoom-endoscopy is used. Confocal Laser Endomicroscopy (CLE) allows skilled and trained experts to qualitative evaluate mucosa alteration such as a decrease in goblet cells density, presence of villous atrophy or crypt hypertrophy. We present a method for automatically classifying CLE images into three different classes: normal regions, villous atrophy and crypt hypertrophy. This classification is performed after a features selection process, in which four features are extracted from each image, through the application of homomorphic filtering and border identification through Canny and Sobel operators. Three different classifiers have been tested on a dataset of 67 different images labeled by experts in three classes (normal, VA and CH): linear approach, Naive-Bayes quadratic approach and a standard quadratic analysis, all validated with a ten-fold cross validation. Linear classification achieves 82.09% accuracy (class accuracies: 90.32% for normal villi, 82.35% for VA and 68.42% for CH, sensitivity: 0.68, specificity 1.00), Naive Bayes analysis returns 83.58% accuracy (90.32% for normal villi, 70.59% for VA and 84.21% for CH, sensitivity: 0.84 specificity: 0.92), while the quadratic analysis achieves a final accuracy of 94.03% (96.77% accuracy for normal villi, 94.12% for VA and 89.47% for CH, sensitivity: 0.89, specificity: 0.98).

  11. A Guide to the Library of Congress Classification. Fifth Edition. Library and Information Science Text Series.

    ERIC Educational Resources Information Center

    Chan, Lois Mai

    This fifth edition continues to provide an exposition of the Library of Congress Classification and to act as a tool for studying and for staff training in the use of the scheme. It reflects significant electronic developments that have taken place with regard to the Library of Congress Classification since the last edition. Chapter 1 contains an…

  12. Improved chemical text mining of patents with infinite dictionaries and automatic spelling correction.

    PubMed

    Sayle, Roger; Xie, Paul Hongxing; Muresan, Sorel

    2012-01-23

    The text mining of patents of pharmaceutical interest poses a number of unique challenges not encountered in other fields of text mining. Unlike fields, such as bioinformatics, where the number of terms of interest is enumerable and essentially static, systematic chemical nomenclature can describe an infinite number of molecules. Hence, the dictionary- and ontology-based techniques that are commonly used for gene names, diseases, species, etc., have limited utility when searching for novel therapeutic compounds in patents. Additionally, the length and the composition of IUPAC-like names make them more susceptible to typographic problems: OCR failures, human spelling errors, and hyphenation and line breaking issues. This work describes a novel technique, called CaffeineFix, designed to efficiently identify chemical names in free text, even in the presence of typographical errors. Corrected chemical names are generated as input for name-to-structure software. This forms a preprocessing pass, independent of the name-to-structure software used, and is shown to greatly improve the results of chemical text mining in our study. PMID:22148717

  13. BROWSER: An Automatic Indexing On-Line Text Retrieval System. Annual Progress Report.

    ERIC Educational Resources Information Center

    Williams, J. H., Jr.

    The development and testing of the Browsing On-line With Selective Retrieval (BROWSER) text retrieval system allowing a natural language query statement and providing on-line browsing capabilities through an IBM 2260 display terminal is described. The prototype system contains data bases of 25,000 German language patent abstracts, 9,000 English…

  14. The Automatic Assessment of Free Text Answers Using a Modified BLEU Algorithm

    ERIC Educational Resources Information Center

    Noorbehbahani, F.; Kardan, A. A.

    2011-01-01

    e-Learning plays an undoubtedly important role in today's education and assessment is one of the most essential parts of any instruction-based learning process. Assessment is a common way to evaluate a student's knowledge regarding the concepts related to learning objectives. In this paper, a new method for assessing the free text answers of…

  15. Semi-Automatic Grading of Students' Answers Written in Free Text

    ERIC Educational Resources Information Center

    Escudeiro, Nuno; Escudeiro, Paula; Cruz, Augusto

    2011-01-01

    The correct grading of free text answers to exam questions during an assessment process is time consuming and subject to fluctuations in the application of evaluation criteria, particularly when the number of answers is high (in the hundreds). In consequence of these fluctuations, inherent to human nature, and largely determined by emotional…

  16. Automatic classification of endogenous seismic sources within a landslide body using random forest algorithm

    NASA Astrophysics Data System (ADS)

    Provost, Floriane; Hibert, Clément; Malet, Jean-Philippe; Stumpf, André; Doubre, Cécile

    2016-04-01

    Different studies have shown the presence of microseismic activity in soft-rock landslides. The seismic signals exhibit significantly different features in the time and frequency domains which allow their classification and interpretation. Most of the classes could be associated with different mechanisms of deformation occurring within and at the surface (e.g. rockfall, slide-quake, fissure opening, fluid circulation). However, some signals remain not fully understood and some classes contain few examples that prevent any interpretation. To move toward a more complete interpretation of the links between the dynamics of soft-rock landslides and the physical processes controlling their behaviour, a complete catalog of the endogeneous seismicity is needed. We propose a multi-class detection method based on the random forests algorithm to automatically classify the source of seismic signals. Random forests is a supervised machine learning technique that is based on the computation of a large number of decision trees. The multiple decision trees are constructed from training sets including each of the target classes. In the case of seismic signals, these attributes may encompass spectral features but also waveform characteristics, multi-stations observations and other relevant information. The Random Forest classifier is used because it provides state-of-the-art performance when compared with other machine learning techniques (e.g. SVM, Neural Networks) and requires no fine tuning. Furthermore it is relatively fast, robust, easy to parallelize, and inherently suitable for multi-class problems. In this work, we present the first results of the classification method applied to the seismicity recorded at the Super-Sauze landslide between 2013 and 2015. We selected a dozen of seismic signal features that characterize precisely its spectral content (e.g. central frequency, spectrum width, energy in several frequency bands, spectrogram shape, spectrum local and global maxima

  17. Progress toward automatic classification of human brown adipose tissue using biomedical imaging

    NASA Astrophysics Data System (ADS)

    Gifford, Aliya; Towse, Theodore F.; Walker, Ronald C.; Avison, Malcom J.; Welch, E. B.

    2015-03-01

    Brown adipose tissue (BAT) is a small but significant tissue, which may play an important role in obesity and the pathogenesis of metabolic syndrome. Interest in studying BAT in adult humans is increasing, but in order to quantify BAT volume in a single measurement or to detect changes in BAT over the time course of a longitudinal experiment, BAT needs to first be reliably differentiated from surrounding tissue. Although the uptake of the radiotracer 18F-Fluorodeoxyglucose (18F-FDG) in adipose tissue on positron emission tomography (PET) scans following cold exposure is accepted as an indication of BAT, it is not a definitive indicator, and to date there exists no standardized method for segmenting BAT. Consequently, there is a strong need for robust automatic classification of BAT based on properties measured with biomedical imaging. In this study we begin the process of developing an automated segmentation method based on properties obtained from fat-water MRI and PET-CT scans acquired on ten healthy adult subjects.

  18. Automatic type classification and speaker identification of african elephant (Loxodonta africana) vocalizations

    NASA Astrophysics Data System (ADS)

    Clemins, Patrick J.; Johnson, Michael T.

    2003-04-01

    This paper presents a system for automatically classifying African elephant vocalizations based on systems used for human speech recognition and speaker identification. The experiments are performed on vocalizations collected from captive elephants in a naturalistic environment. Features used for classification include Mel-Frequency Cepstral Coefficients (MFCCs) and log energy which are the most common features used in human speech processing. Since African elephants use lower frequencies than humans in their vocalizations, the MFCCs are computed using a shifted Mel-Frequency filter bank to emphasize the infrasound range of the frequency spectrum. In addition to these features, the use of less traditional features such as those based on fundamental frequency and the phase of the frequency spectrum is also considered. A Hidden Markov Model with Gaussian mixture state probabilities is used to model each type of vocalization. Vocalizations are classified based on type, speaker and estrous cycle. Experiments on continuous call type recognition, which can classify multiple vocalizations in the same utterance, are also performed. The long-term goal of this research is to develop a universal analysis framework and robust feature set for animal vocalizations that can be applied to many species.

  19. A Hessian-based methodology for automatic surface crack detection and classification from pavement images

    NASA Astrophysics Data System (ADS)

    Ghanta, Sindhu; Shahini Shamsabadi, Salar; Dy, Jennifer; Wang, Ming; Birken, Ralf

    2015-04-01

    Around 3,000,000 million vehicle miles are annually traveled utilizing the US transportation systems alone. In addition to the road traffic safety, maintaining the road infrastructure in a sound condition promotes a more productive and competitive economy. Due to the significant amounts of financial and human resources required to detect surface cracks by visual inspection, detection of these surface defects are often delayed resulting in deferred maintenance operations. This paper introduces an automatic system for acquisition, detection, classification, and evaluation of pavement surface cracks by unsupervised analysis of images collected from a camera mounted on the rear of a moving vehicle. A Hessian-based multi-scale filter has been utilized to detect ridges in these images at various scales. Post-processing on the extracted features has been implemented to produce statistics of length, width, and area covered by cracks, which are crucial for roadway agencies to assess pavement quality. This process has been realized on three sets of roads with different pavement conditions in the city of Brockton, MA. A ground truth dataset labeled manually is made available to evaluate this algorithm and results rendered more than 90% segmentation accuracy demonstrating the feasibility of employing this approach at a larger scale.

  20. Automatic detection and classification of EOL-concrete and resulting recovered products by hyperspectral imaging

    NASA Astrophysics Data System (ADS)

    Palmieri, Roberta; Bonifazi, Giuseppe; Serranti, Silvia

    2014-05-01

    The recovery of materials from Demolition Waste (DW) represents one of the main target of the recycling industry and the its characterization is important in order to set up efficient sorting and/or quality control systems. End-Of-Life (EOL) concrete materials identification is necessary to maximize DW conversion into useful secondary raw materials, so it is fundamental to develop strategies for the implementation of an automatic recognition system of the recovered products. In this paper, HyperSpectral Imaging (HSI) technique was applied in order to detect DW composition. Hyperspectral images were acquired by a laboratory device equipped with a HSI sensing device working in the near infrared range (1000-1700 nm): NIR Spectral Camera™, embedding an ImSpector™ N17E (SPECIM Ltd, Finland). Acquired spectral data were analyzed adopting the PLS_Toolbox (Version 7.5, Eigenvector Research, Inc.) under Matlab® environment (Version 7.11.1, The Mathworks, Inc.), applying different chemometric methods: Principal Component Analysis (PCA) for exploratory data approach and Partial Least Square- Discriminant Analysis (PLS-DA) to build classification models. Results showed that it is possible to recognize DW materials, distinguishing recycled aggregates from contaminants (e.g. bricks, gypsum, plastics, wood, foam, etc.). The developed procedure is cheap, fast and non-destructive: it could be used to make some steps of the recycling process more efficient and less expensive.

  1. Automatic Classification of the Vestibulo-Ocular Reflex Nystagmus: Integration of Data Clustering and System Identification.

    PubMed

    Ranjbaran, Mina; Smith, Heather L H; Galiana, Henrietta L

    2016-04-01

    The vestibulo-ocular reflex (VOR) plays an important role in our daily activities by enabling us to fixate on objects during head movements. Modeling and identification of the VOR improves our insight into the system behavior and improves diagnosis of various disorders. However, the switching nature of eye movements (nystagmus), including the VOR, makes dynamic analysis challenging. The first step in such analysis is to segment data into its subsystem responses (here slow and fast segment intervals). Misclassification of segments results in biased analysis of the system of interest. Here, we develop a novel three-step algorithm to classify the VOR data into slow and fast intervals automatically. The proposed algorithm is initialized using a K-means clustering method. The initial classification is then refined using system identification approaches and prediction error statistics. The performance of the algorithm is evaluated on simulated and experimental data. It is shown that the new algorithm performance is much improved over the previous methods, in terms of higher specificity. PMID:26357393

  2. Using Fractal And Morphological Criteria For Automatic Classification Of Lung Diseases

    NASA Astrophysics Data System (ADS)

    Vehel, Jacques Levy

    1989-11-01

    Medical Images are difficult to analyze by means of classical image processing tools because they are very complex and irregular. Such shapes are obtained for instance in Nuclear Medecine with the spatial distribution of activity for organs such as lungs, liver, and heart. We have tried to apply two different theories to these signals: - Fractal Geometry deals with the analysis of complex irregular shapes which cannot well be described by the classical Euclidean geometry. - Integral Geometry treats sets globally and allows to introduce robust measures. We have computed three parameters on three kinds of Lung's SPECT images: normal, pulmonary embolism and chronic desease: - The commonly used fractal dimension (FD), that gives a measurement of the irregularity of the 3D shape. - The generalized lacunarity dimension (GLD), defined as the variance of the ratio of the local activity by the mean activity, which is only sensitive to the distribution and the size of gaps in the surface. - The Favard length that gives an approximation of the surface of a 3-D shape. The results show that each slice of the lung, considered as a 3D surface, is fractal and that the fractal dimension is the same for each slice and for the three kind of lungs; as for the lacunarity and Favard length, they are clearly different for normal lungs, pulmonary embolisms and chronic diseases. These results indicate that automatic classification of Lung's SPECT can be achieved, and that a quantitative measurement of the evolution of the disease could be made.

  3. Development of a rapid method for the automatic classification of biological agents' fluorescence spectral signatures

    NASA Astrophysics Data System (ADS)

    Carestia, Mariachiara; Pizzoferrato, Roberto; Gelfusa, Michela; Cenciarelli, Orlando; Ludovici, Gian Marco; Gabriele, Jessica; Malizia, Andrea; Murari, Andrea; Vega, Jesus; Gaudio, Pasquale

    2015-11-01

    Biosecurity and biosafety are key concerns of modern society. Although nanomaterials are improving the capacities of point detectors, standoff detection still appears to be an open issue. Laser-induced fluorescence of biological agents (BAs) has proved to be one of the most promising optical techniques to achieve early standoff detection, but its strengths and weaknesses are still to be fully investigated. In particular, different BAs tend to have similar fluorescence spectra due to the ubiquity of biological endogenous fluorophores producing a signal in the UV range, making data analysis extremely challenging. The Universal Multi Event Locator (UMEL), a general method based on support vector regression, is commonly used to identify characteristic structures in arrays of data. In the first part of this work, we investigate fluorescence emission spectra of different simulants of BAs and apply UMEL for their automatic classification. In the second part of this work, we elaborate a strategy for the application of UMEL to the discrimination of different BAs' simulants spectra. Through this strategy, it has been possible to discriminate between these BAs' simulants despite the high similarity of their fluorescence spectra. These preliminary results support the use of SVR methods to classify BAs' spectral signatures.

  4. Statistical Comparison of Classifiers Applied to the Interferential Tear Film Lipid Layer Automatic Classification

    PubMed Central

    Remeseiro, B.; Penas, M.; Mosquera, A.; Novo, J.; Penedo, M. G.; Yebra-Pimentel, E.

    2012-01-01

    The tear film lipid layer is heterogeneous among the population. Its classification depends on its thickness and can be done using the interference pattern categories proposed by Guillon. The interference phenomena can be characterised as a colour texture pattern, which can be automatically classified into one of these categories. From a photography of the eye, a region of interest is detected and its low-level features are extracted, generating a feature vector that describes it, to be finally classified in one of the target categories. This paper presents an exhaustive study about the problem at hand using different texture analysis methods in three colour spaces and different machine learning algorithms. All these methods and classifiers have been tested on a dataset composed of 105 images from healthy subjects and the results have been statistically analysed. As a result, the manual process done by experts can be automated with the benefits of being faster and unaffected by subjective factors, with maximum accuracy over 95%. PMID:22567040

  5. An Improved Automatic Classification of a Landsat/TM Image from Kansas (FIFE)

    NASA Technical Reports Server (NTRS)

    Kanefsky, Bob; Stutz, John; Cheeseman, Peter; Taylor, Will

    1994-01-01

    This research note shows the results of applying a new massively parallel version of the automatic classification program (AutoClass IV) to a particular Landsat/TM image. The previous results for this image were produced using a "subsampling" technique because of the image size. The new massively parallel version of AutoClass allows the complete image to be classified without "subsampling", thus yielding improved results. The area in question is the FIFE study area in Kansas, and the classes AutoClass found show many interesting subtle variations in types of ground cover. Displays of the spatial distributions of these classes make up the bulk of this report. While the spatial distribution of some of these classes make their interpretation easy, most of the classes require detailed knowledge of the area for their full interpretation. We hope that some who receive this document can help us in understanding these classes. One of the motivations of this exercise was to test the new version of AutoClass (IV) that allows for correlation among the variables within a class. The scatter plots associated with the classes show that this correlation information is important in separating the classes. The fact that the spatial distribution of each of these classes is far from uniform, even though AutoClass was not given information about positions of pixels, shows that the classes are due to real differences in the image.

  6. EnvMine: A text-mining system for the automatic extraction of contextual information

    PubMed Central

    2010-01-01

    Background For ecological studies, it is crucial to count on adequate descriptions of the environments and samples being studied. Such a description must be done in terms of their physicochemical characteristics, allowing a direct comparison between different environments that would be difficult to do otherwise. Also the characterization must include the precise geographical location, to make possible the study of geographical distributions and biogeographical patterns. Currently, there is no schema for annotating these environmental features, and these data have to be extracted from textual sources (published articles). So far, this had to be performed by manual inspection of the corresponding documents. To facilitate this task, we have developed EnvMine, a set of text-mining tools devoted to retrieve contextual information (physicochemical variables and geographical locations) from textual sources of any kind. Results EnvMine is capable of retrieving the physicochemical variables cited in the text, by means of the accurate identification of their associated units of measurement. In this task, the system achieves a recall (percentage of items retrieved) of 92% with less than 1% error. Also a Bayesian classifier was tested for distinguishing parts of the text describing environmental characteristics from others dealing with, for instance, experimental settings. Regarding the identification of geographical locations, the system takes advantage of existing databases such as GeoNames to achieve 86% recall with 92% precision. The identification of a location includes also the determination of its exact coordinates (latitude and longitude), thus allowing the calculation of distance between the individual locations. Conclusion EnvMine is a very efficient method for extracting contextual information from different text sources, like published articles or web pages. This tool can help in determining the precise location and physicochemical variables of sampling sites, thus

  7. A Grid Service for Automatic Land Cover Classification Using Hyperspectral Images

    NASA Astrophysics Data System (ADS)

    Jasso, H.; Shin, P.; Fountain, T.; Pennington, D.; Ding, L.; Cotofana, N.

    2004-12-01

    Hyperspectral images are collected using Airborne Visible/Infrared Imaging Spectrometer (Aviris) optical sensors [1]. 224 contiguous channels are measured across the spectral range, from 400 to 2500 nanometers. We present a system for the automatic classification of land cover using hyperspectral images, and propose an architecture for deploying the system in a grid environment that harnesses distributed file storage and CPU resources for the task. Originally, we ran the following data mining algorithms on a 300x300 image of a section of the Sevilleta National Wildlife Refuge in New Mexico [2]: Maximum Likelihood, Naive Bayes Classifier, Minimum Distance, and Support Vector Machine (SVM). For this, ground truth for 673 pixels was manually collected according to eight possible land covers: river, riparian, agriculture, arid upland, semi-arid upland, barren, pavement, or clouds. The classification accuracies for these algorithms were of 96.4%, 90.9%, 88.4%, and 77.6%, respectively [3]. In this study, we noticed that the slope between adjacent frequencies produces specific patterns across the whole spectrum, giving a good indication of the pixel's land cover type. Wavelet analysis makes these global patterns explicit, by breaking down the signal into variable-sized windows, where long time windows capture low-frequency information and short time windows capture high-frequency information. High frequency information translates to information among close neighbors while low frequency information displays the overall trend of the features. We pre-processed the data using different families of wavelets, resulting in an increase in the performance of the Naive Bayesian Classifier and SVM to 94.2% and 90.1%, respectively. Classification accuracy with SVM was further increased to 97.1 % by modifying the mechanism by which multi-class is achieved using basic two-class SVMs. The original winner-take-all SVM scheme was replaced with a one-against-one scheme, in which k(k-1

  8. Automatic target classification of man-made objects in synthetic aperture radar images using Gabor wavelet and neural network

    NASA Astrophysics Data System (ADS)

    Vasuki, Perumal; Roomi, S. Mohamed Mansoor

    2013-01-01

    Processing of synthetic aperture radar (SAR) images has led to the development of automatic target classification approaches. These approaches help to classify individual and mass military ground vehicles. This work aims to develop an automatic target classification technique to classify military targets like truck/tank/armored car/cannon/bulldozer. The proposed method consists of three stages via preprocessing, feature extraction, and neural network (NN). The first stage removes speckle noise in a SAR image by the identified frost filter and enhances the image by histogram equalization. The second stage uses a Gabor wavelet to extract the image features. The third stage classifies the target by an NN classifier using image features. The proposed work performs better than its counterparts, like K-nearest neighbor (KNN). The proposed work performs better on databases like moving and stationary target acquisition and recognition against the earlier methods by KNN.

  9. Automatic extraction of reference gene from literature in plants based on texting mining.

    PubMed

    He, Lin; Shen, Gengyu; Li, Fei; Huang, Shuiqing

    2015-01-01

    Real-Time Quantitative Polymerase Chain Reaction (qRT-PCR) is widely used in biological research. It is a key to the availability of qRT-PCR experiment to select a stable reference gene. However, selecting an appropriate reference gene usually requires strict biological experiment for verification with high cost in the process of selection. Scientific literatures have accumulated a lot of achievements on the selection of reference gene. Therefore, mining reference genes under specific experiment environments from literatures can provide quite reliable reference genes for similar qRT-PCR experiments with the advantages of reliability, economic and efficiency. An auxiliary reference gene discovery method from literature is proposed in this paper which integrated machine learning, natural language processing and text mining approaches. The validity tests showed that this new method has a better precision and recall on the extraction of reference genes and their environments. PMID:26510294

  10. Regional Image Features Model for Automatic Classification between Normal and Glaucoma in Fundus and Scanning Laser Ophthalmoscopy (SLO) Images.

    PubMed

    Haleem, Muhammad Salman; Han, Liangxiu; Hemert, Jano van; Fleming, Alan; Pasquale, Louis R; Silva, Paolo S; Song, Brian J; Aiello, Lloyd Paul

    2016-06-01

    Glaucoma is one of the leading causes of blindness worldwide. There is no cure for glaucoma but detection at its earliest stage and subsequent treatment can aid patients to prevent blindness. Currently, optic disc and retinal imaging facilitates glaucoma detection but this method requires manual post-imaging modifications that are time-consuming and subjective to image assessment by human observers. Therefore, it is necessary to automate this process. In this work, we have first proposed a novel computer aided approach for automatic glaucoma detection based on Regional Image Features Model (RIFM) which can automatically perform classification between normal and glaucoma images on the basis of regional information. Different from all the existing methods, our approach can extract both geometric (e.g. morphometric properties) and non-geometric based properties (e.g. pixel appearance/intensity values, texture) from images and significantly increase the classification performance. Our proposed approach consists of three new major contributions including automatic localisation of optic disc, automatic segmentation of disc, and classification between normal and glaucoma based on geometric and non-geometric properties of different regions of an image. We have compared our method with existing approaches and tested it on both fundus and Scanning laser ophthalmoscopy (SLO) images. The experimental results show that our proposed approach outperforms the state-of-the-art approaches using either geometric or non-geometric properties. The overall glaucoma classification accuracy for fundus images is 94.4% and accuracy of detection of suspicion of glaucoma in SLO images is 93.9 %. PMID:27086033