Science.gov

Sample records for automatic text classification

  1. Information fusion for automatic text classification

    SciTech Connect

    Dasigi, V.; Mann, R.C.; Protopopescu, V.A.

    1996-08-01

    Analysis and classification of free text documents encompass decision-making processes that rely on several clues derived from text and other contextual information. When using multiple clues, it is generally not known a priori how these should be integrated into a decision. An algorithmic sensor based on Latent Semantic Indexing (LSI) (a recent successful method for text retrieval rather than classification) is the primary sensor used in our work, but its utility is limited by the {ital reference}{ital library} of documents. Thus, there is an important need to complement or at least supplement this sensor. We have developed a system that uses a neural network to integrate the LSI-based sensor with other clues derived from the text. This approach allows for systematic fusion of several information sources in order to determine a combined best decision about the category to which a document belongs.

  2. Automatic Text Classification of English Newswire Articles Based on Statistical Classification Techniques

    NASA Astrophysics Data System (ADS)

    Zu, Guowei; Ohyama, Wataru; Wakabayashi, Tetsushi; Kimura, Fumitaka

    The basic process of automatic text classification is learning a classification scheme from training examples then using it to classify unseen textual documents. It is essentially the same as graphic or character pattern recognition process. So the pattern recognition approaches can be used for automatic text categorization. In this research several statistical classification techniques each of which employs Euclidean distance, various similarity measures, linear discriminant function, projection distance, modified projection distance, SVM, nearest-neighbor, have been used for automatic text classification. The principal component analysis was used to reduce the dimensionality of the feature vector. Comparative experiments have been conducted on the Reuters-21578 test collection of English newswire articles. The results illustrate that the efficiency of modified projection distance is totally better than the other methods and the principal component analysis is suitable for reducing the dimensionality of the text features.

  3. Finding keywords amongst noise: automatic text classification without parsing

    NASA Astrophysics Data System (ADS)

    Allison, Andrew G.; Pearce, Charles E. M.; Abbott, Derek

    2007-06-01

    The amount of text stored on the Internet, and in our libraries, continues to expand at an exponential rate. There is a great practical need to locate relevant content. This requires quick automated methods for classifying textual information, according to subject. We propose a quick statistical approach, which can distinguish between 'keywords' and 'noisewords', like 'the' and 'a', without the need to parse the text into its parts of speech. Our classification is based on an F-statistic, which compares the observed Word Recurrence Interval (WRI) with a simple null hypothesis. We also propose a model to account for the observed distribution of WRI statistics and we subject this model to a number of tests.

  4. Toward a multi-sensor neural net approach to automatic text classification

    SciTech Connect

    Dasigi, V.; Mann, R.

    1996-01-26

    Many automatic text indexing and retrieval methods use a term-document matrix that is automatically derived from the text in question. Latent Semantic Indexing, a recent method for approximating large term-document matrices, appears to be quite useful in the problem of text information retrieval, rather than text classification. Here we outline a method that attempts to combine the strength of the LSI method with that of neural networks, in addressing the problem of text classification. In doing so, we also indicate ways to improve performance by adding additional {open_quotes}logical sensors{close_quotes} to the neural network, something that is hard to do with the LSI method when employed by itself. Preliminary results are summarized, but much work remains to be done.

  5. Portable Automatic Text Classification for Adverse Drug Reaction Detection via Multi-corpus Training

    PubMed Central

    Gonzalez, Graciela

    2014-01-01

    Objective Automatic detection of Adverse Drug Reaction (ADR) mentions from text has recently received significant interest in pharmacovigilance research. Current research focuses on various sources of text-based information, including social media — where enormous amounts of user posted data is available, which have the potential for use in pharmacovigilance if collected and filtered accurately. The aims of this study are: (i) to explore natural language processing approaches for generating useful features from text, and utilizing them in optimized machine learning algorithms for automatic classification of ADR assertive text segments; (ii) to present two data sets that we prepared for the task of ADR detection from user posted internet data; and (iii) to investigate if combining training data from distinct corpora can improve automatic classification accuracies. Methods One of our three data sets contains annotated sentences from clinical reports, and the two other data sets, built in-house, consist of annotated posts from social media. Our text classification approach relies on generating a large set of features, representing semantic properties (e.g., sentiment, polarity, and topic), from short text nuggets. Importantly, using our expanded feature sets, we combine training data from different corpora in attempts to boost classification accuracies. Results Our feature-rich classification approach performs significantly better than previously published approaches with ADR class F-scores of 0.812 (previously reported best: 0.770), 0.538 and 0.678 for the three data sets. Combining training data from multiple compatible corpora further improves the ADR F-scores for the in-house data sets to 0.597 (improvement of 5.9 units) and 0.704 (improvement of 2.6 units) respectively. Conclusions Our research results indicate that using advanced NLP techniques for generating information rich features from text can significantly improve classification accuracies over existing

  6. Toward a multi-sensor-based approach to automatic text classification

    SciTech Connect

    Dasigi, V.R.; Mann, R.C.

    1995-10-01

    Many automatic text indexing and retrieval methods use a term-document matrix that is automatically derived from the text in question. Latent Semantic Indexing is a method, recently proposed in the Information Retrieval (IR) literature, for approximating a large and sparse term-document matrix with a relatively small number of factors, and is based on a solid mathematical foundation. LSI appears to be quite useful in the problem of text information retrieval, rather than text classification. In this report, we outline a method that attempts to combine the strength of the LSI method with that of neural networks, in addressing the problem of text classification. In doing so, we also indicate ways to improve performance by adding additional {open_quotes}logical sensors{close_quotes} to the neural network, something that is hard to do with the LSI method when employed by itself. The various programs that can be used in testing the system with TIPSTER data set are described. Preliminary results are summarized, but much work remains to be done.

  7. Automatic classification of diseases from free-text death certificates for real-time surveillance.

    PubMed

    Koopman, Bevan; Karimi, Sarvnaz; Nguyen, Anthony; McGuire, Rhydwyn; Muscatello, David; Kemp, Madonna; Truran, Donna; Zhang, Ming; Thackway, Sarah

    2015-07-15

    Death certificates provide an invaluable source for mortality statistics which can be used for surveillance and early warnings of increases in disease activity and to support the development and monitoring of prevention or response strategies. However, their value can be realised only if accurate, quantitative data can be extracted from death certificates, an aim hampered by both the volume and variable nature of certificates written in natural language. This study aims to develop a set of machine learning and rule-based methods to automatically classify death certificates according to four high impact diseases of interest: diabetes, influenza, pneumonia and HIV. Two classification methods are presented: i) a machine learning approach, where detailed features (terms, term n-grams and SNOMED CT concepts) are extracted from death certificates and used to train a set of supervised machine learning models (Support Vector Machines); and ii) a set of keyword-matching rules. These methods were used to identify the presence of diabetes, influenza, pneumonia and HIV in a death certificate. An empirical evaluation was conducted using 340,142 death certificates, divided between training and test sets, covering deaths from 2000-2007 in New South Wales, Australia. Precision and recall (positive predictive value and sensitivity) were used as evaluation measures, with F-measure providing a single, overall measure of effectiveness. A detailed error analysis was performed on classification errors. Classification of diabetes, influenza, pneumonia and HIV was highly accurate (F-measure 0.96). More fine-grained ICD-10 classification effectiveness was more variable but still high (F-measure 0.80). The error analysis revealed that word variations as well as certain word combinations adversely affected classification. In addition, anomalies in the ground truth likely led to an underestimation of the effectiveness. The high accuracy and low cost of the classification methods allow for an

  8. Automatic ICD-10 classification of cancers from free-text death certificates.

    PubMed

    Koopman, Bevan; Zuccon, Guido; Nguyen, Anthony; Bergheim, Anton; Grayson, Narelle

    2015-11-01

    Death certificates provide an invaluable source for cancer mortality statistics; however, this value can only be realised if accurate, quantitative data can be extracted from certificates--an aim hampered by both the volume and variable nature of certificates written in natural language. This paper proposes an automatic classification system for identifying cancer related causes of death from death certificates. Detailed features, including terms, n-grams and SNOMED CT concepts were extracted from a collection of 447,336 death certificates. These features were used to train Support Vector Machine classifiers (one classifier for each cancer type). The classifiers were deployed in a cascaded architecture: the first level identified the presence of cancer (i.e., binary cancer/nocancer) and the second level identified the type of cancer (according to the ICD-10 classification system). A held-out test set was used to evaluate the effectiveness of the classifiers according to precision, recall and F-measure. In addition, detailed feature analysis was performed to reveal the characteristics of a successful cancer classification model. The system was highly effective at identifying cancer as the underlying cause of death (F-measure 0.94). The system was also effective at determining the type of cancer for common cancers (F-measure 0.7). Rare cancers, for which there was little training data, were difficult to classify accurately (F-measure 0.12). Factors influencing performance were the amount of training data and certain ambiguous cancers (e.g., those in the stomach region). The feature analysis revealed a combination of features were important for cancer type classification, with SNOMED CT concept and oncology specific morphology features proving the most valuable. The system proposed in this study provides automatic identification and characterisation of cancers from large collections of free-text death certificates. This allows organisations such as Cancer Registries

  9. Automatic Classification of Free-Text Radiology Reports to Identify Limb Fractures using Machine Learning and the SNOMED CT Ontology

    PubMed Central

    Zuccon, Guido; Wagholikar, Amol S; Nguyen, Anthony N; Butt, Luke; Chu, Kevin; Martin, Shane; Greenslade, Jaimi

    Objective To develop and evaluate machine learning techniques that identify limb fractures and other abnormalities (e.g. dislocations) from radiology reports. Materials and Methods 99 free-text reports of limb radiology examinations were acquired from an Australian public hospital. Two clinicians were employed to identify fractures and abnormalities from the reports; a third senior clinician resolved disagreements. These assessors found that, of the 99 reports, 48 referred to fractures or abnormalities of limb structures. Automated methods were then used to extract features from these reports that could be useful for their automatic classification. The Naive Bayes classification algorithm and two implementations of the support vector machine algorithm were formally evaluated using cross-fold validation over the 99 reports. Results Results show that the Naive Bayes classifier accurately identifies fractures and other abnormalities from the radiology reports. These results were achieved when extracting stemmed token bigram and negation features, as well as using these features in combination with SNOMED CT concepts related to abnormalities and disorders. The latter feature has not been used in previous works that attempted classifying free-text radiology reports. Discussion Automated classification methods have proven effective at identifying fractures and other abnormalities from radiology reports (F-Measure up to 92.31%). Key to the success of these techniques are features such as stemmed token bigrams, negations, and SNOMED CT concepts associated with morphologic abnormalities and disorders. Conclusion This investigation shows early promising results and future work will further validate and strengthen the proposed approaches. PMID:24303284

  10. Automatic Text Structuring and Summarization.

    ERIC Educational Resources Information Center

    Salton, Gerard; And Others

    1997-01-01

    Discussion of the use of information retrieval techniques for automatic generation of semantic hypertext links focuses on automatic text summarization. Topics include World Wide Web links, text segmentation, and evaluation of text summarization by comparing automatically generated abstracts with manually prepared abstracts. (Author/LRW)

  11. TEXT CLASSIFICATION FOR AUTOMATIC DETECTION OF E-CIGARETTE USE AND USE FOR SMOKING CESSATION FROM TWITTER: A FEASIBILITY PILOT.

    PubMed

    Aphinyanaphongs, Yin; Lulejian, Armine; Brown, Duncan Penfold; Bonneau, Richard; Krebs, Paul

    2016-01-01

    Rapid increases in e-cigarette use and potential exposure to harmful byproducts have shifted public health focus to e-cigarettes as a possible drug of abuse. Effective surveillance of use and prevalence would allow appropriate regulatory responses. An ideal surveillance system would collect usage data in real time, focus on populations of interest, include populations unable to take the survey, allow a breadth of questions to answer, and enable geo-location analysis. Social media streams may provide this ideal system. To realize this use case, a foundational question is whether we can detect e-cigarette use at all. This work reports two pilot tasks using text classification to identify automatically Tweets that indicate e-cigarette use and/or e-cigarette use for smoking cessation. We build and define both datasets and compare performance of 4 state of the art classifiers and a keyword search for each task. Our results demonstrate excellent classifier performance of up to 0.90 and 0.94 area under the curve in each category. These promising initial results form the foundation for further studies to realize the ideal surveillance solution.

  12. Text Classification for Automatic Detection of E-Cigarette Use and Use for Smoking Cessation from Twitter: A Feasibility Pilot

    PubMed Central

    Aphinyanaphongs, Yin; Lulejian, Armine; Brown, Duncan Penfold; Bonneau, Richard; Krebs, Paul

    2015-01-01

    Rapid increases in e-cigarette use and potential exposure to harmful byproducts have shifted public health focus to e-cigarettes as a possible drug of abuse. Effective surveillance of use and prevalence would allow appropriate regulatory responses. An ideal surveillance system would collect usage data in real time, focus on populations of interest, include populations unable to take the survey, allow a breadth of questions to answer, and enable geo-location analysis. Social media streams may provide this ideal system. To realize this use case, a foundational question is whether we can detect ecigarette use at all. This work reports two pilot tasks using text classification to identify automatically Tweets that indicate e-cigarette use and/or e-cigarette use for smoking cessation. We build and define both datasets and compare performance of 4 state of the art classifiers and a keyword search for each task. Our results demonstrate excellent classifier performance of up to 0.90 and 0.94 area under the curve in each category. These promising initial results form the foundation for further studies to realize the ideal surveillance solution. PMID:26776211

  13. Automatic Classification in Information Retrieval.

    ERIC Educational Resources Information Center

    van Rijsbergen, C. J.

    1978-01-01

    Addresses the application of automatic classification methods to the problems associated with computerized document retrieval. Different kinds of classifications are described, and both document and term clustering methods are discussed. References and notes are provided. (Author/JD)

  14. Combining automatic table classification and relationship extraction in extracting anticancer drug-side effect pairs from full-text articles.

    PubMed

    Xu, Rong; Wang, QuanQiu

    2015-02-01

    Anticancer drug-associated side effect knowledge often exists in multiple heterogeneous and complementary data sources. A comprehensive anticancer drug-side effect (drug-SE) relationship knowledge base is important for computation-based drug target discovery, drug toxicity predication and drug repositioning. In this study, we present a two-step approach by combining table classification and relationship extraction to extract drug-SE pairs from a large number of high-profile oncological full-text articles. The data consists of 31,255 tables downloaded from the Journal of Oncology (JCO). We first trained a statistical classifier to classify tables into SE-related and -unrelated categories. We then extracted drug-SE pairs from SE-related tables. We compared drug side effect knowledge extracted from JCO tables to that derived from FDA drug labels. Finally, we systematically analyzed relationships between anti-cancer drug-associated side effects and drug-associated gene targets, metabolism genes, and disease indications. The statistical table classifier is effective in classifying tables into SE-related and -unrelated (precision: 0.711; recall: 0.941; F1: 0.810). We extracted a total of 26,918 drug-SE pairs from SE-related tables with a precision of 0.605, a recall of 0.460, and a F1 of 0.520. Drug-SE pairs extracted from JCO tables is largely complementary to those derived from FDA drug labels; as many as 84.7% of the pairs extracted from JCO tables have not been included a side effect database constructed from FDA drug labels. Side effects associated with anticancer drugs positively correlate with drug target genes, drug metabolism genes, and disease indications. Copyright © 2014 Elsevier Inc. All rights reserved.

  15. Automatic lexical classification: bridging research and practice.

    PubMed

    Korhonen, Anna

    2010-08-13

    Natural language processing (NLP)--the automatic analysis, understanding and generation of human language by computers--is vitally dependent on accurate knowledge about words. Because words change their behaviour between text types, domains and sub-languages, a fully accurate static lexical resource (e.g. a dictionary, word classification) is unattainable. Researchers are now developing techniques that could be used to automatically acquire or update lexical resources from textual data. If successful, the automatic approach could considerably enhance the accuracy and portability of language technologies, such as machine translation, text mining and summarization. This paper reviews the recent and on-going research in automatic lexical acquisition. Focusing on lexical classification, it discusses the many challenges that still need to be met before the approach can benefit NLP on a large scale.

  16. Injury narrative text classification using factorization model

    PubMed Central

    2015-01-01

    Narrative text is a useful way of identifying injury circumstances from the routine emergency department data collections. Automatically classifying narratives based on machine learning techniques is a promising technique, which can consequently reduce the tedious manual classification process. Existing works focus on using Naive Bayes which does not always offer the best performance. This paper proposes the Matrix Factorization approaches along with a learning enhancement process for this task. The results are compared with the performance of various other classification approaches. The impact on the classification results from the parameters setting during the classification of a medical text dataset is discussed. With the selection of right dimension k, Non Negative Matrix Factorization-model method achieves 10 CV accuracy of 0.93. PMID:26043671

  17. Autoclass: An automatic classification system

    NASA Technical Reports Server (NTRS)

    Stutz, John; Cheeseman, Peter; Hanson, Robin

    1991-01-01

    The task of inferring a set of classes and class descriptions most likely to explain a given data set can be placed on a firm theoretical foundation using Bayesian statistics. Within this framework, and using various mathematical and algorithmic approximations, the AutoClass System searches for the most probable classifications, automatically choosing the number of classes and complexity of class descriptions. A simpler version of AutoClass has been applied to many large real data sets, has discovered new independently-verified phenomena, and has been released as a robust software package. Recent extensions allow attributes to be selectively correlated within particular classes, and allow classes to inherit, or share, model parameters through a class hierarchy. The mathematical foundations of AutoClass are summarized.

  18. An Experiment in Automatic Hierarchical Document Classification.

    ERIC Educational Resources Information Center

    Garland, Kathleen

    1983-01-01

    Describes method of automatic document classification in which documents classed as QA by Library of Congress classification system were clustered at six thresholds by keyword using single link technique. Automatically generated clusters were compared to Library of Congress subclasses, and partial classified hierarchy was formed. Twelve references…

  19. An Experiment in Automatic Hierarchical Document Classification.

    ERIC Educational Resources Information Center

    Garland, Kathleen

    1983-01-01

    Describes method of automatic document classification in which documents classed as QA by Library of Congress classification system were clustered at six thresholds by keyword using single link technique. Automatically generated clusters were compared to Library of Congress subclasses, and partial classified hierarchy was formed. Twelve references…

  20. Automatic Classification of Marine Mammals with Speaker Classification Methods.

    PubMed

    Kreimeyer, Roman; Ludwig, Stefan

    2016-01-01

    We present an automatic acoustic classifier for marine mammals based on human speaker classification methods as an element of a passive acoustic monitoring (PAM) tool. This work is part of the Protection of Marine Mammals (PoMM) project under the framework of the European Defense Agency (EDA) and joined by the Research Department for Underwater Acoustics and Geophysics (FWG), Bundeswehr Technical Centre (WTD 71) and Kiel University. The automatic classification should support sonar operators in the risk mitigation process before and during sonar exercises with a reliable automatic classification result.

  1. Automatic Classification of Interplanetary Dust Particles

    NASA Astrophysics Data System (ADS)

    Lasue, J.; Stepinski, T. F.; Bell, S. W.

    2010-03-01

    We present an automatic classification of the IDPs collected by NASA-JSC based on their EDS spectra. Agglomerative clustering and the Sammon's map algorithms are used to visualize relationships between the clusters.

  2. Spatial Text Visualization Using Automatic Typographic Maps.

    PubMed

    Afzal, S; Maciejewski, R; Jang, Yun; Elmqvist, N; Ebert, D S

    2012-12-01

    We present a method for automatically building typographic maps that merge text and spatial data into a visual representation where text alone forms the graphical features. We further show how to use this approach to visualize spatial data such as traffic density, crime rate, or demographic data. The technique accepts a vector representation of a geographic map and spatializes the textual labels in the space onto polylines and polygons based on user-defined visual attributes and constraints. Our sample implementation runs as a Web service, spatializing shape files from the OpenStreetMap project into typographic maps for any region.

  3. Automatic discourse connective detection in biomedical text.

    PubMed

    Ramesh, Balaji Polepalli; Prasad, Rashmi; Miller, Tim; Harrington, Brian; Yu, Hong

    2012-01-01

    Relation extraction in biomedical text mining systems has largely focused on identifying clause-level relations, but increasing sophistication demands the recognition of relations at discourse level. A first step in identifying discourse relations involves the detection of discourse connectives: words or phrases used in text to express discourse relations. In this study supervised machine-learning approaches were developed and evaluated for automatically identifying discourse connectives in biomedical text. Two supervised machine-learning models (support vector machines and conditional random fields) were explored for identifying discourse connectives in biomedical literature. In-domain supervised machine-learning classifiers were trained on the Biomedical Discourse Relation Bank, an annotated corpus of discourse relations over 24 full-text biomedical articles (~112,000 word tokens), a subset of the GENIA corpus. Novel domain adaptation techniques were also explored to leverage the larger open-domain Penn Discourse Treebank (~1 million word tokens). The models were evaluated using the standard evaluation metrics of precision, recall and F1 scores. Supervised machine-learning approaches can automatically identify discourse connectives in biomedical text, and the novel domain adaptation techniques yielded the best performance: 0.761 F1 score. A demonstration version of the fully implemented classifier BioConn is available at: http://bioconn.askhermes.org.

  4. Automatic discourse connective detection in biomedical text

    PubMed Central

    Polepalli Ramesh, Balaji; Prasad, Rashmi; Miller, Tim; Harrington, Brian

    2012-01-01

    Objective Relation extraction in biomedical text mining systems has largely focused on identifying clause-level relations, but increasing sophistication demands the recognition of relations at discourse level. A first step in identifying discourse relations involves the detection of discourse connectives: words or phrases used in text to express discourse relations. In this study supervised machine-learning approaches were developed and evaluated for automatically identifying discourse connectives in biomedical text. Materials and Methods Two supervised machine-learning models (support vector machines and conditional random fields) were explored for identifying discourse connectives in biomedical literature. In-domain supervised machine-learning classifiers were trained on the Biomedical Discourse Relation Bank, an annotated corpus of discourse relations over 24 full-text biomedical articles (∼112 000 word tokens), a subset of the GENIA corpus. Novel domain adaptation techniques were also explored to leverage the larger open-domain Penn Discourse Treebank (∼1 million word tokens). The models were evaluated using the standard evaluation metrics of precision, recall and F1 scores. Results and Conclusion Supervised machine-learning approaches can automatically identify discourse connectives in biomedical text, and the novel domain adaptation techniques yielded the best performance: 0.761 F1 score. A demonstration version of the fully implemented classifier BioConn is available at: http://bioconn.askhermes.org. PMID:22744958

  5. Experiments in Automatic Library of Congress Classification.

    ERIC Educational Resources Information Center

    Larson, Ray R.

    1992-01-01

    Presents the results of research into the automatic selection of Library of Congress Classification numbers based on the titles and subject headings in MARC records from a test database at the University of California at Berkeley Library School library. Classification clustering and matching techniques are described. (44 references) (LRW)

  6. Automatic Figure Classification in Bioscience Literature

    PubMed Central

    Kim, Daehyun; Ramesh, Balaji Polepalli; Yu, Hong

    2011-01-01

    Millions of figures appear in biomedical articles, and it is important to develop an intelligent figure search engine to return relevant figures based on user entries. In this study we report a figure classifier that automatically classifies biomedical figures into five predefined figure types: Gel-image, Image-of-thing, Graph, Model, and Mix. The classifier explored rich image features and integrated them with text features. We performed feature selection and explored different classification models, including a rule-based figure classifier, a supervised machine-learning classifier, and a multi-model classifier, the latter of which integrated the first two classifiers. Our results show that feature selection improved figure classification and the novel image features we explored were the best among image features that we have examined. Our results also show that integrating text and image features achieved better performance than using either of them individually. The best system is a multi-model classifier which combines the rule-based hierarchical classifier and a support vector machine (SVM) based classifier, achieving a 76.7% F1-score for five-type classification. We demonstrated our system at http://figureclassification.askhermes.org/. PMID:21645638

  7. Automatic figure classification in bioscience literature.

    PubMed

    Kim, Daehyun; Ramesh, Balaji Polepalli; Yu, Hong

    2011-10-01

    Millions of figures appear in biomedical articles, and it is important to develop an intelligent figure search engine to return relevant figures based on user entries. In this study we report a figure classifier that automatically classifies biomedical figures into five predefined figure types: Gel-image, Image-of-thing, Graph, Model, and Mix. The classifier explored rich image features and integrated them with text features. We performed feature selection and explored different classification models, including a rule-based figure classifier, a supervised machine-learning classifier, and a multi-model classifier, the latter of which integrated the first two classifiers. Our results show that feature selection improved figure classification and the novel image features we explored were the best among image features that we have examined. Our results also show that integrating text and image features achieved better performance than using either of them individually. The best system is a multi-model classifier which combines the rule-based hierarchical classifier and a support vector machine (SVM) based classifier, achieving a 76.7% F1-score for five-type classification. We demonstrated our system at http://figureclassification.askhermes.org/.

  8. Towards Automatic Classification of Neurons

    PubMed Central

    Armañanzas, Rubén; Ascoli, Giorgio A.

    2015-01-01

    The classification of neurons into types has been much debated since the inception of modern neuroscience. Recent experimental advances are accelerating the pace of data collection. The resulting information growth of morphological, physiological, and molecular properties encourages efforts to automate neuronal classification by powerful machine learning techniques. We review state-of-the-art analysis approaches and availability of suitable data and resources, highlighting prominent challenges and opportunities. The effective solution of the neuronal classification problem will require continuous development of computational methods, high-throughput data production, and systematic metadata organization to enable cross-lab integration. PMID:25765323

  9. Automatic classification of animal vocalizations

    NASA Astrophysics Data System (ADS)

    Clemins, Patrick J.

    2005-11-01

    Bioacoustics, the study of animal vocalizations, has begun to use increasingly sophisticated analysis techniques in recent years. Some common tasks in bioacoustics are repertoire determination, call detection, individual identification, stress detection, and behavior correlation. Each research study, however, uses a wide variety of different measured variables, called features, and classification systems to accomplish these tasks. The well-established field of human speech processing has developed a number of different techniques to perform many of the aforementioned bioacoustics tasks. Melfrequency cepstral coefficients (MFCCs) and perceptual linear prediction (PLP) coefficients are two popular feature sets. The hidden Markov model (HMM), a statistical model similar to a finite autonoma machine, is the most commonly used supervised classification model and is capable of modeling both temporal and spectral variations. This research designs a framework that applies models from human speech processing for bioacoustic analysis tasks. The development of the generalized perceptual linear prediction (gPLP) feature extraction model is one of the more important novel contributions of the framework. Perceptual information from the species under study can be incorporated into the gPLP feature extraction model to represent the vocalizations as the animals might perceive them. By including this perceptual information and modifying parameters of the HMM classification system, this framework can be applied to a wide range of species. The effectiveness of the framework is shown by analyzing African elephant and beluga whale vocalizations. The features extracted from the African elephant data are used as input to a supervised classification system and compared to results from traditional statistical tests. The gPLP features extracted from the beluga whale data are used in an unsupervised classification system and the results are compared to labels assigned by experts. The

  10. A Sequential Method for Automatic Document Classification.

    ERIC Educational Resources Information Center

    White, Lee J.; And Others

    The major advantage of sequential classification, a technique for automatically classifying documents into previously selected categories, is that the entire document need not be processed before it is classified. This method assumes the availability of a priori categories, a selection of keywords representative of these categories, and the a…

  11. A Sequential Method for Automatic Document Classification.

    ERIC Educational Resources Information Center

    White, Lee J.; And Others

    The major advantage of sequential classification, a technique for automatically classifying documents into previously selected categories, is that the entire document need not be processed before it is classified. This method assumes the availability of a priori categories, a selection of keywords representative of these categories, and the a…

  12. What Makes an Automatic Keyword Classification Effective?

    ERIC Educational Resources Information Center

    Jones, K. Sparck; Barber, E. O.

    1971-01-01

    The substitution information contained in automatically obtained keyword classification is most effectively exploited when: (1) strong similarity connectives only are utilized, (2) grouping is confined to non-frequent terms, (3) term groups are used to provide additional and not alternative descriptive items and (4) descriptor collection frequency…

  13. Metric learning for automatic sleep stage classification.

    PubMed

    Phan, Huy; Do, Quan; Do, The-Luan; Vu, Duc-Lung

    2013-01-01

    We introduce in this paper a metric learning approach for automatic sleep stage classification based on single-channel EEG data. We show that learning a global metric from training data instead of using the default Euclidean metric, the k-nearest neighbor classification rule outperforms state-of-the-art methods on Sleep-EDF dataset with various classification settings. The overall accuracy for Awake/Sleep and 4-class classification setting are 98.32% and 94.49% respectively. Furthermore, the superior accuracy is achieved by performing classification on a low-dimensional feature space derived from time and frequency domains and without the need for artifact removal as a preprocessing step.

  14. Presentation video retrieval using automatically recovered slide and spoken text

    NASA Astrophysics Data System (ADS)

    Cooper, Matthew

    2013-03-01

    Video is becoming a prevalent medium for e-learning. Lecture videos contain text information in both the presentation slides and lecturer's speech. This paper examines the relative utility of automatically recovered text from these sources for lecture video retrieval. To extract the visual information, we automatically detect slides within the videos and apply optical character recognition to obtain their text. Automatic speech recognition is used similarly to extract spoken text from the recorded audio. We perform controlled experiments with manually created ground truth for both the slide and spoken text from more than 60 hours of lecture video. We compare the automatically extracted slide and spoken text in terms of accuracy relative to ground truth, overlap with one another, and utility for video retrieval. Results reveal that automatically recovered slide text and spoken text contain different content with varying error profiles. Experiments demonstrate that automatically extracted slide text enables higher precision video retrieval than automatically recovered spoken text.

  15. Bayesian Automatic Classification Of HMI Images

    NASA Astrophysics Data System (ADS)

    Ulrich, R. K.; Beck, John G.

    2011-05-01

    The Bayesian automatic classification system known as "AutoClass" finds a set of class definitions based on a set of observed data and assigns data to classes without human supervision. It has been applied to Mt Wilson data to improve modeling of total solar irradiance variations (Ulrich, et al, 2010). We apply AutoClass to HMI observables to automatically identify regions of the solar surface. To prevent small instrument artifacts from interfering with class identification, we apply a flat-field correction and a rotationally shifted temporal average to the HMI images prior to processing with AutoClass. Additionally, the sensitivity of AutoClass to instrumental artifacts is investigated.

  16. Automatic Text Summarization for Indonesian Language Using TextTeaser

    NASA Astrophysics Data System (ADS)

    Gunawan, D.; Pasaribu, A.; Rahmat, R. F.; Budiarto, R.

    2017-04-01

    Text summarization is one of the solution for information overload. Reducing text without losing the meaning not only can save time to read, but also maintain the reader’s understanding. One of many algorithms to summarize text is TextTeaser. Originally, this algorithm is intended to be used for text in English. However, due to TextTeaser algorithm does not consider the meaning of the text, we implement this algorithm for text in Indonesian language. This algorithm calculates four elements, such as title feature, sentence length, sentence position and keyword frequency. We utilize TextRank, an unsupervised and language independent text summarization algorithm, to evaluate the summarized text yielded by TextTeaser. The result shows that the TextTeaser algorithm needs more improvement to obtain better accuracy.

  17. Automatic morphological classification of galaxy images

    PubMed Central

    Shamir, Lior

    2009-01-01

    We describe an image analysis supervised learning algorithm that can automatically classify galaxy images. The algorithm is first trained using a manually classified images of elliptical, spiral, and edge-on galaxies. A large set of image features is extracted from each image, and the most informative features are selected using Fisher scores. Test images can then be classified using a simple Weighted Nearest Neighbor rule such that the Fisher scores are used as the feature weights. Experimental results show that galaxy images from Galaxy Zoo can be classified automatically to spiral, elliptical and edge-on galaxies with accuracy of ~90% compared to classifications carried out by the author. Full compilable source code of the algorithm is available for free download, and its general-purpose nature makes it suitable for other uses that involve automatic image analysis of celestial objects. PMID:20161594

  18. Automatic morphological classification of galaxy images

    NASA Astrophysics Data System (ADS)

    Shamir, Lior

    2009-11-01

    We describe an image analysis supervised learning algorithm that can automatically classify galaxy images. The algorithm is first trained using manually classified images of elliptical, spiral and edge-on galaxies. A large set of image features is extracted from each image, and the most informative features are selected using Fisher scores. Test images can then be classified using a simple Weighted Nearest Neighbour rule such that the Fisher scores are used as the feature weights. Experimental results show that galaxy images from Galaxy Zoo can be classified automatically to spiral, elliptical and edge-on galaxies with an accuracy of ~90 per cent compared to classifications carried out by the author. Full compilable source code of the algorithm is available for free download, and its general-purpose nature makes it suitable for other uses that involve automatic image analysis of celestial objects.

  19. Automatic classification of blank substrate defects

    NASA Astrophysics Data System (ADS)

    Boettiger, Tom; Buck, Peter; Paninjath, Sankaranarayanan; Pereira, Mark; Ronald, Rob; Rost, Dan; Samir, Bhamidipati

    2014-10-01

    Mask preparation stages are crucial in mask manufacturing, since this mask is to later act as a template for considerable number of dies on wafer. Defects on the initial blank substrate, and subsequent cleaned and coated substrates, can have a profound impact on the usability of the finished mask. This emphasizes the need for early and accurate identification of blank substrate defects and the risk they pose to the patterned reticle. While Automatic Defect Classification (ADC) is a well-developed technology for inspection and analysis of defects on patterned wafers and masks in the semiconductors industry, ADC for mask blanks is still in the early stages of adoption and development. Calibre ADC is a powerful analysis tool for fast, accurate, consistent and automatic classification of defects on mask blanks. Accurate, automated classification of mask blanks leads to better usability of blanks by enabling defect avoidance technologies during mask writing. Detailed information on blank defects can help to select appropriate job-decks to be written on the mask by defect avoidance tools [1][4][5]. Smart algorithms separate critical defects from the potentially large number of non-critical defects or false defects detected at various stages during mask blank preparation. Mechanisms used by Calibre ADC to identify and characterize defects include defect location and size, signal polarity (dark, bright) in both transmitted and reflected review images, distinguishing defect signals from background noise in defect images. The Calibre ADC engine then uses a decision tree to translate this information into a defect classification code. Using this automated process improves classification accuracy, repeatability and speed, while avoiding the subjectivity of human judgment compared to the alternative of manual defect classification by trained personnel [2]. This paper focuses on the results from the evaluation of Automatic Defect Classification (ADC) product at MP Mask

  20. Toward text understanding: classification of text documents by word map

    NASA Astrophysics Data System (ADS)

    Visa, Ari J. E.; Toivanen, Jarmo; Back, Barbro; Vanharanta, Hannu

    2000-04-01

    In many fields, for example in business, engineering, and law there is interest in the search and the classification of text documents in large databases. To information retrieval purposes there exist methods. They are mainly based on keywords. In cases where keywords are lacking the information retrieval is problematic. One approach is to use the whole text document as a search key. Neural networks offer an adaptive tool for this purpose. This paper suggests a new adaptive approach to the problem of clustering and search in large text document databases. The approach is a multilevel one based on word, sentence, and paragraph level maps. Here only the word map level is reported. The reported approach is based on smart encoding, on Self-Organizing Maps, and on document histograms. The results are very promising.

  1. Automatic detection and classification of odontocete whistles.

    PubMed

    Gillespie, Douglas; Caillat, Marjolaine; Gordon, Jonathan; White, Paul

    2013-09-01

    Methods for the fully automatic detection and species classification of odontocete whistles are described. The detector applies a number of noise cancellation techniques to a spectrogram of sound data and then searches for connected regions of data which rise above a pre-determined threshold. When tested on a dataset of recordings which had been carefully annotated by a human operator, the detector was able to detect (recall) 79.6% of human identified sounds that had a signal-to-noise ratio above 10 dB, with 88% of the detections being valid. A significant problem with automatic detectors is that they tend to partially detect whistles or break whistles into several parts. A classifier has been developed specifically to work with fragmented whistle detections. By accumulating statistics over many whistle fragments, correct classification rates of over 94% have been achieved for four species. The success rate is, however, heavily dependent on the number of species included in the classifier mix, with the mean correct classification rate dropping to 58.5% when 12 species were included.

  2. Towards automatic classification of all WISE sources

    NASA Astrophysics Data System (ADS)

    Kurcz, A.; Bilicki, M.; Solarz, A.; Krupa, M.; Pollo, A.; Małek, K.

    2016-07-01

    Context. The Wide-field Infrared Survey Explorer (WISE) has detected hundreds of millions of sources over the entire sky. Classifying them reliably is, however, a challenging task owing to degeneracies in WISE multicolour space and low levels of detection in its two longest-wavelength bandpasses. Simple colour cuts are often not sufficient; for satisfactory levels of completeness and purity, more sophisticated classification methods are needed. Aims: Here we aim to obtain comprehensive and reliable star, galaxy, and quasar catalogues based on automatic source classification in full-sky WISE data. This means that the final classification will employ only parameters available from WISE itself, in particular those which are reliably measured for the majority of sources. Methods: For the automatic classification we applied a supervised machine learning algorithm, support vector machines (SVM). It requires a training sample with relevant classes already identified, and we chose to use the SDSS spectroscopic dataset (DR10) for that purpose. We tested the performance of two kernels used by the classifier, and determined the minimum number of sources in the training set required to achieve stable classification, as well as the minimum dimension of the parameter space. We also tested SVM classification accuracy as a function of extinction and apparent magnitude. Thus, the calibrated classifier was finally applied to all-sky WISE data, flux-limited to 16 mag (Vega) in the 3.4 μm channel. Results: By calibrating on the test data drawn from SDSS, we first established that a polynomial kernel is preferred over a radial one for this particular dataset. Next, using three classification parameters (W1 magnitude, W1-W2 colour, and a differential aperture magnitude) we obtained very good classification efficiency in all the tests. At the bright end, the completeness for stars and galaxies reaches ~95%, deteriorating to ~80% at W1 = 16 mag, while for quasars it stays at a level of

  3. Stemming Malay Text and Its Application in Automatic Text Categorization

    NASA Astrophysics Data System (ADS)

    Yasukawa, Michiko; Lim, Hui Tian; Yokoo, Hidetoshi

    In Malay language, there are no conjugations and declensions and affixes have important grammatical functions. In Malay, the same word may function as a noun, an adjective, an adverb, or, a verb, depending on its position in the sentence. Although extensively simple root words are used in informal conversations, it is essential to use the precise words in formal speech or written texts. In Malay, to make sentences clear, derivative words are used. Derivation is achieved mainly by the use of affixes. There are approximately a hundred possible derivative forms of a root word in written language of the educated Malay. Therefore, the composition of Malay words may be complicated. Although there are several types of stemming algorithms available for text processing in English and some other languages, they cannot be used to overcome the difficulties in Malay word stemming. Stemming is the process of reducing various words to their root forms in order to improve the effectiveness of text processing in information systems. It is essential to avoid both over-stemming and under-stemming errors. We have developed a new Malay stemmer (stemming algorithm) for removing inflectional and derivational affixes. Our stemmer uses a set of affix rules and two types of dictionaries: a root-word dictionary and a derivative-word dictionary. The use of set of rules is aimed at reducing the occurrence of under-stemming errors, while that of the dictionaries is believed to reduce the occurrence of over-stemming errors. We performed an experiment to evaluate the application of our stemmer in text mining software. For the experiment, text data used were actual web pages collected from the World Wide Web to demonstrate the effectiveness of our Malay stemming algorithm. The experimental results showed that our stemmer can effectively increase the precision of the extracted Boolean expressions for text categorization.

  4. Designing a Knowledge Base for Automatic Book Classification.

    ERIC Educational Resources Information Center

    Kim, Jeong-Hyen; Lee, Kyung-Ho

    2002-01-01

    Reports on the design of a knowledge base for an automatic classification in the library science field by using the facet classification principles of colon classification. Discusses inputting titles or key words into the computer to create class numbers through automatic subject recognition and processing title key words. (Author/LRW)

  5. Designing a Knowledge Base for Automatic Book Classification.

    ERIC Educational Resources Information Center

    Kim, Jeong-Hyen; Lee, Kyung-Ho

    2002-01-01

    Reports on the design of a knowledge base for an automatic classification in the library science field by using the facet classification principles of colon classification. Discusses inputting titles or key words into the computer to create class numbers through automatic subject recognition and processing title key words. (Author/LRW)

  6. Automatic Genre Classification of Musical Signals

    NASA Astrophysics Data System (ADS)

    Barbedo, Jayme Garcia sArnal; Lopes, Amauri

    2006-12-01

    We present a strategy to perform automatic genre classification of musical signals. The technique divides the signals into 21.3 milliseconds frames, from which 4 features are extracted. The values of each feature are treated over 1-second analysis segments. Some statistical results of the features along each analysis segment are used to determine a vector of summary features that characterizes the respective segment. Next, a classification procedure uses those vectors to differentiate between genres. The classification procedure has two main characteristics: (1) a very wide and deep taxonomy, which allows a very meticulous comparison between different genres, and (2) a wide pairwise comparison of genres, which allows emphasizing the differences between each pair of genres. The procedure points out the genre that best fits the characteristics of each segment. The final classification of the signal is given by the genre that appears more times along all signal segments. The approach has shown very good accuracy even for the lowest layers of the hierarchical structure.

  7. Automatic extraction of angiogenesis bioprocess from text

    PubMed Central

    Wang, Xinglong; McKendrick, Iain; Barrett, Ian; Dix, Ian; French, Tim; Tsujii, Jun'ichi; Ananiadou, Sophia

    2011-01-01

    Motivation: Understanding key biological processes (bioprocesses) and their relationships with constituent biological entities and pharmaceutical agents is crucial for drug design and discovery. One way to harvest such information is searching the literature. However, bioprocesses are difficult to capture because they may occur in text in a variety of textual expressions. Moreover, a bioprocess is often composed of a series of bioevents, where a bioevent denotes changes to one or a group of cells involved in the bioprocess. Such bioevents are often used to refer to bioprocesses in text, which current techniques, relying solely on specialized lexicons, struggle to find. Results: This article presents a range of methods for finding bioprocess terms and events. To facilitate the study, we built a gold standard corpus in which terms and events related to angiogenesis, a key biological process of the growth of new blood vessels, were annotated. Statistics of the annotated corpus revealed that over 36% of the text expressions that referred to angiogenesis appeared as events. The proposed methods respectively employed domain-specific vocabularies, a manually annotated corpus and unstructured domain-specific documents. Evaluation results showed that, while a supervised machine-learning model yielded the best precision, recall and F1 scores, the other methods achieved reasonable performance and less cost to develop. Availability: The angiogenesis vocabularies, gold standard corpus, annotation guidelines and software described in this article are available at http://text0.mib.man.ac.uk/~mbassxw2/angiogenesis/ Contact: xinglong.wang@gmail.com PMID:21821664

  8. Multidimensional text classification for drug information.

    PubMed

    Lertnattee, Verayuth; Theeramunkong, Thanaruk

    2004-09-01

    This paper proposes a multidimensional model for classifying drug information text documents. The concept of multidimensional category model is introduced for representing classes. In contrast with traditional flat and hierarchical category models, the multidimensional category model classifies each document using multiple predefined sets of categories, where each set corresponds to a dimension. Since a multidimensional model can be converted to flat and hierarchical models, three classification approaches are possible, i.e., classifying directly based on the multidimensional model and classifying with the equivalent flat or hierarchical models. The efficiency of these three approaches is investigated using drug information collection with two different dimensions: 1) drug topics and 2) primary therapeutic classes. In the experiments, k-nearest neighbor, naive Bayes, and two centroid-based methods are selected as classifiers. The comparisons among three approaches of classification are done using two-way analysis of variance, followed by the Scheffé's test for post hoc comparison. The experimental results show that multidimensional-based classification performs better than the others, especially in the presence of a relatively small training set. As one application, a category-based search engine using the multidimensional category concept was developed to help users retrieve drug information.

  9. Using statistical text classification to identify health information technology incidents.

    PubMed

    Chai, Kevin E K; Anthony, Stephen; Coiera, Enrico; Magrabi, Farah

    2013-01-01

    To examine the feasibility of using statistical text classification to automatically identify health information technology (HIT) incidents in the USA Food and Drug Administration (FDA) Manufacturer and User Facility Device Experience (MAUDE) database. We used a subset of 570 272 incidents including 1534 HIT incidents reported to MAUDE between 1 January 2008 and 1 July 2010. Text classifiers using regularized logistic regression were evaluated with both 'balanced' (50% HIT) and 'stratified' (0.297% HIT) datasets for training, validation, and testing. Dataset preparation, feature extraction, feature selection, cross-validation, classification, performance evaluation, and error analysis were performed iteratively to further improve the classifiers. Feature-selection techniques such as removing short words and stop words, stemming, lemmatization, and principal component analysis were examined. κ statistic, F1 score, precision and recall. Classification performance was similar on both the stratified (0.954 F1 score) and balanced (0.995 F1 score) datasets. Stemming was the most effective technique, reducing the feature set size to 79% while maintaining comparable performance. Training with balanced datasets improved recall (0.989) but reduced precision (0.165). Statistical text classification appears to be a feasible method for identifying HIT reports within large databases of incidents. Automated identification should enable more HIT problems to be detected, analyzed, and addressed in a timely manner. Semi-supervised learning may be necessary when applying machine learning to big data analysis of patient safety incidents and requires further investigation.

  10. Automatic lymphoma classification with sentence subgraph mining from pathology reports.

    PubMed

    Luo, Yuan; Sohani, Aliyah R; Hochberg, Ephraim P; Szolovits, Peter

    2014-01-01

    Pathology reports are rich in narrative statements that encode a complex web of relations among medical concepts. These relations are routinely used by doctors to reason on diagnoses, but often require hand-crafted rules or supervised learning to extract into prespecified forms for computational disease modeling. We aim to automatically capture relations from narrative text without supervision. We design a novel framework that translates sentences into graph representations, automatically mines sentence subgraphs, reduces redundancy in mined subgraphs, and automatically generates subgraph features for subsequent classification tasks. To ensure meaningful interpretations over the sentence graphs, we use the Unified Medical Language System Metathesaurus to map token subsequences to concepts, and in turn sentence graph nodes. We test our system with multiple lymphoma classification tasks that together mimic the differential diagnosis by a pathologist. To this end, we prevent our classifiers from looking at explicit mentions or synonyms of lymphomas in the text. We compare our system with three baseline classifiers using standard n-grams, full MetaMap concepts, and filtered MetaMap concepts. Our system achieves high F-measures on multiple binary classifications of lymphoma (Burkitt lymphoma, 0.8; diffuse large B-cell lymphoma, 0.909; follicular lymphoma, 0.84; Hodgkin lymphoma, 0.912). Significance tests show that our system outperforms all three baselines. Moreover, feature analysis identifies subgraph features that contribute to improved performance; these features agree with the state-of-the-art knowledge about lymphoma classification. We also highlight how these unsupervised relation features may provide meaningful insights into lymphoma classification. Published by the BMJ Publishing Group Limited. For permission to use (where not already granted under a licence) please go to http://group.bmj.com/group/rights-licensing/permissions.

  11. Automatic lymphoma classification with sentence subgraph mining from pathology reports

    PubMed Central

    Luo, Yuan; Sohani, Aliyah R; Hochberg, Ephraim P; Szolovits, Peter

    2014-01-01

    Objective Pathology reports are rich in narrative statements that encode a complex web of relations among medical concepts. These relations are routinely used by doctors to reason on diagnoses, but often require hand-crafted rules or supervised learning to extract into prespecified forms for computational disease modeling. We aim to automatically capture relations from narrative text without supervision. Methods We design a novel framework that translates sentences into graph representations, automatically mines sentence subgraphs, reduces redundancy in mined subgraphs, and automatically generates subgraph features for subsequent classification tasks. To ensure meaningful interpretations over the sentence graphs, we use the Unified Medical Language System Metathesaurus to map token subsequences to concepts, and in turn sentence graph nodes. We test our system with multiple lymphoma classification tasks that together mimic the differential diagnosis by a pathologist. To this end, we prevent our classifiers from looking at explicit mentions or synonyms of lymphomas in the text. Results and Conclusions We compare our system with three baseline classifiers using standard n-grams, full MetaMap concepts, and filtered MetaMap concepts. Our system achieves high F-measures on multiple binary classifications of lymphoma (Burkitt lymphoma, 0.8; diffuse large B-cell lymphoma, 0.909; follicular lymphoma, 0.84; Hodgkin lymphoma, 0.912). Significance tests show that our system outperforms all three baselines. Moreover, feature analysis identifies subgraph features that contribute to improved performance; these features agree with the state-of-the-art knowledge about lymphoma classification. We also highlight how these unsupervised relation features may provide meaningful insights into lymphoma classification. PMID:24431333

  12. Automatic extraction of relations between medical concepts in clinical texts

    PubMed Central

    Harabagiu, Sanda; Roberts, Kirk

    2011-01-01

    Objective A supervised machine learning approach to discover relations between medical problems, treatments, and tests mentioned in electronic medical records. Materials and methods A single support vector machine classifier was used to identify relations between concepts and to assign their semantic type. Several resources such as Wikipedia, WordNet, General Inquirer, and a relation similarity metric inform the classifier. Results The techniques reported in this paper were evaluated in the 2010 i2b2 Challenge and obtained the highest F1 score for the relation extraction task. When gold standard data for concepts and assertions were available, F1 was 73.7, precision was 72.0, and recall was 75.3. F1 is defined as 2*Precision*Recall/(Precision+Recall). Alternatively, when concepts and assertions were discovered automatically, F1 was 48.4, precision was 57.6, and recall was 41.7. Discussion Although a rich set of features was developed for the classifiers presented in this paper, little knowledge mining was performed from medical ontologies such as those found in UMLS. Future studies should incorporate features extracted from such knowledge sources, which we expect to further improve the results. Moreover, each relation discovery was treated independently. Joint classification of relations may further improve the quality of results. Also, joint learning of the discovery of concepts, assertions, and relations may also improve the results of automatic relation extraction. Conclusion Lexical and contextual features proved to be very important in relation extraction from medical texts. When they are not available to the classifier, the F1 score decreases by 3.7%. In addition, features based on similarity contribute to a decrease of 1.1% when they are not available. PMID:21846787

  13. Automatic image classification for the urinoculture screening.

    PubMed

    Andreini, Paolo; Bonechi, Simone; Bianchini, Monica; Garzelli, Andrea; Mecocci, Alessandro

    2016-03-01

    Urinary tract infections (UTIs) are considered to be the most common bacterial infection and, actually, it is estimated that about 150 million UTIs occur world wide yearly, giving rise to roughly $6 billion in healthcare expenditures and resulting in 100,000 hospitalizations. Nevertheless, it is difficult to carefully assess the incidence of UTIs, since an accurate diagnosis depends both on the presence of symptoms and on a positive urinoculture, whereas in most outpatient settings this diagnosis is made without an ad hoc analysis protocol. On the other hand, in the traditional urinoculture test, a sample of midstream urine is put onto a Petri dish, where a growth medium favors the proliferation of germ colonies. Then, the infection severity is evaluated by a visual inspection of a human expert, an error prone and lengthy process. In this paper, we propose a fully automated system for the urinoculture screening that can provide quick and easily traceable results for UTIs. Based on advanced image processing and machine learning tools, the infection type recognition, together with the estimation of the bacterial load, can be automatically carried out, yielding accurate diagnoses. The proposed AID (Automatic Infection Detector) system provides support during the whole analysis process: first, digital color images of Petri dishes are automatically captured, then specific preprocessing and spatial clustering algorithms are applied to isolate the colonies from the culture ground and, finally, an accurate classification of the infections and their severity evaluation are performed. The AID system speeds up the analysis, contributes to the standardization of the process, allows result repeatability, and reduces the costs. Moreover, the continuous transition between sterile and external environments (typical of the standard analysis procedure) is completely avoided. Copyright © 2016 Elsevier Ltd. All rights reserved.

  14. Automatic Approach to Vhr Satellite Image Classification

    NASA Astrophysics Data System (ADS)

    Kupidura, P.; Osińska-Skotak, K.; Pluto-Kossakowska, J.

    2016-06-01

    In this paper, we present a proposition of a fully automatic classification of VHR satellite images. Unlike the most widespread approaches: supervised classification, which requires prior defining of class signatures, or unsupervised classification, which must be followed by an interpretation of its results, the proposed method requires no human intervention except for the setting of the initial parameters. The presented approach bases on both spectral and textural analysis of the image and consists of 3 steps. The first step, the analysis of spectral data, relies on NDVI values. Its purpose is to distinguish between basic classes, such as water, vegetation and non-vegetation, which all differ significantly spectrally, thus they can be easily extracted basing on spectral analysis. The second step relies on granulometric maps. These are the product of local granulometric analysis of an image and present information on the texture of each pixel neighbourhood, depending on the texture grain. The purpose of texture analysis is to distinguish between different classes, spectrally similar, but yet of different texture, e.g. bare soil from a built-up area, or low vegetation from a wooded area. Due to the use of granulometric analysis, based on mathematical morphology opening and closing, the results are resistant to the border effect (qualifying borders of objects in an image as spaces of high texture), which affect other methods of texture analysis like GLCM statistics or fractal analysis. Therefore, the effectiveness of the analysis is relatively high. Several indices based on values of different granulometric maps have been developed to simplify the extraction of classes of different texture. The third and final step of the process relies on a vegetation index, based on near infrared and blue bands. Its purpose is to correct partially misclassified pixels. All the indices used in the classification model developed relate to reflectance values, so the preliminary step

  15. Method: automatic segmentation of mitochondria utilizing patch classification, contour pair classification, and automatically seeded level sets.

    PubMed

    Giuly, Richard J; Martone, Maryann E; Ellisman, Mark H

    2012-02-09

    While progress has been made to develop automatic segmentation techniques for mitochondria, there remains a need for more accurate and robust techniques to delineate mitochondria in serial blockface scanning electron microscopic data. Previously developed texture based methods are limited for solving this problem because texture alone is often not sufficient to identify mitochondria. This paper presents a new three-step method, the Cytoseg process, for automated segmentation of mitochondria contained in 3D electron microscopic volumes generated through serial block face scanning electron microscopic imaging. The method consists of three steps. The first is a random forest patch classification step operating directly on 2D image patches. The second step consists of contour-pair classification. At the final step, we introduce a method to automatically seed a level set operation with output from previous steps. We report accuracy of the Cytoseg process on three types of tissue and compare it to a previous method based on Radon-Like Features. At step 1, we show that the patch classifier identifies mitochondria texture but creates many false positive pixels. At step 2, our contour processing step produces contours and then filters them with a second classification step, helping to improve overall accuracy. We show that our final level set operation, which is automatically seeded with output from previous steps, helps to smooth the results. Overall, our results show that use of contour pair classification and level set operations improve segmentation accuracy beyond patch classification alone. We show that the Cytoseg process performs well compared to another modern technique based on Radon-Like Features. We demonstrated that texture based methods for mitochondria segmentation can be enhanced with multiple steps that form an image processing pipeline. While we used a random-forest based patch classifier to recognize texture, it would be possible to replace this with

  16. Automatic classification of seismo-volcanic signatures

    NASA Astrophysics Data System (ADS)

    Malfante, Marielle; Dalla Mura, Mauro; Mars, Jérôme; Macedo, Orlando; Inza, Adolfo; Métaxian, Jean-Philippe

    2017-04-01

    The prediction of volcanic eruptions and the evaluation of their associated risks is still a timely and open issue. For this purpose, several types of signals are recorded in the proximity of volcanoes and then analysed by experts. Typically, seismic signals that are considered as precursor or indicator of an active volcanic phase are detected and manually classified. In this work, we propose an architecture for automatic classification of seismo-volcanic waves. The system we propose is based on supervised machine learning. Specifically, a prediction model is built from a large dataset of labelled examples by the means of a learning algorithm (Support Vector Machine or Random Forest). Four main steps are involved: (i) preprocess the signals, (ii) from each signal, extract features that are useful for the classes discrimination, (iii) use an automatic learning algorithm to train a prediction model and (iv) classify (i.e., assign a semantic label) newly recorded and unlabelled examples. Our main contribution lies in the definition of the feature space used to represent the signals (i.e., in the choice of the features to extract from the data). Feature vectors describe the data in a space of lower dimension with respect to the original one. Ideally, signals are separable in the feature space depending on their classes. For this work, we consider a large set of features (79) gathered from an extensive state of the art in both acoustic and seismic fields. An analysis of this feature set shows that for the application of interest, 11 features are sufficient to discriminate the data. The architecture is tested on 4725 seismic events recorded between June 2006 and September 2011 at Ubinas, the most active volcano of Peru. Six main classes of signals are considered: volcanic tremors (TR), long period (LP), volcano-tectonic (VT), explosion (EXP), hybrids (HIB) and tornillo (TOR). Our model reaches above 90% of accuracy, thereby validating the proposed architecture and the

  17. Automatic extraction of corollaries from semantic structure of text

    NASA Astrophysics Data System (ADS)

    Nurtazin, Abyz T.; Khisamiev, Zarif G.

    2016-11-01

    The aim of this study is to develop an algorithm for automatic representation of the text of natural language as a formal system for the subsequent automatic extraction as reasonable answers to profound questions in the context of the text, and the deep logical consequences of the text and related areas of knowledge to which the text refers. The most universal method of constructing algorithms of automatic treatment of text for a particular purpose is a representation of knowledge in the form of a graph expressing the semantic values of the text. The paper presents an algorithm of automatic presentation of text and its associated knowledge as a formal logic programming theory for sufficiently strict texts, such as legal texts. This representation is a semantic-syntactic as the causal-investigatory relationships between the various parts are both logical and semantic. This representation of the text allows to resolve the issues of causal-investigatory relationships of present concepts, as methods of the theory and practice of logic programming and methods of model theory as well. In particular, these means of classical branches of mathematics can be used to address such issues as the definition and determination of consequences and questions of consistency of the theory.

  18. Using statistical text classification to identify health information technology incidents

    PubMed Central

    Chai, Kevin E K; Anthony, Stephen; Coiera, Enrico; Magrabi, Farah

    2013-01-01

    Objective To examine the feasibility of using statistical text classification to automatically identify health information technology (HIT) incidents in the USA Food and Drug Administration (FDA) Manufacturer and User Facility Device Experience (MAUDE) database. Design We used a subset of 570 272 incidents including 1534 HIT incidents reported to MAUDE between 1 January 2008 and 1 July 2010. Text classifiers using regularized logistic regression were evaluated with both ‘balanced’ (50% HIT) and ‘stratified’ (0.297% HIT) datasets for training, validation, and testing. Dataset preparation, feature extraction, feature selection, cross-validation, classification, performance evaluation, and error analysis were performed iteratively to further improve the classifiers. Feature-selection techniques such as removing short words and stop words, stemming, lemmatization, and principal component analysis were examined. Measurements κ statistic, F1 score, precision and recall. Results Classification performance was similar on both the stratified (0.954 F1 score) and balanced (0.995 F1 score) datasets. Stemming was the most effective technique, reducing the feature set size to 79% while maintaining comparable performance. Training with balanced datasets improved recall (0.989) but reduced precision (0.165). Conclusions Statistical text classification appears to be a feasible method for identifying HIT reports within large databases of incidents. Automated identification should enable more HIT problems to be detected, analyzed, and addressed in a timely manner. Semi-supervised learning may be necessary when applying machine learning to big data analysis of patient safety incidents and requires further investigation. PMID:23666777

  19. Profiling School Shooters: Automatic Text-Based Analysis

    PubMed Central

    Neuman, Yair; Assaf, Dan; Cohen, Yochai; Knoll, James L.

    2015-01-01

    School shooters present a challenge to both forensic psychiatry and law enforcement agencies. The relatively small number of school shooters, their various characteristics, and the lack of in-depth analysis of all of the shooters prior to the shooting add complexity to our understanding of this problem. In this short paper, we introduce a new methodology for automatically profiling school shooters. The methodology involves automatic analysis of texts and the production of several measures relevant for the identification of the shooters. Comparing texts written by 6 school shooters to 6056 texts written by a comparison group of male subjects, we found that the shooters’ texts scored significantly higher on the Narcissistic Personality dimension as well as on the Humilated and Revengeful dimensions. Using a ranking/prioritization procedure, similar to the one used for the automatic identification of sexual predators, we provide support for the validity and relevance of the proposed methodology. PMID:26089804

  20. Research on Classification of Chinese Text Data Based on SVM

    NASA Astrophysics Data System (ADS)

    Lin, Yuan; Yu, Hongzhi; Wan, Fucheng; Xu, Tao

    2017-09-01

    Data Mining has important application value in today’s industry and academia. Text classification is a very important technology in data mining. At present, there are many mature algorithms for text classification. KNN, NB, AB, SVM, decision tree and other classification methods all show good classification performance. Support Vector Machine’ (SVM) classification method is a good classifier in machine learning research. This paper will study the classification effect based on the SVM method in the Chinese text data, and use the support vector machine method in the chinese text to achieve the classify chinese text, and to able to combination of academia and practical application.

  1. PDF text classification to leverage information extraction from publication reports.

    PubMed

    Bui, Duy Duc An; Del Fiol, Guilherme; Jonnalagadda, Siddhartha

    2016-06-01

    Data extraction from original study reports is a time-consuming, error-prone process in systematic review development. Information extraction (IE) systems have the potential to assist humans in the extraction task, however majority of IE systems were not designed to work on Portable Document Format (PDF) document, an important and common extraction source for systematic review. In a PDF document, narrative content is often mixed with publication metadata or semi-structured text, which add challenges to the underlining natural language processing algorithm. Our goal is to categorize PDF texts for strategic use by IE systems. We used an open-source tool to extract raw texts from a PDF document and developed a text classification algorithm that follows a multi-pass sieve framework to automatically classify PDF text snippets (for brevity, texts) into TITLE, ABSTRACT, BODYTEXT, SEMISTRUCTURE, and METADATA categories. To validate the algorithm, we developed a gold standard of PDF reports that were included in the development of previous systematic reviews by the Cochrane Collaboration. In a two-step procedure, we evaluated (1) classification performance, and compared it with machine learning classifier, and (2) the effects of the algorithm on an IE system that extracts clinical outcome mentions. The multi-pass sieve algorithm achieved an accuracy of 92.6%, which was 9.7% (p<0.001) higher than the best performing machine learning classifier that used a logistic regression algorithm. F-measure improvements were observed in the classification of TITLE (+15.6%), ABSTRACT (+54.2%), BODYTEXT (+3.7%), SEMISTRUCTURE (+34%), and MEDADATA (+14.2%). In addition, use of the algorithm to filter semi-structured texts and publication metadata improved performance of the outcome extraction system (F-measure +4.1%, p=0.002). It also reduced of number of sentences to be processed by 44.9% (p<0.001), which corresponds to a processing time reduction of 50% (p=0.005). The rule-based multi

  2. Automatic Mapping of Martian Landforms Using Segmentation-based Classification

    NASA Astrophysics Data System (ADS)

    Ghosh, S.; Stepinski, T. F.; Vilalta, R.

    2007-03-01

    We use terrain segmentation and classification techniques to automatically map landforms on Mars. The method is applied to six sites to obtain geomorphic maps geared toward rapid characterization of impact craters.

  3. A text classification algorithm based on feature weighting

    NASA Astrophysics Data System (ADS)

    Yang, Han; Cui, Honggang; Tang, Hao

    2017-08-01

    The text classification comes down to match according to certain characteristics of the data to be classified. Of course, the complete match is not possible, so the optimal matching result must be selected to complete the classification. Aiming at the shortcomings of the traditional KNN text classification algorithm, a KNN text classification algorithm based on feature weighting is proposed. The algorithm considers the contribution of each dimension to the classification of the model, gives different characteristics to different weights, improves the function of important features, and improves the classification accuracy of the algorithm.

  4. Use of Automatic Text Analyzer in Preparation of SDI Profiles

    ERIC Educational Resources Information Center

    Carroll, John M.; Tague, Jean M.

    1973-01-01

    This research shows that by submitting samples of the client's recent professional reading material to automatic text analysis, Selective Dissemination of Information (SDI) profiles can be prepared that result in significantly higher initial recall scores than do those prepared by conventional techniques; relevance scores are not significantly…

  5. A Distance Measure for Automatic Document Classification by Sequential Analysis.

    ERIC Educational Resources Information Center

    Kar, Gautam; White, Lee J.

    1978-01-01

    Investigates the feasibility of using a distance measure for automatic sequential document classification. This property of the distance measure is used to design a sequential classification algorithm which classifies key words and analyzes them separately in order to assign primary and secondary classes to a document. (VT)

  6. A Distance Measure for Automatic Document Classification by Sequential Analysis.

    ERIC Educational Resources Information Center

    Kar, Gautam; White, Lee J.

    1978-01-01

    Investigates the feasibility of using a distance measure for automatic sequential document classification. This property of the distance measure is used to design a sequential classification algorithm which classifies key words and analyzes them separately in order to assign primary and secondary classes to a document. (VT)

  7. Multi-sensor text classification experiments -- a comparison

    SciTech Connect

    Dasigi, V.R.; Mann, R.C.; Protopopescu, V.

    1997-01-01

    In this paper, the authors report recent results on automatic classification of free text documents into a given number of categories. The method uses multiple sensors to derive informative clues about patterns of interest in the input text, and fuses this information using a neural network. Encouraging preliminary results were obtained by applying this approach to a set of free text documents from the Associated Press (AP) news wire. New free text documents have been made available by the Reuters news agency. The advantages of this collection compared to the AP data are that the Reuters stories were already manually classified, and included sufficiently high numbers of stories per category. The results indicate the usefulness of the new method: after the network is fully trained, if data belonging to only one category are used for testing, correctness is about 90%, representing nearly 15% over the best results for the AP data. Based on the performance of the method with the AP and the Reuters collections they now have conclusive evidence that the approach is viable and practical. More work remains to be done for handling data belonging to the multiple categories.

  8. Super pixel density based clustering automatic image classification method

    NASA Astrophysics Data System (ADS)

    Xu, Mingxing; Zhang, Chuan; Zhang, Tianxu

    2015-12-01

    The image classification is an important means of image segmentation and data mining, how to achieve rapid automated image classification has been the focus of research. In this paper, based on the super pixel density of cluster centers algorithm for automatic image classification and identify outlier. The use of the image pixel location coordinates and gray value computing density and distance, to achieve automatic image classification and outlier extraction. Due to the increased pixel dramatically increase the computational complexity, consider the method of ultra-pixel image preprocessing, divided into a small number of super-pixel sub-blocks after the density and distance calculations, while the design of a normalized density and distance discrimination law, to achieve automatic classification and clustering center selection, whereby the image automatically classify and identify outlier. After a lot of experiments, our method does not require human intervention, can automatically categorize images computing speed than the density clustering algorithm, the image can be effectively automated classification and outlier extraction.

  9. Automatic inpainting scheme for video text detection and removal.

    PubMed

    Mosleh, Ali; Bouguila, Nizar; Ben Hamza, Abdessamad

    2013-11-01

    We present a two stage framework for automatic video text removal to detect and remove embedded video texts and fill-in their remaining regions by appropriate data. In the video text detection stage, text locations in each frame are found via an unsupervised clustering performed on the connected components produced by the stroke width transform (SWT). Since SWT needs an accurate edge map, we develop a novel edge detector which benefits from the geometric features revealed by the bandlet transform. Next, the motion patterns of the text objects of each frame are analyzed to localize video texts. The detected video text regions are removed, then the video is restored by an inpainting scheme. The proposed video inpainting approach applies spatio-temporal geometric flows extracted by bandlets to reconstruct the missing data. A 3D volume regularization algorithm, which takes advantage of bandlet bases in exploiting the anisotropic regularities, is introduced to carry out the inpainting task. The method does not need extra processes to satisfy visual consistency. The experimental results demonstrate the effectiveness of both our proposed video text detection approach and the video completion technique, and consequently the entire automatic video text removal and restoration process.

  10. Document Exploration and Automatic Knowledge Extraction for Unstructured Biomedical Text

    NASA Astrophysics Data System (ADS)

    Chu, S.; Totaro, G.; Doshi, N.; Thapar, S.; Mattmann, C. A.; Ramirez, P.

    2015-12-01

    We describe our work on building a web-browser based document reader with built-in exploration tool and automatic concept extraction of medical entities for biomedical text. Vast amounts of biomedical information are offered in unstructured text form through scientific publications and R&D reports. Utilizing text mining can help us to mine information and extract relevant knowledge from a plethora of biomedical text. The ability to employ such technologies to aid researchers in coping with information overload is greatly desirable. In recent years, there has been an increased interest in automatic biomedical concept extraction [1, 2] and intelligent PDF reader tools with the ability to search on content and find related articles [3]. Such reader tools are typically desktop applications and are limited to specific platforms. Our goal is to provide researchers with a simple tool to aid them in finding, reading, and exploring documents. Thus, we propose a web-based document explorer, which we called Shangri-Docs, which combines a document reader with automatic concept extraction and highlighting of relevant terms. Shangri-Docsalso provides the ability to evaluate a wide variety of document formats (e.g. PDF, Words, PPT, text, etc.) and to exploit the linked nature of the Web and personal content by performing searches on content from public sites (e.g. Wikipedia, PubMed) and private cataloged databases simultaneously. Shangri-Docsutilizes Apache cTAKES (clinical Text Analysis and Knowledge Extraction System) [4] and Unified Medical Language System (UMLS) to automatically identify and highlight terms and concepts, such as specific symptoms, diseases, drugs, and anatomical sites, mentioned in the text. cTAKES was originally designed specially to extract information from clinical medical records. Our investigation leads us to extend the automatic knowledge extraction process of cTAKES for biomedical research domain by improving the ontology guided information extraction

  11. A scheme for automatic text rectification in real scene images

    NASA Astrophysics Data System (ADS)

    Wang, Baokang; Liu, Changsong; Ding, Xiaoqing

    2015-03-01

    Digital camera is gradually replacing traditional flat-bed scanner as the main access to obtain text information for its usability, cheapness and high-resolution, there has been a large amount of research done on camera-based text understanding. Unfortunately, arbitrary position of camera lens related to text area can frequently cause perspective distortion which most OCR systems at present cannot manage, thus creating demand for automatic text rectification. Current rectification-related research mainly focused on document images, distortion of natural scene text is seldom considered. In this paper, a scheme for automatic text rectification in natural scene images is proposed. It relies on geometric information extracted from characters themselves as well as their surroundings. For the first step, linear segments are extracted from interested region, and a J-Linkage based clustering is performed followed by some customized refinement to estimate primary vanishing point(VP)s. To achieve a more comprehensive VP estimation, second stage would be performed by inspecting the internal structure of characters which involves analysis on pixels and connected components of text lines. Finally VPs are verified and used to implement perspective rectification. Experiments demonstrate increase of recognition rate and improvement compared with some related algorithms.

  12. Automatic Spectral Classification of Unresolved Binary Stars

    NASA Astrophysics Data System (ADS)

    Weaver, W. B.

    2000-12-01

    An artificial neural network (ANN) technique has been developed to perform two-dimensional classification of the components of binary stars of any temperature or luminosity classifications. Using 15 Angstrom-resolution spectra, a single ANN can classify the unresolved components with an average accuracy of 2.5 subclasses in temperature and about 0.45 classes in luminostiy for up to 3 magnitudes difference in luminosity. The use of two ANNs, the first providing coarse classification while the second provides specialist classification, reduces the mean absolute errors to about 0.5 subclasses in temperature and 0.33 classes in luminosity. The system operates with no human intervention except initial wavelength registration and can classify about 20 binaries per second on a Pentium-class computer. This research was supported by the Friends of MIRA.

  13. Machine Learning Algorithms for Automatic Classification of Marmoset Vocalizations

    PubMed Central

    Ribeiro, Sidarta; Pereira, Danillo R.; Papa, João P.; de Albuquerque, Victor Hugo C.

    2016-01-01

    Automatic classification of vocalization type could potentially become a useful tool for acoustic the monitoring of captive colonies of highly vocal primates. However, for classification to be useful in practice, a reliable algorithm that can be successfully trained on small datasets is necessary. In this work, we consider seven different classification algorithms with the goal of finding a robust classifier that can be successfully trained on small datasets. We found good classification performance (accuracy > 0.83 and F1-score > 0.84) using the Optimum Path Forest classifier. Dataset and algorithms are made publicly available. PMID:27654941

  14. Text Classification by Combining Different Distance Functions with Weights

    NASA Astrophysics Data System (ADS)

    Yamada, Takahiro; Ishii, Naohiro; Nakashima, Toyoshiro

    The text classification is an important subject in the data mining. For the text classification, several methods have been developed up to now, as the nearest neighbor analysis, the latent semantic analysis, etc. The k-nearest neighbor (kNN) classification is a well-known simple and effective method for the classification of data in many domains. In the use of the kNN, the distance function is important to measure the distance and the similarity between data. To improve the performance of the classifier by the kNN, a new approach to combine multiple distance functions is proposed here. The weighting factors of elements in the distance function, are computed by GA for the effectiveness of the measurement. Further, an ensemble processing was developed for the improvement of the classification accuracy. Finally, it is shown by experiments that the methods, developed here, are effective in the text classification.

  15. Automatic breast density classification using neural network

    NASA Astrophysics Data System (ADS)

    Arefan, D.; Talebpour, A.; Ahmadinejhad, N.; Kamali Asl, A.

    2015-12-01

    According to studies, the risk of breast cancer directly associated with breast density. Many researches are done on automatic diagnosis of breast density using mammography. In the current study, artifacts of mammograms are removed by using image processing techniques and by using the method presented in this study, including the diagnosis of points of the pectoral muscle edges and estimating them using regression techniques, pectoral muscle is detected with high accuracy in mammography and breast tissue is fully automatically extracted. In order to classify mammography images into three categories: Fatty, Glandular, Dense, a feature based on difference of gray-levels of hard tissue and soft tissue in mammograms has been used addition to the statistical features and a neural network classifier with a hidden layer. Image database used in this research is the mini-MIAS database and the maximum accuracy of system in classifying images has been reported 97.66% with 8 hidden layers in neural network.

  16. Automatic UXO classification for fully polarimetric GPR data

    NASA Astrophysics Data System (ADS)

    Youn, Hyoung-Sun; Chen, Chi-Chih

    2003-09-01

    This paper presents an automatic UXO classification system using neural network and fuzzy inference based on the classification rules developed by the OSU. These rules incorporate scattering pattern, polarization and resonance features extracted from an ultra-wide bandwidth, fully polarimetric radar system. These features allow one to discriminate an elongated object. The algorithm consists of two stages. The first-stage classifies objects into clutter (group-A and D), a horizontal linear object (group-B) and a vertical linear object (group-C) according to the spatial distribution of the Estimated Linear Factor (ELF) values. Then second-stage discriminates UXO-LIKE targets from clutters under groups B and C. The rule in the first-stage was implemented by neural network and rules in the second-stage were realized by fuzzy inference with quantitative variables, i.e. ELF level, flatness of Estimated Target Orientation (ETO), the consistency of the target orientation, and the magnitude of the target response. It was found that the classification performance of this automatic algorithm is comparable with or superior to that obtained from a trained expert. However, the automatic classification procedure does not require the involvement of the operator and assigns a unbiased quantitative confidence level (or quality factor) associated with each classification. Classification error and inconsistency associated with fatigue, memory fading or complex features should be greatly reduced.

  17. Generalized minimum dominating set and application in automatic text summarization

    NASA Astrophysics Data System (ADS)

    Xu, Yi-Zhi; Zhou, Hai-Jun

    2016-03-01

    For a graph formed by vertices and weighted edges, a generalized minimum dominating set (MDS) is a vertex set of smallest cardinality such that the summed weight of edges from each outside vertex to vertices in this set is equal to or larger than certain threshold value. This generalized MDS problem reduces to the conventional MDS problem in the limiting case of all the edge weights being equal to the threshold value. We treat the generalized MDS problem in the present paper by a replica-symmetric spin glass theory and derive a set of belief-propagation equations. As a practical application we consider the problem of extracting a set of sentences that best summarize a given input text document. We carry out a preliminary test of the statistical physics-inspired method to this automatic text summarization problem.

  18. Features identification for automatic burn classification.

    PubMed

    Serrano, Carmen; Boloix-Tortosa, Rafael; Gómez-Cía, Tomás; Acha, Begoña

    2015-12-01

    In this paper an automatic system to diagnose burn depths based on colour digital photographs is presented. There is a low success rate in the determination of burn depth for inexperienced surgeons (around 50%), which rises to the range from 64 to 76% for experienced surgeons. In order to establish the first treatment, which is crucial for the patient evolution, the determination of the burn depth is one of the main steps. As the cost of maintaining a Burn Unit is very high, it would be desirable to have an automatic system to give a first assessment in local medical centres or at the emergency, where there is a lack of specialists. To this aim a psychophysical experiment to determine the physical characteristics that physicians employ to diagnose a burn depth is described. A Multidimensional Scaling Analysis (MDS) is then applied to the data obtained from the experiment in order to identify these physical features. Subsequently, these characteristics are translated into mathematical features. Finally, via a classifier (Support Vector Machine) and a feature selection method, the discriminant power of these mathematical features to distinguish among burn depths is analysed, and the subset of features that better estimates the burn depth is selected. A success rate of 79.73% was obtained when burns were classified as those which needed grafts and those which did not. Results validate the ability of the features extracted from the psychophysical experiment to classify burns into their depths. Copyright © 2015 Elsevier Ltd and ISBI. All rights reserved.

  19. Research on Automatic Classification, Indexing and Extracting. Annual Progress Report.

    ERIC Educational Resources Information Center

    Baker, F.T.; And Others

    In order to contribute to the success of several studies for automatic classification, indexing and extracting currently in progress, as well as to further the theoretical and practical understanding of textual item distributions, the development of a frequency program capable of supplying these types of information was undertaken. The program…

  20. Text Passage Retrieval Based on Colon Classification: Retrieval Performance.

    ERIC Educational Resources Information Center

    Shepherd, Michael A.

    1981-01-01

    Reports the results of experiments using colon classification for the analysis, representation, and retrieval of primary information from the full text of documents. Recall, precision, and search length measures indicate colon classification did not perform significantly better than Boolean or simple word occurrence systems. Thirteen references…

  1. The tool for the automatic analysis of text cohesion (TAACO): Automatic assessment of local, global, and text cohesion.

    PubMed

    Crossley, Scott A; Kyle, Kristopher; McNamara, Danielle S

    2016-12-01

    This study introduces the Tool for the Automatic Analysis of Cohesion (TAACO), a freely available text analysis tool that is easy to use, works on most operating systems (Windows, Mac, and Linux), is housed on a user's hard drive (rather than having an Internet interface), allows for the batch processing of text files, and incorporates over 150 classic and recently developed indices related to text cohesion. The study validates TAACO by investigating how its indices related to local, global, and overall text cohesion can predict expert judgments of text coherence and essay quality. The findings of this study provide predictive validation of TAACO and support the notion that expert judgments of text coherence and quality are either negatively correlated or not predicted by local and overall text cohesion indices, but are positively predicted by global indices of cohesion. Combined, these findings provide supporting evidence that coherence for expert raters is a property of global cohesion and not of local cohesion, and that expert ratings of text quality are positively related to global cohesion.

  2. Automatically classifying sentences in full-text biomedical articles into Introduction, Methods, Results and Discussion.

    PubMed

    Agarwal, Shashank; Yu, Hong

    2009-12-01

    Biomedical texts can be typically represented by four rhetorical categories: Introduction, Methods, Results and Discussion (IMRAD). Classifying sentences into these categories can benefit many other text-mining tasks. Although many studies have applied different approaches for automatically classifying sentences in MEDLINE abstracts into the IMRAD categories, few have explored the classification of sentences that appear in full-text biomedical articles. We first evaluated whether sentences in full-text biomedical articles could be reliably annotated into the IMRAD format and then explored different approaches for automatically classifying these sentences into the IMRAD categories. Our results show an overall annotation agreement of 82.14% with a Kappa score of 0.756. The best classification system is a multinomial naïve Bayes classifier trained on manually annotated data that achieved 91.95% accuracy and an average F-score of 91.55%, which is significantly higher than baseline systems. A web version of this system is available online at-http://wood.ims.uwm.edu/full_text_classifier/.

  3. Automatically classifying sentences in full-text biomedical articles into Introduction, Methods, Results and Discussion

    PubMed Central

    Agarwal, Shashank; Yu, Hong

    2009-01-01

    Biomedical texts can be typically represented by four rhetorical categories: Introduction, Methods, Results and Discussion (IMRAD). Classifying sentences into these categories can benefit many other text-mining tasks. Although many studies have applied different approaches for automatically classifying sentences in MEDLINE abstracts into the IMRAD categories, few have explored the classification of sentences that appear in full-text biomedical articles. We first evaluated whether sentences in full-text biomedical articles could be reliably annotated into the IMRAD format and then explored different approaches for automatically classifying these sentences into the IMRAD categories. Our results show an overall annotation agreement of 82.14% with a Kappa score of 0.756. The best classification system is a multinomial naïve Bayes classifier trained on manually annotated data that achieved 91.95% accuracy and an average F-score of 91.55%, which is significantly higher than baseline systems. A web version of this system is available online at—http://wood.ims.uwm.edu/full_text_classifier/. Contact: hongyu@uwm.edu PMID:19783830

  4. Semi-automatic classification of textures in thoracic CT scans

    NASA Astrophysics Data System (ADS)

    Kockelkorn, Thessa T. J. P.; de Jong, Pim A.; Schaefer-Prokop, Cornelia M.; Wittenberg, Rianne; Tiehuis, Audrey M.; Gietema, Hester A.; Grutters, Jan C.; Viergever, Max A.; van Ginneken, Bram

    2016-08-01

    The textural patterns in the lung parenchyma, as visible on computed tomography (CT) scans, are essential to make a correct diagnosis in interstitial lung disease. We developed one automatic and two interactive protocols for classification of normal and seven types of abnormal lung textures. Lungs were segmented and subdivided into volumes of interest (VOIs) with homogeneous texture using a clustering approach. In the automatic protocol, VOIs were classified automatically by an extra-trees classifier that was trained using annotations of VOIs from other CT scans. In the interactive protocols, an observer iteratively trained an extra-trees classifier to distinguish the different textures, by correcting mistakes the classifier makes in a slice-by-slice manner. The difference between the two interactive methods was whether or not training data from previously annotated scans was used in classification of the first slice. The protocols were compared in terms of the percentages of VOIs that observers needed to relabel. Validation experiments were carried out using software that simulated observer behavior. In the automatic classification protocol, observers needed to relabel on average 58% of the VOIs. During interactive annotation without the use of previous training data, the average percentage of relabeled VOIs decreased from 64% for the first slice to 13% for the second half of the scan. Overall, 21% of the VOIs were relabeled. When previous training data was available, the average overall percentage of VOIs requiring relabeling was 20%, decreasing from 56% in the first slice to 13% in the second half of the scan.

  5. Automatic Spectral Classification of Galaxies in the Infrared

    NASA Astrophysics Data System (ADS)

    Navarro, S. G.; Guzmán, V.; Dafonte, C.; Kemp, S. N.; Corral, L. J.

    2016-10-01

    Multi-object spectroscopy (MOS) provides us with numerous spectral data, and the projected new facilities and survey missions will increment the available spectra from stars and galaxies. In order to better understand this huge amount of data we need to develop new techniques of analysis and classification. Over the past decades it has been demonstrated that artificial neural networks are excellent tools for automatic spectral classification and identification, being robust tools and highly resistant to the presence of noise. We present here the result of the application of unsupervised neural networks: competitive neural networks (CNN) and self organized maps (SOM), to a sample of 747 galaxy spectra from the Infrared Spectrograph (IRS) of Spitzer. We obtained an automatic classification on 17 groups with the CNN, and we compare the results with those obtained with SOMs.The final goal of the project is to develop an automatic spectral classification tool for galaxies in the infrared, making use of artificial neural networks with unsupervised training and analyze the spectral characteristics of the galaxies that can give us clues to the physical processes taking place inside them.

  6. Automatic lung nodule classification with radiomics approach

    NASA Astrophysics Data System (ADS)

    Ma, Jingchen; Wang, Qian; Ren, Yacheng; Hu, Haibo; Zhao, Jun

    2016-03-01

    Lung cancer is the first killer among the cancer deaths. Malignant lung nodules have extremely high mortality while some of the benign nodules don't need any treatment .Thus, the accuracy of diagnosis between benign or malignant nodules diagnosis is necessary. Notably, although currently additional invasive biopsy or second CT scan in 3 months later may help radiologists to make judgments, easier diagnosis approaches are imminently needed. In this paper, we propose a novel CAD method to distinguish the benign and malignant lung cancer from CT images directly, which can not only improve the efficiency of rumor diagnosis but also greatly decrease the pain and risk of patients in biopsy collecting process. Briefly, according to the state-of-the-art radiomics approach, 583 features were used at the first step for measurement of nodules' intensity, shape, heterogeneity and information in multi-frequencies. Further, with Random Forest method, we distinguish the benign nodules from malignant nodules by analyzing all these features. Notably, our proposed scheme was tested on all 79 CT scans with diagnosis data available in The Cancer Imaging Archive (TCIA) which contain 127 nodules and each nodule is annotated by at least one of four radiologists participating in the project. Satisfactorily, this method achieved 82.7% accuracy in classification of malignant primary lung nodules and benign nodules. We believe it would bring much value for routine lung cancer diagnosis in CT imaging and provide improvement in decision-support with much lower cost.

  7. Automatic grade classification of Barretts Esophagus through feature enhancement

    NASA Astrophysics Data System (ADS)

    Ghatwary, Noha; Ahmed, Amr; Ye, Xujiong; Jalab, Hamid

    2017-03-01

    Barretts Esophagus (BE) is a precancerous condition that affects the esophagus tube and has the risk of developing esophageal adenocarcinoma. BE is the process of developing metaplastic intestinal epithelium and replacing the normal cells in the esophageal area. The detection of BE is considered difficult due to its appearance and properties. The diagnosis is usually done through both endoscopy and biopsy. Recently, Computer Aided Diagnosis systems have been developed to support physicians opinion when facing difficulty in detection/classification in different types of diseases. In this paper, an automatic classification of Barretts Esophagus condition is introduced. The presented method enhances the internal features of a Confocal Laser Endomicroscopy (CLE) image by utilizing a proposed enhancement filter. This filter depends on fractional differentiation and integration that improve the features in the discrete wavelet transform of an image. Later on, various features are extracted from each enhanced image on different levels for the multi-classification process. Our approach is validated on a dataset that consists of a group of 32 patients with 262 images with different histology grades. The experimental results demonstrated the efficiency of the proposed technique. Our method helps clinicians for more accurate classification. This potentially helps to reduce the need for biopsies needed for diagnosis, facilitate the regular monitoring of treatment/development of the patients case and can help train doctors with the new endoscopy technology. The accurate automatic classification is particularly important for the Intestinal Metaplasia (IM) type, which could turn into deadly cancerous. Hence, this work contributes to automatic classification that facilitates early intervention/treatment and decreasing biopsy samples needed.

  8. Automatic classification of time-variable X-ray sources

    SciTech Connect

    Lo, Kitty K.; Farrell, Sean; Murphy, Tara; Gaensler, B. M.

    2014-05-01

    To maximize the discovery potential of future synoptic surveys, especially in the field of transient science, it will be necessary to use automatic classification to identify some of the astronomical sources. The data mining technique of supervised classification is suitable for this problem. Here, we present a supervised learning method to automatically classify variable X-ray sources in the Second XMM-Newton Serendipitous Source Catalog (2XMMi-DR2). Random Forest is our classifier of choice since it is one of the most accurate learning algorithms available. Our training set consists of 873 variable sources and their features are derived from time series, spectra, and other multi-wavelength contextual information. The 10 fold cross validation accuracy of the training data is ∼97% on a 7 class data set. We applied the trained classification model to 411 unknown variable 2XMM sources to produce a probabilistically classified catalog. Using the classification margin and the Random Forest derived outlier measure, we identified 12 anomalous sources, of which 2XMM J180658.7–500250 appears to be the most unusual source in the sample. Its X-ray spectra is suggestive of a ultraluminous X-ray source but its variability makes it highly unusual. Machine-learned classification and anomaly detection will facilitate scientific discoveries in the era of all-sky surveys.

  9. Automatic detection and classification of leukocytes using convolutional neural networks.

    PubMed

    Zhao, Jianwei; Zhang, Minshu; Zhou, Zhenghua; Chu, Jianjun; Cao, Feilong

    2017-08-01

    The detection and classification of white blood cells (WBCs, also known as Leukocytes) is a hot issue because of its important applications in disease diagnosis. Nowadays the morphological analysis of blood cells is operated manually by skilled operators, which results in some drawbacks such as slowness of the analysis, a non-standard accuracy, and the dependence on the operator's skills. Although there have been many papers studying the detection of WBCs or classification of WBCs independently, few papers consider them together. This paper proposes an automatic detection and classification system for WBCs from peripheral blood images. It firstly proposes an algorithm to detect WBCs from the microscope images based on the simple relation of colors R, B and morphological operation. Then a granularity feature (pairwise rotation invariant co-occurrence local binary pattern, PRICoLBP feature) and SVM are applied to classify eosinophil and basophil from other WBCs firstly. Lastly, convolution neural networks are used to extract features in high level from WBCs automatically, and a random forest is applied to these features to recognize the other three kinds of WBCs: neutrophil, monocyte and lymphocyte. Some detection experiments on Cellavison database and ALL-IDB database show that our proposed detection method has better effect almost than iterative threshold method with less cost time, and some classification experiments show that our proposed classification method has better accuracy almost than some other methods.

  10. Enhancing navigation in biomedical databases by community voting and database-driven text classification

    PubMed Central

    Duchrow, Timo; Shtatland, Timur; Guettler, Daniel; Pivovarov, Misha; Kramer, Stefan; Weissleder, Ralph

    2009-01-01

    Background The breadth of biological databases and their information content continues to increase exponentially. Unfortunately, our ability to query such sources is still often suboptimal. Here, we introduce and apply community voting, database-driven text classification, and visual aids as a means to incorporate distributed expert knowledge, to automatically classify database entries and to efficiently retrieve them. Results Using a previously developed peptide database as an example, we compared several machine learning algorithms in their ability to classify abstracts of published literature results into categories relevant to peptide research, such as related or not related to cancer, angiogenesis, molecular imaging, etc. Ensembles of bagged decision trees met the requirements of our application best. No other algorithm consistently performed better in comparative testing. Moreover, we show that the algorithm produces meaningful class probability estimates, which can be used to visualize the confidence of automatic classification during the retrieval process. To allow viewing long lists of search results enriched by automatic classifications, we added a dynamic heat map to the web interface. We take advantage of community knowledge by enabling users to cast votes in Web 2.0 style in order to correct automated classification errors, which triggers reclassification of all entries. We used a novel framework in which the database "drives" the entire vote aggregation and reclassification process to increase speed while conserving computational resources and keeping the method scalable. In our experiments, we simulate community voting by adding various levels of noise to nearly perfectly labelled instances, and show that, under such conditions, classification can be improved significantly. Conclusion Using PepBank as a model database, we show how to build a classification-aided retrieval system that gathers training data from the community, is completely controlled

  11. Using LSI and its variants in Text Classification

    NASA Astrophysics Data System (ADS)

    Batra, Shalini; Bawa, Seema

    Latent Semantic Indexing (LSI), a well known technique in Information Retrieval has been partially successful in text retrieval and no major breakthrough has been achieved in text classification as yet. A significant step forward in this regard was made by Hofmann[3], who presented the probabilistic LSI (PLSI) model, as an alternative to LSI. If we wish to consider exchangeable representations for documents and words, PLSI is not successful which further led to the Latent Dirichlet Allocation (LDA) model [4]. A new local Latent Semantic Indexing method has been proposed by some authors called "Local Relevancy Ladder-Weighted LSI" (LRLW-LSI) to improve text classification [5]. In this paper we study LSI and its variants in detail , analyze the role played by them in text classification and conclude with future directions in this area.

  12. Automatic classification of seismic events within a regional seismograph network

    NASA Astrophysics Data System (ADS)

    Tiira, Timo; Kortström, Jari; Uski, Marja

    2015-04-01

    A fully automatic method for seismic event classification within a sparse regional seismograph network is presented. The tool is based on a supervised pattern recognition technique, Support Vector Machine (SVM), trained here to distinguish weak local earthquakes from a bulk of human-made or spurious seismic events. The classification rules rely on differences in signal energy distribution between natural and artificial seismic sources. Seismic records are divided into four windows, P, P coda, S, and S coda. For each signal window STA is computed in 20 narrow frequency bands between 1 and 41 Hz. The 80 discrimination parameters are used as a training data for the SVM. The SVM models are calculated for 19 on-line seismic stations in Finland. The event data are compiled mainly from fully automatic event solutions that are manually classified after automatic location process. The station-specific SVM training events include 11-302 positive (earthquake) and 227-1048 negative (non-earthquake) examples. The best voting rules for combining results from different stations are determined during an independent testing period. Finally, the network processing rules are applied to an independent evaluation period comprising 4681 fully automatic event determinations, of which 98 % have been manually identified as explosions or noise and 2 % as earthquakes. The SVM method correctly identifies 94 % of the non-earthquakes and all the earthquakes. The results imply that the SVM tool can identify and filter out blasts and spurious events from fully automatic event solutions with a high level of confidence. The tool helps to reduce work-load in manual seismic analysis by leaving only ~5 % of the automatic event determinations, i.e. the probable earthquakes for more detailed seismological analysis. The approach presented is easy to adjust to requirements of a denser or wider high-frequency network, once enough training examples for building a station-specific data set are available.

  13. Improving classification in protein structure databases using text mining.

    PubMed

    Koussounadis, Antonis; Redfern, Oliver C; Jones, David T

    2009-05-05

    The classification of protein domains in the CATH resource is primarily based on structural comparisons, sequence similarity and manual analysis. One of the main bottlenecks in the processing of new entries is the evaluation of 'borderline' cases by human curators with reference to the literature, and better tools for helping both expert and non-expert users quickly identify relevant functional information from text are urgently needed. A text based method for protein classification is presented, which complements the existing sequence and structure-based approaches, especially in cases exhibiting low similarity to existing members and requiring manual intervention. The method is based on the assumption that textual similarity between sets of documents relating to proteins reflects biological function similarities and can be exploited to make classification decisions. An optimal strategy for the text comparisons was identified by using an established gold standard enzyme dataset. Filtering of the abstracts using a machine learning approach to discriminate sentences containing functional, structural and classification information that are relevant to the protein classification task improved performance. Testing this classification scheme on a dataset of 'borderline' protein domains that lack significant sequence or structure similarity to classified proteins showed that although, as expected, the structural similarity classifiers perform better on average, there is a significant benefit in incorporating text similarity in logistic regression models, indicating significant orthogonality in this additional information. Coverage was significantly increased especially at low error rates, which is important for routine classification tasks: 15.3% for the combined structure and text classifier compared to 10% for the structural classifier alone, at 10-3 error rate. Finally when only the highest scoring predictions were used to infer classification, an extra 4.2% of

  14. Automatic classification of killer whale vocalizations using dynamic time warping.

    PubMed

    Brown, Judith C; Miller, Patrick J O

    2007-08-01

    A set of killer whale sounds from Marineland were recently classified automatically [Brown et al., J. Acoust. Soc. Am. 119, EL34-EL40 (2006)] into call types using dynamic time warping (DTW), multidimensional scaling, and kmeans clustering to give near-perfect agreement with a perceptual classification. Here the effectiveness of four DTW algorithms on a larger and much more challenging set of calls by Northern Resident whales will be examined, with each call consisting of two independently modulated pitch contours and having considerable overlap in contours for several of the perceptual call types. Classification results are given for each of the four algorithms for the low frequency contour (LFC), the high frequency contour (HFC), their derivatives, and weighted sums of the distances corresponding to LFC with HFC, LFC with its derivative, and HFC with its derivative. The best agreement with the perceptual classification was 90% attained by the Sakoe-Chiba algorithm for the low frequency contours alone.

  15. AUTOMATIC CLASSIFICATION OF VARIABLE STARS IN CATALOGS WITH MISSING DATA

    SciTech Connect

    Pichara, Karim; Protopapas, Pavlos

    2013-11-10

    We present an automatic classification method for astronomical catalogs with missing data. We use Bayesian networks and a probabilistic graphical model that allows us to perform inference to predict missing values given observed data and dependency relationships between variables. To learn a Bayesian network from incomplete data, we use an iterative algorithm that utilizes sampling methods and expectation maximization to estimate the distributions and probabilistic dependencies of variables from data with missing values. To test our model, we use three catalogs with missing data (SAGE, Two Micron All Sky Survey, and UBVI) and one complete catalog (MACHO). We examine how classification accuracy changes when information from missing data catalogs is included, how our method compares to traditional missing data approaches, and at what computational cost. Integrating these catalogs with missing data, we find that classification of variable objects improves by a few percent and by 15% for quasar detection while keeping the computational cost the same.

  16. Automatic sleep stage classification using ear-EEG.

    PubMed

    Stochholm, Andreas; Mikkelsen, Kaare; Kidmose, Preben

    2016-08-01

    Sleep assessment is of great importance in the diagnosis and treatment of sleep disorders. In clinical practice this is typically performed based on polysomnography recordings and manual sleep staging by experts. This procedure has the disadvantages that the measurements are cumbersome, may have a negative influence on the sleep, and the clinical assessment is labor intensive. Addressing the latter, there has recently been encouraging progress in the field of automatic sleep staging [1]. Furthermore, a minimally obtrusive method for recording EEG from electrodes in the ear (ear-EEG) has recently been proposed [2]. The objective of this study was to investigate the feasibility of automatic sleep stage classification based on ear-EEG. This paper presents a preliminary study based on recordings from a total of 18 subjects. Sleep scoring was performed by a clinical expert based on frontal, central and occipital region EEG, as well as EOG and EMG. 5 subjects were excluded from the study because of alpha wave contamination. In one subject the standard polysomnography was supplemented by ear-EEG. A single EEG channel sleep stage classifier was implemented using the same features and the same classifier as proposed in [1]. The performance of the single channel sleep classifier based on the scalp recordings showed an 85.7 % agreement with the manual expert scoring through 10-fold inter-subject cross validation, while the performance of the ear-EEG recordings was based on a 10-fold intra-subject cross validation and showed an 82 % agreement with the manual scoring. These results suggest that automatic sleep stage classification based on ear-EEG recordings may provide similar performance as compared to single channel scalp EEG sleep stage classification. Thereby ear-EEG may be a feasible technology for future minimal intrusive sleep stage classification.

  17. Lexical Inference Mechanisms for Text Understanding and Classification.

    ERIC Educational Resources Information Center

    Figa, Elizabeth; Tarau, Paul

    2003-01-01

    Describes a framework for building "story traces" (compact global views of a narrative) and "story projects" (selections of key elements of a narrative) and their applications in text understanding and classification. The resulting "abstract story traces" provide a compact view of the underlying narrative's key…

  18. PADMA: PArallel Data Mining Agents for scalable text classification

    SciTech Connect

    Kargupta, H.; Hamzaoglu, I.; Stafford, B.

    1997-03-01

    This paper introduces PADMA (PArallel Data Mining Agents), a parallel agent based system for scalable text classification. PADMA contains modules for (1) parallel data accessing operations, (2) parallel hierarchical clustering, and (3) web-based data visualization. This paper introduces the general architecture of PADMA and presents a detailed description of its different modules.

  19. Automatic Cataract Hardness Classification Ex Vivo by Ultrasound Techniques.

    PubMed

    Caixinha, Miguel; Santos, Mário; Santos, Jaime

    2016-04-01

    To demonstrate the feasibility of a new methodology for cataract hardness characterization and automatic classification using ultrasound techniques, different cataract degrees were induced in 210 porcine lenses. A 25-MHz ultrasound transducer was used to obtain acoustical parameters (velocity and attenuation) and backscattering signals. B-Scan and parametric Nakagami images were constructed. Ninety-seven parameters were extracted and subjected to a Principal Component Analysis. Bayes, K-Nearest-Neighbours, Fisher Linear Discriminant and Support Vector Machine (SVM) classifiers were used to automatically classify the different cataract severities. Statistically significant increases with cataract formation were found for velocity, attenuation, mean brightness intensity of the B-Scan images and mean Nakagami m parameter (p < 0.01). The four classifiers showed a good performance for healthy versus cataractous lenses (F-measure ≥ 92.68%), while for initial versus severe cataracts the SVM classifier showed the higher performance (90.62%). The results showed that ultrasound techniques can be used for non-invasive cataract hardness characterization and automatic classification.

  20. Comparative Analysis of Document level Text Classification Algorithms using R

    NASA Astrophysics Data System (ADS)

    Syamala, Maganti; Nalini, N. J., Dr; Maguluri, Lakshamanaphaneendra; Ragupathy, R., Dr.

    2017-08-01

    From the past few decades there has been tremendous volumes of data available in Internet either in structured or unstructured form. Also, there is an exponential growth of information on Internet, so there is an emergent need of text classifiers. Text mining is an interdisciplinary field which draws attention on information retrieval, data mining, machine learning, statistics and computational linguistics. And to handle this situation, a wide range of supervised learning algorithms has been introduced. Among all these K-Nearest Neighbor(KNN) is efficient and simplest classifier in text classification family. But KNN suffers from imbalanced class distribution and noisy term features. So, to cope up with this challenge we use document based centroid dimensionality reduction(CentroidDR) using R Programming. By combining these two text classification techniques, KNN and Centroid classifiers, we propose a scalable and effective flat classifier, called MCenKNN which works well substantially better than CenKNN.

  1. Simple-random-sampling-based multiclass text classification algorithm.

    PubMed

    Liu, Wuying; Wang, Lin; Yi, Mianzhu

    2014-01-01

    Multiclass text classification (MTC) is a challenging issue and the corresponding MTC algorithms can be used in many applications. The space-time overhead of the algorithms must be concerned about the era of big data. Through the investigation of the token frequency distribution in a Chinese web document collection, this paper reexamines the power law and proposes a simple-random-sampling-based MTC (SRSMTC) algorithm. Supported by a token level memory to store labeled documents, the SRSMTC algorithm uses a text retrieval approach to solve text classification problems. The experimental results on the TanCorp data set show that SRSMTC algorithm can achieve the state-of-the-art performance at greatly reduced space-time requirements.

  2. LADAR And FLIR Based Sensor Fusion For Automatic Target Classification

    NASA Astrophysics Data System (ADS)

    Selzer, Fred; Gutfinger, Dan

    1989-01-01

    The purpose of this report is to show results of automatic target classification and sensor fusion for forward looking infrared (FLIR) and Laser Radar sensors. The sensor fusion data base was acquired from the Naval Weapon Center and it consists of coregistered Laser RaDAR (range and reflectance image), FLIR (raw and preprocessed image) and TV. Using this data base we have developed techniques to extract relevant object edges from the FLIR and LADAR which are correlated to wireframe models. The resulting correlation coefficients from both the LADAR and FLIR are fused using either the Bayesian or the Dempster-Shafer combination method so as to provide a higher confidence target classifica-tion level output. Finally, to minimize the correlation process the wireframe models are modified to reflect target range (size of target) and target orientation which is extracted from the LADAR reflectance image.

  3. Automatic Detection and Classification of Coronal Mass Ejections

    NASA Astrophysics Data System (ADS)

    Qu, Ming; Shih, Frank Y.; Jing, Ju; Wang, Haimin

    2006-09-01

    We present an automatic algorithm to detect, characterize, and classify coronal mass ejections (CMEs) in Large Angle Spectrometric Coronagraph (LASCO) C2 and C3 images. The algorithm includes three steps: (1) production running difference images of LASCO C2 and C3; (2) characterization of properties of CMEs such as intensity, height, angular width of span, and speed, and (3) classification of strong, median, and weak CMEs on the basis of CME characterization. In this work, image enhancement, segmentation, and morphological methods are used to detect and characterize CME regions. In addition, Support Vector Machine (SVM) classifiers are incorporated with the CME properties to distinguish strong CMEs from other weak CMEs. The real-time CME detection and classification results are recorded in a database to be available to the public. Comparing the two available CME catalogs, SOHO/LASCO and CACTus CME catalogs, we have achieved accurate and fast detection of strong CMEs and most of weak CMEs.

  4. Scanning electron microscope automatic defect classification of process induced defects

    NASA Astrophysics Data System (ADS)

    Wolfe, Scott; McGarvey, Steve

    2017-03-01

    With the integration of high speed Scanning Electron Microscope (SEM) based Automated Defect Redetection (ADR) in both high volume semiconductor manufacturing and Research and Development (R and D), the need for reliable SEM Automated Defect Classification (ADC) has grown tremendously in the past few years. In many high volume manufacturing facilities and R and D operations, defect inspection is performed on EBeam (EB), Bright Field (BF) or Dark Field (DF) defect inspection equipment. A comma separated value (CSV) file is created by both the patterned and non-patterned defect inspection tools. The defect inspection result file contains a list of the inspection anomalies detected during the inspection tools' examination of each structure, or the examination of an entire wafers surface for non-patterned applications. This file is imported into the Defect Review Scanning Electron Microscope (DRSEM). Following the defect inspection result file import, the DRSEM automatically moves the wafer to each defect coordinate and performs ADR. During ADR the DRSEM operates in a reference mode, capturing a SEM image at the exact position of the anomalies coordinates and capturing a SEM image of a reference location in the center of the wafer. A Defect reference image is created based on the Reference image minus the Defect image. The exact coordinates of the defect is calculated based on the calculated defect position and the anomalies stage coordinate calculated when the high magnification SEM defect image is captured. The captured SEM image is processed through either DRSEM ADC binning, exporting to a Yield Analysis System (YAS), or a combination of both. Process Engineers, Yield Analysis Engineers or Failure Analysis Engineers will manually review the captured images to insure that either the YAS defect binning is accurately classifying the defects or that the DRSEM defect binning is accurately classifying the defects. This paper is an exploration of the feasibility of the

  5. Neural net learning issues in classification of free text documents

    SciTech Connect

    Dasigi, V.R.; Mann, R.C.

    1996-03-01

    In intelligent analysis of large amounts of text, not any single clue indicates reliably that a pattern of interest has been found. When using multiple clues, it is not known how these should be integrated into a decision. In the context of this investigation, we have been using neural nets as parameterized mappings that allow for fusion of higher level clues extracted from free text. By using higher level clues and features, we avoid very large networks. By using the dominant singular values computed by Latent Semantic Indexing (LSI) and applying neural network algorithms for integrating these values and the outputs from other ``sensors,`` we have obtained preliminary encouraging results with text classification.

  6. Clinically-inspired automatic classification of ovarian carcinoma subtypes

    PubMed Central

    BenTaieb, Aïcha; Nosrati, Masoud S; Li-Chang, Hector; Huntsman, David; Hamarneh, Ghassan

    2016-01-01

    Context: It has been shown that ovarian carcinoma subtypes are distinct pathologic entities with differing prognostic and therapeutic implications. Histotyping by pathologists has good reproducibility, but occasional cases are challenging and require immunohistochemistry and subspecialty consultation. Motivated by the need for more accurate and reproducible diagnoses and to facilitate pathologists’ workflow, we propose an automatic framework for ovarian carcinoma classification. Materials and Methods: Our method is inspired by pathologists’ workflow. We analyse imaged tissues at two magnification levels and extract clinically-inspired color, texture, and segmentation-based shape descriptors using image-processing methods. We propose a carefully designed machine learning technique composed of four modules: A dissimilarity matrix, dimensionality reduction, feature selection and a support vector machine classifier to separate the five ovarian carcinoma subtypes using the extracted features. Results: This paper presents the details of our implementation and its validation on a clinically derived dataset of eighty high-resolution histopathology images. The proposed system achieved a multiclass classification accuracy of 95.0% when classifying unseen tissues. Assessment of the classifier's confusion (confusion matrix) between the five different ovarian carcinoma subtypes agrees with clinician's confusion and reflects the difficulty in diagnosing endometrioid and serous carcinomas. Conclusions: Our results from this first study highlight the difficulty of ovarian carcinoma diagnosis which originate from the intrinsic class-imbalance observed among subtypes and suggest that the automatic analysis of ovarian carcinoma subtypes could be valuable to clinician's diagnostic procedure by providing a second opinion. PMID:27563487

  7. Image classification approach for automatic identification of grassland weeds

    NASA Astrophysics Data System (ADS)

    Gebhardt, Steffen; Kühbauch, Walter

    2006-08-01

    The potential of digital image processing for weed mapping in arable crops has widely been investigated in the last decades. In grassland farming these techniques are rarely applied so far. The project presented here focuses on the automatic identification of one of the most invasive and persistent grassland weed species, the broad-leaved dock (Rumex obtusifolius L.) in complex mixtures of grass and herbs. A total of 108 RGB-images were acquired in near range from a field experiment under constant illumination conditions using a commercial digital camera. The objects of interest were separated from the background by transforming the 24 bit RGB-images into 8 bit intensities and then calculating the local homogeneity images. These images were binarised by applying a dynamic grey value threshold. Finally, morphological opening was applied to the binary images. The remaining contiguous regions were considered to be objects. In order to classify these objects into 3 different weed species, a soil and a residue class, a total of 17 object-features related to shape, color and texture of the weeds were extracted. Using MANOVA, 12 of them were identified which contribute to classification. Maximum-likelihood classification was conducted to discriminate the weed species. The total classification rate across all classes ranged from 76 % to 83 %. The classification of Rumex obtusifolius achieved detection rates between 85 % and 93 % by misclassifications below 10 %. Further, Rumex obtusifolius distribution and the density maps were generated based on classification results and transformation of image coordinates into Gauss-Krueger system. These promising results show the high potential of image analysis for weed mapping in grassland and the implementation of site-specific herbicide spraying.

  8. Rationale-Augmented Convolutional Neural Networks for Text Classification

    PubMed Central

    Zhang, Ye; Marshall, Iain; Wallace, Byron C.

    2016-01-01

    We present a new Convolutional Neural Network (CNN) model for text classification that jointly exploits labels on documents and their constituent sentences. Specifically, we consider scenarios in which annotators explicitly mark sentences (or snippets) that support their overall document categorization, i.e., they provide rationales. Our model exploits such supervision via a hierarchical approach in which each document is represented by a linear combination of the vector representations of its component sentences. We propose a sentence-level convolutional model that estimates the probability that a given sentence is a rationale, and we then scale the contribution of each sentence to the aggregate document representation in proportion to these estimates. Experiments on five classification datasets that have document labels and associated rationales demonstrate that our approach consistently outperforms strong baselines. Moreover, our model naturally provides explanations for its predictions. PMID:28191551

  9. Classification of protein-protein interaction full-text documents using text and citation network features.

    PubMed

    Kolchinsky, Artemy; Abi-Haidar, Alaa; Kaur, Jasleen; Hamed, Ahmed Abdeen; Rocha, Luis M

    2010-01-01

    We participated (as Team 9) in the Article Classification Task of the Biocreative II.5 Challenge: binary classification of full-text documents relevant for protein-protein interaction. We used two distinct classifiers for the online and offline challenges: 1) the lightweight Variable Trigonometric Threshold (VTT) linear classifier we successfully introduced in BioCreative 2 for binary classification of abstracts and 2) a novel Naive Bayes classifier using features from the citation network of the relevant literature. We supplemented the supplied training data with full-text documents from the MIPS database. The lightweight VTT classifier was very competitive in this new full-text scenario: it was a top-performing submission in this task, taking into account the rank product of the Area Under the interpolated precision and recall Curve, Accuracy, Balanced F-Score, and Matthew's Correlation Coefficient performance measures. The novel citation network classifier for the biomedical text mining domain, while not a top performing classifier in the challenge, performed above the central tendency of all submissions, and therefore indicates a promising new avenue to investigate further in bibliome informatics.

  10. Adaptive automatic sleep stage classification under covariate shift.

    PubMed

    Khalighi, Sirvan; Sousa, Teresa; Nunes, Urbano

    2012-01-01

    Current automatic sleep stage classification (ASSC) methods that rely on polysomnographic (PSG) signals suffer from inter-subject differences that make them unreliable in facing with new and different subjects. A novel adaptive sleep scoring method based on unsupervised domain adaptation, aiming to be robust to inter-subject variability, is proposed. We assume that the sleep quality variants follow a covariate shift model, where only the sleep features distribution change in the training and test phases. The maximum overlap discrete wavelet transform (MODWT) is applied to extract relevant features from EEG, EOG and EMG signals. A set of significant features are selected by minimum-redundancy maximum-relevance (mRMR) which is a powerful feature selection method. Finally, an instance-weighting method, namely the importance weighted kernel logistic regression (IWKLR) is applied for the purpose of obtaining adaptation in classification. The classification results using leave one out cross-validation (LOOCV), show that the proposed method performs at the state-of-the art in the field of ASSC.

  11. Automatic classification of spectra from the Infrared Astronomical Satellite (IRAS)

    NASA Technical Reports Server (NTRS)

    Cheeseman, Peter; Stutz, John; Self, Matthew; Taylor, William; Goebel, John; Volk, Kevin; Walker, Helen

    1989-01-01

    A new classification of Infrared spectra collected by the Infrared Astronomical Satellite (IRAS) is presented. The spectral classes were discovered automatically by a program called Auto Class 2. This program is a method for discovering (inducing) classes from a data base, utilizing a Bayesian probability approach. These classes can be used to give insight into the patterns that occur in the particular domain, in this case, infrared astronomical spectroscopy. The classified spectra are the entire Low Resolution Spectra (LRS) Atlas of 5,425 sources. There are seventy-seven classes in this classification and these in turn were meta-classified to produce nine meta-classes. The classification is presented as spectral plots, IRAS color-color plots, galactic distribution plots and class commentaries. Cross-reference tables, listing the sources by IRAS name and by Auto Class class, are also given. These classes show some of the well known classes, such as the black-body class, and silicate emission classes, but many other classes were unsuspected, while others show important subtle differences within the well known classes.

  12. Automated Classification of Selected Data Elements from Free-text Diagnostic Reports for Clinical Research.

    PubMed

    Löpprich, Martin; Krauss, Felix; Ganzinger, Matthias; Senghas, Karsten; Riezler, Stefan; Knaup, Petra

    2016-08-05

    In the Multiple Myeloma clinical registry at Heidelberg University Hospital, most data are extracted from discharge letters. Our aim was to analyze if it is possible to make the manual documentation process more efficient by using methods of natural language processing for multiclass classification of free-text diagnostic reports to automatically document the diagnosis and state of disease of myeloma patients. The first objective was to create a corpus consisting of free-text diagnosis paragraphs of patients with multiple myeloma from German diagnostic reports, and its manual annotation of relevant data elements by documentation specialists. The second objective was to construct and evaluate a framework using different NLP methods to enable automatic multiclass classification of relevant data elements from free-text diagnostic reports. The main diagnoses paragraph was extracted from the clinical report of one third randomly selected patients of the multiple myeloma research database from Heidelberg University Hospital (in total 737 selected patients). An EDC system was setup and two data entry specialists performed independently a manual documentation of at least nine specific data elements for multiple myeloma characterization. Both data entries were compared and assessed by a third specialist and an annotated text corpus was created. A framework was constructed, consisting of a self-developed package to split multiple diagnosis sequences into several subsequences, four different preprocessing steps to normalize the input data and two classifiers: a maximum entropy classifier (MEC) and a support vector machine (SVM). In total 15 different pipelines were examined and assessed by a ten-fold cross-validation, reiterated 100 times. For quality indication the average error rate and the average F1-score were conducted. For significance testing the approximate randomization test was used. The created annotated corpus consists of 737 different diagnoses paragraphs with a

  13. Visual affective classification by combining visual and text features.

    PubMed

    Liu, Ningning; Wang, Kai; Jin, Xin; Gao, Boyang; Dellandréa, Emmanuel; Chen, Liming

    2017-01-01

    Affective analysis of images in social networks has drawn much attention, and the texts surrounding images are proven to provide valuable semantic meanings about image content, which can hardly be represented by low-level visual features. In this paper, we propose a novel approach for visual affective classification (VAC) task. This approach combines visual representations along with novel text features through a fusion scheme based on Dempster-Shafer (D-S) Evidence Theory. Specifically, we not only investigate different types of visual features and fusion methods for VAC, but also propose textual features to effectively capture emotional semantics from the short text associated to images based on word similarity. Experiments are conducted on three public available databases: the International Affective Picture System (IAPS), the Artistic Photos and the MirFlickr Affect set. The results demonstrate that the proposed approach combining visual and textual features provides promising results for VAC task.

  14. Visual affective classification by combining visual and text features

    PubMed Central

    Liu, Ningning; Wang, Kai; Jin, Xin; Gao, Boyang; Dellandréa, Emmanuel; Chen, Liming

    2017-01-01

    Affective analysis of images in social networks has drawn much attention, and the texts surrounding images are proven to provide valuable semantic meanings about image content, which can hardly be represented by low-level visual features. In this paper, we propose a novel approach for visual affective classification (VAC) task. This approach combines visual representations along with novel text features through a fusion scheme based on Dempster-Shafer (D-S) Evidence Theory. Specifically, we not only investigate different types of visual features and fusion methods for VAC, but also propose textual features to effectively capture emotional semantics from the short text associated to images based on word similarity. Experiments are conducted on three public available databases: the International Affective Picture System (IAPS), the Artistic Photos and the MirFlickr Affect set. The results demonstrate that the proposed approach combining visual and textual features provides promising results for VAC task. PMID:28850566

  15. Comparison of Document Index Graph Using TextRank and HITS Weighting Method in Automatic Text Summarization

    NASA Astrophysics Data System (ADS)

    Hadyan, Fadhlil; Shaufiah; Arif Bijaksana, Moch.

    2017-01-01

    Automatic summarization is a system that can help someone to take the core information of a long text instantly. The system can help by summarizing text automatically. there’s Already many summarization systems that have been developed at this time but there are still many problems in those system. In this final task proposed summarization method using document index graph. This method utilizes the PageRank and HITS formula used to assess the web page, adapted to make an assessment of words in the sentences in a text document. The expected outcome of this final task is a system that can do summarization of a single document, by utilizing document index graph with TextRank and HITS to improve the quality of the summary results automatically.

  16. An Approach for Automatic Classification of Radiology Reports in Spanish.

    PubMed

    Cotik, Viviana; Filippo, Darío; Castaño, José

    2015-01-01

    Automatic detection of relevant terms in medical reports is useful for educational purposes and for clinical research. Natural language processing (NLP) techniques can be applied in order to identify them. In this work we present an approach to classify radiology reports written in Spanish into two sets: the ones that indicate pathological findings and the ones that do not. In addition, the entities corresponding to pathological findings are identified in the reports. We use RadLex, a lexicon of English radiology terms, and NLP techniques to identify the occurrence of pathological findings. Reports are classified using a simple algorithm based on the presence of pathological findings, negation and hedge terms. The implemented algorithms were tested with a test set of 248 reports annotated by an expert, obtaining a best result of 0.72 F1 measure. The output of the classification task can be used to look for specific occurrences of pathological findings.

  17. Automatic music genres classification as a pattern recognition problem

    NASA Astrophysics Data System (ADS)

    Ul Haq, Ihtisham; Khan, Fauzia; Sharif, Sana; Shaukat, Arsalan

    2013-12-01

    Music genres are the simplest and effect descriptors for searching music libraries stores or catalogues. The paper compares the results of two automatic music genres classification systems implemented by using two different yet simple classifiers (K-Nearest Neighbor and Naïve Bayes). First a 10-12 second sample is selected and features are extracted from it, and then based on those features results of both classifiers are represented in the form of accuracy table and confusion matrix. An experiment carried out on test 60 taken from middle of a song represents the true essence of its genre as compared to the samples taken from beginning and ending of a song. The novel techniques have achieved an accuracy of 91% and 78% by using Naïve Bayes and KNN classifiers respectively.

  18. Automatic classification of visual evoked potentials based on wavelet decomposition

    NASA Astrophysics Data System (ADS)

    Stasiakiewicz, Paweł; Dobrowolski, Andrzej P.; Tomczykiewicz, Kazimierz

    2017-04-01

    Diagnosis of part of the visual system, that is responsible for conducting compound action potential, is generally based on visual evoked potentials generated as a result of stimulation of the eye by external light source. The condition of patient's visual path is assessed by set of parameters that describe the time domain characteristic extremes called waves. The decision process is compound therefore diagnosis significantly depends on experience of a doctor. The authors developed a procedure - based on wavelet decomposition and linear discriminant analysis - that ensures automatic classification of visual evoked potentials. The algorithm enables to assign individual case to normal or pathological class. The proposed classifier has a 96,4% sensitivity at 10,4% probability of false alarm in a group of 220 cases and area under curve ROC equals to 0,96 which, from the medical point of view, is a very good result.

  19. Automatic Text Structuring and Categorization As a First Step in Summarizing Legal Cases.

    ERIC Educational Resources Information Center

    Moens, Marie-Francine; Uyttendaele, Caroline

    1997-01-01

    Describes SALOMON (Summary and Analysis of Legal texts for Managing Online Needs), a system which automatically summarizes Belgian criminal cases to improve access to court decisions. Highlights include a text grammar represented as a semantic network; automatic abstracting; knowledge acquisition and representation; parsing; evaluation, including…

  20. Automatic Coding of Short Text Responses via Clustering in Educational Assessment

    ERIC Educational Resources Information Center

    Zehner, Fabian; Sälzer, Christine; Goldhammer, Frank

    2016-01-01

    Automatic coding of short text responses opens new doors in assessment. We implemented and integrated baseline methods of natural language processing and statistical modelling by means of software components that are available under open licenses. The accuracy of automatic text coding is demonstrated by using data collected in the "Programme…

  1. Automatic Coding of Short Text Responses via Clustering in Educational Assessment

    ERIC Educational Resources Information Center

    Zehner, Fabian; Sälzer, Christine; Goldhammer, Frank

    2016-01-01

    Automatic coding of short text responses opens new doors in assessment. We implemented and integrated baseline methods of natural language processing and statistical modelling by means of software components that are available under open licenses. The accuracy of automatic text coding is demonstrated by using data collected in the "Programme…

  2. Automatic classification of ceramic sherds with relief motifs

    NASA Astrophysics Data System (ADS)

    Debroutelle, Teddy; Treuillet, Sylvie; Chetouani, Aladine; Exbrayat, Matthieu; Martin, Lionel; Jesset, Sebastien

    2017-03-01

    A large corpus of ceramic sherds dating from the High Middle Ages has been extracted in Saran (France). The sherds have an engraved frieze made by the potter with a carved wooden wheel. These relief patterns can be used to date the sherds in order to study the diffusion of ceramic production. The aim of the ARCADIA project was to develop an automatic classification of this archaeological heritage. The sherds were scanned using a three-dimensional (3-D) laser scanner. After projecting the 3-D point cloud onto a depth map, the local variance highlighted the shallow relief patterns. The saliency region focused on the motif was extracted by a density-based spatial clustering of FAST points. An adaptive thresholding was then applied to the depth to obtain a binary pattern close to manual sampling. The five most representative types of motif were classified by training an SVM model with a pyramid histogram of visual words descriptor. Compared with other state-of-the-art methods, the proposed approach succeeded in classifying up to 84% of the binary patterns on a dataset of 377 scanned sherds. The automatic method is extremely time-saving compared to manual stamping.

  3. Automatic classification of spatial signatures on semiconductor wafermaps

    SciTech Connect

    Tobin, K.W.; Gleason, S.S.; Karnowski, T.P.; Cohen, S.L.; Lakhani, F.

    1997-03-01

    This paper describes Spatial Signature Analysis (SSA), a cooperative research project between SEMATECH and Oak Ridge National Laboratory for automatically analyzing and reducing semiconductor wafermap defect data to useful information. Trends toward larger wafer formats and smaller critical dimensions have caused an exponential increase in the volume of visual and parametric defect data which must be analyzed and stored, therefore necessitating the development of automated tools for wafer defect analysis. Contamination particles that did not create problems with 1 micron design rules can now be categorized as killer defects. SSA is an automated wafermap analysis procedure which performs a sophisticated defect clustering and signature classification of electronic wafermaps. This procedure has been realized in a software system that contains a signature classifier that is user-trainable. Known examples of historically problematic process signatures are added to a training database for the classifier. Once a suitable training set has been established, the software can automatically segment and classify multiple signatures form a standard electronic wafermap file into user-defined categories. It is anticipated that successful integration of this technology with other wafer monitoring strategies will result in reduced time-to-discovery and ultimately improved product yield.

  4. Supervised and traditional term weighting methods for automatic text categorization.

    PubMed

    Lan, Man; Tan, Chew Lim; Su, Jian; Lu, Yue

    2009-04-01

    In vector space model (VSM), text representation is the task of transforming the content of a textual document into a vector in the term space so that the document could be recognized and classified by a computer or a classifier. Different terms (i.e. words, phrases, or any other indexing units used to identify the contents of a text) have different importance in a text. The term weighting methods assign appropriate weights to the terms to improve the performance of text categorization. In this study, we investigate several widely-used unsupervised (traditional) and supervised term weighting methods on benchmark data collections in combination with SVM and kappa NN algorithms. In consideration of the distribution of relevant documents in the collection, we propose a new simple supervised term weighting method, i.e. tf.rf, to improve the terms' discriminating power for text categorization task. From the controlled experimental results, these supervised term weighting methods have mixed performance. Specifically, our proposed supervised term weighting method, tf.rf, has a consistently better performance than other term weighting methods while other supervised term weighting methods based on information theory or statistical metric perform the worst in all experiments. On the other hand, the popularly used tf.idf method has not shown a uniformly good performance in terms of different data sets.

  5. Automatic classification of sentences to support Evidence Based Medicine.

    PubMed

    Kim, Su Nam; Martinez, David; Cavedon, Lawrence; Yencken, Lars

    2011-03-29

    Given a set of pre-defined medical categories used in Evidence Based Medicine, we aim to automatically annotate sentences in medical abstracts with these labels. We constructed a corpus of 1,000 medical abstracts annotated by hand with specified medical categories (e.g. Intervention, Outcome). We explored the use of various features based on lexical, semantic, structural, and sequential information in the data, using Conditional Random Fields (CRF) for classification. For the classification tasks over all labels, our systems achieved micro-averaged f-scores of 80.9% and 66.9% over datasets of structured and unstructured abstracts respectively, using sequential features. In labeling only the key sentences, our systems produced f-scores of 89.3% and 74.0% over structured and unstructured abstracts respectively, using the same sequential features. The results over an external dataset were lower (f-scores of 63.1% for all labels, and 83.8% for key sentences). Of the features we used, the best for classifying any given sentence in an abstract were based on unigrams, section headings, and sequential information from preceding sentences. These features resulted in improved performance over a simple bag-of-words approach, and outperformed feature sets used in previous work.

  6. Network patterns recognition for automatic dermatologic images classification

    NASA Astrophysics Data System (ADS)

    Grana, Costantino; Daniele, Vanini; Pellacani, Giovanni; Seidenari, Stefania; Cucchiara, Rita

    2007-03-01

    In this paper we focus on the problem of automatic classification of melanocytic lesions, aiming at identifying the presence of reticular patterns. The recognition of reticular lesions is an important step in the description of the pigmented network, in order to obtain meaningful diagnostic information. Parameters like color, size or symmetry could benefit from the knowledge of having a reticular or non-reticular lesion. The detection of network patterns is performed with a three-steps procedure. The first step is the localization of line points, by means of the line points detection algorithm, firstly described by Steger. The second step is the linking of such points into a line considering the direction of the line at its endpoints and the number of line points connected to these. Finally a third step discards the meshes which couldn't be closed at the end of the linking procedure and the ones characterized by anomalous values of area or circularity. The number of the valid meshes left and their area with respect to the whole area of the lesion are the inputs of a discriminant function which classifies the lesions into reticular and non-reticular. This approach was tested on two balanced (both sets are formed by 50 reticular and 50 non-reticular images) training and testing sets. We obtained above 86% correct classification of the reticular and non-reticular lesions on real skin images, with a specificity value never lower than 92%.

  7. Automatic Classification of Specific Melanocytic Lesions Using Artificial Intelligence

    PubMed Central

    Jaworek-Korjakowska, Joanna; Kłeczek, Paweł

    2016-01-01

    Background. Given its propensity to metastasize, and lack of effective therapies for most patients with advanced disease, early detection of melanoma is a clinical imperative. Different computer-aided diagnosis (CAD) systems have been proposed to increase the specificity and sensitivity of melanoma detection. Although such computer programs are developed for different diagnostic algorithms, to the best of our knowledge, a system to classify different melanocytic lesions has not been proposed yet. Method. In this research we present a new approach to the classification of melanocytic lesions. This work is focused not only on categorization of skin lesions as benign or malignant but also on specifying the exact type of a skin lesion including melanoma, Clark nevus, Spitz/Reed nevus, and blue nevus. The proposed automatic algorithm contains the following steps: image enhancement, lesion segmentation, feature extraction, and selection as well as classification. Results. The algorithm has been tested on 300 dermoscopic images and achieved accuracy of 92% indicating that the proposed approach classified most of the melanocytic lesions correctly. Conclusions. A proposed system can not only help to precisely diagnose the type of the skin mole but also decrease the amount of biopsies and reduce the morbidity related to skin lesion excision. PMID:26885520

  8. Acoustic censusing using automatic vocalization classification and identity recognition.

    PubMed

    Adi, Kuntoro; Johnson, Michael T; Osiejuk, Tomasz S

    2010-02-01

    This paper presents an advanced method to acoustically assess animal abundance. The framework combines supervised classification (song-type and individual identity recognition), unsupervised classification (individual identity clustering), and the mark-recapture model of abundance estimation. The underlying algorithm is based on clustering using hidden Markov models (HMMs) and Gaussian mixture models (GMMs) similar to methods used in the speech recognition community for tasks such as speaker identification and clustering. Initial experiments using a Norwegian ortolan bunting (Emberiza hortulana) data set show the feasibility and effectiveness of the approach. Individually distinct acoustic features have been observed in a wide range of animal species, and this combined with the widespread success of speaker identification and verification methods for human speech suggests that robust automatic identification of individuals from their vocalizations is attainable. Only a few studies, however, have yet attempted to use individual acoustic distinctiveness to directly assess population density and structure. The approach introduced here offers a direct mechanism for using individual vocal variability to create simpler and more accurate population assessment tools in vocally active species.

  9. Semi-automatic indexing of full text biomedical articles.

    PubMed

    Gay, Clifford W; Kayaalp, Mehmet; Aronson, Alan R

    2005-01-01

    The main application of U.S. National Library of Medicine's Medical Text Indexer (MTI) is to provide indexing recommendations to the Library's indexing staff. The current input to MTI consists of the titles and abstracts of articles to be indexed. This study reports on an extension of MTI to the full text of articles appearing in online medical journals that are indexed for Medline. Using a collection of 17 journal issues containing 500 articles, we report on the effectiveness of the contribution of terms by the whole article and also by each section. We obtain the best results using a model consisting of the sections Results, Results and Discussion, and Conclusions together with the article's title and abstract, the captions of tables and figures, and sections that have no titles. The resulting model provides indexing significantly better (7.4%) than what is currently achieved using only titles and abstracts.

  10. Semi-Automatic Indexing of Full Text Biomedical Articles

    PubMed Central

    Gay, Clifford W.; Kayaalp, Mehmet; Aronson, Alan R.

    2005-01-01

    The main application of U.S. National Library of Medicine’s Medical Text Indexer (MTI) is to provide indexing recommendations to the Library’s indexing staff. The current input to MTI consists of the titles and abstracts of articles to be indexed. This study reports on an extension of MTI to the full text of articles appearing in online medical journals that are indexed for Medline®. Using a collection of 17 journal issues containing 500 articles, we report on the effectiveness of the contribution of terms by the whole article and also by each section. We obtain the best results using a model consisting of the sections Results, Results and Discussion, and Conclusions together with the article’s title and abstract, the captions of tables and figures, and sections that have no titles. The resulting model provides indexing significantly better (7.4%) than what is currently achieved using only titles and abstracts. PMID:16779044

  11. (Almost) Automatic Semantic Feature Extraction from Technical Text

    DTIC Science & Technology

    1994-01-01

    independent manner. The next section will describe an existing NLP system ( KUDZU ) which has been developed at Mississippi State Uni- versity...EXISTING KUDZU SYSTEM The research described in this paper is part of a larger on- going project called the KUDZU (Knowledge Under Devel- opment from...Zero Understanding) project. This project is aimed at exploring the automation of extraction of infor- mation from technical texts. The KUDZU system

  12. Automatic text extraction in news images using morphology

    NASA Astrophysics Data System (ADS)

    Jang, InYoung; Ko, ByoungChul; Byun, HyeRan; Choi, Yeongwoo

    2002-01-01

    In this paper we present a new method to extract both superimposed and embedded graphical texts in a freeze-frame of news video. The algorithm is summarized in the following three steps. For the first step, we convert a color image into a gray-level image and apply contrast stretching to enhance the contrast of the input image. Then, a modified local adaptive thresholding is applied to the contrast-stretched image. The second step is divided into three processes: eliminating text-like components by applying erosion, dilation, and (OpenClose + CloseOpen)/2 morphological operations, maintaining text components using (OpenClose + CloseOpen)/2 operation with a new Geo-correction method, and subtracting two result images for eliminating false-positive components further. In the third filtering step, the characteristics of each component such as the ratio of the number of pixels in each candidate component to the number of its boundary pixels and the ratio of the minor to the major axis of each bounding box are used. Acceptable results have been obtained using the proposed method on 300 news images with a recognition rate of 93.6%. Also, our method indicates a good performance on all the various kinds of images by adjusting the size of the structuring element.

  13. Automatic semantic interpretation of anatomic spatial relationships in clinical text.

    PubMed

    Bean, C A; Rindflesch, T C; Sneiderman, C A

    1998-01-01

    A set of semantic interpretation rules to link the syntax and semantics of locative relationships among anatomic entities was developed and implemented in a natural language processing system. Two experiments assessed the ability of the system to identify and characterize physico-spatial relationships in coronary angiography reports. Branching relationships were by far the most common observed (75%), followed by PATH (20%) and PART/WHOLE relationships. Recall and precision scores were 0.78 and 0.67 overall, suggesting the viability of this approach in semantic processing of clinical text.

  14. On multilabel classification methods of incompletely labeled biomedical text data.

    PubMed

    Kolesov, Anton; Kamyshenkov, Dmitry; Litovchenko, Maria; Smekalova, Elena; Golovizin, Alexey; Zhavoronkov, Alex

    2014-01-01

    Multilabel classification is often hindered by incompletely labeled training datasets; for some items of such dataset (or even for all of them) some labels may be omitted. In this case, we cannot know if any item is labeled fully and correctly. When we train a classifier directly on incompletely labeled dataset, it performs ineffectively. To overcome the problem, we added an extra step, training set modification, before training a classifier. In this paper, we try two algorithms for training set modification: weighted k-nearest neighbor (WkNN) and soft supervised learning (SoftSL). Both of these approaches are based on similarity measurements between data vectors. We performed the experiments on AgingPortfolio (text dataset) and then rechecked on the Yeast (nontext genetic data). We tried SVM and RF classifiers for the original datasets and then for the modified ones. For each dataset, our experiments demonstrated that both classification algorithms performed considerably better when preceded by the training set modification step.

  15. Automatic theory generation from analyst text files using coherence networks

    NASA Astrophysics Data System (ADS)

    Shaffer, Steven C.

    2014-05-01

    This paper describes a three-phase process of extracting knowledge from analyst textual reports. Phase 1 involves performing natural language processing on the source text to extract subject-predicate-object triples. In phase 2, these triples are then fed into a coherence network analysis process, using a genetic algorithm optimization. Finally, the highest-value sub networks are processed into a semantic network graph for display. Initial work on a well- known data set (a Wikipedia article on Abraham Lincoln) has shown excellent results without any specific tuning. Next, we ran the process on the SYNthetic Counter-INsurgency (SYNCOIN) data set, developed at Penn State, yielding interesting and potentially useful results.

  16. Text Classification Using the Sum of Frequency Ratios of Word andN-gram Over Categories

    NASA Astrophysics Data System (ADS)

    Suzuki, Makoto; Hirasawa, Shigeichi

    In this paper, we consider the automatic text classification as a series of information processing, and propose a new classification technique, namely, “Frequency Ratio Accumulation Method (FRAM)”. This is a simple technique that calculates the sum of ratios of term frequency in each category. However, it has a desirable property that feature terms can be used without their extraction procedure. Then, we use “character N-gram” and “word N-gram” as feature terms by using this property of our classification technique. Next, we evaluate our technique by some experiments. In our experiments, we classify the newspaper articles of Japanese “CD-Mainichi 2002” and English “Reuters-21578” using the Naive Bayes (baseline method) and the proposed method. As the result, we show that the classification accuracy of the proposed method improves greatly compared with the baseline. That is, it is 89.6% for Mainichi, 87.8% for Reuters. Thus, the proposed method has a very high performance. Though the proposed method is a simple technique, it has a new viewpoint, a high potential and is language-independent, so it can be expected the development in the future.

  17. Why discourse structures in medical reports matter for the validity of automatically generated text knowledge bases.

    PubMed

    Hahn, U; Romacker, M; Schulz, S

    1998-01-01

    The automatic analysis of medical full-texts currently suffers from neglecting text coherence phenomena such as reference relations between discourse units. This has unwarranted effects on the description adequacy of medical knowledge bases automatically generated from texts. The resulting representation bias can be characterized in terms of artificially fragmented, incomplete and invalid knowledge structures. We discuss three types of textual phenomena (pronominal and nominal anaphora, as well as textual ellipsis) and outline basic methodologies how to deal with them.

  18. One Approach to Classification of Users and Automatic Clustering of Documents.

    ERIC Educational Resources Information Center

    Frants, Valery I.; And Others

    1993-01-01

    Shows how to automatically construct a classification of users and a clustering of documents and cross-references among clusters based on users' information needs. Feedback in the construction of this classification and clustering that allows for the classification to be changed to reflect changing needs of users is also described. (22 references)…

  19. Text classification for assisting moderators in online health communities.

    PubMed

    Huh, Jina; Yetisgen-Yildiz, Meliha; Pratt, Wanda

    2013-12-01

    Patients increasingly visit online health communities to get help on managing health. The large scale of these online communities makes it impossible for the moderators to engage in all conversations; yet, some conversations need their expertise. Our work explores low-cost text classification methods to this new domain of determining whether a thread in an online health forum needs moderators' help. We employed a binary classifier on WebMD's online diabetes community data. To train the classifier, we considered three feature types: (1) word unigram, (2) sentiment analysis features, and (3) thread length. We applied feature selection methods based on χ² statistics and under sampling to account for unbalanced data. We then performed a qualitative error analysis to investigate the appropriateness of the gold standard. Using sentiment analysis features, feature selection methods, and balanced training data increased the AUC value up to 0.75 and the F1-score up to 0.54 compared to the baseline of using word unigrams with no feature selection methods on unbalanced data (0.65 AUC and 0.40 F1-score). The error analysis uncovered additional reasons for why moderators respond to patients' posts. We showed how feature selection methods and balanced training data can improve the overall classification performance. We present implications of weighing precision versus recall for assisting moderators of online health communities. Our error analysis uncovered social, legal, and ethical issues around addressing community members' needs. We also note challenges in producing a gold standard, and discuss potential solutions for addressing these challenges. Social media environments provide popular venues in which patients gain health-related information. Our work contributes to understanding scalable solutions for providing moderators' expertise in these large-scale, social media environments. Copyright © 2013 Elsevier Inc. All rights reserved.

  20. Text Classification for Assisting Moderators in Online Health Communities

    PubMed Central

    Huh, Jina; Yetisgen-Yildiz, Meliha; Pratt, Wanda

    2013-01-01

    Objectives Patients increasingly visit online health communities to get help on managing health. The large scale of these online communities makes it impossible for the moderators to engage in all conversations; yet, some conversations need their expertise. Our work explores low-cost text classification methods to this new domain of determining whether a thread in an online health forum needs moderators’ help. Methods We employed a binary classifier on WebMD’s online diabetes community data. To train the classifier, we considered three feature types: (1) word unigram, (2) sentiment analysis features, and (3) thread length. We applied feature selection methods based on χ2 statistics and under sampling to account for unbalanced data. We then performed a qualitative error analysis to investigate the appropriateness of the gold standard. Results Using sentiment analysis features, feature selection methods, and balanced training data increased the AUC value up to 0.75 and the F1-score up to 0.54 compared to the baseline of using word unigrams with no feature selection methods on unbalanced data (0.65 AUC and 0.40 F1-score). The error analysis uncovered additional reasons for why moderators respond to patients’ posts. Discussion We showed how feature selection methods and balanced training data can improve the overall classification performance. We present implications of weighing precision versus recall for assisting moderators of online health communities. Our error analysis uncovered social, legal, and ethical issues around addressing community members’ needs. We also note challenges in producing a gold standard, and discuss potential solutions for addressing these challenges. Conclusion Social media environments provide popular venues in which patients gain health-related information. Our work contributes to understanding scalable solutions for providing moderators’ expertise in these large-scale, social media environments. PMID:24025513

  1. Text mining for the Vaccine Adverse Event Reporting System: medical text classification using informative feature selection

    PubMed Central

    Nguyen, Michael D; Woo, Emily Jane; Markatou, Marianthi; Ball, Robert

    2011-01-01

    Objective The US Vaccine Adverse Event Reporting System (VAERS) collects spontaneous reports of adverse events following vaccination. Medical officers review the reports and often apply standardized case definitions, such as those developed by the Brighton Collaboration. Our objective was to demonstrate a multi-level text mining approach for automated text classification of VAERS reports that could potentially reduce human workload. Design We selected 6034 VAERS reports for H1N1 vaccine that were classified by medical officers as potentially positive (Npos=237) or negative for anaphylaxis. We created a categorized corpus of text files that included the class label and the symptom text field of each report. A validation set of 1100 labeled text files was also used. Text mining techniques were applied to extract three feature sets for important keywords, low- and high-level patterns. A rule-based classifier processed the high-level feature representation, while several machine learning classifiers were trained for the remaining two feature representations. Measurements Classifiers' performance was evaluated by macro-averaging recall, precision, and F-measure, and Friedman's test; misclassification error rate analysis was also performed. Results Rule-based classifier, boosted trees, and weighted support vector machines performed well in terms of macro-recall, however at the expense of a higher mean misclassification error rate. The rule-based classifier performed very well in terms of average sensitivity and specificity (79.05% and 94.80%, respectively). Conclusion Our validated results showed the possibility of developing effective medical text classifiers for VAERS reports by combining text mining with informative feature selection; this strategy has the potential to reduce reviewer workload considerably. PMID:21709163

  2. Text mining for the Vaccine Adverse Event Reporting System: medical text classification using informative feature selection.

    PubMed

    Botsis, Taxiarchis; Nguyen, Michael D; Woo, Emily Jane; Markatou, Marianthi; Ball, Robert

    2011-01-01

    The US Vaccine Adverse Event Reporting System (VAERS) collects spontaneous reports of adverse events following vaccination. Medical officers review the reports and often apply standardized case definitions, such as those developed by the Brighton Collaboration. Our objective was to demonstrate a multi-level text mining approach for automated text classification of VAERS reports that could potentially reduce human workload. We selected 6034 VAERS reports for H1N1 vaccine that were classified by medical officers as potentially positive (N(pos)=237) or negative for anaphylaxis. We created a categorized corpus of text files that included the class label and the symptom text field of each report. A validation set of 1100 labeled text files was also used. Text mining techniques were applied to extract three feature sets for important keywords, low- and high-level patterns. A rule-based classifier processed the high-level feature representation, while several machine learning classifiers were trained for the remaining two feature representations. Classifiers' performance was evaluated by macro-averaging recall, precision, and F-measure, and Friedman's test; misclassification error rate analysis was also performed. Rule-based classifier, boosted trees, and weighted support vector machines performed well in terms of macro-recall, however at the expense of a higher mean misclassification error rate. The rule-based classifier performed very well in terms of average sensitivity and specificity (79.05% and 94.80%, respectively). Our validated results showed the possibility of developing effective medical text classifiers for VAERS reports by combining text mining with informative feature selection; this strategy has the potential to reduce reviewer workload considerably.

  3. Deep transfer learning for automatic target classification: MWIR to LWIR

    NASA Astrophysics Data System (ADS)

    Ding, Zhengming; Nasrabadi, Nasser; Fu, Yun

    2016-05-01

    Publisher's Note: This paper, originally published on 5/12/2016, was replaced with a corrected/revised version on 5/18/2016. If you downloaded the original PDF but are unable to access the revision, please contact SPIE Digital Library Customer Service for assistance. When dealing with sparse or no labeled data in the target domain, transfer learning shows its appealing performance by borrowing the supervised knowledge from external domains. Recently deep structure learning has been exploited in transfer learning due to its attractive power in extracting effective knowledge through multi-layer strategy, so that deep transfer learning is promising to address the cross-domain mismatch. In general, cross-domain disparity can be resulted from the difference between source and target distributions or different modalities, e.g., Midwave IR (MWIR) and Longwave IR (LWIR). In this paper, we propose a Weighted Deep Transfer Learning framework for automatic target classification through a task-driven fashion. Specifically, deep features and classifier parameters are obtained simultaneously for optimal classification performance. In this way, the proposed deep structures can extract more effective features with the guidance of the classifier performance; on the other hand, the classifier performance is further improved since it is optimized on more discriminative features. Furthermore, we build a weighted scheme to couple source and target output by assigning pseudo labels to target data, therefore we can transfer knowledge from source (i.e., MWIR) to target (i.e., LWIR). Experimental results on real databases demonstrate the superiority of the proposed algorithm by comparing with others.

  4. Construction accident narrative classification: An evaluation of text mining techniques.

    PubMed

    Goh, Yang Miang; Ubeynarayana, C U

    2017-11-01

    Learning from past accidents is fundamental to accident prevention. Thus, accident and near miss reporting are encouraged by organizations and regulators. However, for organizations managing large safety databases, the time taken to accurately classify accident and near miss narratives will be very significant. This study aims to evaluate the utility of various text mining classification techniques in classifying 1000 publicly available construction accident narratives obtained from the US OSHA website. The study evaluated six machine learning algorithms, including support vector machine (SVM), linear regression (LR), random forest (RF), k-nearest neighbor (KNN), decision tree (DT) and Naive Bayes (NB), and found that SVM produced the best performance in classifying the test set of 251 cases. Further experimentation with tokenization of the processed text and non-linear SVM were also conducted. In addition, a grid search was conducted on the hyperparameters of the SVM models. It was found that the best performing classifiers were linear SVM with unigram tokenization and radial basis function (RBF) SVM with uni-gram tokenization. In view of its relative simplicity, the linear SVM is recommended. Across the 11 labels of accident causes or types, the precision of the linear SVM ranged from 0.5 to 1, recall ranged from 0.36 to 0.9 and F1 score was between 0.45 and 0.92. The reasons for misclassification were discussed and suggestions on ways to improve the performance were provided. Copyright © 2017 Elsevier Ltd. All rights reserved.

  5. Industry survey of automatic defect classification technologies, methods, and performance

    NASA Astrophysics Data System (ADS)

    Tobin, Kenneth W., Jr.; Lakhani, Fred; Karnowski, Thomas P.

    2002-07-01

    To be productive and profitable in a modern semiconductor fabrication environment, large amounts of manufacturing data must be collected, analyzed, and maintained. This data is increasingly being used to design new processes, control and maintain tools, and to provide the information needed for rapid yield learning and prediction. Towards this end, a significant level of investment has been made over the past decade to bring to maturity viable technologies for Automatic Defect Classification (ADC) as a means of automating the recognition and analysis of defect imagery captured during in-line inspection and off-line review. ADC has been developed to provide automation of the tedious manual inspection processes associated with defect review. Although significant advances have been achieved in the capabilities of ADC systems today, concerns continue to persist regarding effective integration, maintenance, and usability of commercial ADC technologies. During the summer of 2001, the Oak Ridge National Laboratory and International SEMATECH performed an industry survey of eight major semiconductor device manufacturers to address the issues of ADC integration, usability, and maintenance for the various in-line inspection and review applications available today. The purpose of the survey was to determine and prioritize those issues that inhibit the effective adoption, integration, and application of ADC technology in today's fabrication environment. In this paper, we will review the various ADC technologies available to the semiconductor industry today and discus the result of the survey.

  6. Automatic classification for pathological prostate images based on fractal analysis.

    PubMed

    Huang, Po-Whei; Lee, Cheng-Hsiung

    2009-07-01

    Accurate grading for prostatic carcinoma in pathological images is important to prognosis and treatment planning. Since human grading is always time-consuming and subjective, this paper presents a computer-aided system to automatically grade pathological images according to Gleason grading system which is the most widespread method for histological grading of prostate tissues. We proposed two feature extraction methods based on fractal dimension to analyze variations of intensity and texture complexity in regions of interest. Each image can be classified into an appropriate grade by using Bayesian, k-NN, and support vector machine (SVM) classifiers, respectively. Leave-one-out and k-fold cross-validation procedures were used to estimate the correct classification rates (CCR). Experimental results show that 91.2%, 93.7%, and 93.7% CCR can be achieved by Bayesian, k-NN, and SVM classifiers, respectively, for a set of 205 pathological prostate images. If our fractal-based feature set is optimized by the sequential floating forward selection method, the CCR can be promoted up to 94.6%, 94.2%, and 94.6%, respectively, using each of the above three classifiers. Experimental results also show that our feature set is better than the feature sets extracted from multiwavelets, Gabor filters, and gray-level co-occurrence matrix methods because it has a much smaller size and still keeps the most powerful discriminating capability in grading prostate images.

  7. EUCLID: automatic classification of proteins in functional classes by their database annotations.

    PubMed

    Tamames, J; Ouzounis, C; Casari, G; Sander, C; Valencia, A

    1998-01-01

    A tool is described for the automatic classification of sequences in functional classes using their database annotations. The Euclid system is based on a simple learning procedure from examples provided by human experts. Euclid is freely available for academics at http://www.gredos.cnb.uam.es/EUCLID, with the corresponding dictionaries for the generation of three, eight and 14 functional classes. E-mail: valencia@cnb.uam.es The results of the EUCLID classification of different genomes are available at http://www.sander.ebi.ac. uk/genequiz/. A detailed description of the different applications mentioned in the text is available at http://www.gredos.cnb.uam. es/EUCLID/Full_Paper

  8. An automatic agricultural zone classification procedure for crop inventory satellite images

    NASA Technical Reports Server (NTRS)

    Parada, N. D. J. (Principal Investigator); Kux, H. J.; Velasco, F. R. D.; Deoliveira, M. O. B.

    1982-01-01

    A classification procedure for assessing crop areal proportion in multispectral scanner image is discussed. The procedure is into four parts: labeling; classification; proportion estimation; and evaluation. The procedure also has the following characteristics: multitemporal classification; the need for a minimum field information; and verification capability between automatic classification and analyst labeling. The processing steps and the main algorithms involved are discussed. An outlook on the future of this technology is also presented.

  9. [Automatic classification method of star spectrum data based on constrained concept lattice].

    PubMed

    Zhang, Ji-Fu; Ma, Yang

    2010-02-01

    Concept lattice is an effective formal tool for data analysis and knowledge extraction. Constrained concept lattice, with the characteristics of higher constructing efficiency, practicability and pertinency, is a new concept lattice structure. For the automatic classification task of star spectrum, a classification rule mining method based on constrained concept lattice is presented by using the concepts of partition and extant supports. In the end, the experimental results validate the higher classification efficiency and correctness of the method by taking the star spectrum data as the formal context, so that an effective way is provided for the automatic classification of massive star spectrum.

  10. Automatic sleep classification according to Rechtschaffen and Kales.

    PubMed

    Anderer, Peter; Gruber, Georg; Parapatics, Silvia; Dorffner, Georg

    2007-01-01

    Conventionally, polysomnographic recordings are classified according to the rules published in 1968 by Rechtschaffen and Kales (R&K). The present paper describes an automatic classification system embedded in an e-health solution that has been developed and validated in a large database of healthy controls and sleep disturbed patients. The Somnolyzer 24x7 adheres to the decision rules for visual scoring as closely as possible and includes a structured quality control procedure by a human expert. The final system consists of a raw data quality check, a feature extraction algorithm (density and intensity of sleep/wake-related patterns such as sleep spindles, delta waves, slow and rapid eye movements), a feature matrix plausibility check, a classifier designed as an expert system and a rule-based smoothing procedure for the start and the end of stages REM and 2. The validation based on 286 recordings in both normal healthy subjects aged 20 to 95 years and patients suffering from organic or nonorganic sleep disorders demonstrated an overall epoch-by-epoch agreement of 80% (Cohen's Kappa: 0.72) between the Somnolyzer 24x7 and the human expert scoring, as compared with an inter-rater reliability of 77% (Cohen's Kappa: 0.68) between two human experts. Two Somnolyzer 24x7 analyses (including a structured quality control by two human experts) revealed an inter-rater reliability close to 1 (Cohen's Kappa: 0.991). Moreover, correlation analysis in R&K derived target variables revealed similar -- in 36 out of 38 variables even higher -- relationships between Somnolyzer 24x7 and expert evaluations as compared to the concordance between two human experts. Thus, the validation study proved the high reliability and validity of the Somnolyzer 24x7 both, on the epoch-by-epoch and on the target variable level. These results demonstrate the applicability of the Somnolyzer 24x7 evaluation in clinical routine and sleep studies.

  11. Automatic detection of adverse events to predict drug label changes using text and data mining techniques.

    PubMed

    Gurulingappa, Harsha; Toldo, Luca; Rajput, Abdul Mateen; Kors, Jan A; Taweel, Adel; Tayrouz, Yorki

    2013-11-01

    The aim of this study was to assess the impact of automatically detected adverse event signals from text and open-source data on the prediction of drug label changes. Open-source adverse effect data were collected from FAERS, Yellow Cards and SIDER databases. A shallow linguistic relation extraction system (JSRE) was applied for extraction of adverse effects from MEDLINE case reports. Statistical approach was applied on the extracted datasets for signal detection and subsequent prediction of label changes issued for 29 drugs by the UK Regulatory Authority in 2009. 76% of drug label changes were automatically predicted. Out of these, 6% of drug label changes were detected only by text mining. JSRE enabled precise identification of four adverse drug events from MEDLINE that were undetectable otherwise. Changes in drug labels can be predicted automatically using data and text mining techniques. Text mining technology is mature and well-placed to support the pharmacovigilance tasks. Copyright © 2013 John Wiley & Sons, Ltd.

  12. Combining MEDLINE and publisher data to create parallel corpora for the automatic translation of biomedical text

    PubMed Central

    2013-01-01

    Background Most of the institutional and research information in the biomedical domain is available in the form of English text. Even in countries where English is an official language, such as the United States, language can be a barrier for accessing biomedical information for non-native speakers. Recent progress in machine translation suggests that this technique could help make English texts accessible to speakers of other languages. However, the lack of adequate specialized corpora needed to train statistical models currently limits the quality of automatic translations in the biomedical domain. Results We show how a large-sized parallel corpus can automatically be obtained for the biomedical domain, using the MEDLINE database. The corpus generated in this work comprises article titles obtained from MEDLINE and abstract text automatically retrieved from journal websites, which substantially extends the corpora used in previous work. After assessing the quality of the corpus for two language pairs (English/French and English/Spanish) we use the Moses package to train a statistical machine translation model that outperforms previous models for automatic translation of biomedical text. Conclusions We have built translation data sets in the biomedical domain that can easily be extended to other languages available in MEDLINE. These sets can successfully be applied to train statistical machine translation models. While further progress should be made by incorporating out-of-domain corpora and domain-specific lexicons, we believe that this work improves the automatic translation of biomedical texts. PMID:23631733

  13. Automatic Classification of Book Material Represented by Back-of-the-Book Index.

    ERIC Educational Resources Information Center

    Enser, P. G. B.

    1985-01-01

    Investigates techniques for automatic classification of book material focusing on: computer-based surrogation of monographic material, book surrogate clustering on basis of content association, evaluation of resultant classifications. Test collection (250 books) is described with surrogation by means of back-of-the-book index, table of contents,…

  14. Automatic Method of Supernovae Classification by Modeling Human Procedure of Spectrum Analysis

    NASA Astrophysics Data System (ADS)

    Módolo, Marcelo; Rosa, Reinaldo; Guimaraes, Lamartine N. F.

    2016-07-01

    The classification of a recently discovered supernova must be done as quickly as possible in order to define what information will be captured and analyzed in the following days. This classification is not trivial and only a few experts astronomers are able to perform it. This paper proposes an automatic method that models the human procedure of classification. It uses Multilayer Perceptron Neural Networks to analyze the supernovae spectra. Experiments were performed using different pre-processing and multiple neural network configurations to identify the classic types of supernovae. Significant results were obtained indicating the viability of using this method in places that have no specialist or that require an automatic analysis.

  15. Automated ancillary cancer history classification for mesothelioma patients from free-text clinical reports

    PubMed Central

    Wilson, Richard A.; Chapman, Wendy W.; DeFries, Shawn J.; Becich, Michael J.; Chapman, Brian E.

    2010-01-01

    Background: Clinical records are often unstructured, free-text documents that create information extraction challenges and costs. Healthcare delivery and research organizations, such as the National Mesothelioma Virtual Bank, require the aggregation of both structured and unstructured data types. Natural language processing offers techniques for automatically extracting information from unstructured, free-text documents. Methods: Five hundred and eight history and physical reports from mesothelioma patients were split into development (208) and test sets (300). A reference standard was developed and each report was annotated by experts with regard to the patient’s personal history of ancillary cancer and family history of any cancer. The Hx application was developed to process reports, extract relevant features, perform reference resolution and classify them with regard to cancer history. Two methods, Dynamic-Window and ConText, for extracting information were evaluated. Hx’s classification responses using each of the two methods were measured against the reference standard. The average Cohen’s weighted kappa served as the human benchmark in evaluating the system. Results: Hx had a high overall accuracy, with each method, scoring 96.2%. F-measures using the Dynamic-Window and ConText methods were 91.8% and 91.6%, which were comparable to the human benchmark of 92.8%. For the personal history classification, Dynamic-Window scored highest with 89.2% and for the family history classification, ConText scored highest with 97.6%, in which both methods were comparable to the human benchmark of 88.3% and 97.2%, respectively. Conclusion: We evaluated an automated application’s performance in classifying a mesothelioma patient’s personal and family history of cancer from clinical reports. To do so, the Hx application must process reports, identify cancer concepts, distinguish the known mesothelioma from ancillary cancers, recognize negation, perform reference

  16. Automated ancillary cancer history classification for mesothelioma patients from free-text clinical reports.

    PubMed

    Wilson, Richard A; Chapman, Wendy W; Defries, Shawn J; Becich, Michael J; Chapman, Brian E

    2010-10-11

    Clinical records are often unstructured, free-text documents that create information extraction challenges and costs. Healthcare delivery and research organizations, such as the National Mesothelioma Virtual Bank, require the aggregation of both structured and unstructured data types. Natural language processing offers techniques for automatically extracting information from unstructured, free-text documents. Five hundred and eight history and physical reports from mesothelioma patients were split into development (208) and test sets (300). A reference standard was developed and each report was annotated by experts with regard to the patient's personal history of ancillary cancer and family history of any cancer. The Hx application was developed to process reports, extract relevant features, perform reference resolution and classify them with regard to cancer history. Two methods, Dynamic-Window and ConText, for extracting information were evaluated. Hx's classification responses using each of the two methods were measured against the reference standard. The average Cohen's weighted kappa served as the human benchmark in evaluating the system. Hx had a high overall accuracy, with each method, scoring 96.2%. F-measures using the Dynamic-Window and ConText methods were 91.8% and 91.6%, which were comparable to the human benchmark of 92.8%. For the personal history classification, Dynamic-Window scored highest with 89.2% and for the family history classification, ConText scored highest with 97.6%, in which both methods were comparable to the human benchmark of 88.3% and 97.2%, respectively. We evaluated an automated application's performance in classifying a mesothelioma patient's personal and family history of cancer from clinical reports. To do so, the Hx application must process reports, identify cancer concepts, distinguish the known mesothelioma from ancillary cancers, recognize negation, perform reference resolution and determine the experiencer. Results

  17. Automatic classification of sleep stages based on the time-frequency image of EEG signals.

    PubMed

    Bajaj, Varun; Pachori, Ram Bilas

    2013-12-01

    In this paper, a new method for automatic sleep stage classification based on time-frequency image (TFI) of electroencephalogram (EEG) signals is proposed. Automatic classification of sleep stages is an important part for diagnosis and treatment of sleep disorders. The smoothed pseudo Wigner-Ville distribution (SPWVD) based time-frequency representation (TFR) of EEG signal has been used to obtain the time-frequency image (TFI). The segmentation of TFI has been performed based on the frequency-bands of the rhythms of EEG signals. The features derived from the histogram of segmented TFI have been used as an input feature set to multiclass least squares support vector machines (MC-LS-SVM) together with the radial basis function (RBF), Mexican hat wavelet, and Morlet wavelet kernel functions for automatic classification of sleep stages from EEG signals. The experimental results are presented to show the effectiveness of the proposed method for classification of sleep stages from EEG signals.

  18. Automatic Cataloguing and Searching for Retrospective Data by Use of OCR Text.

    ERIC Educational Resources Information Center

    Tseng, Yuen-Hsien

    2001-01-01

    Describes efforts in supporting information retrieval from OCR (optical character recognition) degraded text. Reports on approaches used in an automatic cataloging and searching contest for books in multiple languages, including a vector space retrieval model, an n-gram indexing method, and a weighting scheme; and discusses problems of Asian…

  19. Automatic Cataloguing and Searching for Retrospective Data by Use of OCR Text.

    ERIC Educational Resources Information Center

    Tseng, Yuen-Hsien

    2001-01-01

    Describes efforts in supporting information retrieval from OCR (optical character recognition) degraded text. Reports on approaches used in an automatic cataloging and searching contest for books in multiple languages, including a vector space retrieval model, an n-gram indexing method, and a weighting scheme; and discusses problems of Asian…

  20. Investigation into Text Classification With Kernel Based Schemes

    DTIC Science & Technology

    2010-03-01

    Document Matrix TDMs Term-Document Matrices TMG Text to Matrix Generator TN True Negative TP True Positive VSM Vector Space Model xxii THIS PAGE...are represented as a term-document matrix, common evaluation metrics, and the software package Text to Matrix Generator ( TMG ). The classifier...AND METRICS This chapter introduces the indexing capabilities of the Text to Matrix Generator ( TMG ) Toolbox. Specific attention is placed on the

  1. A case-comparison study of automatic document classification utilizing both serial and parallel approaches

    NASA Astrophysics Data System (ADS)

    Wilges, B.; Bastos, R. C.; Mateus, G. P.; Dantas, M. A. R.

    2014-10-01

    A well-known problem faced by any organization nowadays is the high volume of data that is available and the required process to transform this volume into differential information. In this study, a case-comparison study of automatic document classification (ADC) approach is presented, utilizing both serial and parallel paradigms. The serial approach was implemented by adopting the RapidMiner software tool, which is recognized as the worldleading open-source system for data mining. On the other hand, considering the MapReduce programming model, the Hadoop software environment has been used. The main goal of this case-comparison study is to exploit differences between these two paradigms, especially when large volumes of data such as Web text documents are utilized to build a category database. In the literature, many studies point out that distributed processing in unstructured documents have been yielding efficient results in utilizing Hadoop. Results from our research indicate a threshold to such efficiency.

  2. Automatic extraction of gene/protein biological functions from biomedical text.

    PubMed

    Koike, Asako; Niwa, Yoshiki; Takagi, Toshihisa

    2005-04-01

    With the rapid advancement of biomedical science and the development of high-throughput analysis methods, the extraction of various types of information from biomedical text has become critical. Since automatic functional annotations of genes are quite useful for interpreting large amounts of high-throughput data efficiently, the demand for automatic extraction of information related to gene functions from text has been increasing. We have developed a method for automatically extracting the biological process functions of genes/protein/families based on Gene Ontology (GO) from text using a shallow parser and sentence structure analysis techniques. When the gene/protein/family names and their functions are described in ACTOR (doer of action) and OBJECT (receiver of action) relationships, the corresponding GO-IDs are assigned to the genes/proteins/families. The gene/protein/family names are recognized using the gene/protein/family name dictionaries developed by our group. To achieve wide recognition of the gene/protein/family functions, we semi-automatically gather functional terms based on GO using co-occurrence, collocation similarities and rule-based techniques. A preliminary experiment demonstrated that our method has an estimated recall of 54-64% with a precision of 91-94% for actually described functions in abstracts. When applied to the PUBMED, it extracted over 190 000 gene-GO relationships and 150 000 family-GO relationships for major eukaryotes.

  3. Automatic classification of protein structures using physicochemical parameters.

    PubMed

    Mohan, Abhilash; Rao, M Divya; Sunderrajan, Shruthi; Pennathur, Gautam

    2014-09-01

    Protein classification is the first step to functional annotation; SCOP and Pfam databases are currently the most relevant protein classification schemes. However, the disproportion in the number of three dimensional (3D) protein structures generated versus their classification into relevant superfamilies/families emphasizes the need for automated classification schemes. Predicting function of novel proteins based on sequence information alone has proven to be a major challenge. The present study focuses on the use of physicochemical parameters in conjunction with machine learning algorithms (Naive Bayes, Decision Trees, Random Forest and Support Vector Machines) to classify proteins into their respective SCOP superfamily/Pfam family, using sequence derived information. Spectrophores™, a 1D descriptor of the 3D molecular field surrounding a structure was used as a benchmark to compare the performance of the physicochemical parameters. The machine learning algorithms were modified to select features based on information gain for each SCOP superfamily/Pfam family. The effect of combining physicochemical parameters and spectrophores on classification accuracy (CA) was studied. Machine learning algorithms trained with the physicochemical parameters consistently classified SCOP superfamilies and Pfam families with a classification accuracy above 90%, while spectrophores performed with a CA of around 85%. Feature selection improved classification accuracy for both physicochemical parameters and spectrophores based machine learning algorithms. Combining both attributes resulted in a marginal loss of performance. Physicochemical parameters were able to classify proteins from both schemes with classification accuracy ranging from 90-96%. These results suggest the usefulness of this method in classifying proteins from amino acid sequences.

  4. Automatic counting and classification of bacterial colonies using hyperspectral imaging

    USDA-ARS?s Scientific Manuscript database

    Detection and counting of bacterial colonies on agar plates is a routine microbiology practice to get a rough estimate of the number of viable cells in a sample. There have been a variety of different automatic colony counting systems and software algorithms mainly based on color or gray-scale pictu...

  5. Automatic detection of inconsistencies between free text and coded data in Sarcoma discharge letters.

    PubMed

    Rinott, Ruty; Torresani, Michele; Bertulli, Rossella; Goldsteen, Abigail; Casali, Paolo; Carmeli, Boaz; Slonim, Noam

    2012-01-01

    Discordance between data stored in Electronic Health Records (EHR) may have a harmful effect on patient care. Automatic identification of such situations is an important yet challenging task, especially when the discordance involves information stored in free text fields. Here we present a method to automatically detect inconsistencies between data stored in free text and related coded fields. Using EHR data we train an ensemble of classifiers to predict the value of coded fields from the free text fields. Cases in which the classifiers predict with high confidence a code different from the clinicians' choice are marked as potential inconsistencies. Experimental results over discharge letters of sarcoma patients, verified by a domain expert, demonstrate the validity of our method.

  6. A Feature Selection Method Based on Fisher's Discriminant Ratio for Text Sentiment Classification

    NASA Astrophysics Data System (ADS)

    Wang, Suge; Li, Deyu; Wei, Yingjie; Li, Hongxia

    With the rapid growth of e-commerce, product reviews on the Web have become an important information source for customers' decision making when they intend to buy some product. As the reviews are often too many for customers to go through, how to automatically classify them into different sentiment orientation categories (i.e. positive/negative) has become a research problem. In this paper, based on Fisher's discriminant ratio, an effective feature selection method is proposed for product review text sentiment classification. In order to validate the validity of the proposed method, we compared it with other methods respectively based on information gain and mutual information while support vector machine is adopted as the classifier. In this paper, 6 subexperiments are conducted by combining different feature selection methods with 2 kinds of candidate feature sets. Under 1006 review documents of cars, the experimental results indicate that the Fisher's discriminant ratio based on word frequency estimation has the best performance with F value 83.3% while the candidate features are the words which appear in both positive and negative texts.

  7. Methods for automatic cloud classification from MODIS data

    NASA Astrophysics Data System (ADS)

    Astafurov, V. G.; Kuriyanovich, K. V.; Skorokhodov, A. V.

    2016-12-01

    In this paper, different texture-analysis methods are used to describe different cloud types in MODIS satellite images. A universal technique is suggested for the formation of efficient sets of textural features using the algorithm of truncated scanning of the features for different classifiers based on neural networks and cluster-analysis methods. Efficient sets of textural features are given for the considered classifiers; the cloud-image classification results are discussed. The characteristics of the classification methods used in this work are described: the probabilistic neural network, K-nearest neighbors, self-organizing Kohonen network, fuzzy C-means, and density clustering algorithm methods. It is shown that the algorithm based on a probabilistic neural network is the most efficient. It provides for the best classification reliability for 25 cloud types and allows the recognition of 11 cloud types with a probability greater than 0.7. As an example, the cloud classification results are given for the Tomsk region. The classifications were carried out using full-size satellite cloud images and different methods. The results agree with each other and agree well with the observational data from ground-based weather stations.

  8. Automatic apical view classification of echocardiograms using a discriminative learning dictionary.

    PubMed

    Khamis, Hanan; Zurakhov, Grigoriy; Azar, Vered; Raz, Adi; Friedman, Zvi; Adam, Dan

    2017-02-01

    As part of striving towards fully automatic cardiac functional assessment of echocardiograms, automatic classification of their standard views is essential as a pre-processing stage. The similarity among three of the routinely acquired longitudinal scans: apical two-chamber (A2C), apical four-chamber (A4C) and apical long-axis (ALX), and the noise commonly inherent to these scans - make the classification a challenge. Here we introduce a multi-stage classification algorithm that employs spatio-temporal feature extraction (Cuboid Detector) and supervised dictionary learning (LC-KSVD) approaches to uniquely enhance the automatic recognition and classification accuracy of echocardiograms. The algorithm incorporates both discrimination and labelling information to allow a discriminative and sparse representation of each view. The advantage of the spatio-temporal feature extraction as compared to spatial processing is then validated. A set of 309 clinical clips (103 for each view), were labeled by 2 experts. A subset of 70 clips of each class was used as a training set and the rest as a test set. The recognition accuracies achieved were: 97%, 91% and 97% of A2C, A4C and ALX respectively, with average recognition rate of 95%. Thus, automatic classification of echocardiogram views seems promising, despite the inter-view similarity between the classes and intra-view variability among clips belonging to the same class.

  9. A simple semi-automatic approach for land cover classification from multispectral remote sensing imagery.

    PubMed

    Jiang, Dong; Huang, Yaohuan; Zhuang, Dafang; Zhu, Yunqiang; Xu, Xinliang; Ren, Hongyan

    2012-01-01

    Land cover data represent a fundamental data source for various types of scientific research. The classification of land cover based on satellite data is a challenging task, and an efficient classification method is needed. In this study, an automatic scheme is proposed for the classification of land use using multispectral remote sensing images based on change detection and a semi-supervised classifier. The satellite image can be automatically classified using only the prior land cover map and existing images; therefore human involvement is reduced to a minimum, ensuring the operability of the method. The method was tested in the Qingpu District of Shanghai, China. Using Environment Satellite 1(HJ-1) images of 2009 with 30 m spatial resolution, the areas were classified into five main types of land cover based on previous land cover data and spectral features. The results agreed on validation of land cover maps well with a Kappa value of 0.79 and statistical area biases in proportion less than 6%. This study proposed a simple semi-automatic approach for land cover classification by using prior maps with satisfied accuracy, which integrated the accuracy of visual interpretation and performance of automatic classification methods. The method can be used for land cover mapping in areas lacking ground reference information or identifying rapid variation of land cover regions (such as rapid urbanization) with convenience.

  10. A Simple Semi-Automatic Approach for Land Cover Classification from Multispectral Remote Sensing Imagery

    PubMed Central

    Jiang, Dong; Huang, Yaohuan; Zhuang, Dafang; Zhu, Yunqiang; Xu, Xinliang; Ren, Hongyan

    2012-01-01

    Land cover data represent a fundamental data source for various types of scientific research. The classification of land cover based on satellite data is a challenging task, and an efficient classification method is needed. In this study, an automatic scheme is proposed for the classification of land use using multispectral remote sensing images based on change detection and a semi-supervised classifier. The satellite image can be automatically classified using only the prior land cover map and existing images; therefore human involvement is reduced to a minimum, ensuring the operability of the method. The method was tested in the Qingpu District of Shanghai, China. Using Environment Satellite 1(HJ-1) images of 2009 with 30 m spatial resolution, the areas were classified into five main types of land cover based on previous land cover data and spectral features. The results agreed on validation of land cover maps well with a Kappa value of 0.79 and statistical area biases in proportion less than 6%. This study proposed a simple semi-automatic approach for land cover classification by using prior maps with satisfied accuracy, which integrated the accuracy of visual interpretation and performance of automatic classification methods. The method can be used for land cover mapping in areas lacking ground reference information or identifying rapid variation of land cover regions (such as rapid urbanization) with convenience. PMID:23049886

  11. Automatic Source Classification in Digitised First Byurakan Survey

    NASA Astrophysics Data System (ADS)

    Topinka, Martin; Mickaelian, Areg; Nesci, Roberto; Rossi, Corinne

    2017-06-01

    The Digitised First Byurakan Survey (DFBS) provides low dispersion optical spectra for about 24 million sources. A two-step machine learning algorithm based on similarities to predefined templates is applied to select different classes of rare objects in the dataset automatically, for example late type stars, quasars and white dwarves. Identifying outliers from the groups of common astrophysical objects may lead to discovery of rare objects, such as gamma-ray burst afterglows.

  12. Low-cost real-time automatic wheel classification system

    NASA Astrophysics Data System (ADS)

    Shabestari, Behrouz N.; Miller, John W. V.; Wedding, Victoria

    1992-11-01

    This paper describes the design and implementation of a low-cost machine vision system for identifying various types of automotive wheels which are manufactured in several styles and sizes. In this application, a variety of wheels travel on a conveyor in random order through a number of processing steps. One of these processes requires the identification of the wheel type which was performed manually by an operator. A vision system was designed to provide the required identification. The system consisted of an annular illumination source, a CCD TV camera, frame grabber, and 386-compatible computer. Statistical pattern recognition techniques were used to provide robust classification as well as a simple means for adding new wheel designs to the system. Maintenance of the system can be performed by plant personnel with minimal training. The basic steps for identification include image acquisition, segmentation of the regions of interest, extraction of selected features, and classification. The vision system has been installed in a plant and has proven to be extremely effective. The system properly identifies the wheels correctly up to 30 wheels per minute regardless of rotational orientation in the camera's field of view. Correct classification can even be achieved if a portion of the wheel is blocked off from the camera. Significant cost savings have been achieved by a reduction in scrap associated with incorrect manual classification as well as a reduction of labor in a tedious task.

  13. Automatic parquet block sorting using real-time spectral classification

    NASA Astrophysics Data System (ADS)

    Astrom, Anders; Astrand, Erik; Johansson, Magnus

    1999-03-01

    This paper presents a real-time spectral classification system based on the PGP spectrograph and a smart image sensor. The PGP is a spectrograph which extracts the spectral information from a scene and projects the information on an image sensor, which is a method often referred to as Imaging Spectroscopy. The classification is based on linear models and categorizes a number of pixels along a line. Previous systems adopting this method have used standard sensors, which often resulted in poor performance. The new system, however, is based on a patented near-sensor classification method, which exploits analogue features on the smart image sensor. The method reduces the enormous amount of data to be processed at an early stage, thus making true real-time spectral classification possible. The system has been evaluated on hardwood parquet boards showing very good results. The color defects considered in the experiments were blue stain, white sapwood, yellow decay and red decay. In addition to these four defect classes, a reference class was used to indicate correct surface color. The system calculates a statistical measure for each parquet block, giving the pixel defect percentage. The patented method makes it possible to run at very high speeds with a high spectral discrimination ability. Using a powerful illuminator, the system can run with a line frequency exceeding 2000 line/s. This opens up the possibility to maintain high production speed and still measure with good resolution.

  14. Automatic Classification of Cetacean Vocalizations Using an Aural Classifier

    DTIC Science & Technology

    2013-09-30

    were inspired by research directed at discriminating the timbre of different musical instruments – a passive classification problem – which suggests...the method should be able to classify marine mammal vocalizations since these calls possess many of the acoustic attributes of music . APPROACH

  15. Automatic Classification of Cetacean Vocalizations Using an Aural Classifier

    DTIC Science & Technology

    2012-09-30

    inspired by research directed at discriminating the timbre of different musical instruments – a passive classification problem – which suggests it should...be able to classify marine mammal vocalizations since these calls possess many of the acoustic attributes of music . APPROACH The research is

  16. The Fractal Patterns of Words in a Text: A Method for Automatic Keyword Extraction.

    PubMed

    Najafi, Elham; Darooneh, Amir H

    2015-01-01

    A text can be considered as a one dimensional array of words. The locations of each word type in this array form a fractal pattern with certain fractal dimension. We observe that important words responsible for conveying the meaning of a text have dimensions considerably different from one, while the fractal dimensions of unimportant words are close to one. We introduce an index quantifying the importance of the words in a given text using their fractal dimensions and then ranking them according to their importance. This index measures the difference between the fractal pattern of a word in the original text relative to a shuffled version. Because the shuffled text is meaningless (i.e., words have no importance), the difference between the original and shuffled text can be used to ascertain degree of fractality. The degree of fractality may be used for automatic keyword detection. Words with the degree of fractality higher than a threshold value are assumed to be the retrieved keywords of the text. We measure the efficiency of our method for keywords extraction, making a comparison between our proposed method and two other well-known methods of automatic keyword extraction.

  17. The Fractal Patterns of Words in a Text: A Method for Automatic Keyword Extraction

    PubMed Central

    Najafi, Elham; Darooneh, Amir H.

    2015-01-01

    A text can be considered as a one dimensional array of words. The locations of each word type in this array form a fractal pattern with certain fractal dimension. We observe that important words responsible for conveying the meaning of a text have dimensions considerably different from one, while the fractal dimensions of unimportant words are close to one. We introduce an index quantifying the importance of the words in a given text using their fractal dimensions and then ranking them according to their importance. This index measures the difference between the fractal pattern of a word in the original text relative to a shuffled version. Because the shuffled text is meaningless (i.e., words have no importance), the difference between the original and shuffled text can be used to ascertain degree of fractality. The degree of fractality may be used for automatic keyword detection. Words with the degree of fractality higher than a threshold value are assumed to be the retrieved keywords of the text. We measure the efficiency of our method for keywords extraction, making a comparison between our proposed method and two other well-known methods of automatic keyword extraction. PMID:26091207

  18. Realizing parameterless automatic classification of remote sensing imagery using ontology engineering and cyberinfrastructure techniques

    NASA Astrophysics Data System (ADS)

    Sun, Ziheng; Fang, Hui; Di, Liping; Yue, Peng

    2016-09-01

    It was an untouchable dream for remote sensing experts to realize total automatic image classification without inputting any parameter values. Experts usually spend hours and hours on tuning the input parameters of classification algorithms in order to obtain the best results. With the rapid development of knowledge engineering and cyberinfrastructure, a lot of data processing and knowledge reasoning capabilities become online accessible, shareable and interoperable. Based on these recent improvements, this paper presents an idea of parameterless automatic classification which only requires an image and automatically outputs a labeled vector. No parameters and operations are needed from endpoint consumers. An approach is proposed to realize the idea. It adopts an ontology database to store the experiences of tuning values for classifiers. A sample database is used to record training samples of image segments. Geoprocessing Web services are used as functionality blocks to finish basic classification steps. Workflow technology is involved to turn the overall image classification into a total automatic process. A Web-based prototypical system named PACS (Parameterless Automatic Classification System) is implemented. A number of images are fed into the system for evaluation purposes. The results show that the approach could automatically classify remote sensing images and have a fairly good average accuracy. It is indicated that the classified results will be more accurate if the two databases have higher quality. Once the experiences and samples in the databases are accumulated as many as an expert has, the approach should be able to get the results with similar quality to that a human expert can get. Since the approach is total automatic and parameterless, it can not only relieve remote sensing workers from the heavy and time-consuming parameter tuning work, but also significantly shorten the waiting time for consumers and facilitate them to engage in image

  19. Drug related webpages classification using images and text information based on multi-kernel learning

    NASA Astrophysics Data System (ADS)

    Hu, Ruiguang; Xiao, Liping; Zheng, Wenjuan

    2015-12-01

    In this paper, multi-kernel learning(MKL) is used for drug-related webpages classification. First, body text and image-label text are extracted through HTML parsing, and valid images are chosen by the FOCARSS algorithm. Second, text based BOW model is used to generate text representation, and image-based BOW model is used to generate images representation. Last, text and images representation are fused with a few methods. Experimental results demonstrate that the classification accuracy of MKL is higher than those of all other fusion methods in decision level and feature level, and much higher than the accuracy of single-modal classification.

  20. Automatic Fibrosis Quantification By Using a k-NN Classificator

    DTIC Science & Technology

    2001-10-25

    Fluthrope, “Stages in fiber breakdown in duchenne muscular dystrophy ,” J. Neurol. Sci., vol. 24, pp. 179– 186, 1975. [6] F. Cornelio and I. Dones, “Muscle...pp. 694–701, 1984. [7] A.E.H. Emery, Duchenne muscular dystrophy , 2nd ed, Oxford University Press, 1993. [8] A.T.M. Hageman, F.J.M. Gabreels, and...an automatic algorithm to measure fibrosis in muscle sections of mdx mice, a mutant species used as a model of the Duchenne dystrophy . The al- gorithm

  1. On the automatic classification of rain patterns on radar images

    NASA Astrophysics Data System (ADS)

    Pawlina Bonati, Apolonia

    The automation of the process of identification and classification of rain patterns on radar derived images is approached using some tools of digital image interpretation adapted to the specific application. The formal characterization of rain patterns and their partition in classes related to the type of precipitation is the main problem addressed in the paper, as the standard well established criteria for such classification are not defined. The digital maps of rain at horizontal plane derived from three-dimensional radar scans are processed by the interpretation package which identifies and classifies rain structures present on the map. The results generated by this package are illustrated in the paper and offered for discussion. The interpretation procedure is tailored for the radio-meteorology applications but the method is adaptable to other field requirements.

  2. [Automatic classification method of star spectrum data based on classification pattern tree].

    PubMed

    Zhao, Xu-Jun; Cai, Jiang-Hui; Zhang, Ji-Fu; Yang, Hai-Feng; Ma, Yang

    2013-10-01

    Frequent pattern, frequently appearing in the data set, plays an important role in data mining. For the stellar spectrum classification tasks, a classification rule mining method based on classification pattern tree is presented on the basis of frequent pattern. The procedures can be shown as follows. Firstly, a new tree structure, i. e., classification pattern tree, is introduced based on the different frequencies of stellar spectral attributes in data base and its different importance used for classification. The related concepts and the construction method of classification pattern tree are also described in this paper. Then, the characteristics of the stellar spectrum are mapped to the classification pattern tree. Two modes of top-to-down and bottom-to-up are used to traverse the classification pattern tree and extract the classification rules. Meanwhile, the concept of pattern capability is introduced to adjust the number of classification rules and improve the construction efficiency of the classification pattern tree. Finally, the SDSS (the Sloan Digital Sky Survey) stellar spectral data provided by the National Astronomical Observatory are used to verify the accuracy of the method. The results show that a higher classification accuracy has been got.

  3. Automatic classification of spectral units in the Aristarchus plateau

    NASA Astrophysics Data System (ADS)

    Erard, S.; Le Mouelic, S.; Langevin, Y.

    1999-09-01

    A reduction scheme has been recently proposed for the NIR images of Clementine (Le Mouelic et al, JGR 1999). This reduction has been used to build an integrated UVvis-NIR image cube of the Aristarchus region, from which compositional and maturity variations can be studied (Pinet et al, LPSC 1999). We will present an analysis of this image cube, providing a classification in spectral types and spectral units. The image cube is processed with Gmode analysis using three different data sets: Normalized spectra provide a classification based mainly on spectral slope variations (ie. maturity and volcanic glasses). This analysis discriminates between craters plus ejecta, mare basalts, and DMD. Olivine-rich areas and Aristarchus central peak are also recognized. Continuum-removed spectra provide a classification more related to compositional variations, which correctly identifies olivine and pyroxenes-rich areas (in Aristarchus, Krieger, Schiaparelli\\ldots). A third analysis uses spectral parameters related to maturity and Fe composition (reflectance, 1 mu m band depth, and spectral slope) rather than intensities. It provides the most spatially consistent picture, but fails in detecting Vallis Schroeteri and DMDs. A supplementary unit, younger and rich in pyroxene, is found on Aristarchus south rim. In conclusion, Gmode analysis can discriminate between different spectral types already identified with more classic methods (PCA, linear mixing\\ldots). No previous assumption is made on the data structure, such as endmembers number and nature, or linear relationship between input variables. The variability of the spectral types is intrinsically accounted for, so that the level of analysis is always restricted to meaningful limits. A complete classification should integrate several analyses based on different sets of parameters. Gmode is therefore a powerful light toll to perform first look analysis of spectral imaging data. This research has been partly founded by the French

  4. Automatic age and gender classification using supervised appearance model

    NASA Astrophysics Data System (ADS)

    Bukar, Ali Maina; Ugail, Hassan; Connah, David

    2016-11-01

    Age and gender classification are two important problems that recently gained popularity in the research community, due to their wide range of applications. Research has shown that both age and gender information are encoded in the face shape and texture, hence the active appearance model (AAM), a statistical model that captures shape and texture variations, has been one of the most widely used feature extraction techniques for the aforementioned problems. However, AAM suffers from some drawbacks, especially when used for classification. This is primarily because principal component analysis (PCA), which is at the core of the model, works in an unsupervised manner, i.e., PCA dimensionality reduction does not take into account how the predictor variables relate to the response (class labels). Rather, it explores only the underlying structure of the predictor variables, thus, it is no surprise if PCA discards valuable parts of the data that represent discriminatory features. Toward this end, we propose a supervised appearance model (sAM) that improves on AAM by replacing PCA with partial least-squares regression. This feature extraction technique is then used for the problems of age and gender classification. Our experiments show that sAM has better predictive power than the conventional AAM.

  5. Automatic body flexibility classification using laser doppler flowmeter

    NASA Astrophysics Data System (ADS)

    Lien, I.-Chan; Li, Yung-Hui; Bau, Jian-Guo

    2015-10-01

    Body flexibility is an important indicator that can measure whether an individual is healthy or not. Traditionally, we need to prepare a protractor and the subject need to perform a pre-defined set of actions. The measurement takes place at the same time when the subject performs required action, which is clumsy and inconvenient. In this paper, we propose a statistical learning model using the technique of random forest. The proposed system can classify body flexibility based on LDF signals analyzed in the frequency domain. The reasons of using random forest are because of their efficiency (fast in classification), interpretable structures and their ability to filter out irrelevant features. In addition, using random forest can prevent the problem of over-fitting, and the output model will become more robust to noises. In our experiment, we use chirp Z-transform (CZT), to transform a LDF signal into its energy values in five frequency bands. Combining the power of the random forest algorithm and frequency band analysis methods, a maximum recognition rate of 66% is achieved. Compared to traditional flexibility measuring process, the proposed system shortens the long and tedious stages of measurement to a simple, fast and pre-defined activity set. The major contributions of our work include (1) a novel body flexibility classification scheme using non-invasive biomedical sensor; (2) a set of designed protocol which is easy to conduct and practice; (3) a high precision classification scheme which combines the power of spectrum analysis and machine learning algorithms.

  6. Automatic evaluation of tracheoesophageal substitute voice: sustained vowel versus standard text.

    PubMed

    Bocklet, Tobias; Toy, Hikmet; Nöth, Elmar; Schuster, Maria; Eysholdt, Ulrich; Rosanowski, Frank; Gottwald, Frank; Haderlein, Tino

    2009-01-01

    The Hoarseness Diagram, a program for voice quality analysis used in German-speaking countries, was compared with an automatic speech recognition system with a module for prosodic analysis. The latter computed prosodic features on the basis of a text recording. We examined whether voice analysis of sustained vowels and text analysis correlate in tracheoesophageal speakers. Test speakers were 24 male laryngectomees with tracheoesophageal substitute speech, age 60.6 +/- 8.9 years. Each person read the German version of the text 'The North Wind and the Sun'. Additionally, five sustained vowels were recorded from each patient. The fundamental frequency (F(0)) detected by both programs was compared for all vowels. The correlation between the measures obtained by the Hoarseness Diagram and the features from the prosody module was computed. Both programs have problems in determining the F(0) of highly pathologic voices. Parameters like jitter, shimmer, F(0), and irregularity as computed by the Hoarseness Diagram from vowels show correlations of about -0.8 with prosodic features obtained from the text recordings. Voice properties can reliably be evaluated both on the basis of vowel and text recordings. Text analysis, however, also offers possibilities for the automatic evaluation of running speech since it realistically represents everyday speech. Copyright 2009 S. Karger AG, Basel.

  7. Implementation of Automatic Process of Edge Rotation Diagnostic System on J-TEXT Tokamak

    NASA Astrophysics Data System (ADS)

    Zhang, Zepin; Cheng, Zhifeng; Luo, Jian; Wang, Zhijiang; Zhang, Xiaolong; Hou, Saiying; Cheng, Cheng

    2014-08-01

    A spectral diagnostic control system (SDCS) is developed to implement automatic process of the edge rotation diagnostic system on the J-TEXT tokamak. The SDCS contains a control module, data operation module, data analysis module, and data upload module. The core of this system is a newly developed software “Spectra Assist”, which completes the whole process by coupling all related subroutines and servers. The results of data correction and calculated rotation are presented. In the daily discharge of J-TEXT, SDCS is proved to have a stable performance and high efficiency in completing the process of data acquisition, operation and results output.

  8. Segmenting articular cartilage automatically using a voxel classification approach.

    PubMed

    Folkesson, Jenny; Dam, Erik B; Olsen, Ole F; Pettersen, Paola C; Christiansen, Claus

    2007-01-01

    We present a fully automatic method for articular cartilage segmentation from magnetic resonance imaging (MRI) which we use as the foundation of a quantitative cartilage assessment. We evaluate our method by comparisons to manual segmentations by a radiologist and by examining the interscan reproducibility of the volume and area estimates. Training and evaluation of the method is performed on a data set consisting of 139 scans of knees with a status ranging from healthy to severely osteoarthritic. This is, to our knowledge, the only fully automatic cartilage segmentation method that has good agreement with manual segmentations, an interscan reproducibility as good as that of a human expert, and enables the separation between healthy and osteoarthritic populations. While high-field scanners offer high-quality imaging from which the articular cartilage have been evaluated extensively using manual and automated image analysis techniques, low-field scanners on the other hand produce lower quality images but to a fraction of the cost of their high-field counterpart. For low-field MRI, there is no well-established accuracy validation for quantitative cartilage estimates, but we show that differences between healthy and osteoarthritic populations are statistically significant using our cartilage volume and surface area estimates, which suggests that low-field MRI analysis can become a useful, affordable tool in clinical studies.

  9. An examination of the potential applications of automatic classification techniques to Georgia management problems

    NASA Technical Reports Server (NTRS)

    Rado, B. Q.

    1975-01-01

    Automatic classification techniques are described in relation to future information and natural resource planning systems with emphasis on application to Georgia resource management problems. The concept, design, and purpose of Georgia's statewide Resource AS Assessment Program is reviewed along with participation in a workshop at the Earth Resources Laboratory. Potential areas of application discussed include: agriculture, forestry, water resources, environmental planning, and geology.

  10. An automatic system to detect and extract texts in medical images for de-identification

    NASA Astrophysics Data System (ADS)

    Zhu, Yingxuan; Singh, P. D.; Siddiqui, Khan; Gillam, Michael

    2010-03-01

    Recently, there is an increasing need to share medical images for research purpose. In order to respect and preserve patient privacy, most of the medical images are de-identified with protected health information (PHI) before research sharing. Since manual de-identification is time-consuming and tedious, so an automatic de-identification system is necessary and helpful for the doctors to remove text from medical images. A lot of papers have been written about algorithms of text detection and extraction, however, little has been applied to de-identification of medical images. Since the de-identification system is designed for end-users, it should be effective, accurate and fast. This paper proposes an automatic system to detect and extract text from medical images for de-identification purposes, while keeping the anatomic structures intact. First, considering the text have a remarkable contrast with the background, a region variance based algorithm is used to detect the text regions. In post processing, geometric constraints are applied to the detected text regions to eliminate over-segmentation, e.g., lines and anatomic structures. After that, a region based level set method is used to extract text from the detected text regions. A GUI for the prototype application of the text detection and extraction system is implemented, which shows that our method can detect most of the text in the images. Experimental results validate that our method can detect and extract text in medical images with a 99% recall rate. Future research of this system includes algorithm improvement, performance evaluation, and computation optimization.

  11. Using a MaxEnt Classifier for the Automatic Content Scoring of Free-Text Responses

    NASA Astrophysics Data System (ADS)

    Sukkarieh, Jana Z.

    2011-03-01

    Criticisms against multiple-choice item assessments in the USA have prompted researchers and organizations to move towards constructed-response (free-text) items. Constructed-response (CR) items pose many challenges to the education community—one of which is that they are expensive to score by humans. At the same time, there has been widespread movement towards computer-based assessment and hence, assessment organizations are competing to develop automatic content scoring engines for such items types—which we view as a textual entailment task. This paper describes how MaxEnt Modeling is used to help solve the task. MaxEnt has been used in many natural language tasks but this is the first application of the MaxEnt approach to textual entailment and automatic content scoring.

  12. Semi automatic indexing of PostScript files using Medical Text Indexer in medical education.

    PubMed

    Mollah, Shamim Ara; Cimino, Christopher

    2007-10-11

    At Albert Einstein College of Medicine a large part of online lecture materials contain PostScript files. As the collection grows it becomes essential to create a digital library to have easy access to relevant sections of the lecture material that is full-text indexed; to create this index it is necessary to extract all the text from the document files that constitute the originals of the lectures. In this study we present a semi automatic indexing method using robust technique for extracting text from PostScript files and National Library of Medicine's Medical Text Indexer (MTI) program for indexing the text. This model can be applied to other medical schools for indexing purposes.

  13. Automatic classification and speaker identification of African elephant (Loxodonta africana) vocalizations

    NASA Astrophysics Data System (ADS)

    Clemins, Patrick J.; Johnson, Michael T.; Leong, Kirsten M.; Savage, Anne

    2005-02-01

    A hidden Markov model (HMM) system is presented for automatically classifying African elephant vocalizations. The development of the system is motivated by successful models from human speech analysis and recognition. Classification features include frequency-shifted Mel-frequency cepstral coefficients (MFCCs) and log energy, spectrally motivated features which are commonly used in human speech processing. Experiments, including vocalization type classification and speaker identification, are performed on vocalizations collected from captive elephants in a naturalistic environment. The system classified vocalizations with accuracies of 94.3% and 82.5% for type classification and speaker identification classification experiments, respectively. Classification accuracy, statistical significance tests on the model parameters, and qualitative analysis support the effectiveness and robustness of this approach for vocalization analysis in nonhuman species. .

  14. Automatic classification of atherosclerotic plaques imaged with intravascular OCT

    PubMed Central

    Rico-Jimenez, Jose J.; Campos-Delgado, Daniel U.; Villiger, Martin; Otsuka, Kenichiro; Bouma, Brett E.; Jo, Javier A.

    2016-01-01

    Intravascular optical coherence tomography (IV-OCT) allows evaluation of atherosclerotic plaques; however, plaque characterization is performed by visual assessment and requires a trained expert for interpretation of the large data sets. Here, we present a novel computational method for automated IV-OCT plaque characterization. This method is based on the modeling of each A-line of an IV-OCT data set as a linear combination of a number of depth profiles. After estimating these depth profiles by means of an alternating least square optimization strategy, they are automatically classified to predefined tissue types based on their morphological characteristics. The performance of our proposed method was evaluated with IV-OCT scans of cadaveric human coronary arteries and corresponding tissue histopathology. Our results suggest that this methodology allows automated identification of fibrotic and lipid-containing plaques. Moreover, this novel computational method has the potential to enable high throughput atherosclerotic plaque characterization. PMID:27867716

  15. Automatic classification of endogenous landslide seismicity using the Random Forest supervised classifier

    NASA Astrophysics Data System (ADS)

    Provost, F.; Hibert, C.; Malet, J.-P.

    2017-01-01

    The deformation of slow-moving landslides developed in clays induces endogenous seismicity of mostly low-magnitude events (ML<1). Long seismic records and complete catalogs are needed to identify the type of seismic sources and understand their mechanisms. Manual classification of long records is time-consuming and may be highly subjective. We propose an automatic classification method based on the computation of 71 seismic attributes and the use of a supervised classifier. No attribute was selected a priori in order to create a generic multi-class classification method applicable to many landslide contexts. The method can be applied directly on the results of a simple detector. We developed the approach on the seismic network of eight sensors of the Super-Sauze clay-rich landslide (South French Alps) for the detection of four types of seismic sources. The automatic algorithm retrieves 93% of sensitivity in comparison to a manually interpreted catalog considered as reference.

  16. Automatic classification of DMSA scans using an artificial neural network

    NASA Astrophysics Data System (ADS)

    Wright, J. W.; Duguid, R.; Mckiddie, F.; Staff, R. T.

    2014-04-01

    DMSA imaging is carried out in nuclear medicine to assess the level of functional renal tissue in patients. This study investigated the use of an artificial neural network to perform diagnostic classification of these scans. Using the radiological report as the gold standard, the network was trained to classify DMSA scans as positive or negative for defects using a representative sample of 257 previously reported images. The trained network was then independently tested using a further 193 scans and achieved a binary classification accuracy of 95.9%. The performance of the network was compared with three qualified expert observers who were asked to grade each scan in the 193 image testing set on a six point defect scale, from ‘definitely normal’ to ‘definitely abnormal’. A receiver operating characteristic analysis comparison between a consensus operator, generated from the scores of the three expert observers, and the network revealed a statistically significant increase (α < 0.05) in performance between the network and operators. A further result from this work was that when suitably optimized, a negative predictive value of 100% for renal defects was achieved by the network, while still managing to identify 93% of the negative cases in the dataset. These results are encouraging for application of such a network as a screening tool or quality assurance assistant in clinical practice.

  17. Automatic Fault Characterization via Abnormality-Enhanced Classification

    SciTech Connect

    Bronevetsky, G; Laguna, I; de Supinski, B R

    2010-12-20

    Enterprise and high-performance computing systems are growing extremely large and complex, employing hundreds to hundreds of thousands of processors and software/hardware stacks built by many people across many organizations. As the growing scale of these machines increases the frequency of faults, system complexity makes these faults difficult to detect and to diagnose. Current system management techniques, which focus primarily on efficient data access and query mechanisms, require system administrators to examine the behavior of various system services manually. Growing system complexity is making this manual process unmanageable: administrators require more effective management tools that can detect faults and help to identify their root causes. System administrators need timely notification when a fault is manifested that includes the type of fault, the time period in which it occurred and the processor on which it originated. Statistical modeling approaches can accurately characterize system behavior. However, the complex effects of system faults make these tools difficult to apply effectively. This paper investigates the application of classification and clustering algorithms to fault detection and characterization. We show experimentally that naively applying these methods achieves poor accuracy. Further, we design novel techniques that combine classification algorithms with information on the abnormality of application behavior to improve detection and characterization accuracy. Our experiments demonstrate that these techniques can detect and characterize faults with 65% accuracy, compared to just 5% accuracy for naive approaches.

  18. Automatic classification of DMSA scans using an artificial neural network.

    PubMed

    Wright, J W; Duguid, R; McKiddie, F; Staff, R T

    2014-04-07

    DMSA imaging is carried out in nuclear medicine to assess the level of functional renal tissue in patients. This study investigated the use of an artificial neural network to perform diagnostic classification of these scans. Using the radiological report as the gold standard, the network was trained to classify DMSA scans as positive or negative for defects using a representative sample of 257 previously reported images. The trained network was then independently tested using a further 193 scans and achieved a binary classification accuracy of 95.9%. The performance of the network was compared with three qualified expert observers who were asked to grade each scan in the 193 image testing set on a six point defect scale, from 'definitely normal' to 'definitely abnormal'. A receiver operating characteristic analysis comparison between a consensus operator, generated from the scores of the three expert observers, and the network revealed a statistically significant increase (α < 0.05) in performance between the network and operators. A further result from this work was that when suitably optimized, a negative predictive value of 100% for renal defects was achieved by the network, while still managing to identify 93% of the negative cases in the dataset. These results are encouraging for application of such a network as a screening tool or quality assurance assistant in clinical practice.

  19. Feature ranking and rank aggregation for automatic sleep stage classification: a comparative study.

    PubMed

    Najdi, Shirin; Gharbali, Ali Abdollahi; Fonseca, José Manuel

    2017-08-18

    Nowadays, sleep quality is one of the most important measures of healthy life, especially considering the huge number of sleep-related disorders. Identifying sleep stages using polysomnographic (PSG) signals is the traditional way of assessing sleep quality. However, the manual process of sleep stage classification is time-consuming, subjective and costly. Therefore, in order to improve the accuracy and efficiency of the sleep stage classification, researchers have been trying to develop automatic classification algorithms. Automatic sleep stage classification mainly consists of three steps: pre-processing, feature extraction and classification. Since classification accuracy is deeply affected by the extracted features, a poor feature vector will adversely affect the classifier and eventually lead to low classification accuracy. Therefore, special attention should be given to the feature extraction and selection process. In this paper the performance of seven feature selection methods, as well as two feature rank aggregation methods, were compared. Pz-Oz EEG, horizontal EOG and submental chin EMG recordings of 22 healthy males and females were used. A comprehensive feature set including 49 features was extracted from these recordings. The extracted features are among the most common and effective features used in sleep stage classification from temporal, spectral, entropy-based and nonlinear categories. The feature selection methods were evaluated and compared using three criteria: classification accuracy, stability, and similarity. Simulation results show that MRMR-MID achieves the highest classification performance while Fisher method provides the most stable ranking. In our simulations, the performance of the aggregation methods was in the average level, although they are known to generate more stable results and better accuracy. The Borda and RRA rank aggregation methods could not outperform significantly the conventional feature ranking methods. Among

  20. Text Categorization Based on K-Nearest Neighbor Approach for Web Site Classification.

    ERIC Educational Resources Information Center

    Kwon, Oh-Woog; Lee, Jong-Hyeok

    2003-01-01

    Discusses text categorization and Web site classification and proposes a three-step classification system that includes the use of Web pages linked with the home page. Highlights include the k-nearest neighbor (k-NN) approach; improving performance with a feature selection method and a term weighting scheme using HTML tags; and similarity…

  1. Text Categorization Based on K-Nearest Neighbor Approach for Web Site Classification.

    ERIC Educational Resources Information Center

    Kwon, Oh-Woog; Lee, Jong-Hyeok

    2003-01-01

    Discusses text categorization and Web site classification and proposes a three-step classification system that includes the use of Web pages linked with the home page. Highlights include the k-nearest neighbor (k-NN) approach; improving performance with a feature selection method and a term weighting scheme using HTML tags; and similarity…

  2. Ipsilateral coordination features for automatic classification of Parkinson's disease

    NASA Astrophysics Data System (ADS)

    Sarmiento, Fernanda; Atehortúa, Angélica; Martínez, Fabio; Romero, Eduardo

    2015-12-01

    A reliable diagnosis of the Parkinson Disease lies on the objective evaluation of different motor sub-systems. Discovering specific motor patterns associated to the disease is fundamental for the development of unbiased assessments that facilitate the disease characterization, independently of the particular examiner. This paper proposes a new objective screening of patients with Parkinson, an approach that optimally combines ipsilateral global descriptors. These ipsilateral gait features are simple upper-lower limb relationships in frequency and relative phase spaces. These low level characteristics feed a simple SVM classifier with a polynomial kernel function. The strategy was assessed in a binary classification task, normal against Parkinson, under a leave-one-out scheme in a population of 16 Parkinson patients and 7 healthy control subjects. Results showed an accuracy of 94;6% using relative phase spaces and 82;1% with simple frequency relations.

  3. Assessing the impact of graphical quality on automatic text recognition in digital maps

    NASA Astrophysics Data System (ADS)

    Chiang, Yao-Yi; Leyk, Stefan; Honarvar Nazari, Narges; Moghaddam, Sima; Tan, Tian Xiang

    2016-08-01

    Converting geographic features (e.g., place names) in map images into a vector format is the first step for incorporating cartographic information into a geographic information system (GIS). With the advancement in computational power and algorithm design, map processing systems have been considerably improved over the last decade. However, the fundamental map processing techniques such as color image segmentation, (map) layer separation, and object recognition are sensitive to minor variations in graphical properties of the input image (e.g., scanning resolution). As a result, most map processing results would not meet user expectations if the user does not "properly" scan the map of interest, pre-process the map image (e.g., using compression or not), and train the processing system, accordingly. These issues could slow down the further advancement of map processing techniques as such unsuccessful attempts create a discouraged user community, and less sophisticated tools would be perceived as more viable solutions. Thus, it is important to understand what kinds of maps are suitable for automatic map processing and what types of results and process-related errors can be expected. In this paper, we shed light on these questions by using a typical map processing task, text recognition, to discuss a number of map instances that vary in suitability for automatic processing. We also present an extensive experiment on a diverse set of scanned historical maps to provide measures of baseline performance of a standard text recognition tool under varying map conditions (graphical quality) and text representations (that can vary even within the same map sheet). Our experimental results help the user understand what to expect when a fully or semi-automatic map processing system is used to process a scanned map with certain (varying) graphical properties and complexities in map content.

  4. Automatic Entity Recognition and Typing from Massive Text Corpora: A Phrase and Network Mining Approach.

    PubMed

    Ren, Xiang; El-Kishky, Ahmed; Wang, Chi; Han, Jiawei

    2015-08-01

    In today's computerized and information-based society, we are soaked with vast amounts of text data, ranging from news articles, scientific publications, product reviews, to a wide range of textual information from social media. To unlock the value of these unstructured text data from various domains, it is of great importance to gain an understanding of entities and their relationships. In this tutorial, we introduce data-driven methods to recognize typed entities of interest in massive, domain-specific text corpora. These methods can automatically identify token spans as entity mentions in documents and label their types (e.g., people, product, food) in a scalable way. We demonstrate on real datasets including news articles and tweets how these typed entities aid in knowledge discovery and management.

  5. Can prosody aid the automatic classification of dialog acts in conversational speech?

    PubMed

    Shriberg, E; Bates, R; Stolcke, A; Taylor, P; Jurafsky, D; Ries, K; Coccaro, N; Martin, R; Meteer, M; van Ess-Dykema, C

    1998-01-01

    Identifying whether an utterance is a statement, question, greeting, and so forth is integral to effective automatic understanding of natural dialog. Little is known, however, about how such dialog acts (DAs) can be automatically classified in truly natural conversation. This study asks whether current approaches, which use mainly word information, could be improved by adding prosodic information. The study is based on more than 1000 conversations from the Switchboard corpus. DAs were hand-annotated, and prosodic features (duration, pause, F0, energy, and speaking rate) were automatically extracted for each DA. In training, decision trees based on these features were inferred; trees were then applied to unseen test data to evaluate performance. Performance was evaluated for prosody models alone, and after combining the prosody models with word information--either from true words or from the output of an automatic speech recognizer. For an overall classification task, as well as three subtasks, prosody made significant contributions to classification. Feature-specific analyses further revealed that although canonical features (such as F0 for questions) were important, less obvious features could compensate if canonical features were removed. Finally, in each task, integrating the prosodic model with a DA-specific statistical language model improved performance over that of the language model alone, especially for the case of recognized words. Results suggest that DAs are redundantly marked in natural conversation, and that a variety of automatically extractable prosodic features could aid dialog processing in speech applications.

  6. Automatic classification of retinal vessels into arteries and veins

    NASA Astrophysics Data System (ADS)

    Niemeijer, Meindert; van Ginneken, Bram; Abràmoff, Michael D.

    2009-02-01

    Separating the retinal vascular tree into arteries and veins is important for quantifying vessel changes that preferentially affect either the veins or the arteries. For example the ratio of arterial to venous diameter, the retinal a/v ratio, is well established to be predictive of stroke and other cardiovascular events in adults, as well as the staging of retinopathy of prematurity in premature infants. This work presents a supervised, automatic method that can determine whether a vessel is an artery or a vein based on intensity and derivative information. After thinning of the vessel segmentation, vessel crossing and bifurcation points are removed leaving a set of vessel segments containing centerline pixels. A set of features is extracted from each centerline pixel and using these each is assigned a soft label indicating the likelihood that it is part of a vein. As all centerline pixels in a connected segment should be the same type we average the soft labels and assign this average label to each centerline pixel in the segment. We train and test the algorithm using the data (40 color fundus photographs) from the DRIVE database1 with an enhanced reference standard. In the enhanced reference standard a fellowship trained retinal specialist (MDA) labeled all vessels for which it was possible to visually determine whether it was a vein or an artery. After applying the proposed method to the 20 images of the DRIVE test set we obtained an area under the receiver operator characteristic (ROC) curve of 0.88 for correctly assigning centerline pixels to either the vein or artery classes.

  7. Automatic classification of atypical lymphoid B cells using digital blood image processing.

    PubMed

    Alférez, S; Merino, A; Mujica, L E; Ruiz, M; Bigorra, L; Rodellar, J

    2014-08-01

    There are automated systems for digital peripheral blood (PB) cell analysis, but they operate most effectively in nonpathological blood samples. The objective of this work was to design a methodology to improve the automatic classification of abnormal lymphoid cells. We analyzed 340 digital images of individual lymphoid cells from PB films obtained in the CellaVision DM96:150 chronic lymphocytic leukemia (CLL) cells, 100 hairy cell leukemia (HCL) cells, and 90 normal lymphocytes (N). We implemented the Watershed Transformation to segment the nucleus, the cytoplasm, and the peripheral cell region. We extracted 44 features and then the clustering Fuzzy C-Means (FCM) was applied in two steps for the lymphocyte classification. The images were automatically clustered in three groups, one of them with 98% of the HCL cells. The set of the remaining cells was clustered again using FCM and texture features. The two new groups contained 83.3% of the N cells and 71.3% of the CLL cells, respectively. The approach has been able to automatically classify with high precision three types of lymphoid cells. The addition of more descriptors and other classification techniques will allow extending the classification to other classes of atypical lymphoid cells. © 2013 John Wiley & Sons Ltd.

  8. Automatic Survey-invariant Classification of Variable Stars

    NASA Astrophysics Data System (ADS)

    Benavente, Patricio; Protopapas, Pavlos; Pichara, Karim

    2017-08-01

    Machine learning techniques have been successfully used to classify variable stars on widely studied astronomical surveys. These data sets have been available to astronomers long enough, thus allowing them to perform deep analysis over several variable sources and generating useful catalogs with identified variable stars. The products of these studies are labeled data that enable supervised learning models to be trained successfully. However, when these models are blindly applied to data from new sky surveys, their performance drops significantly. Furthermore, unlabeled data become available at a much higher rate than their labeled counterpart, since labeling is a manual and time-consuming effort. Domain adaptation techniques aim to learn from a domain where labeled data are available, the source domain, and through some adaptation perform well on a different domain, the target domain. We propose a full probabilistic model that represents the joint distribution of features from two surveys, as well as a probabilistic transformation of the features from one survey to the other. This allows us to transfer labeled data to a study where they are not available and to effectively run a variable star classification model in a new survey. Our model represents the features of each domain as a Gaussian mixture and models the transformation as a translation, rotation, and scaling of each separate component. We perform tests using three different variability catalogs, EROS, MACHO, and HiTS, presenting differences among them, such as the number of observations per star, cadence, observational time, and optical bands observed, among others.

  9. Automatic Classification on Bio Medical Prognosisof Invasive Breast Cancer

    PubMed

    S, Sountharrajan; M, Karthiga; E, Suganya; C, Rajan

    2017-09-27

    Breast Cancer one of the appalling diseases among the middle-aged women and it is a foremost threatening death possibility cancer in women throughout the world. Earlier prognosis and preclusion reduces the conceivability of death. The proposed system beseech various data mining techniques together with a real-time input data from a biosensor device to determine the disease development proportion. Surface acoustic waves (SAW) biosensor empowers a label-free, worthwhile and straight detection of HER-2/neu cancer biomarker. The output from the biosensor is fed into the proposed system as an input along with data collected from Winconsin dataset. The complete dataset are processed using data mining classification algorithms to predict the accuracy. The exactness of the proposed model is improved by ranking attributes by Ranker algorithm. The results of the proposed model are highly gifted with an accuracy of 79.25% with SVM classifier and an ROC area of 0.754 which is better than other existing systems. The results are used in designing the proper drug thereby improving the survivability of the patients. Creative Commons Attribution License

  10. Automatically Detecting Failures in Natural Language Processing Tools for Online Community Text

    PubMed Central

    Hartzler, Andrea L; Huh, Jina; McDonald, David W; Pratt, Wanda

    2015-01-01

    Background The prevalence and value of patient-generated health text are increasing, but processing such text remains problematic. Although existing biomedical natural language processing (NLP) tools are appealing, most were developed to process clinician- or researcher-generated text, such as clinical notes or journal articles. In addition to being constructed for different types of text, other challenges of using existing NLP include constantly changing technologies, source vocabularies, and characteristics of text. These continuously evolving challenges warrant the need for applying low-cost systematic assessment. However, the primarily accepted evaluation method in NLP, manual annotation, requires tremendous effort and time. Objective The primary objective of this study is to explore an alternative approach—using low-cost, automated methods to detect failures (eg, incorrect boundaries, missed terms, mismapped concepts) when processing patient-generated text with existing biomedical NLP tools. We first characterize common failures that NLP tools can make in processing online community text. We then demonstrate the feasibility of our automated approach in detecting these common failures using one of the most popular biomedical NLP tools, MetaMap. Methods Using 9657 posts from an online cancer community, we explored our automated failure detection approach in two steps: (1) to characterize the failure types, we first manually reviewed MetaMap’s commonly occurring failures, grouped the inaccurate mappings into failure types, and then identified causes of the failures through iterative rounds of manual review using open coding, and (2) to automatically detect these failure types, we then explored combinations of existing NLP techniques and dictionary-based matching for each failure cause. Finally, we manually evaluated the automatically detected failures. Results From our manual review, we characterized three types of failure: (1) boundary failures, (2) missed

  11. Automatic identification of ROI in figure images toward improving hybrid (text and image) biomedical document retrieval

    NASA Astrophysics Data System (ADS)

    You, Daekeun; Antani, Sameer; Demner-Fushman, Dina; Rahman, Md Mahmudur; Govindaraju, Venu; Thoma, George R.

    2011-01-01

    Biomedical images are often referenced for clinical decision support (CDS), educational purposes, and research. They appear in specialized databases or in biomedical publications and are not meaningfully retrievable using primarily textbased retrieval systems. The task of automatically finding the images in an article that are most useful for the purpose of determining relevance to a clinical situation is quite challenging. An approach is to automatically annotate images extracted from scientific publications with respect to their usefulness for CDS. As an important step toward achieving the goal, we proposed figure image analysis for localizing pointers (arrows, symbols) to extract regions of interest (ROI) that can then be used to obtain meaningful local image content. Content-based image retrieval (CBIR) techniques can then associate local image ROIs with identified biomedical concepts in figure captions for improved hybrid (text and image) retrieval of biomedical articles. In this work we present methods that make robust our previous Markov random field (MRF)-based approach for pointer recognition and ROI extraction. These include use of Active Shape Models (ASM) to overcome problems in recognizing distorted pointer shapes and a region segmentation method for ROI extraction. We measure the performance of our methods on two criteria: (i) effectiveness in recognizing pointers in images, and (ii) improved document retrieval through use of extracted ROIs. Evaluation on three test sets shows 87% accuracy in the first criterion. Further, the quality of document retrieval using local visual features and text is shown to be better than using visual features alone.

  12. Automatic Evaluation of Voice Quality Using Text-Based Laryngograph Measurements and Prosodic Analysis

    PubMed Central

    Haderlein, Tino; Schwemmle, Cornelia; Döllinger, Michael; Matoušek, Václav; Ptok, Martin; Nöth, Elmar

    2015-01-01

    Due to low intra- and interrater reliability, perceptual voice evaluation should be supported by objective, automatic methods. In this study, text-based, computer-aided prosodic analysis and measurements of connected speech were combined in order to model perceptual evaluation of the German Roughness-Breathiness-Hoarseness (RBH) scheme. 58 connected speech samples (43 women and 15 men; 48.7 ± 17.8 years) containing the German version of the text “The North Wind and the Sun” were evaluated perceptually by 19 speech and voice therapy students according to the RBH scale. For the human-machine correlation, Support Vector Regression with measurements of the vocal fold cycle irregularities (CFx) and the closed phases of vocal fold vibration (CQx) of the Laryngograph and 33 features from a prosodic analysis module were used to model the listeners' ratings. The best human-machine results for roughness were obtained from a combination of six prosodic features and CFx (r = 0.71, ρ = 0.57). These correlations were approximately the same as the interrater agreement among human raters (r = 0.65, ρ = 0.61). CQx was one of the substantial features of the hoarseness model. For hoarseness and breathiness, the human-machine agreement was substantially lower. Nevertheless, the automatic analysis method can serve as the basis for a meaningful objective support for perceptual analysis. PMID:26136813

  13. Validating Automatically Generated Students' Conceptual Models from Free-text Answers at the Level of Concepts

    NASA Astrophysics Data System (ADS)

    Pérez-Marín, Diana; Pascual-Nieto, Ismael; Rodríguez, Pilar; Anguiano, Eloy; Alfonseca, Enrique

    2008-11-01

    Students' conceptual models can be defined as networks of interconnected concepts, in which a confidence-value (CV) is estimated per each concept. This CV indicates how confident the system is that each student knows the concept according to how the student has used it in the free-text answers provided to an automatic free-text scoring system. In a previous work, a preliminary validation was done based on the global comparison between the score achieved by each student in the final exam and the score associated to his or her model (calculated as the average of the CVs of the concepts). 50% Pearson correlation statistically significant (p = 0.01) was reached. In order to complete those results, in this paper, the level of granularity has been lowered down to each particular concept. In fact, the correspondence between the human estimation of how well each concept of the conceptual model is known versus the computer estimation is calculated. 0.08 mean quadratic error between both values has been attained, which validates the automatically generated students' conceptual models at the concept level of granularity.

  14. Toward Routine Automatic Pathway Discovery from On-line Scientific Text Abstracts.

    PubMed

    Ng; Wong

    1999-01-01

    We are entering a new era of research where the latest scientific discoveries are often first reported online and are readily accessible by scientists worldwide. This rapid electronic dissemination of research breakthroughs has greatly accelerated the current pace in genomics and proteomics research. The race to the discovery of a gene or a drug has now become increasingly dependent on how quickly a scientist can scan through voluminous amount of information available online to construct the relevant picture (such as protein-protein interaction pathways) as it takes shape amongst the rapidly expanding pool of globally accessible biological data (e.g. GENBANK) and scientific literature (e.g. MEDLINE). We describe a prototype system for automatic pathway discovery from on-line text abstracts, combining technologies that (1) retrieve research abstracts from online sources, (2) extract relevant information from the free texts, and (3) present the extracted information graphically and intuitively. Our work demonstrates that this framework allows us to routinely scan online scientific literature for automatic discovery of knowledge, giving modern scientists the necessary competitive edge in managing the information explosion in this electronic age.

  15. Automatic vs. manual curation of a multi-source chemical dictionary: the impact on text mining

    PubMed Central

    2010-01-01

    Background Previously, we developed a combined dictionary dubbed Chemlist for the identification of small molecules and drugs in text based on a number of publicly available databases and tested it on an annotated corpus. To achieve an acceptable recall and precision we used a number of automatic and semi-automatic processing steps together with disambiguation rules. However, it remained to be investigated which impact an extensive manual curation of a multi-source chemical dictionary would have on chemical term identification in text. ChemSpider is a chemical database that has undergone extensive manual curation aimed at establishing valid chemical name-to-structure relationships. Results We acquired the component of ChemSpider containing only manually curated names and synonyms. Rule-based term filtering, semi-automatic manual curation, and disambiguation rules were applied. We tested the dictionary from ChemSpider on an annotated corpus and compared the results with those for the Chemlist dictionary. The ChemSpider dictionary of ca. 80 k names was only a 1/3 to a 1/4 the size of Chemlist at around 300 k. The ChemSpider dictionary had a precision of 0.43 and a recall of 0.19 before the application of filtering and disambiguation and a precision of 0.87 and a recall of 0.19 after filtering and disambiguation. The Chemlist dictionary had a precision of 0.20 and a recall of 0.47 before the application of filtering and disambiguation and a precision of 0.67 and a recall of 0.40 after filtering and disambiguation. Conclusions We conclude the following: (1) The ChemSpider dictionary achieved the best precision but the Chemlist dictionary had a higher recall and the best F-score; (2) Rule-based filtering and disambiguation is necessary to achieve a high precision for both the automatically generated and the manually curated dictionary. ChemSpider is available as a web service at http://www.chemspider.com/ and the Chemlist dictionary is freely available as an XML file in

  16. Material classification and automatic content enrichment of images using supervised learning and knowledge bases

    NASA Astrophysics Data System (ADS)

    Mallepudi, Sri Abhishikth; Calix, Ricardo A.; Knapp, Gerald M.

    2011-02-01

    In recent years there has been a rapid increase in the size of video and image databases. Effective searching and retrieving of images from these databases is a significant current research area. In particular, there is a growing interest in query capabilities based on semantic image features such as objects, locations, and materials, known as content-based image retrieval. This study investigated mechanisms for identifying materials present in an image. These capabilities provide additional information impacting conditional probabilities about images (e.g. objects made of steel are more likely to be buildings). These capabilities are useful in Building Information Modeling (BIM) and in automatic enrichment of images. I2T methodologies are a way to enrich an image by generating text descriptions based on image analysis. In this work, a learning model is trained to detect certain materials in images. To train the model, an image dataset was constructed containing single material images of bricks, cloth, grass, sand, stones, and wood. For generalization purposes, an additional set of 50 images containing multiple materials (some not used in training) was constructed. Two different supervised learning classification models were investigated: a single multi-class SVM classifier, and multiple binary SVM classifiers (one per material). Image features included Gabor filter parameters for texture, and color histogram data for RGB components. All classification accuracy scores using the SVM-based method were above 85%. The second model helped in gathering more information from the images since it assigned multiple classes to the images. A framework for the I2T methodology is presented.

  17. Automatic classification and accurate size measurement of blank mask defects

    NASA Astrophysics Data System (ADS)

    Bhamidipati, Samir; Paninjath, Sankaranarayanan; Pereira, Mark; Buck, Peter

    2015-07-01

    complexity of defects encountered. The variety arises due to factors such as defect nature, size, shape and composition; and the optical phenomena occurring around the defect. This paper focuses on preliminary characterization results, in terms of classification and size estimation, obtained by Calibre MDPAutoClassify tool on a variety of mask blank defects. It primarily highlights the challenges faced in achieving the results with reference to the variety of defects observed on blank mask substrates and the underlying complexities which make accurate defect size measurement an important and challenging task.

  18. Automatic classification of MR scans in Alzheimer’s disease

    PubMed Central

    Klöppel, Stefan; Stonnington, Cynthia M.; Chu, Carlton; Draganski, Bogdan; Scahill, Rachael I.; Rohrer, Jonathan D.; Fox, Nick C.; Jack, Clifford R.; Ashburner, John; Frackowiak, Richard S. J.

    2008-01-01

    To be diagnostically useful, structural MRI must reliably distinguish Alzheimer’s disease (AD) from normal aging in individual scans. Recent advances in statistical learning theory have led to the application of support vector machines to MRI for detection of a variety of disease states. The aims of this study were to assess how successfully support vector machines assigned individual diagnoses and to determine whether data-sets combined from multiple scanners and different centres could be used to obtain effective classification of scans. We used linear support vector machines to classify the grey matter segment of T1-weighted MR scans from pathologically proven AD patients and cognitively normal elderly individuals obtained from two centres with different scanning equipment. Because the clinical diagnosis of mild AD is difficult we also tested the ability of support vector machines to differentiate control scans from patients without post-mortem confirmation. Finally we sought to use these methods to differentiate scans between patients suffering from AD from those with frontotemporal lobar degeneration. Up to 96% of pathologically verified AD patients were correctly classified using whole brain images. Data from different centres were successfully combined achieving comparable results from the separate analyses. Importantly, data from one centre could be used to train a support vector machine to accurately differentiate AD and normal ageing scans obtained from another centre with different subjects and different scanner equipment. Patients with mild, clinically probable AD and age/sex matched controls were correctly separated in 89% of cases which is compatible with published diagnosis rates in the best clinical centres. This method correctly assigned 89% of patients with post-mortem confirmed diagnosis of either AD or frontotemporal lobar degeneration to their respective group. Our study leads to three conclusions: Firstly, support vector machines

  19. An Automatic Multidocument Text Summarization Approach Based on Naïve Bayesian Classifier Using Timestamp Strategy

    PubMed Central

    Ramanujam, Nedunchelian; Kaliappan, Manivannan

    2016-01-01

    Nowadays, automatic multidocument text summarization systems can successfully retrieve the summary sentences from the input documents. But, it has many limitations such as inaccurate extraction to essential sentences, low coverage, poor coherence among the sentences, and redundancy. This paper introduces a new concept of timestamp approach with Naïve Bayesian Classification approach for multidocument text summarization. The timestamp provides the summary an ordered look, which achieves the coherent looking summary. It extracts the more relevant information from the multiple documents. Here, scoring strategy is also used to calculate the score for the words to obtain the word frequency. The higher linguistic quality is estimated in terms of readability and comprehensibility. In order to show the efficiency of the proposed method, this paper presents the comparison between the proposed methods with the existing MEAD algorithm. The timestamp procedure is also applied on the MEAD algorithm and the results are examined with the proposed method. The results show that the proposed method results in lesser time than the existing MEAD algorithm to execute the summarization process. Moreover, the proposed method results in better precision, recall, and F-score than the existing clustering with lexical chaining approach. PMID:27034971

  20. Groupwise conditional random forests for automatic shape classification and contour quality assessment in radiotherapy planning.

    PubMed

    McIntosh, Chris; Svistoun, Igor; Purdie, Thomas G

    2013-06-01

    Radiation therapy is used to treat cancer patients around the world. High quality treatment plans maximally radiate the targets while minimally radiating healthy organs at risk. In order to judge plan quality and safety, segmentations of the targets and organs at risk are created, and the amount of radiation that will be delivered to each structure is estimated prior to treatment. If the targets or organs at risk are mislabelled, or the segmentations are of poor quality, the safety of the radiation doses will be erroneously reviewed and an unsafe plan could proceed. We propose a technique to automatically label groups of segmentations of different structures from a radiation therapy plan for the joint purposes of providing quality assurance and data mining. Given one or more segmentations and an associated image we seek to assign medically meaningful labels to each segmentation and report the confidence of that label. Our method uses random forests to learn joint distributions over the training features, and then exploits a set of learned potential group configurations to build a conditional random field (CRF) that ensures the assignment of labels is consistent across the group of segmentations. The CRF is then solved via a constrained assignment problem. We validate our method on 1574 plans, consisting of 17[Formula: see text] 579 segmentations, demonstrating an overall classification accuracy of 91.58%. Our results also demonstrate the stability of RF with respect to tree depth and the number of splitting variables in large data sets.

  1. Challenges for automatically extracting molecular interactions from full-text articles

    PubMed Central

    McIntosh, Tara; Curran, James R

    2009-01-01

    Background The increasing availability of full-text biomedical articles will allow more biomedical knowledge to be extracted automatically with greater reliability. However, most Information Retrieval (IR) and Extraction (IE) tools currently process only abstracts. The lack of corpora has limited the development of tools that are capable of exploiting the knowledge in full-text articles. As a result, there has been little investigation into the advantages of full-text document structure, and the challenges developers will face in processing full-text articles. Results We manually annotated passages from full-text articles that describe interactions summarised in a Molecular Interaction Map (MIM). Our corpus tracks the process of identifying facts to form the MIM summaries and captures any factual dependencies that must be resolved to extract the fact completely. For example, a fact in the results section may require a synonym defined in the introduction. The passages are also annotated with negated and coreference expressions that must be resolved. We describe the guidelines for identifying relevant passages and possible dependencies. The corpus includes 2162 sentences from 78 full-text articles. Our corpus analysis demonstrates the necessity of full-text processing; identifies the article sections where interactions are most commonly stated; and quantifies the proportion of interaction statements requiring coherent dependencies. Further, it allows us to report on the relative importance of identifying synonyms and resolving negated expressions. We also experiment with an oracle sentence retrieval system using the corpus as a gold-standard evaluation set. Conclusion We introduce the MIM corpus, a unique resource that maps interaction facts in a MIM to annotated passages within full-text articles. It is an invaluable case study providing guidance to developers of biomedical IR and IE systems, and can be used as a gold-standard evaluation set for full-text IR tasks

  2. Challenges for automatically extracting molecular interactions from full-text articles.

    PubMed

    McIntosh, Tara; Curran, James R

    2009-09-24

    The increasing availability of full-text biomedical articles will allow more biomedical knowledge to be extracted automatically with greater reliability. However, most Information Retrieval (IR) and Extraction (IE) tools currently process only abstracts. The lack of corpora has limited the development of tools that are capable of exploiting the knowledge in full-text articles. As a result, there has been little investigation into the advantages of full-text document structure, and the challenges developers will face in processing full-text articles. We manually annotated passages from full-text articles that describe interactions summarised in a Molecular Interaction Map (MIM). Our corpus tracks the process of identifying facts to form the MIM summaries and captures any factual dependencies that must be resolved to extract the fact completely. For example, a fact in the results section may require a synonym defined in the introduction. The passages are also annotated with negated and coreference expressions that must be resolved.We describe the guidelines for identifying relevant passages and possible dependencies. The corpus includes 2162 sentences from 78 full-text articles. Our corpus analysis demonstrates the necessity of full-text processing; identifies the article sections where interactions are most commonly stated; and quantifies the proportion of interaction statements requiring coherent dependencies. Further, it allows us to report on the relative importance of identifying synonyms and resolving negated expressions. We also experiment with an oracle sentence retrieval system using the corpus as a gold-standard evaluation set. We introduce the MIM corpus, a unique resource that maps interaction facts in a MIM to annotated passages within full-text articles. It is an invaluable case study providing guidance to developers of biomedical IR and IE systems, and can be used as a gold-standard evaluation set for full-text IR tasks.

  3. Semi-automatic image personalization tool for variable text insertion and replacement

    NASA Astrophysics Data System (ADS)

    Ding, Hengzhou; Bala, Raja; Fan, Zhigang; Eschbach, Reiner; Bouman, Charles A.; Allebach, Jan P.

    2010-02-01

    Image personalization is a widely used technique in personalized marketing,1 in which a vendor attempts to promote new products or retain customers by sending marketing collateral that is tailored to the customers' demographics, needs, and interests. With current solutions of which we are aware such as XMPie,2 DirectSmile,3 and AlphaPicture,4 in order to produce this tailored marketing collateral, image templates need to be created manually by graphic designers, involving complex grid manipulation and detailed geometric adjustments. As a matter of fact, the image template design is highly manual, skill-demanding and costly, and essentially the bottleneck for image personalization. We present a semi-automatic image personalization tool for designing image templates. Two scenarios are considered: text insertion and text replacement, with the text replacement option not offered in current solutions. The graphical user interface (GUI) of the tool is described in detail. Unlike current solutions, the tool renders the text in 3-D, which allows easy adjustment of the text. In particular, the tool has been implemented in Java, which introduces flexible deployment and eliminates the need for any special software or know-how on the part of the end user.

  4. AutoFACT: An Automatic Functional Annotation and Classification Tool

    PubMed Central

    Koski, Liisa B; Gray, Michael W; Lang, B Franz; Burger, Gertraud

    2005-01-01

    Background Assignment of function to new molecular sequence data is an essential step in genomics projects. The usual process involves similarity searches of a given sequence against one or more databases, an arduous process for large datasets. Results We present AutoFACT, a fully automated and customizable annotation tool that assigns biologically informative functions to a sequence. Key features of this tool are that it (1) analyzes nucleotide and protein sequence data; (2) determines the most informative functional description by combining multiple BLAST reports from several user-selected databases; (3) assigns putative metabolic pathways, functional classes, enzyme classes, GeneOntology terms and locus names; and (4) generates output in HTML, text and GFF formats for the user's convenience. We have compared AutoFACT to four well-established annotation pipelines. The error rate of functional annotation is estimated to be only between 1–2%. Comparison of AutoFACT to the traditional top-BLAST-hit annotation method shows that our procedure increases the number of functionally informative annotations by approximately 50%. Conclusion AutoFACT will serve as a useful annotation tool for smaller sequencing groups lacking dedicated bioinformatics staff. It is implemented in PERL and runs on LINUX/UNIX platforms. AutoFACT is available at . PMID:15960857

  5. Generalizability and Comparison of Automatic Clinical Text De-Identification Methods and Resources

    PubMed Central

    Ferrández, Óscar; South, Brett R.; Shen, Shuying; Friedlin, F. Jeff; Samore, Matthew H.; Meystre, Stéphane M.

    2012-01-01

    In this paper, we present an evaluation of the hybrid best-of-breed automated VHA (Veteran’s Health Administration) clinical text de-identification system, nicknamed BoB, developed within the VHA Consortium for Healthcare Informatics Research. We also evaluate two available machine learning-based text de-identifications systems: MIST and HIDE. Two different clinical corpora were used for this evaluation: a manually annotated VHA corpus, and the 2006 i2b2 de-identification challenge corpus. These experiments focus on the generalizability and portability of the classification models across different document sources. BoB demonstrated good recall (92.6%), satisfactorily prioritizing patient privacy, and also achieved competitive precision (83.6%) for preserving subsequent document interpretability. MIST and HIDE reached very competitive results, in most cases with high precision (92.6% and 93.6%), although recall was sometimes lower than desired for the most sensitive PHI categories. PMID:23304289

  6. Improving the text classification using clustering and a novel HMM to reduce the dimensionality.

    PubMed

    Seara Vieira, A; Borrajo, L; Iglesias, E L

    2016-11-01

    In text classification problems, the representation of a document has a strong impact on the performance of learning systems. The high dimensionality of the classical structured representations can lead to burdensome computations due to the great size of real-world data. Consequently, there is a need for reducing the quantity of handled information to improve the classification process. In this paper, we propose a method to reduce the dimensionality of a classical text representation based on a clustering technique to group documents, and a previously developed Hidden Markov Model to represent them. We have applied tests with the k-NN and SVM classifiers on the OHSUMED and TREC benchmark text corpora using the proposed dimensionality reduction technique. The experimental results obtained are very satisfactory compared to commonly used techniques like InfoGain and the statistical tests performed demonstrate the suitability of the proposed technique for the preprocessing step in a text classification task.

  7. An ensemble system for automatic sleep stage classification using single channel EEG signal.

    PubMed

    Koley, B; Dey, D

    2012-12-01

    The present work aims at automatic identification of various sleep stages like, sleep stages 1, 2, slow wave sleep (sleep stages 3 and 4), REM sleep and wakefulness from single channel EEG signal. Automatic scoring of sleep stages was performed with the help of pattern recognition technique which involves feature extraction, selection and finally classification. Total 39 numbers of features from time domain, frequency domain and from non-linear analysis were extracted. After extraction of features, SVM based recursive feature elimination (RFE) technique was used to find the optimum number of feature subset which can provide significant classification performance with reduced number of features for the five different sleep stages. Finally for classification, binary SVMs were combined with one-against-all (OAA) strategy. Careful extraction and selection of optimum feature subset helped to reduce the classification error to 8.9% for training dataset, validated by k-fold cross-validation (CV) technique and 10.61% in the case of independent testing dataset. Agreement of the estimated sleep stages with those obtained by expert scoring for all sleep stages of training dataset was 0.877 and for independent testing dataset it was 0.8572. The proposed ensemble SVM-based method could be used as an efficient and cost-effective method for sleep staging with the advantage of reducing stress and burden imposed on subjects.

  8. Automatic detection and classification of sleep stages by multichannel EEG signal modeling.

    PubMed

    Zhovna, Inna; Shallom, Ilan D

    2008-01-01

    In this paper a novel method for automatic detection and classification of sleep stages using a multichannel electroencephalography (EEG) is presented. Understanding the sleep mechanism is vital for diagnosis and treatment of sleep disorders. The EEG is one of the most important tools of studying and diagnosing sleep disorders. EEG signals waveforms activity interpretation is performed by visual analysis (a very difficult procedure). This research aim is to ease the difficulties involved in the existing manual process of EEG interpretation by proposing an automatic sleep stage detection and classification system. The suggested method based on Multichannel Auto Regressive (MAR) model. The multichannel analysis approach incorporates the cross correlation information existing between different EEG signals. In the training phase, we used the vector quantization (VQ) algorithm, Linde-Buzo-Gray (LBG) and sleep stage definition, by estimation of probability mass functions (pmf) per every sleep stage using Generalized Log Likelihood Ratio (GLLR) distortion. The classification phase was performed using Kullback-Leibler (KL) divergence. The results of this research are promising with classification accuracy rate of 93.2%. The results encourage continuation of this research in the sleep field and in other biomedical signals applications.

  9. Acoustic detection and classification of Microchiroptera using machine learning: lessons learned from automatic speech recognition.

    PubMed

    Skowronski, Mark D; Harris, John G

    2006-03-01

    Current automatic acoustic detection and classification of microchiroptera utilize global features of individual calls (i.e., duration, bandwidth, frequency extrema), an approach that stems from expert knowledge of call sonograms. This approach parallels the acoustic phonetic paradigm of human automatic speech recognition (ASR), which relied on expert knowledge to account for variations in canonical linguistic units. ASR research eventually shifted from acoustic phonetics to machine learning, primarily because of the superior ability of machine learning to account for signal variation. To compare machine learning with conventional methods of detection and classification, nearly 3000 search-phase calls were hand labeled from recordings of five species: Pipistrellus bodenheimeri, Molossus molossus, Lasiurus borealis, L. cinereus semotus, and Tadarida brasiliensis. The hand labels were used to train two machine learning models: a Gaussian mixture model (GMM) for detection and classification and a hidden Markov model (HMM) for classification. The GMM detector produced 4% error compared to 32% error for a baseline broadband energy detector, while the GMM and HMM classifiers produced errors of 0.6 +/- 0.2% compared to 16.9 +/- 1.1% error for a baseline discriminant function analysis classifier. The experiments showed that machine learning algorithms produced errors an order of magnitude smaller than those for conventional methods.

  10. Automatic and adaptive classification of electroencephalographic signals for brain computer interfaces.

    PubMed

    Rodríguez-Bermúdez, Germán; García-Laencina, Pedro J

    2012-11-01

    Extracting knowledge from electroencephalographic (EEG) signals has become an increasingly important research area in biomedical engineering. In addition to its clinical diagnostic purposes, in recent years there have been many efforts to develop brain computer interface (BCI) systems, which allow users to control external devices only by using their brain activity. Once the EEG signals have been acquired, it is necessary to use appropriate feature extraction and classification methods adapted to the user in order to improve the performance of the BCI system and, also, to make its design stage easier. This work introduces a novel fast adaptive BCI system for automatic feature extraction and classification of EEG signals. The proposed system efficiently combines several well-known feature extraction procedures and automatically chooses the most useful features for performing the classification task. Three different feature extraction techniques are applied: power spectral density, Hjorth parameters and autoregressive modelling. The most relevant features for linear discrimination are selected using a fast and robust wrapper methodology. The proposed method is evaluated using EEG signals from nine subjects during motor imagery tasks. Obtained experimental results show its advantages over the state-of-the-art methods, especially in terms of classification accuracy and computational cost.

  11. Large-scale automatic extraction of side effects associated with targeted anticancer drugs from full-text oncological articles.

    PubMed

    Xu, Rong; Wang, QuanQiu

    2015-06-01

    Targeted anticancer drugs such as imatinib, trastuzumab and erlotinib dramatically improved treatment outcomes in cancer patients, however, these innovative agents are often associated with unexpected side effects. The pathophysiological mechanisms underlying these side effects are not well understood. The availability of a comprehensive knowledge base of side effects associated with targeted anticancer drugs has the potential to illuminate complex pathways underlying toxicities induced by these innovative drugs. While side effect association knowledge for targeted drugs exists in multiple heterogeneous data sources, published full-text oncological articles represent an important source of pivotal, investigational, and even failed trials in a variety of patient populations. In this study, we present an automatic process to extract targeted anticancer drug-associated side effects (drug-SE pairs) from a large number of high profile full-text oncological articles. We downloaded 13,855 full-text articles from the Journal of Oncology (JCO) published between 1983 and 2013. We developed text classification, relationship extraction, signaling filtering, and signal prioritization algorithms to extract drug-SE pairs from downloaded articles. We extracted a total of 26,264 drug-SE pairs with an average precision of 0.405, a recall of 0.899, and an F1 score of 0.465. We show that side effect knowledge from JCO articles is largely complementary to that from the US Food and Drug Administration (FDA) drug labels. Through integrative correlation analysis, we show that targeted drug-associated side effects positively correlate with their gene targets and disease indications. In conclusion, this unique database that we built from a large number of high-profile oncological articles could facilitate the development of computational models to understand toxic effects associated with targeted anticancer drugs.

  12. Introduction to Subject Indexing; A Programmed Text. Volume One: Subject Analysis and Practical Classification.

    ERIC Educational Resources Information Center

    Brown, Alan George

    This programed text presents the basic principles and practices of subject indexing--limited to the area of precoordinate indexing. This first of two volumes deals with the subject analysis of documents, primarily at the level of summarization, and the basic elements of translation into classification schemes. The text includes regular self-tests…

  13. An Automatic User-Adapted Physical Activity Classification Method Using Smartphones.

    PubMed

    Li, Pengfei; Wang, Yu; Tian, Yu; Zhou, Tian-Shu; Li, Jing-Song

    2017-03-01

    In recent years, an increasing number of people have become concerned about their health. Most chronic diseases are related to lifestyle, and daily activity records can be used as an important indicator of health. Specifically, using advanced technology to automatically monitor actual activities can effectively prevent and manage chronic diseases. The data used in this paper were obtained from acceleration sensors and gyroscopes integrated in smartphones. We designed an efficient Adaboost-Stump running on a smartphone to classify five common activities: cycling, running, sitting, standing, and walking and achieved a satisfactory classification accuracy of 98%. We designed an online learning method, and the classification model requires continuous training with actual data. The parameters in the model then become increasingly fitted to the specific user, which allows the classification accuracy to reach 95% under different use environments. In addition, this paper also utilized the OpenCL framework to design the program in parallel. This process can enhance the computing efficiency approximately ninefold.

  14. Automatic Classification of Protein Structure Using the Maximum Contact Map Overlap Metric

    SciTech Connect

    Andonov, Rumen; Djidjev, Hristo Nikolov; Klau, Gunnar W.; Le Boudic-Jamin, Mathilde; Wohlers, Inken

    2015-10-09

    In this paper, we propose a new distance measure for comparing two protein structures based on their contact map representations. We show that our novel measure, which we refer to as the maximum contact map overlap (max-CMO) metric, satisfies all properties of a metric on the space of protein representations. Having a metric in that space allows one to avoid pairwise comparisons on the entire database and, thus, to significantly accelerate exploring the protein space compared to no-metric spaces. We show on a gold standard superfamily classification benchmark set of 6759 proteins that our exact k-nearest neighbor (k-NN) scheme classifies up to 224 out of 236 queries correctly and on a larger, extended version of the benchmark with 60; 850 additional structures, up to 1361 out of 1369 queries. Finally, our k-NN classification thus provides a promising approach for the automatic classification of protein structures based on flexible contact map overlap alignments.

  15. Deep Learning-Based Large-Scale Automatic Satellite Crosswalk Classification

    NASA Astrophysics Data System (ADS)

    Berriel, Rodrigo F.; Lopes, Andre Teixeira; de Souza, Alberto F.; Oliveira-Santos, Thiago

    2017-09-01

    High-resolution satellite imagery have been increasingly used on remote sensing classification problems. One of the main factors is the availability of this kind of data. Even though, very little effort has been placed on the zebra crossing classification problem. In this letter, crowdsourcing systems are exploited in order to enable the automatic acquisition and annotation of a large-scale satellite imagery database for crosswalks related tasks. Then, this dataset is used to train deep-learning-based models in order to accurately classify satellite images that contains or not zebra crossings. A novel dataset with more than 240,000 images from 3 continents, 9 countries and more than 20 cities was used in the experiments. Experimental results showed that freely available crowdsourcing data can be used to accurately (97.11%) train robust models to perform crosswalk classification on a global scale.

  16. Automatic pathology classification using a single feature machine learning support - vector machines

    NASA Astrophysics Data System (ADS)

    Yepes-Calderon, Fernando; Pedregosa, Fabian; Thirion, Bertrand; Wang, Yalin; Lepore, Natasha

    2014-03-01

    Magnetic Resonance Imaging (MRI) has been gaining popularity in the clinic in recent years as a safe in-vivo imaging technique. As a result, large troves of data are being gathered and stored daily that may be used as clinical training sets in hospitals. While numerous machine learning (ML) algorithms have been implemented for Alzheimer's disease classification, their outputs are usually difficult to interpret in the clinical setting. Here, we propose a simple method of rapid diagnostic classification for the clinic using Support Vector Machines (SVM)1 and easy to obtain geometrical measurements that, together with a cortical and sub-cortical brain parcellation, create a robust framework capable of automatic diagnosis with high accuracy. On a significantly large imaging dataset consisting of over 800 subjects taken from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database, classification-success indexes of up to 99.2% are reached with a single measurement.

  17. Performance Analysis of Distributed Applications using Automatic Classification of Communication Inefficiencies

    SciTech Connect

    Vetter, J.

    1999-11-01

    We present a technique for performance analysis that helps users understand the communication behavior of their message passing applications. Our method automatically classifies individual communication operations and it reveals the cause of communication inefficiencies in the application. This classification allows the developer to focus quickly on the culprits of truly inefficient behavior, rather than manually foraging through massive amounts of performance data. Specifically, we trace the message operations of MPI applications and then classify each individual communication event using decision tree classification, a supervised learning technique. We train our decision tree using microbenchmarks that demonstrate both efficient and inefficient communication. Since our technique adapts to the target system's configuration through these microbenchmarks, we can simultaneously automate the performance analysis process and improve classification accuracy. Our experiments on four applications demonstrate that our technique can improve the accuracy of performance analysis, and dramatically reduce the amount of data that users must encounter.

  18. Automatic classification of laser-induced breakdown spectroscopy (LIBS) data of protein biomarker solutions.

    PubMed

    Pokrajac, David; Lazarevic, Aleksandar; Kecman, Vojislav; Marcano, Aristides; Markushin, Yuri; Vance, Tia; Reljin, Natasa; McDaniel, Samantha; Melikechi, Noureddine

    2014-01-01

    We perform multi-class classification of laser-induced breakdown spectroscopy data of four commercial samples of proteins diluted in phosphate-buffered saline solution at different concentrations: bovine serum albumin, osteopontin, leptin, and insulin-like growth factor II. We achieve this by using principal component analysis as a method for dimensionality reduction. In addition, we apply several different classification algorithms (K-nearest neighbor, classification and regression trees, neural networks, support vector machines, adaptive local hyperplane, and linear discriminant classifiers) to perform multi-class classification. We achieve classification accuracies above 98% by using the linear classifier with 21-31 principal components. We obtain the best detection performance for neural networks, support vector machines, and adaptive local hyperplanes for a range of the number of principal components with no significant differences in performance except for that of the linear classifier. With the optimal number of principal components, a simplistic K-nearest classifier still provided acceptable results. Our proposed approach demonstrates that highly accurate automatic classification of complex protein samples from laser-induced breakdown spectroscopy data can be successfully achieved using principal component analysis with a sufficiently large number of extracted features, followed by a wrapper technique to determine the optimal number of principal components.

  19. Methodology for the Evaluation of the Algorithms for Text Line Segmentation Based on Extended Binary Classification

    NASA Astrophysics Data System (ADS)

    Brodic, D.

    2011-01-01

    Text line segmentation represents the key element in the optical character recognition process. Hence, testing of text line segmentation algorithms has substantial relevance. All previously proposed testing methods deal mainly with text database as a template. They are used for testing as well as for the evaluation of the text segmentation algorithm. In this manuscript, methodology for the evaluation of the algorithm for text segmentation based on extended binary classification is proposed. It is established on the various multiline text samples linked with text segmentation. Their results are distributed according to binary classification. Final result is obtained by comparative analysis of cross linked data. At the end, its suitability for different types of scripts represents its main advantage.

  20. Quantitative analysis of lunar crater's landscape: automatic detection, classification and geological applications

    NASA Astrophysics Data System (ADS)

    Li, Ke; Chen, Jianping; He, Shujun; Zhang, Mingchao

    2013-04-01

    Lunar craters are the most important geological tectonic features on the moon; they are among the most studied subjects when it comes to the analysis of the surface of the moon since they provide us with the relative age of the surface unit and more information about lunar geology. Quantitative analysis of landscape on lunar crater is an important approach in lunar geological unit dating which plays a key role in understanding and reconstruction of lunar geological evolution. In this paper, a new approach of automatic crater detection and classification is proposed based on the quantitative analysis of crater's landscape with different spatial resolution digital terrain models. The approach proposed in this paper includes the following key points: 1) A new crater detection method which selects profile similarity parameters as the distinguishing marks is presented. The new method overcomes the high error defect of former DTM based crater detection algorithm; 2) Craters are sorted by the morphological characteristics of profiles. The new quantitative classification method overcomes the subjectivity of the previously descriptive classification method. In order to verify the usefulness of the proposed method the pre-selected landing area of China's Chang'e-III lunar satellite-Sinus Iridum is chosen as the experimental zone. DTM with different resolutions from the Chang'e-I Laser Altimeter, the Chang'e-I Stereoscopic Camera and the Lunar Orbiter Laser Altimeter (LOLA) are used for crater detection and classification. Dating results of each geological unit are gotten using crater size-frequency distribution method (CSFD). By comparing the former dating and manual classification data, we found that the results obtained by our method and the former results have the strong consistency. With the combination of automatic crater detection and classification, this paper basically provided a quantitative approach which can analyze the lunar crater's landscape and get geological

  1. Motif-Based Text Mining of Microbial Metagenome Redundancy Profiling Data for Disease Classification.

    PubMed

    Wang, Yin; Li, Rudong; Zhou, Yuhua; Ling, Zongxin; Guo, Xiaokui; Xie, Lu; Liu, Lei

    2016-01-01

    Text data of 16S rRNA are informative for classifications of microbiota-associated diseases. However, the raw text data need to be systematically processed so that features for classification can be defined/extracted; moreover, the high-dimension feature spaces generated by the text data also pose an additional difficulty. Here we present a Phylogenetic Tree-Based Motif Finding algorithm (PMF) to analyze 16S rRNA text data. By integrating phylogenetic rules and other statistical indexes for classification, we can effectively reduce the dimension of the large feature spaces generated by the text datasets. Using the retrieved motifs in combination with common classification methods, we can discriminate different samples of both pneumonia and dental caries better than other existing methods. We extend the phylogenetic approaches to perform supervised learning on microbiota text data to discriminate the pathological states for pneumonia and dental caries. The results have shown that PMF may enhance the efficiency and reliability in analyzing high-dimension text data.

  2. Cue-based assertion classification for Swedish clinical text--developing a lexicon for pyConTextSwe.

    PubMed

    Velupillai, Sumithra; Skeppstedt, Maria; Kvist, Maria; Mowery, Danielle; Chapman, Brian E; Dalianis, Hercules; Chapman, Wendy W

    2014-07-01

    The ability of a cue-based system to accurately assert whether a disorder is affirmed, negated, or uncertain is dependent, in part, on its cue lexicon. In this paper, we continue our study of porting an assertion system (pyConTextNLP) from English to Swedish (pyConTextSwe) by creating an optimized assertion lexicon for clinical Swedish. We integrated cues from four external lexicons, along with generated inflections and combinations. We used subsets of a clinical corpus in Swedish. We applied four assertion classes (definite existence, probable existence, probable negated existence and definite negated existence) and two binary classes (existence yes/no and uncertainty yes/no) to pyConTextSwe. We compared pyConTextSwe's performance with and without the added cues on a development set, and improved the lexicon further after an error analysis. On a separate evaluation set, we calculated the system's final performance. Following integration steps, we added 454 cues to pyConTextSwe. The optimized lexicon developed after an error analysis resulted in statistically significant improvements on the development set (83% F-score, overall). The system's final F-scores on an evaluation set were 81% (overall). For the individual assertion classes, F-score results were 88% (definite existence), 81% (probable existence), 55% (probable negated existence), and 63% (definite negated existence). For the binary classifications existence yes/no and uncertainty yes/no, final system performance was 97%/87% and 78%/86% F-score, respectively. We have successfully ported pyConTextNLP to Swedish (pyConTextSwe). We have created an extensive and useful assertion lexicon for Swedish clinical text, which could form a valuable resource for similar studies, and which is publicly available. Copyright © 2014 The Authors. Published by Elsevier B.V. All rights reserved.

  3. Towards Automatic Classification of Exoplanet-Transit-Like Signals: A Case Study on Kepler Mission Data

    NASA Astrophysics Data System (ADS)

    Valizadegan, Hamed; Martin, Rodney; McCauliff, Sean D.; Jenkins, Jon Michael; Catanzarite, Joseph; Oza, Nikunj C.

    2015-08-01

    Building new catalogues of planetary candidates, astrophysical false alarms, and non-transiting phenomena is a challenging task that currently requires a reviewing team of astrophysicists and astronomers. These scientists need to examine more than 100 diagnostic metrics and associated graphics for each candidate exoplanet-transit-like signal to classify it into one of the three classes. Considering that the NASA Explorer Program's TESS mission and ESA's PLATO mission survey even a larger area of space, the classification of their transit-like signals is more time-consuming for human agents and a bottleneck to successfully construct the new catalogues in a timely manner. This encourages building automatic classification tools that can quickly and reliably classify the new signal data from these missions. The standard tool for building automatic classification systems is the supervised machine learning that requires a large set of highly accurate labeled examples in order to build an effective classifier. This requirement cannot be easily met for classifying transit-like signals because not only are existing labeled signals very limited, but also the current labels may not be reliable (because the labeling process is a subjective task). Our experiments with using different supervised classifiers to categorize transit-like signals verifies that the labeled signals are not rich enough to provide the classifier with enough power to generalize well beyond the observed cases (e.g. to unseen or test signals). That motivated us to utilize a new category of learning techniques, so-called semi-supervised learning, that combines the label information from the costly labeled signals, and distribution information from the cheaply available unlabeled signals in order to construct more effective classifiers. Our study on the Kepler Mission data shows that semi-supervised learning can significantly improve the result of multiple base classifiers (e.g. Support Vector Machines, Ada

  4. Comparison of chi-squared automatic interaction detection classification trees vs TNM classification for patients with head and neck squamous cell carcinoma.

    PubMed

    Avilés-Jurado, F Xavier; Terra, Ximena; Figuerola, Enric; Quer, Miquel; León, Xavier

    2012-03-01

    To compare chi-squared automatic interaction detection (CHAID) classification trees vs the seventh edition of the TNM classification for patients with head and neck squamous cell carcinoma and to assess whether CHAID classification trees might improve results obtained with the TNM classification. Patient disease was classified according to CHAID classification trees and the TNM classification, and the results were compared. Academic research. A total of 3373 patients with carcinoma of the oral cavity, oropharynx, hypopharynx, or larynx. The 2 classification methods were evaluated objectively, measuring intrastage homogeneity (hazard consistency), interstage heterogeneity (hazard discrimination), and disease stage distribution among patients (balance). In addition, to assess agreement between CHAID classification trees and the TNM classification, we calculated the κ statistic, weighted linearly and quadratically. Objective evaluation of the quality of the classification methods indicated that CHAID classification trees performed better than the TNM classification in terms of hazard consistency (2.51 for CHAID and 3.01 for TNM) and hazard discrimination (70.9% for CHAID and 52.7% for TNM) but not balance (-31.7% for CHAID and -15.5% for TNM). Analysis of concordance between the classification methods showed that the quadratic κ statistic was 0.77 (95% CI, 0.76-0.78) and the linear κ statistic was 0.59 (95% CI, 0.57-0.60) (P < .001 for both). CHAID classification trees performed better than the TNM classification and offer potential inclusion of new prognostic factors.

  5. Automatic extraction of property norm-like data from large text corpora.

    PubMed

    Kelly, Colin; Devereux, Barry; Korhonen, Anna

    2014-01-01

    Traditional methods for deriving property-based representations of concepts from text have focused on either extracting only a subset of possible relation types, such as hyponymy/hypernymy (e.g., car is-a vehicle) or meronymy/metonymy (e.g., car has wheels), or unspecified relations (e.g., car--petrol). We propose a system for the challenging task of automatic, large-scale acquisition of unconstrained, human-like property norms from large text corpora, and discuss the theoretical implications of such a system. We employ syntactic, semantic, and encyclopedic information to guide our extraction, yielding concept-relation-feature triples (e.g., car be fast, car require petrol, car cause pollution), which approximate property-based conceptual representations. Our novel method extracts candidate triples from parsed corpora (Wikipedia and the British National Corpus) using syntactically and grammatically motivated rules, then reweights triples with a linear combination of their frequency and four statistical metrics. We assess our system output in three ways: lexical comparison with norms derived from human-generated property norm data, direct evaluation by four human judges, and a semantic distance comparison with both WordNet similarity data and human-judged concept similarity ratings. Our system offers a viable and performant method of plausible triple extraction: Our lexical comparison shows comparable performance to the current state-of-the-art, while subsequent evaluations exhibit the human-like character of our generated properties.

  6. Use of LANDSAT data for automatic classification and area estimation of sugarcane plantation in Sao Paulo state, Brazil

    NASA Technical Reports Server (NTRS)

    Dejesusparada, N. (Principal Investigator); Mendonca, F. J.

    1980-01-01

    Ten segments of the size 20 x 10 km were aerially photographed and used as training areas for automatic classifications. The study areas was covered by four LANDSAT paths: 235, 236, 237, and 238. The percentages of overall correct classification for these paths range from 79.56 percent for path 238 to 95.59 percent for path 237.

  7. Automatic large-scale classification of bird sounds is strongly improved by unsupervised feature learning.

    PubMed

    Stowell, Dan; Plumbley, Mark D

    2014-01-01

    Automatic species classification of birds from their sound is a computational tool of increasing importance in ecology, conservation monitoring and vocal communication studies. To make classification useful in practice, it is crucial to improve its accuracy while ensuring that it can run at big data scales. Many approaches use acoustic measures based on spectrogram-type data, such as the Mel-frequency cepstral coefficient (MFCC) features which represent a manually-designed summary of spectral information. However, recent work in machine learning has demonstrated that features learnt automatically from data can often outperform manually-designed feature transforms. Feature learning can be performed at large scale and "unsupervised", meaning it requires no manual data labelling, yet it can improve performance on "supervised" tasks such as classification. In this work we introduce a technique for feature learning from large volumes of bird sound recordings, inspired by techniques that have proven useful in other domains. We experimentally compare twelve different feature representations derived from the Mel spectrum (of which six use this technique), using four large and diverse databases of bird vocalisations, classified using a random forest classifier. We demonstrate that in our classification tasks, MFCCs can often lead to worse performance than the raw Mel spectral data from which they are derived. Conversely, we demonstrate that unsupervised feature learning provides a substantial boost over MFCCs and Mel spectra without adding computational complexity after the model has been trained. The boost is particularly notable for single-label classification tasks at large scale. The spectro-temporal activations learned through our procedure resemble spectro-temporal receptive fields calculated from avian primary auditory forebrain. However, for one of our datasets, which contains substantial audio data but few annotations, increased performance is not discernible. We

  8. Automatic large-scale classification of bird sounds is strongly improved by unsupervised feature learning

    PubMed Central

    Plumbley, Mark D.

    2014-01-01

    Automatic species classification of birds from their sound is a computational tool of increasing importance in ecology, conservation monitoring and vocal communication studies. To make classification useful in practice, it is crucial to improve its accuracy while ensuring that it can run at big data scales. Many approaches use acoustic measures based on spectrogram-type data, such as the Mel-frequency cepstral coefficient (MFCC) features which represent a manually-designed summary of spectral information. However, recent work in machine learning has demonstrated that features learnt automatically from data can often outperform manually-designed feature transforms. Feature learning can be performed at large scale and “unsupervised”, meaning it requires no manual data labelling, yet it can improve performance on “supervised” tasks such as classification. In this work we introduce a technique for feature learning from large volumes of bird sound recordings, inspired by techniques that have proven useful in other domains. We experimentally compare twelve different feature representations derived from the Mel spectrum (of which six use this technique), using four large and diverse databases of bird vocalisations, classified using a random forest classifier. We demonstrate that in our classification tasks, MFCCs can often lead to worse performance than the raw Mel spectral data from which they are derived. Conversely, we demonstrate that unsupervised feature learning provides a substantial boost over MFCCs and Mel spectra without adding computational complexity after the model has been trained. The boost is particularly notable for single-label classification tasks at large scale. The spectro-temporal activations learned through our procedure resemble spectro-temporal receptive fields calculated from avian primary auditory forebrain. However, for one of our datasets, which contains substantial audio data but few annotations, increased performance is not

  9. [Automatic classification method of star spectra data based on manifold fuzzy twin support vector machine].

    PubMed

    Liu, Zhong-bao; Gao, Yan-yun; Wang, Jian-zhen

    2015-01-01

    Support vector machine (SVM) with good leaning ability and generalization is widely used in the star spectra data classification. But when the scale of data becomes larger, the shortages of SVM appear: the calculation amount is quite large and the classification speed is too slow. In order to solve the above problems, twin support vector machine (TWSVM) was proposed by Jayadeva. The advantage of TSVM is that the time cost is reduced to 1/4 of that of SVM. While all the methods mentioned above only focus on the global characteristics and neglect the local characteristics. In view of this, an automatic classification method of star spectra data based on manifold fuzzy twin support vector machine (MF-TSVM) is proposed in this paper. In MF-TSVM, manifold-based discriminant analysis (MDA) is used to obtain the global and local characteristics of the input data and the fuzzy membership is introduced to reduce the influences of noise and singular data on the classification results. Comparative experiments with current classification methods, such as C-SVM and KNN, on the SDSS star spectra datasets verify the effectiveness of the proposed method.

  10. Automatic Crack Detection and Classification Method for Subway Tunnel Safety Monitoring

    PubMed Central

    Zhang, Wenyu; Zhang, Zhenjiang; Qi, Dapeng; Liu, Yun

    2014-01-01

    Cracks are an important indicator reflecting the safety status of infrastructures. This paper presents an automatic crack detection and classification methodology for subway tunnel safety monitoring. With the application of high-speed complementary metal-oxide-semiconductor (CMOS) industrial cameras, the tunnel surface can be captured and stored in digital images. In a next step, the local dark regions with potential crack defects are segmented from the original gray-scale images by utilizing morphological image processing techniques and thresholding operations. In the feature extraction process, we present a distance histogram based shape descriptor that effectively describes the spatial shape difference between cracks and other irrelevant objects. Along with other features, the classification results successfully remove over 90% misidentified objects. Also, compared with the original gray-scale images, over 90% of the crack length is preserved in the last output binary images. The proposed approach was tested on the safety monitoring for Beijing Subway Line 1. The experimental results revealed the rules of parameter settings and also proved that the proposed approach is effective and efficient for automatic crack detection and classification. PMID:25325337

  11. Automatic Classification of Volcanic Earthquakes Using Multi-Station Waveforms and Dynamic Neural Networks

    NASA Astrophysics Data System (ADS)

    Bruton, C. P.; West, M. E.

    2013-12-01

    Earthquakes and seismicity have long been used to monitor volcanoes. In addition to time, location, and magnitude of an earthquake, the characteristics of the waveform itself are important. For example, low-frequency or hybrid type events could be generated by magma rising toward the surface. A rockfall event could indicate a growing lava dome. Classification of earthquake waveforms is thus a useful tool in volcano monitoring. A procedure to perform such classification automatically could flag certain event types immediately, instead of waiting for a human analyst's review. Inspired by speech recognition techniques, we have developed a procedure to classify earthquake waveforms using artificial neural networks. A neural network can be "trained" with an existing set of input and desired output data; in this case, we use a set of earthquake waveforms (input) that has been classified by a human analyst (desired output). After training the neural network, new waveforms can be classified automatically as they are presented. Our procedure uses waveforms from multiple stations, making it robust to seismic network changes and outages. The use of a dynamic time-delay neural network allows waveforms to be presented without precise alignment in time, and thus could be applied to continuous data or to seismic events without clear start and end times. We have evaluated several different training algorithms and neural network structures to determine their effects on classification performance. We apply this procedure to earthquakes recorded at Mount Spurr and Katmai in Alaska, and Uturuncu Volcano in Bolivia.

  12. Automatically Detecting Medications and the Reason for their Prescription in Clinical Narrative Text Documents

    PubMed Central

    Meystre, Stéphane M.; Thibault, Julien; Shen, Shuying; Hurdle, John F.; South, Brett R.

    2011-01-01

    An important proportion of the information about the medications a patient is taking is mentioned only in narrative text in the electronic health record. Automated information extraction can make this information accessible for decision-support, research, or any other automated processing. In the context of the “i2b2 medication extraction challenge,” we have developed a new NLP application called Textractor to automatically extract medications and details about them (e.g., dosage, frequency, reason for their prescription). This application and its evaluation with part of the reference standard for this “challenge” are presented here, along with an analysis of the development of this reference standard. During this evaluation, Textractor reached a system-level overall F1-measure, the reference metric for this challenge, of about 77% for exact matches. The best performance was measured with medication routes (F1-measure 86.4%), and the worst with prescription reasons (F1-measure 29%). These results are consistent with the agreement observed between human annotators when developing the reference standard, and with other published research. PMID:20841823

  13. Automatic brain caudate nuclei segmentation and classification in diagnostic of Attention-Deficit/Hyperactivity Disorder.

    PubMed

    Igual, Laura; Soliva, Joan Carles; Escalera, Sergio; Gimeno, Roger; Vilarroya, Oscar; Radeva, Petia

    2012-12-01

    We present a fully automatic diagnostic imaging test for Attention-Deficit/Hyperactivity Disorder diagnosis assistance based on previously found evidences of caudate nucleus volumetric abnormalities. The proposed method consists of different steps: a new automatic method for external and internal segmentation of caudate based on Machine Learning methodologies; the definition of a set of new volume relation features, 3D Dissociated Dipoles, used for caudate representation and classification. We separately validate the contributions using real data from a pediatric population and show precise internal caudate segmentation and discrimination power of the diagnostic test, showing significant performance improvements in comparison to other state-of-the-art methods. Copyright © 2012 Elsevier Ltd. All rights reserved.

  14. A semi-automatic traffic sign detection, classification, and positioning system

    NASA Astrophysics Data System (ADS)

    Creusen, I. M.; Hazelhoff, L.; de With, P. H. N.

    2012-01-01

    The availability of large-scale databases containing street-level panoramic images offers the possibility to perform semi-automatic surveying of real-world objects such as traffic signs. These inventories can be performed significantly more efficiently than using conventional methods. Governmental agencies are interested in these inventories for maintenance and safety reasons. This paper introduces a complete semi-automatic traffic sign inventory system. The system consists of several components. First, a detection algorithm locates the 2D position of the traffic signs in the panoramic images. Second, a classification algorithm is used to identify the traffic sign. Third, the 3D position of the traffic sign is calculated using the GPS position of the photographs. Finally, the results are listed in a table for quick inspection and are also visualized in a web browser.

  15. Evaluation of formant-like features on an automatic vowel classification task

    NASA Astrophysics Data System (ADS)

    de Wet, Febe; Weber, Katrin; Boves, Louis; Cranen, Bert; Bengio, Samy; Bourlard, Hervé

    2004-09-01

    Numerous attempts have been made to find low-dimensional, formant-related representations of speech signals that are suitable for automatic speech recognition. However, it is often not known how these features behave in comparison with true formants. The purpose of this study was to compare two sets of automatically extracted formant-like features, i.e., robust formants and HMM2 features, to hand-labeled formants. The robust formant features were derived by means of the split Levinson algorithm while the HMM2 features correspond to the frequency segmentation of speech signals obtained by two-dimensional hidden Markov models. Mel-frequency cepstral coefficients (MFCCs) were also included in the investigation as an example of state-of-the-art automatic speech recognition features. The feature sets were compared in terms of their performance on a vowel classification task. The speech data and hand-labeled formants that were used in this study are a subset of the American English vowels database presented in Hillenbrand et al. [J. Acoust. Soc. Am. 97, 3099-3111 (1995)]. Classification performance was measured on the original, clean data and in noisy acoustic conditions. When using clean data, the classification performance of the formant-like features compared very well to the performance of the hand-labeled formants in a gender-dependent experiment, but was inferior to the hand-labeled formants in a gender-independent experiment. The results that were obtained in noisy acoustic conditions indicated that the formant-like features used in this study are not inherently noise robust. For clean and noisy data as well as for the gender-dependent and gender-independent experiments the MFCCs achieved the same or superior results as the formant features, but at the price of a much higher feature dimensionality.

  16. In-Vivo Automatic Nuclear Cataract Detection and Classification in an Animal Model by Ultrasounds.

    PubMed

    Caixinha, Miguel; Amaro, Joao; Santos, Mario; Perdigao, Fernando; Gomes, Marco; Santos, Jaime

    2016-11-01

    To early detect nuclear cataract in vivo and automatically classify its severity degree, based on the ultrasound technique, using machine learning. A 20-MHz ophthalmic ultrasound probe with a focal length of 8.9 mm and an active diameter of 3 mm was used. Twenty-seven features in time and frequency domain were extracted for cataract detection and classification with support vector machine (SVM), Bayes, multilayer perceptron, and random forest classifiers. Fifty rats were used: 14 as control and 36 as study group. An animal model for nuclear cataract was developed. Twelve rats with incipient, 13 with moderate, and 11 with severe cataract were obtained. The hardness of the nucleus and the cortex regions was objectively measured in 12 rats using the NanoTest. Velocity, attenuation, and frequency downshift significantly increased with cataract formation ( ). The SVM classifier showed the higher performance for the automatic classification of cataract severity, with a precision, sensitivity, and specificity of 99.7% (relative absolute error of 0.4%). A statistically significant difference was found for the hardness of the different cataract degrees ( P = 0.016). The nucleus showed a higher hardness increase with cataract formation ( P = 0.049 ). A moderate-to-good correlation between the features and the nucleus hardness was found in 23 out of the 27 features. The developed methodology made possible detecting the nuclear cataract in-vivo in early stages, classifying automatically its severity degree and estimating its hardness. Based on this work, a medical prototype will be developed for early cataract detection, classification, and hardness estimation.

  17. Automatic extraction of keywords from scientific text: application to the knowledge domain of protein families.

    PubMed

    Andrade, M A; Valencia, A

    1998-01-01

    Annotation of the biological function of different protein sequences is a time-consuming process currently performed by human experts. Genome analysis tools encounter great difficulty in performing this task. Database curators, developers of genome analysis tools and biologists in general could benefit from access to tools able to suggest functional annotations and facilitate access to functional information. We present here the first prototype of a system for the automatic annotation of protein function. The system is triggered by collections of s related to a given protein, and it is able to extract biological information directly from scientific literature, i.e. MEDLINE abstracts. Relevant keywords are selected by their relative accumulation in comparison with a domain-specific background distribution. Simultaneously, the most representative sentences and MEDLINE abstracts are selected and presented to the end-user. Evolutionary information is considered as a predominant characteristic in the domain of protein function. Our system consequently extracts domain-specific information from the analysis of a set of protein families. The system has been tested with different protein families, of which three examples are discussed in detail here: 'ataxia-telangiectasia associated protein', 'ran GTPase' and 'carbonic anhydrase'. We found generally good correlation between the amount of information provided to the system and the quality of the annotations. Finally, the current limitations and future developments of the system are discussed. The current system can be considered as a prototype system. As such, it can be accessed as a server at http://columba.ebi.ac. uk:8765/andrade/abx. The system accepts text related to the protein or proteins to be evaluated (optimally, the result of a MEDLINE search by keyword) and the results are returned in the form of Web pages for keywords, sentences and s. Web pages containing full information on the examples mentioned in the text

  18. Structural families in loops of homologous proteins: automatic classification, modelling and application to antibodies.

    PubMed

    Martin, A C; Thornton, J M

    1996-11-15

    Loop regions of polypeptide in homologous proteins may be classified into structural families. A method is described by which this classification may be performed automatically and "key residue" templates, which may be responsible for the loop adopting a given conformation, are defined. The technique has been applied to the hypervariable loops of antibodies and the results are compared with the previous definition of canonical classes. We have extended these definitions and provide complete sets of structurally determining residues (SDRs) for the observed clusters including the first set of key residues for seven-residue CDR-H3 loops.

  19. Analysis of steranes and triterpanes in geolipid extracts by automatic classification of mass spectra

    NASA Technical Reports Server (NTRS)

    Wardroper, A. M. K.; Brooks, P. W.; Humberston, M. J.; Maxwell, J. R.

    1977-01-01

    A computer method is described for the automatic classification of triterpanes and steranes into gross structural type from their mass spectral characteristics. The method has been applied to the spectra obtained by gas-chromatographic/mass-spectroscopic analysis of two mixtures of standards and of hydrocarbon fractions isolated from Green River and Messel oil shales. Almost all of the steranes and triterpanes identified previously in both shales were classified, in addition to a number of new components. The results indicate that classification of such alkanes is possible with a laboratory computer system. The method has application to diagenesis and maturation studies as well as to oil/oil and oil/source rock correlations in which rapid screening of large numbers of samples is required.

  20. Detection and classification of football players with automatic generation of models

    NASA Astrophysics Data System (ADS)

    Gómez, Jorge R.; Jaraba, Elias Herrero; Montañés, Miguel Angel; Contreras, Francisco Martínez; Uruñuela, Carlos Orrite

    2010-01-01

    We focus on the automatic detection and classification of players in a football match. Our approach is not based on any a priori knowledge of the outfits, but on the assumption that the two main uniforms detected correspond to the two football teams. The algorithm is designed to be able to operate in real time, once it has been trained, and is able to detect partially occluded players and update the color of the kits to cope with some gradual illumination changes through time. Our method, evaluated from real sequences, gave better detection and classification results than those obtained by a system using a manual selection of samples to compute a Gaussian mixture model.

  1. A Novel Feature Selection Technique for Text Classification Using Naïve Bayes

    PubMed Central

    Dey Sarkar, Subhajit; Goswami, Saptarsi; Agarwal, Aman; Aktar, Javed

    2014-01-01

    With the proliferation of unstructured data, text classification or text categorization has found many applications in topic classification, sentiment analysis, authorship identification, spam detection, and so on. There are many classification algorithms available. Naïve Bayes remains one of the oldest and most popular classifiers. On one hand, implementation of naïve Bayes is simple and, on the other hand, this also requires fewer amounts of training data. From the literature review, it is found that naïve Bayes performs poorly compared to other classifiers in text classification. As a result, this makes the naïve Bayes classifier unusable in spite of the simplicity and intuitiveness of the model. In this paper, we propose a two-step feature selection method based on firstly a univariate feature selection and then feature clustering, where we use the univariate feature selection method to reduce the search space and then apply clustering to select relatively independent feature sets. We demonstrate the effectiveness of our method by a thorough evaluation and comparison over 13 datasets. The performance improvement thus achieved makes naïve Bayes comparable or superior to other classifiers. The proposed algorithm is shown to outperform other traditional methods like greedy search based wrapper or CFS. PMID:27433512

  2. Towards the Automatic Classification of Avian Flight Calls for Bioacoustic Monitoring

    PubMed Central

    Bello, Juan Pablo; Farnsworth, Andrew; Robbins, Matt; Keen, Sara; Klinck, Holger; Kelling, Steve

    2016-01-01

    Automatic classification of animal vocalizations has great potential to enhance the monitoring of species movements and behaviors. This is particularly true for monitoring nocturnal bird migration, where automated classification of migrants’ flight calls could yield new biological insights and conservation applications for birds that vocalize during migration. In this paper we investigate the automatic classification of bird species from flight calls, and in particular the relationship between two different problem formulations commonly found in the literature: classifying a short clip containing one of a fixed set of known species (N-class problem) and the continuous monitoring problem, the latter of which is relevant to migration monitoring. We implemented a state-of-the-art audio classification model based on unsupervised feature learning and evaluated it on three novel datasets, one for studying the N-class problem including over 5000 flight calls from 43 different species, and two realistic datasets for studying the monitoring scenario comprising hundreds of thousands of audio clips that were compiled by means of remote acoustic sensors deployed in the field during two migration seasons. We show that the model achieves high accuracy when classifying a clip to one of N known species, even for a large number of species. In contrast, the model does not perform as well in the continuous monitoring case. Through a detailed error analysis (that included full expert review of false positives and negatives) we show the model is confounded by varying background noise conditions and previously unseen vocalizations. We also show that the model needs to be parameterized and benchmarked differently for the continuous monitoring scenario. Finally, we show that despite the reduced performance, given the right conditions the model can still characterize the migration pattern of a specific species. The paper concludes with directions for future research. PMID:27880836

  3. Applying active learning to assertion classification of concepts in clinical text.

    PubMed

    Chen, Yukun; Mani, Subramani; Xu, Hua

    2012-04-01

    Supervised machine learning methods for clinical natural language processing (NLP) research require a large number of annotated samples, which are very expensive to build because of the involvement of physicians. Active learning, an approach that actively samples from a large pool, provides an alternative solution. Its major goal in classification is to reduce the annotation effort while maintaining the quality of the predictive model. However, few studies have investigated its uses in clinical NLP. This paper reports an application of active learning to a clinical text classification task: to determine the assertion status of clinical concepts. The annotated corpus for the assertion classification task in the 2010 i2b2/VA Clinical NLP Challenge was used in this study. We implemented several existing and newly developed active learning algorithms and assessed their uses. The outcome is reported in the global ALC score, based on the Area under the average Learning Curve of the AUC (Area Under the Curve) score. Results showed that when the same number of annotated samples was used, active learning strategies could generate better classification models (best ALC-0.7715) than the passive learning method (random sampling) (ALC-0.7411). Moreover, to achieve the same classification performance, active learning strategies required fewer samples than the random sampling method. For example, to achieve an AUC of 0.79, the random sampling method used 32 samples, while our best active learning algorithm required only 12 samples, a reduction of 62.5% in manual annotation effort.

  4. BI-RADS Density Classification From Areometric and Volumetric Automatic Breast Density Measurements.

    PubMed

    Østerås, Bjørn Helge; Martinsen, Anne Catrine T; Brandal, Siri Helene B; Chaudhry, Khalida Nasreen; Eben, Ellen; Haakenaasen, Unni; Falk, Ragnhild Sørum; Skaane, Per

    2016-04-01

    The aim of our study was to classify breast density using areometric and volumetric automatic measurements to best match Breast Imaging-Reporting and Data System (BI-RADS) density scores, and determine which technique best agrees with BI-RADS. Second, this study aimed to provide a set of threshold values for areometric and volumetric density to estimate BI-RADS categories. We randomly selected 537 full-field digital mammography examinations from a population-based screening program. Five radiologists assessed breast density using BI-RADS with all views available. A commercial program calculated areometric and volumetric breast density automatically. We compared automatically calculated density to all BI-RADS density thresholds using area under the receiver operating characteristic curve, and used Youden's index to estimate thresholds in automatic densities, with matching sensitivity and specificity. The 95% confidence intervals were estimated by bootstrapping. Areometric density correlated well with volumetric density (r(2) = 0.76, excluding outliers, n = 2). For the BI-RADS threshold between II and III, areometric and volumetric assessment showed about equal area under the curve (0.94 vs. 0.93). For the threshold between I and II, areometric assessment was better than volumetric assessment (0.91 vs. 0.86). For the threshold between III and IV, volumetric assessment was better than areometric assessment (0.97 vs. 0.92). Volumetric assessment is equal to or better than areometric assessment for the most clinically relevant thresholds (ie, between scattered fibroglandular and heterogeneously dense, and between heterogeneously dense and extremely dense breasts). Thresholds found in this study can be applied in daily practice to automatic measurements of density to estimate BI-RADS classification. Copyright © 2016 The Association of University Radiologists. Published by Elsevier Inc. All rights reserved.

  5. Automatic classification of delphinids based on the representative frequencies of whistles.

    PubMed

    Lin, Tzu-Hao; Chou, Lien-Siang

    2015-08-01

    Classification of odontocete species remains a challenging task for passive acoustic monitoring. Classifiers that have been developed use spectral features extracted from echolocation clicks and whistle contours. Most of these contour-based classifiers require complete contours to reduce measurement errors. Therefore, overlapping contours and partially detected contours in an automatic detection algorithm may increase the bias for contour-based classifiers. In this study, classification was conducted on each recording section without extracting individual contours. The local-max detector was used to extract representative frequencies of delphinid whistles and each section was divided into multiple non-overlapping fragments. Three acoustical parameters were measured from the distribution of representative frequencies in each fragment. By using the statistical features of the acoustical parameters and the percentage of overlapping whistles, correct classification rate of 70.3% was reached for the recordings of seven species (Tursiops truncatus, Delphinus delphis, Delphinus capensis, Peponocephala electra, Grampus griseus, Stenella longirostris longirostris, and Stenella attenuata) archived in MobySound.org. In addition, correct classification rate was not dramatically reduced in various simulated noise conditions. This algorithm can be employed in acoustic observatories to classify different delphinid species and facilitate future studies on the community ecology of odontocetes.

  6. Automatic classification of infant sleep based on instantaneous frequencies in a single-channel EEG signal.

    PubMed

    Čić, Maja; Šoda, Joško; Bonković, Mirjana

    2013-12-01

    This study presents a novel approach for the electroencephalogram (EEG) signal quantification in which the empirical mode decomposition method, a time-frequency method designated for nonlinear and non-stationary signals, decomposes the EEG signal into intrinsic mode functions (IMF) with corresponding frequency ranges that characterize the appropriate oscillatory modes embedded in the brain neural activity acquired using EEG. To calculate the instantaneous frequency of IMFs, an algorithm was developed using the Generalized Zero Crossing method. From the resulting frequencies, two different novel features were generated: the median instantaneous frequencies and the number of instantaneous frequency changes during a 30s segment for seven IMFs. The sleep stage classification for the daytime sleep of 20 healthy babies was determined using the Support Vector Machine classification algorithm. The results were evaluated using the cross-validation method to achieve an approximately 90% accuracy and with new examinee data to achieve 80% average accuracy of classification. The obtained results were higher than the human experts' agreement and were statistically significant, which positioned the method, based on the proposed features, as an efficient procedure for automatic sleep stage classification. The uniqueness of this study arises from newly proposed features of the time-frequency domain, which bind characteristics of the sleep signals to the oscillation modes of brain activity, reflecting the physical characteristics of sleep, and thus have the potential to highlight the congruency of twin pairs with potential implications for the genetic determination of sleep.

  7. Automatic Detection and Classification of Breast Tumors in Ultrasonic Images Using Texture and Morphological Features

    PubMed Central

    Su, Yanni; Wang, Yuanyuan; Jiao, Jing; Guo, Yi

    2011-01-01

    Due to severe presence of speckle noise, poor image contrast and irregular lesion shape, it is challenging to build a fully automatic detection and classification system for breast ultrasonic images. In this paper, a novel and effective computer-aided method including generation of a region of interest (ROI), segmentation and classification of breast tumor is proposed without any manual intervention. By incorporating local features of texture and position, a ROI is firstly detected using a self-organizing map neural network. Then a modified Normalized Cut approach considering the weighted neighborhood gray values is proposed to partition the ROI into clusters and get the initial boundary. In addition, a regional-fitting active contour model is used to adjust the few inaccurate initial boundaries for the final segmentation. Finally, three textures and five morphologic features are extracted from each breast tumor; whereby a highly efficient Affinity Propagation clustering is used to fulfill the malignancy and benign classification for an existing database without any training process. The proposed system is validated by 132 cases (67 benignancies and 65 malignancies) with its performance compared to traditional methods such as level set segmentation, artificial neural network classifiers, and so forth. Experiment results show that the proposed system, which needs no training procedure or manual interference, performs best in detection and classification of ultrasonic breast tumors, while having the lowest computation complexity. PMID:21892371

  8. Improving Renal Cell Carcinoma Classification by Automatic Region of Interest Selection

    PubMed Central

    Chaudry, Qaiser; Raza, S. Hussain; Sharma, Yachna; Young, Andrew N.; Wang, May D.

    2016-01-01

    In this paper, we present an improved automated system for classification of pathological image data of renal cell carcinoma. The task of analyzing tissue biopsies, generally performed manually by expert pathologists, is extremely challenging due to the variability in the tissue morphology, the preparation of tissue specimen, and the image acquisition process. Due to the complexity of this task and heterogeneity of patient tissue, this process suffers from inter-observer and intra-observer variability. In continuation of our previous work, which proposed a knowledge-based automated system, we observe that real life clinical biopsy images which contain necrotic regions and glands significantly degrade the classification process. Following the pathologist’s technique of focusing on selected region of interest (ROI), we propose a simple ROI selection process which automatically rejects the glands and necrotic regions thereby improving the classification accuracy. We were able to improve the classification accuracy from 90% to 95% on a significantly heterogeneous image data set using our technique.

  9. Classification of Traffic Related Short Texts to Analyse Road Problems in Urban Areas

    NASA Astrophysics Data System (ADS)

    Saldana-Perez, A. M. M.; Moreno-Ibarra, M.; Tores-Ruiz, M.

    2017-09-01

    The Volunteer Geographic Information (VGI) can be used to understand the urban dynamics. In the classification of traffic related short texts to analyze road problems in urban areas, a VGI data analysis is done over a social media's publications, in order to classify traffic events at big cities that modify the movement of vehicles and people through the roads, such as car accidents, traffic and closures. The classification of traffic events described in short texts is done by applying a supervised machine learning algorithm. In the approach users are considered as sensors which describe their surroundings and provide their geographic position at the social network. The posts are treated by a text mining process and classified into five groups. Finally, the classified events are grouped in a data corpus and geo-visualized in the study area, to detect the places with more vehicular problems.

  10. Convolutional Neural Networks for Biomedical Text Classification: Application in Indexing Biomedical Articles.

    PubMed

    Rios, Anthony; Kavuluru, Ramakanth

    2015-09-01

    Building high accuracy text classifiers is an important task in biomedicine given the wealth of information hidden in unstructured narratives such as research articles and clinical documents. Due to large feature spaces, traditionally, discriminative approaches such as logistic regression and support vector machines with n-gram and semantic features (e.g., named entities) have been used for text classification where additional performance gains are typically made through feature selection and ensemble approaches. In this paper, we demonstrate that a more direct approach using convolutional neural networks (CNNs) outperforms several traditional approaches in biomedical text classification with the specific use-case of assigning medical subject headings (or MeSH terms) to biomedical articles. Trained annotators at the national library of medicine (NLM) assign on an average 13 codes to each biomedical article, thus semantically indexing scientific literature to support NLM's PubMed search system. Recent evidence suggests that effective automated efforts for MeSH term assignment start with binary classifiers for each term. In this paper, we use CNNs to build binary text classifiers and achieve an absolute improvement of over 3% in macro F-score over a set of selected hard-to-classify MeSH terms when compared with the best prior results on a public dataset. Additional experiments on 50 high frequency terms in the dataset also show improvements with CNNs. Our results indicate the strong potential of CNNs in biomedical text classification tasks.

  11. Approach for Text Classification Based on the Similarity Measurement between Normal Cloud Models

    PubMed Central

    Dai, Jin; Liu, Xin

    2014-01-01

    The similarity between objects is the core research area of data mining. In order to reduce the interference of the uncertainty of nature language, a similarity measurement between normal cloud models is adopted to text classification research. On this basis, a novel text classifier based on cloud concept jumping up (CCJU-TC) is proposed. It can efficiently accomplish conversion between qualitative concept and quantitative data. Through the conversion from text set to text information table based on VSM model, the text qualitative concept, which is extraction from the same category, is jumping up as a whole category concept. According to the cloud similarity between the test text and each category concept, the test text is assigned to the most similar category. By the comparison among different text classifiers in different feature selection set, it fully proves that not only does CCJU-TC have a strong ability to adapt to the different text features, but also the classification performance is also better than the traditional classifiers. PMID:24711737

  12. Texting

    ERIC Educational Resources Information Center

    Tilley, Carol L.

    2009-01-01

    With the increasing ranks of cell phone ownership is an increase in text messaging, or texting. During 2008, more than 2.5 trillion text messages were sent worldwide--that's an average of more than 400 messages for every person on the planet. Although many of the messages teenagers text each day are perhaps nothing more than "how r u?" or "c u…

  13. A CMAC-based scheme for determining membership with classification of text strings.

    PubMed

    Ma, Heng; Tseng, Ying-Chih; Chen, Lu-I

    Membership determination of text strings has been an important procedure for analyzing textual data of a tremendous amount, especially when time is a crucial factor. Bloom filter has been a well-known approach for dealing with such a problem because of its succinct structure and simple determination procedure. As determination of membership with classification is becoming increasingly desirable, parallel Bloom filters are often implemented for facilitating the additional classification requirement. The parallel Bloom filters, however, tend to produce additional false-positive errors since membership determination must be performed on each of the parallel layers. We propose a scheme based on CMAC, a neural network mapping, which only requires a single-layer calculation to simultaneously obtain information of both the membership and classification. A hash function specifically designed for text strings is also proposed. The proposed scheme could effectively reduce false-positive errors by converging the range of membership acceptance to the minimum for each class during the neural network mapping. Simulation results show that the proposed scheme committed significantly less errors than the benchmark, parallel Bloom filters, with limited and identical memory usage at different classification levels.

  14. Automatic classification of acetowhite temporal patterns to identify precursor lesions of cervical cancer

    NASA Astrophysics Data System (ADS)

    Gutiérrez-Fragoso, K.; Acosta-Mesa, H. G.; Cruz-Ramírez, N.; Hernández-Jiménez, R.

    2013-12-01

    Cervical cancer has remained, until now, as a serious public health problem in developing countries. The most common method of screening is the Pap test or cytology. When abnormalities are reported in the result, the patient is referred to a dysplasia clinic for colposcopy. During this test, a solution of acetic acid is applied, which produces a color change in the tissue and is known as acetowhitening phenomenon. This reaction aims to obtaining a sample of tissue and its histological analysis let to establish a final diagnosis. During the colposcopy test, digital images can be acquired to analyze the behavior of the acetowhitening reaction from a temporal approach. In this way, we try to identify precursor lesions of cervical cancer through a process of automatic classification of acetowhite temporal patterns. In this paper, we present the performance analysis of three classification methods: kNN, Naïve Bayes and C4.5. The results showed that there is similarity between some acetowhite temporal patterns of normal and abnormal tissues. Therefore we conclude that it is not sufficient to only consider the temporal dynamic of the acetowhitening reaction to establish a diagnosis by an automatic method. Information from cytologic, colposcopic and histopathologic disciplines should be integrated as well.

  15. Comparative analysis of image classification methods for automatic diagnosis of ophthalmic images

    NASA Astrophysics Data System (ADS)

    Wang, Liming; Zhang, Kai; Liu, Xiyang; Long, Erping; Jiang, Jiewei; An, Yingying; Zhang, Jia; Liu, Zhenzhen; Lin, Zhuoling; Li, Xiaoyan; Chen, Jingjing; Cao, Qianzhong; Li, Jing; Wu, Xiaohang; Wang, Dongni; Li, Wangting; Lin, Haotian

    2017-01-01

    There are many image classification methods, but it remains unclear which methods are most helpful for analyzing and intelligently identifying ophthalmic images. We select representative slit-lamp images which show the complexity of ocular images as research material to compare image classification algorithms for diagnosing ophthalmic diseases. To facilitate this study, some feature extraction algorithms and classifiers are combined to automatic diagnose pediatric cataract with same dataset and then their performance are compared using multiple criteria. This comparative study reveals the general characteristics of the existing methods for automatic identification of ophthalmic images and provides new insights into the strengths and shortcomings of these methods. The relevant methods (local binary pattern +SVMs, wavelet transformation +SVMs) which achieve an average accuracy of 87% and can be adopted in specific situations to aid doctors in preliminarily disease screening. Furthermore, some methods requiring fewer computational resources and less time could be applied in remote places or mobile devices to assist individuals in understanding the condition of their body. In addition, it would be helpful to accelerate the development of innovative approaches and to apply these methods to assist doctors in diagnosing ophthalmic disease.

  16. The application of pattern recognition in the automatic classification of microscopic rock images

    NASA Astrophysics Data System (ADS)

    Młynarczuk, Mariusz; Górszczyk, Andrzej; Ślipek, Bartłomiej

    2013-10-01

    The classification of rocks is an inherent part of modern geology. The manual identification of rock samples is a time-consuming process, and-due to the subjective nature of human judgement-burdened with risk. In the course of the study discussed in the present paper, the authors investigated the possibility of automating this process. During the study, nine different rock samples were used. Their digital images were obtained from thin sections, with a polarizing microscope. These photographs were subsequently classified in an automatic manner, by means of four pattern recognition methods: the nearest neighbor algorithm, the K-nearest neighbor, the nearest mode algorithm, and the method of optimal spherical neighborhoods. The effectiveness of these methods was tested in four different color spaces: RGB, CIELab, YIQ, and HSV. The results of the study show that the automatic recognition of the discussed rock types is possible. The study also revealed that, if the CIELab color space and the nearest neighbor classification method are used, the rock samples in question are classified correctly, with the recognition levels of 99.8%.

  17. Comparative analysis of image classification methods for automatic diagnosis of ophthalmic images

    PubMed Central

    Wang, Liming; Zhang, Kai; Liu, Xiyang; Long, Erping; Jiang, Jiewei; An, Yingying; Zhang, Jia; Liu, Zhenzhen; Lin, Zhuoling; Li, Xiaoyan; Chen, Jingjing; Cao, Qianzhong; Li, Jing; Wu, Xiaohang; Wang, Dongni; Li, Wangting; Lin, Haotian

    2017-01-01

    There are many image classification methods, but it remains unclear which methods are most helpful for analyzing and intelligently identifying ophthalmic images. We select representative slit-lamp images which show the complexity of ocular images as research material to compare image classification algorithms for diagnosing ophthalmic diseases. To facilitate this study, some feature extraction algorithms and classifiers are combined to automatic diagnose pediatric cataract with same dataset and then their performance are compared using multiple criteria. This comparative study reveals the general characteristics of the existing methods for automatic identification of ophthalmic images and provides new insights into the strengths and shortcomings of these methods. The relevant methods (local binary pattern +SVMs, wavelet transformation +SVMs) which achieve an average accuracy of 87% and can be adopted in specific situations to aid doctors in preliminarily disease screening. Furthermore, some methods requiring fewer computational resources and less time could be applied in remote places or mobile devices to assist individuals in understanding the condition of their body. In addition, it would be helpful to accelerate the development of innovative approaches and to apply these methods to assist doctors in diagnosing ophthalmic disease. PMID:28139688

  18. Automatic sleep stage classification based on EEG signals by using neural networks and wavelet packet coefficients.

    PubMed

    Ebrahimi, Farideh; Mikaeili, Mohammad; Estrada, Edson; Nazeran, Homer

    2008-01-01

    Currently in the world there is an alarming number of people who suffer from sleep disorders. A number of biomedical signals, such as EEG, EMG, ECG and EOG are used in sleep labs among others for diagnosis and treatment of sleep related disorders. The usual method for sleep stage classification is visual inspection by a sleep specialist. This is a very time consuming and laborious exercise. Automatic sleep stage classification can facilitate this process. The definition of sleep stages and the sleep literature show that EEG signals are similar in Stage 1 of non-rapid eye movement (NREM) sleep and rapid eye movement (REM) sleep. Therefore, in this work an attempt was made to classify four sleep stages consisting of Awake, Stage 1 + REM, Stage 2 and Slow Wave Stage based on the EEG signal alone. Wavelet packet coefficients and artificial neural networks were deployed for this purpose. Seven all night recordings from Physionet database were used in the study. The results demonstrated that these four sleep stages could be automatically discriminated from each other with a specificity of 94.4 +/- 4.5%, a of sensitivity 84.2+3.9% and an accuracy of 93.0 +/- 4.0%.

  19. Automatic segmentation and classification of gestational sac based on mean sac diameter using medical ultrasound image

    NASA Astrophysics Data System (ADS)

    Khazendar, Shan; Farren, Jessica; Al-Assam, Hisham; Sayasneh, Ahmed; Du, Hongbo; Bourne, Tom; Jassim, Sabah A.

    2014-05-01

    Ultrasound is an effective multipurpose imaging modality that has been widely used for monitoring and diagnosing early pregnancy events. Technology developments coupled with wide public acceptance has made ultrasound an ideal tool for better understanding and diagnosing of early pregnancy. The first measurable signs of an early pregnancy are the geometric characteristics of the Gestational Sac (GS). Currently, the size of the GS is manually estimated from ultrasound images. The manual measurement involves multiple subjective decisions, in which dimensions are taken in three planes to establish what is known as Mean Sac Diameter (MSD). The manual measurement results in inter- and intra-observer variations, which may lead to difficulties in diagnosis. This paper proposes a fully automated diagnosis solution to accurately identify miscarriage cases in the first trimester of pregnancy based on automatic quantification of the MSD. Our study shows a strong positive correlation between the manual and the automatic MSD estimations. Our experimental results based on a dataset of 68 ultrasound images illustrate the effectiveness of the proposed scheme in identifying early miscarriage cases with classification accuracies comparable with those of domain experts using K nearest neighbor classifier on automatically estimated MSDs.

  20. Automatic Training Sample Selection for a Multi-Evidence Based Crop Classification Approach

    NASA Astrophysics Data System (ADS)

    Chellasamy, M.; Ferre, P. A. Ty; Humlekrog Greve, M.

    2014-09-01

    An approach to use the available agricultural parcel information to automatically select training samples for crop classification is investigated. Previous research addressed the multi-evidence crop classification approach using an ensemble classifier. This first produced confidence measures using three Multi-Layer Perceptron (MLP) neural networks trained separately with spectral, texture and vegetation indices; classification labels were then assigned based on Endorsement Theory. The present study proposes an approach to feed this ensemble classifier with automatically selected training samples. The available vector data representing crop boundaries with corresponding crop codes are used as a source for training samples. These vector data are created by farmers to support subsidy claims and are, therefore, prone to errors such as mislabeling of crop codes and boundary digitization errors. The proposed approach is named as ECRA (Ensemble based Cluster Refinement Approach). ECRA first automatically removes mislabeled samples and then selects the refined training samples in an iterative training-reclassification scheme. Mislabel removal is based on the expectation that mislabels in each class will be far from cluster centroid. However, this must be a soft constraint, especially when working with a hypothesis space that does not contain a good approximation of the targets classes. Difficulty in finding a good approximation often exists either due to less informative data or a large hypothesis space. Thus this approach uses the spectral, texture and indices domains in an ensemble framework to iteratively remove the mislabeled pixels from the crop clusters declared by the farmers. Once the clusters are refined, the selected border samples are used for final learning and the unknown samples are classified using the multi-evidence approach. The study is implemented with WorldView-2 multispectral imagery acquired for a study area containing 10 crop classes. The proposed

  1. Automatic generation of 3D motifs for classification of protein binding sites

    PubMed Central

    Nebel, Jean-Christophe; Herzyk, Pawel; Gilbert, David R

    2007-01-01

    Background Since many of the new protein structures delivered by high-throughput processes do not have any known function, there is a need for structure-based prediction of protein function. Protein 3D structures can be clustered according to their fold or secondary structures to produce classes of some functional significance. A recent alternative has been to detect specific 3D motifs which are often associated to active sites. Unfortunately, there are very few known 3D motifs, which are usually the result of a manual process, compared to the number of sequential motifs already known. In this paper, we report a method to automatically generate 3D motifs of protein structure binding sites based on consensus atom positions and evaluate it on a set of adenine based ligands. Results Our new approach was validated by generating automatically 3D patterns for the main adenine based ligands, i.e. AMP, ADP and ATP. Out of the 18 detected patterns, only one, the ADP4 pattern, is not associated with well defined structural patterns. Moreover, most of the patterns could be classified as binding site 3D motifs. Literature research revealed that the ADP4 pattern actually corresponds to structural features which show complex evolutionary links between ligases and transferases. Therefore, all of the generated patterns prove to be meaningful. Each pattern was used to query all PDB proteins which bind either purine based or guanine based ligands, in order to evaluate the classification and annotation properties of the pattern. Overall, our 3D patterns matched 31% of proteins with adenine based ligands and 95.5% of them were classified correctly. Conclusion A new metric has been introduced allowing the classification of proteins according to the similarity of atomic environment of binding sites, and a methodology has been developed to automatically produce 3D patterns from that classification. A study of proteins binding adenine based ligands showed that these 3D patterns are not

  2. A method for verifying a vector-based text classification system.

    PubMed

    Lu, Chris J; Humphrey, Susanne M; Browne, Allen C

    2008-11-06

    Journal Descriptor Indexing (JDI) is a vector-based text classification system developed at NLM (National Library of Medicine), originally in Lisp and now as a Java tool. Consequently, a testing suite was developed to verify training set data and results of the JDI tool. A methodology was developed and implemented to compare two sets of JD vectors, resulting in a single index (from 0 - 1) measuring their similarity. This methodology is fast, effective, and accurate.

  3. AUTOMATIC UNSUPERVISED CLASSIFICATION OF ALL SLOAN DIGITAL SKY SURVEY DATA RELEASE 7 GALAXY SPECTRA

    SciTech Connect

    Almeida, J. Sanchez; Aguerri, J. A. L.; Munoz-Tunon, C.; De Vicente, A. E-mail: jalfonso@iac.e E-mail: angelv@iac.e

    2010-05-01

    Using the k-means cluster analysis algorithm, we carry out an unsupervised classification of all galaxy spectra in the seventh and final Sloan Digital Sky Survey data release (SDSS/DR7). Except for the shift to rest-frame wavelengths and the normalization to the g-band flux, no manipulation is applied to the original spectra. The algorithm guarantees that galaxies with similar spectra belong to the same class. We find that 99% of the galaxies can be assigned to only 17 major classes, with 11 additional minor classes including the remaining 1%. The classification is not unique since many galaxies appear in between classes; however, our rendering of the algorithm overcomes this weakness with a tool to identify borderline galaxies. Each class is characterized by a template spectrum, which is the average of all the spectra of the galaxies in the class. These low-noise template spectra vary smoothly and continuously along a sequence labeled from 0 to 27, from the reddest class to the bluest class. Our Automatic Spectroscopic K-means-based (ASK) classification separates galaxies in colors, with classes characteristic of the red sequence, the blue cloud, as well as the green valley. When red sequence galaxies and green valley galaxies present emission lines, they are characteristic of active galactic nucleus activity. Blue galaxy classes have emission lines corresponding to star formation regions. We find the expected correlation between spectroscopic class and Hubble type, but this relationship exhibits a high intrinsic scatter. Several potential uses of the ASK classification are identified and sketched, including fast determination of physical properties by interpolation, classes as templates in redshift determinations, and target selection in follow-up works (we find classes of Seyfert galaxies, green valley galaxies, as well as a significant number of outliers). The ASK classification is publicly accessible through various Web sites.

  4. Automatic ship classification system for inverse synthetic aperture radar (ISAR) imagery

    NASA Astrophysics Data System (ADS)

    Menon, Murali M.

    1995-04-01

    The U.S. Navy has been interested in applying neural network processing architectures to automatically determine the naval class of ships from an inverse synthetic aperture radar (ISAR) on-board an airborne surveillance platform. Currently an operator identifies the target based on an ISAR display. The emergence of the littoral warfare scenario, coupled with the addition of multiple sensors on the platform, threatens to impair the ability of the operator to identify and track targets in a timely manner. Thus, on-board automation is quickly becoming a necessity. Over the past four years the Opto-Radar System Group at MIT Lincoln Laboratory has developed and fielded a neural network based automatic ship classification (ASC) system for ISAR imagery. This system utilizes imagery from the APS-137 ISAR. Previous related work with ASC systems processed either simulated or real ISAR imagery under highly controlled conditions. The focus of this work was to develop a ship classification system capability of providing real-time identification from imagery acquired during an actual mission. The ship classification system described in this report uses both neural network and conventional processing techniques to determine the naval class of a ship from a range- Doppler (ISAR) image. The `learning' capability of the neural network classifier allows a single naval class to be distributed across many categories such that a degree of invariance to ship motion is developed. The ASC system was evaluated on 30 ship class database that had also been used for an operational readiness evaluation of ISAR crews. The results of the evaluation indicate that the ASC system has a performance level comparable to ISAR operators and typically provides a significant improvement in throughput.

  5. Automatic classification of volcanic earthquakes using multi-station waveforms and dynamic neural networks

    NASA Astrophysics Data System (ADS)

    Bruton, Christopher Patrick

    Earthquakes and seismicity have long been used to monitor volcanoes. In addition to the time, location, and magnitude of an earthquake, the characteristics of the waveform itself are important. For example, low-frequency or hybrid type events could be generated by magma rising toward the surface. A rockfall event could indicate a growing lava dome. Classification of earthquake waveforms is thus a useful tool in volcano monitoring. A procedure to perform such classification automatically could flag certain event types immediately, instead of waiting for a human analyst's review. Inspired by speech recognition techniques, we have developed a procedure to classify earthquake waveforms using artificial neural networks. A neural network can be "trained" with an existing set of input and desired output data; in this case, we use a set of earthquake waveforms (input) that has been classified by a human analyst (desired output). After training the neural network, new sets of waveforms can be classified automatically as they are presented. Our procedure uses waveforms from multiple stations, making it robust to seismic network changes and outages. The use of a dynamic time-delay neural network allows waveforms to be presented without precise alignment in time, and thus could be applied to continuous data or to seismic events without clear start and end times. We have evaluated several different training algorithms and neural network structures to determine their effects on classification performance. We apply this procedure to earthquakes recorded at Mount Spurr and Katmai in Alaska, and Uturuncu Volcano in Bolivia. The procedure can successfully distinguish between slab and volcanic events at Uturuncu, between events from four different volcanoes in the Katmai region, and between volcano-tectonic and long-period events at Spurr. Average recall and overall accuracy were greater than 80% in all three cases.

  6. Using complex networks for text classification: Discriminating informative and imaginative documents

    NASA Astrophysics Data System (ADS)

    de Arruda, Henrique F.; Costa, Luciano da F.; Amancio, Diego R.

    2016-01-01

    Statistical methods have been widely employed in recent years to grasp many language properties. The application of such techniques have allowed an improvement of several linguistic applications, such as machine translation and document classification. In the latter, many approaches have emphasised the semantical content of texts, as is the case of bag-of-word language models. These approaches have certainly yielded reasonable performance. However, some potential features such as the structural organization of texts have been used only in a few studies. In this context, we probe how features derived from textual structure analysis can be effectively employed in a classification task. More specifically, we performed a supervised classification aiming at discriminating informative from imaginative documents. Using a networked model that describes the local topological/dynamical properties of function words, we achieved an accuracy rate of up to 95%, which is much higher than similar networked approaches. A systematic analysis of feature relevance revealed that symmetry and accessibility measurements are among the most prominent network measurements. Our results suggest that these measurements could be used in related language applications, as they play a complementary role in characterising texts.

  7. High-accuracy automatic classification of Parkinsonian tremor severity using machine learning method.

    PubMed

    Jeon, Hyoseon; Lee, Woongwoo; Park, Hyeyoung; Lee, Hong Ji; Kim, Sang Kyong; Kim, Han Byul; Jeon, Beomseok; Park, Kwang Suk

    2017-09-21

    Although clinical aspirations for new technology to accurately measure and diagnose Parkinsonian tremors exist, automatic scoring of tremor severity using machine learning approaches has not yet been employed. Objective: This study aims to maximize the scientific validity of automatic tremor-severity classification using machine learning algorithms to score Parkinsonian tremor severity in the same manner as the Unified Parkinson's Disease Rating Scale (UPDRS) used to rate scores in real clinical practice. Approach: Eighty-five PD patients perform four tasks for severity assessment of their resting, resting with mental stress, postural, and intention tremors. The tremor signals are measured using a wristwatch-type wearable device with an accelerometer and gyroscope. Displacement and angle signals are obtained by integrating the acceleration and angular-velocity signals. Nineteen features are extracted from each of the four tremor signals. The optimal feature configuration is decided using the wrapper feature selection algorithm or principal component analysis, and decision tree, support vector machine, discriminant analysis, and k-nearest neighbour algorithms are considered to develop an automatic scoring system for UPDRS prediction. The results are compared to UPDRS ratings assigned by two neurologists. Main results: The highest accuracies are 92.3%, 86.2%, 92.1%, and 89.2% for resting, resting with mental stress, postural, and intention tremors, respectively. The weighted Cohen's kappa values are 0.745, 0.635 and 0.633 for resting, resting with mental stress, and postural tremors (almost perfect agreement), and 0.570 for intention tremors (moderate). Significance: These results indicate the feasibility of the proposed system as a clinical decision tool for Parkinsonian tremor-severity automatic scoring. . © 2017 Institute of Physics and Engineering in Medicine.

  8. Automatic Training Site Selection for Agricultural Crop Classification: a Case Study on Karacabey Plain, Turkey

    NASA Astrophysics Data System (ADS)

    Ozdarici Ok, A.; Akyurek, Z.

    2011-09-01

    This study implements a traditional supervised classification method to an optical image composed of agricultural crops by means of a unique way, selecting the training samples automatically. Panchromatic (1m) and multispectral (4m) Kompsat-2 images (July 2008) of Karacabey Plain (~100km2), located in Marmara region, are used to evaluate the proposed approach. Due to the characteristic of rich, loamy soils combined with reasonable weather conditions, the Karacabey Plain is one of the most valuable agricultural regions of Turkey. Analyses start with applying an image fusion algorithm on the panchromatic and multispectral image. As a result of this process, 1m spatial resolution colour image is produced. In the next step, the four-band fused (1m) image and multispectral (4m) image are orthorectified. Next, the fused image (1m) is segmented using a popular segmentation method, Mean- Shift. The Mean-Shift is originally a method based on kernel density estimation and it shifts each pixel to the mode of clusters. In the segmentation procedure, three parameters must be defined: (i) spatial domain (hs), (ii) range domain (hr), and (iii) minimum region (MR). In this study, in total, 176 parameter combinations (hs, hr, and MR) are tested on a small part of the area (~10km2) to find an optimum segmentation result, and a final parameter combination (hs=18, hr=20, and MR=1000) is determined after evaluating multiple goodness measures. The final segmentation output is then utilized to the classification framework. The classification operation is applied on the four-band multispectral image (4m) to minimize the mixed pixel effect. Before the image classification, each segment is overlaid with the bands of the image fused, and several descriptive statistics of each segment are computed for each band. To select the potential homogeneous regions that are eligible for the selection of training samples, a user-defined threshold is applied. After finding those potential regions, the

  9. Localizing text in scene images by boundary clustering, stroke segmentation, and string fragment classification.

    PubMed

    Yi, Chucai; Tian, Yingli

    2012-09-01

    In this paper, we propose a novel framework to extract text regions from scene images with complex backgrounds and multiple text appearances. This framework consists of three main steps: boundary clustering (BC), stroke segmentation, and string fragment classification. In BC, we propose a new bigram-color-uniformity-based method to model both text and attachment surface, and cluster edge pixels based on color pairs and spatial positions into boundary layers. Then, stroke segmentation is performed at each boundary layer by color assignment to extract character candidates. We propose two algorithms to combine the structural analysis of text stroke with color assignment and filter out background interferences. Further, we design a robust string fragment classification based on Gabor-based text features. The features are obtained from feature maps of gradient, stroke distribution, and stroke width. The proposed framework of text localization is evaluated on scene images, born-digital images, broadcast video images, and images of handheld objects captured by blind persons. Experimental results on respective datasets demonstrate that the framework outperforms state-of-the-art localization algorithms.

  10. Automatic classification of registered clinical trials towards the Global Burden of Diseases taxonomy of diseases and injuries.

    PubMed

    Atal, Ignacio; Zeitoun, Jean-David; Névéol, Aurélie; Ravaud, Philippe; Porcher, Raphaël; Trinquart, Ludovic

    2016-09-22

    Clinical trial registries may allow for producing a global mapping of health research. However, health conditions are not described with standardized taxonomies in registries. Previous work analyzed clinical trial registries to improve the retrieval of relevant clinical trials for patients. However, no previous work has classified clinical trials across diseases using a standardized taxonomy allowing a comparison between global health research and global burden across diseases. We developed a knowledge-based classifier of health conditions studied in registered clinical trials towards categories of diseases and injuries from the Global Burden of Diseases (GBD) 2010 study. The classifier relies on the UMLS® knowledge source (Unified Medical Language System®) and on heuristic algorithms for parsing data. It maps trial records to a 28-class grouping of the GBD categories by automatically extracting UMLS concepts from text fields and by projecting concepts between medical terminologies. The classifier allows deriving pathways between the clinical trial record and candidate GBD categories using natural language processing and links between knowledge sources, and selects the relevant GBD classification based on rules of prioritization across the pathways found. We compared automatic and manual classifications for an external test set of 2,763 trials. We automatically classified 109,603 interventional trials registered before February 2014 at WHO ICTRP. In the external test set, the classifier identified the exact GBD categories for 78 % of the trials. It had very good performance for most of the 28 categories, especially "Neoplasms" (sensitivity 97.4 %, specificity 97.5 %). The sensitivity was moderate for trials not relevant to any GBD category (53 %) and low for trials of injuries (16 %). For the 109,603 trials registered at WHO ICTRP, the classifier did not assign any GBD category to 20.5 % of trials while the most common GBD categories were "Neoplasms" (22.8

  11. Towards the Development of a Mobile Phonopneumogram: Automatic Breath-Phase Classification Using Smartphones.

    PubMed

    Reyes, Bersain A; Reljin, Natasa; Kong, Youngsun; Nam, Yunyoung; Ha, Sangho; Chon, Ki H

    2016-09-01

    Correct labeling of breath phases is useful in the automatic analysis of respiratory sounds, where airflow or volume signals are commonly used as temporal reference. However, such signals are not always available. The development of a smartphone-based respiratory sound analysis system has received increased attention. In this study, we propose an optical approach that takes advantage of a smartphone's camera and provides a chest movement signal useful for classification of the breath phases when simultaneously recording tracheal sounds. Spirometer and smartphone-based signals were acquired from N = 13 healthy volunteers breathing at different frequencies, airflow and volume levels. We found that the smartphone-acquired chest movement signal was highly correlated with reference volume (ρ = 0.960 ± 0.025, mean ± SD). A simple linear regression on the chest signal was used to label the breath phases according to the slope between consecutive onsets. 100% accuracy was found for the classification of the analyzed breath phases. We found that the proposed classification scheme can be used to correctly classify breath phases in more challenging breathing patterns, such as those that include non-breath events like swallowing, talking, and coughing, and alternating or irregular breathing. These results show the feasibility of developing a portable and inexpensive phonopneumogram for the analysis of respiratory sounds based on smartphones.

  12. Performance analysis of distributed applications using automatic classification of communication inefficiencies

    DOEpatents

    Vetter, Jeffrey S.

    2005-02-01

    The method and system described herein presents a technique for performance analysis that helps users understand the communication behavior of their message passing applications. The method and system described herein may automatically classifies individual communication operations and reveal the cause of communication inefficiencies in the application. This classification allows the developer to quickly focus on the culprits of truly inefficient behavior, rather than manually foraging through massive amounts of performance data. Specifically, the method and system described herein trace the message operations of Message Passing Interface (MPI) applications and then classify each individual communication event using a supervised learning technique: decision tree classification. The decision tree may be trained using microbenchmarks that demonstrate both efficient and inefficient communication. Since the method and system described herein adapt to the target system's configuration through these microbenchmarks, they simultaneously automate the performance analysis process and improve classification accuracy. The method and system described herein may improve the accuracy of performance analysis and dramatically reduce the amount of data that users must encounter.

  13. Automatic classification of background EEG activity in healthy and sick neonates

    NASA Astrophysics Data System (ADS)

    Löfhede, Johan; Thordstein, Magnus; Löfgren, Nils; Flisberg, Anders; Rosa-Zurera, Manuel; Kjellmer, Ingemar; Lindecrantz, Kaj

    2010-02-01

    The overall aim of our research is to develop methods for a monitoring system to be used at neonatal intensive care units. When monitoring a baby, a range of different types of background activity needs to be considered. In this work, we have developed a scheme for automatic classification of background EEG activity in newborn babies. EEG from six full-term babies who were displaying a burst suppression pattern while suffering from the after-effects of asphyxia during birth was included along with EEG from 20 full-term healthy newborn babies. The signals from the healthy babies were divided into four behavioural states: active awake, quiet awake, active sleep and quiet sleep. By using a number of features extracted from the EEG together with Fisher's linear discriminant classifier we have managed to achieve 100% correct classification when separating burst suppression EEG from all four healthy EEG types and 93% true positive classification when separating quiet sleep from the other types. The other three sleep stages could not be classified. When the pathological burst suppression pattern was detected, the analysis was taken one step further and the signal was segmented into burst and suppression, allowing clinically relevant parameters such as suppression length and burst suppression ratio to be calculated. The segmentation of the burst suppression EEG works well, with a probability of error around 4%.

  14. Fidelity of Automatic Speech Processing for Adult and Child Talker Classifications

    PubMed Central

    2016-01-01

    Automatic speech processing (ASP) has recently been applied to very large datasets of naturalistically collected, daylong recordings of child speech via an audio recorder worn by young children. The system developed by the LENA Research Foundation analyzes children's speech for research and clinical purposes, with special focus on of identifying and tagging family speech dynamics and the at-home acoustic environment from the auditory perspective of the child. A primary issue for researchers, clinicians, and families using the Language ENvironment Analysis (LENA) system is to what degree the segment labels are valid. This classification study evaluates the performance of the computer ASP output against 23 trained human judges who made about 53,000 judgements of classification of segments tagged by the LENA ASP. Results indicate performance consistent with modern ASP such as those using HMM methods, with acoustic characteristics of fundamental frequency and segment duration most important for both human and machine classifications. Results are likely to be important for interpreting and improving ASP output. PMID:27529813

  15. Automatic detection and classification of artifacts in single-channel EEG.

    PubMed

    Olund, Thomas; Duun-Henriksen, Jonas; Kjaer, Troels W; Sorensen, Helge B D

    2014-01-01

    Ambulatory EEG monitoring can provide medical doctors important diagnostic information, without hospitalizing the patient. These recordings are however more exposed to noise and artifacts compared to clinically recorded EEG. An automatic artifact detection and classification algorithm for single-channel EEG is proposed to help identifying these artifacts. Features are extracted from the EEG signal and wavelet subbands. Subsequently a selection algorithm is applied in order to identify the best discriminating features. A non-linear support vector machine is used to discriminate among different artifact classes using the selected features. Single-channel (Fp1-F7) EEG recordings are obtained from experiments with 12 healthy subjects performing artifact inducing movements. The dataset was used to construct and validate the model. Both subject-specific and generic implementation, are investigated. The detection algorithm yield an average sensitivity and specificity above 95% for both the subject-specific and generic models. The classification algorithm show a mean accuracy of 78 and 64% for the subject-specific and generic model, respectively. The classification model was additionally validated on a reference dataset with similar results.

  16. Deep feature learning for automatic tissue classification of coronary artery using optical coherence tomography

    PubMed Central

    Abdolmanafi, Atefeh; Duong, Luc; Dahdah, Nagib; Cheriet, Farida

    2017-01-01

    Kawasaki disease (KD) is an acute childhood disease complicated by coronary artery aneurysms, intima thickening, thrombi, stenosis, lamellar calcifications, and disappearance of the media border. Automatic classification of the coronary artery layers (intima, media, and scar features) is important for analyzing optical coherence tomography (OCT) images recorded in pediatric patients. OCT has been known as an intracoronary imaging modality using near-infrared light which has recently been used to image the inner coronary artery tissues of pediatric patients, providing high spatial resolution (ranging from 10 to 20 μm). This study aims to develop a robust and fully automated tissue classification method by using the convolutional neural networks (CNNs) as feature extractor and comparing the predictions of three state-of-the-art classifiers, CNN, random forest (RF), and support vector machine (SVM). The results show the robustness of CNN as the feature extractor and random forest as the classifier with classification rate up to 96%, especially to characterize the second layer of coronary arteries (media), which is a very thin layer and it is challenging to be recognized and specified from other tissues. PMID:28271012

  17. Performance evaluation of statistical and neural network classifiers for automatic land use/cover classification

    NASA Astrophysics Data System (ADS)

    Pettit, Elaine J.; Bailey, Robert R.; Bowden, Ronald L.; Ashley, Ross C.

    1993-04-01

    A comparative analysis of the performance of five classifier types combined with four feature extraction techniques is presented for the automatic recognition of land use/cover categories from aerial imagery through texture analysis. The classification accuracy of the linear, Bayes quadratic, k-nearest neighbor, Parzen, and backpropagation-trained multi-layer perceptron classifiers are evaluated in combination with the following texture measures: spatial gray-level co-occurrence matrix, Laws, Liu-Jernigan, and Fourier domain rings and wedges. Examples of four land use/cover classes -- urban, fields, trees, and water -- are manually delineated from commercial aerial survey panchromatic images per the U.S. Geological Survey Land Use/Land Cover Classification System. Through leave-one-scene-out sampling, each classifier type is trained and tested using feature vectors generated by each feature extraction technique. Mean classification error and an 80% confidence interval for each combination of classifier- feature extraction method is determined. Error overlap is analyzed to assess improvement of performance though fusing the results from two or more classifier-feature set combinations. The significance of this work likes both in the results of the comparative analysis and in its adherence to formal experimental methodology. We anticipate that these results will be applicable to a wide variety of image recognition problems where texture is a principle discriminant, including medical screening, remote sensing, and materials identification.

  18. Deep feature learning for automatic tissue classification of coronary artery using optical coherence tomography.

    PubMed

    Abdolmanafi, Atefeh; Duong, Luc; Dahdah, Nagib; Cheriet, Farida

    2017-02-01

    Kawasaki disease (KD) is an acute childhood disease complicated by coronary artery aneurysms, intima thickening, thrombi, stenosis, lamellar calcifications, and disappearance of the media border. Automatic classification of the coronary artery layers (intima, media, and scar features) is important for analyzing optical coherence tomography (OCT) images recorded in pediatric patients. OCT has been known as an intracoronary imaging modality using near-infrared light which has recently been used to image the inner coronary artery tissues of pediatric patients, providing high spatial resolution (ranging from 10 to 20 μm). This study aims to develop a robust and fully automated tissue classification method by using the convolutional neural networks (CNNs) as feature extractor and comparing the predictions of three state-of-the-art classifiers, CNN, random forest (RF), and support vector machine (SVM). The results show the robustness of CNN as the feature extractor and random forest as the classifier with classification rate up to 96%, especially to characterize the second layer of coronary arteries (media), which is a very thin layer and it is challenging to be recognized and specified from other tissues.

  19. Automatic classification of background EEG activity in healthy and sick neonates.

    PubMed

    Löfhede, Johan; Thordstein, Magnus; Löfgren, Nils; Flisberg, Anders; Rosa-Zurera, Manuel; Kjellmer, Ingemar; Lindecrantz, Kaj

    2010-02-01

    The overall aim of our research is to develop methods for a monitoring system to be used at neonatal intensive care units. When monitoring a baby, a range of different types of background activity needs to be considered. In this work, we have developed a scheme for automatic classification of background EEG activity in newborn babies. EEG from six full-term babies who were displaying a burst suppression pattern while suffering from the after-effects of asphyxia during birth was included along with EEG from 20 full-term healthy newborn babies. The signals from the healthy babies were divided into four behavioural states: active awake, quiet awake, active sleep and quiet sleep. By using a number of features extracted from the EEG together with Fisher's linear discriminant classifier we have managed to achieve 100% correct classification when separating burst suppression EEG from all four healthy EEG types and 93% true positive classification when separating quiet sleep from the other types. The other three sleep stages could not be classified. When the pathological burst suppression pattern was detected, the analysis was taken one step further and the signal was segmented into burst and suppression, allowing clinically relevant parameters such as suppression length and burst suppression ratio to be calculated. The segmentation of the burst suppression EEG works well, with a probability of error around 4%.

  20. Automatic Classification of Protein Structure Using the Maximum Contact Map Overlap Metric

    DOE PAGES

    Andonov, Rumen; Djidjev, Hristo Nikolov; Klau, Gunnar W.; ...

    2015-10-09

    In this paper, we propose a new distance measure for comparing two protein structures based on their contact map representations. We show that our novel measure, which we refer to as the maximum contact map overlap (max-CMO) metric, satisfies all properties of a metric on the space of protein representations. Having a metric in that space allows one to avoid pairwise comparisons on the entire database and, thus, to significantly accelerate exploring the protein space compared to no-metric spaces. We show on a gold standard superfamily classification benchmark set of 6759 proteins that our exact k-nearest neighbor (k-NN) scheme classifiesmore » up to 224 out of 236 queries correctly and on a larger, extended version of the benchmark with 60; 850 additional structures, up to 1361 out of 1369 queries. Finally, our k-NN classification thus provides a promising approach for the automatic classification of protein structures based on flexible contact map overlap alignments.« less

  1. Does expert knowledge improve automatic probabilistic classification of gait joint motion patterns in children with cerebral palsy?

    PubMed

    De Laet, Tinne; Papageorgiou, Eirini; Nieuwenhuys, Angela; Desloovere, Kaat

    2017-01-01

    This study aimed to improve the automatic probabilistic classification of joint motion gait patterns in children with cerebral palsy by using the expert knowledge available via a recently developed Delphi-consensus study. To this end, this study applied both Naïve Bayes and Logistic Regression classification with varying degrees of usage of the expert knowledge (expert-defined and discretized features). A database of 356 patients and 1719 gait trials was used to validate the classification performance of eleven joint motions. Two main hypotheses stated that: (1) Joint motion patterns in children with CP, obtained through a Delphi-consensus study, can be automatically classified following a probabilistic approach, with an accuracy similar to clinical expert classification, and (2) The inclusion of clinical expert knowledge in the selection of relevant gait features and the discretization of continuous features increases the performance of automatic probabilistic joint motion classification. This study provided objective evidence supporting the first hypothesis. Automatic probabilistic gait classification using the expert knowledge available from the Delphi-consensus study resulted in accuracy (91%) similar to that obtained with two expert raters (90%), and higher accuracy than that obtained with non-expert raters (78%). Regarding the second hypothesis, this study demonstrated that the use of more advanced machine learning techniques such as automatic feature selection and discretization instead of expert-defined and discretized features can result in slightly higher joint motion classification performance. However, the increase in performance is limited and does not outweigh the additional computational cost and the higher risk of loss of clinical interpretability, which threatens the clinical acceptance and applicability.

  2. Does expert knowledge improve automatic probabilistic classification of gait joint motion patterns in children with cerebral palsy?

    PubMed Central

    Papageorgiou, Eirini; Nieuwenhuys, Angela; Desloovere, Kaat

    2017-01-01

    Background This study aimed to improve the automatic probabilistic classification of joint motion gait patterns in children with cerebral palsy by using the expert knowledge available via a recently developed Delphi-consensus study. To this end, this study applied both Naïve Bayes and Logistic Regression classification with varying degrees of usage of the expert knowledge (expert-defined and discretized features). A database of 356 patients and 1719 gait trials was used to validate the classification performance of eleven joint motions. Hypotheses Two main hypotheses stated that: (1) Joint motion patterns in children with CP, obtained through a Delphi-consensus study, can be automatically classified following a probabilistic approach, with an accuracy similar to clinical expert classification, and (2) The inclusion of clinical expert knowledge in the selection of relevant gait features and the discretization of continuous features increases the performance of automatic probabilistic joint motion classification. Findings This study provided objective evidence supporting the first hypothesis. Automatic probabilistic gait classification using the expert knowledge available from the Delphi-consensus study resulted in accuracy (91%) similar to that obtained with two expert raters (90%), and higher accuracy than that obtained with non-expert raters (78%). Regarding the second hypothesis, this study demonstrated that the use of more advanced machine learning techniques such as automatic feature selection and discretization instead of expert-defined and discretized features can result in slightly higher joint motion classification performance. However, the increase in performance is limited and does not outweigh the additional computational cost and the higher risk of loss of clinical interpretability, which threatens the clinical acceptance and applicability. PMID:28570616

  3. Markov model recognition and classification of DNA/protein sequences within large text databases.

    PubMed

    Wren, Jonathan D; Hildebrand, William H; Chandrasekaran, Sreedevi; Melcher, Ulrich

    2005-11-01

    Short sequence patterns frequently define regions of biological interest (binding sites, immune epitopes, primers, etc.), yet a large fraction of this information exists only within the scientific literature and is thus difficult to locate via conventional means (e.g. keyword queries or manual searches). We describe herein a system to accurately identify and classify sequence patterns from within large corpora using an n-gram Markov model (MM). As expected, on test sets we found that identification of sequences with limited alphabets and/or regular structures such as nucleic acids (non-ambiguous) and peptide abbreviations (3-letter) was highly accurate, whereas classification of symbolic (1-letter) peptide strings with more complex alphabets was more problematic. The MM was used to analyze two very large, sequence-containing corpora: over 7.75 million Medline abstracts and 9000 full-text articles from Journal of Virology. Performance was benchmarked by comparing the results with Journal of Virology entries in two existing manually curated databases: VirOligo and the HLA Ligand Database. Performance estimates were 98 +/- 2% precision/84% recall for primer identification and classification and 67 +/- 6% precision/85% recall for peptide epitopes. We also find a dramatic difference between the amounts of sequence-related data reported in abstracts versus full text. Our results suggest that automated extraction and classification of sequence elements is a promising, low-cost means of sequence database curation and annotation. MM routine and datasets are available upon request.

  4. Enhanced information retrieval from narrative German-language clinical text documents using automated document classification.

    PubMed

    Spat, Stephan; Cadonna, Bruno; Rakovac, Ivo; Gütl, Christian; Leitner, Hubert; Stark, Günther; Beck, Peter

    2008-01-01

    The amount of narrative clinical text documents stored in Electronic Patient Records (EPR) of Hospital Information Systems is increasing. Physicians spend a lot of time finding relevant patient-related information for medical decision making in these clinical text documents. Thus, efficient and topical retrieval of relevant patient-related information is an important task in an EPR system. This paper describes the prototype of a medical information retrieval system (MIRS) for clinical text documents. The open-source information retrieval framework Apache Lucene has been used to implement the prototype of the MIRS. Additionally, a multi-label classification system based on the open-source data mining framework WEKA generates metadata from the clinical text document set. The metadata is used for influencing the rank order of documents retrieved by physicians. Combining information retrieval and automated document classification offers an enhanced approach to let physicians and in the near future patients define their information needs for information stored in an EPR. The system has been designed as a J2EE Web-application. First findings are based on a sample of 18,000 unstructured, clinical text documents written in German.

  5. Notes for the improvement of the spatial and spectral data classification method. [automatic classification and mapping of earth resources satellite data

    NASA Technical Reports Server (NTRS)

    Dalton, C. C.

    1974-01-01

    This report examines the spatial and spectral clustering technique for the unsupervised automatic classification and mapping of earth resources satellite data, and makes theoretical analysis of the decision rules and tests in order to suggest how the method might best be applied to other flight data such as Skylab and Spacelab.

  6. Named entity recognition and classification in biomedical text using classifier ensemble.

    PubMed

    Saha, Sriparna; Ekbal, Asif; Sikdar, Utpal Kumar

    2015-01-01

    Named Entity Recognition and Classification (NERC) is an important task in information extraction for biomedicine domain. Biomedical Named Entities include mentions of proteins, genes, DNA, RNA, etc. which, in general, have complex structures and are difficult to recognise. In this paper, we propose a Single Objective Optimisation based classifier ensemble technique using the search capability of Genetic Algorithm (GA) for NERC in biomedical texts. Here, GA is used to quantify the amount of voting for each class in each classifier. We use diverse classification methods like Conditional Random Field and Support Vector Machine to build a number of models depending upon the various representations of the set of features and/or feature templates. The proposed technique is evaluated with two benchmark datasets, namely JNLPBA 2004 and GENETAG. Experiments yield the overall F- measure values of 75.97% and 95.90%, respectively. Comparisons with the existing systems show that our proposed system achieves state-of-the-art performance.

  7. Graph Theory-Based Brain Connectivity for Automatic Classification of Multiple Sclerosis Clinical Courses

    PubMed Central

    Kocevar, Gabriel; Stamile, Claudio; Hannoun, Salem; Cotton, François; Vukusic, Sandra; Durand-Dubief, Françoise; Sappey-Marinier, Dominique

    2016-01-01

    Purpose: In this work, we introduce a method to classify Multiple Sclerosis (MS) patients into four clinical profiles using structural connectivity information. For the first time, we try to solve this question in a fully automated way using a computer-based method. The main goal is to show how the combination of graph-derived metrics with machine learning techniques constitutes a powerful tool for a better characterization and classification of MS clinical profiles. Materials and Methods: Sixty-four MS patients [12 Clinical Isolated Syndrome (CIS), 24 Relapsing Remitting (RR), 24 Secondary Progressive (SP), and 17 Primary Progressive (PP)] along with 26 healthy controls (HC) underwent MR examination. T1 and diffusion tensor imaging (DTI) were used to obtain structural connectivity matrices for each subject. Global graph metrics, such as density and modularity, were estimated and compared between subjects' groups. These metrics were further used to classify patients using tuned Support Vector Machine (SVM) combined with Radial Basic Function (RBF) kernel. Results: When comparing MS patients to HC subjects, a greater assortativity, transitivity, and characteristic path length as well as a lower global efficiency were found. Using all graph metrics, the best F-Measures (91.8, 91.8, 75.6, and 70.6%) were obtained for binary (HC-CIS, CIS-RR, RR-PP) and multi-class (CIS-RR-SP) classification tasks, respectively. When using only one graph metric, the best F-Measures (83.6, 88.9, and 70.7%) were achieved for modularity with previous binary classification tasks. Conclusion: Based on a simple DTI acquisition associated with structural brain connectivity analysis, this automatic method allowed an accurate classification of different MS patients' clinical profiles. PMID:27826224

  8. Automatic classification of prostate cancer Gleason scores from multiparametric magnetic resonance images

    PubMed Central

    Fehr, Duc; Veeraraghavan, Harini; Wibmer, Andreas; Gondo, Tatsuo; Matsumoto, Kazuhiro; Vargas, Herbert Alberto; Sala, Evis; Hricak, Hedvig; Deasy, Joseph O.

    2015-01-01

    Noninvasive, radiological image-based detection and stratification of Gleason patterns can impact clinical outcomes, treatment selection, and the determination of disease status at diagnosis without subjecting patients to surgical biopsies. We present machine learning-based automatic classification of prostate cancer aggressiveness by combining apparent diffusion coefficient (ADC) and T2-weighted (T2-w) MRI-based texture features. Our approach achieved reasonably accurate classification of Gleason scores (GS) 6(3+3) vs. ≥7 and 7(3+4) vs. 7(4+3) despite the presence of highly unbalanced samples by using two different sample augmentation techniques followed by feature selection-based classification. Our method distinguished between GS 6(3+3) and ≥7 cancers with 93% accuracy for cancers occurring in both peripheral (PZ) and transition (TZ) zones and 92% for cancers occurring in the PZ alone. Our approach distinguished the GS 7(3+4) from GS 7(4+3) with 92% accuracy for cancers occurring in both the PZ and TZ and with 93% for cancers occurring in the PZ alone. In comparison, a classifier using only the ADC mean achieved a top accuracy of 58% for distinguishing GS 6(3+3) vs. GS ≥7 for cancers occurring in PZ and TZ and 63% for cancers occurring in PZ alone. The same classifier achieved an accuracy of 59% for distinguishing GS 7(3+4) from GS 7(4+3) occurring in the PZ and TZ and 60% for cancers occurring in PZ alone. Separate analysis of the cancers occurring in TZ alone was not performed owing to the limited number of samples. Our results suggest that texture features derived from ADC and T2-w MRI together with sample augmentation can help to obtain reasonably accurate classification of Gleason patterns. PMID:26578786

  9. Automatic classification of endoscopic images for premalignant conditions of the esophagus

    NASA Astrophysics Data System (ADS)

    Boschetto, Davide; Gambaretto, Gloria; Grisan, Enrico

    2016-03-01

    Barrett's esophagus (BE) is a precancerous complication of gastroesophageal reflux disease in which normal stratified squamous epithelium lining the esophagus is replaced by intestinal metaplastic columnar epithelium. Repeated endoscopies and multiple biopsies are often necessary to establish the presence of intestinal metaplasia. Narrow Band Imaging (NBI) is an imaging technique commonly used with endoscopies that enhances the contrast of vascular pattern on the mucosa. We present a computer-based method for the automatic normal/metaplastic classification of endoscopic NBI images. Superpixel segmentation is used to identify and cluster pixels belonging to uniform regions. From each uniform clustered region of pixels, eight features maximizing differences among normal and metaplastic epithelium are extracted for the classification step. For each superpixel, the three mean intensities of each color channel are firstly selected as features. Three added features are the mean intensities for each superpixel after separately applying to the red-channel image three different morphological filters (top-hat filtering, entropy filtering and range filtering). The last two features require the computation of the Grey-Level Co-Occurrence Matrix (GLCM), and are reflective of the contrast and the homogeneity of each superpixel. The classification step is performed using an ensemble of 50 classification trees, with a 10-fold cross-validation scheme by training the classifier at each step on a random 70% of the images and testing on the remaining 30% of the dataset. Sensitivity and Specificity are respectively of 79.2% and 87.3%, with an overall accuracy of 83.9%.

  10. Automatic Building Detection based on Supervised Classification using High Resolution Google Earth Images

    NASA Astrophysics Data System (ADS)

    Ghaffarian, S.; Ghaffarian, S.

    2014-08-01

    This paper presents a novel approach to detect the buildings by automization of the training area collecting stage for supervised classification. The method based on the fact that a 3d building structure should cast a shadow under suitable imaging conditions. Therefore, the methodology begins with the detection and masking out the shadow areas using luminance component of the LAB color space, which indicates the lightness of the image, and a novel double thresholding technique. Further, the training areas for supervised classification are selected by automatically determining a buffer zone on each building whose shadow is detected by using the shadow shape and the sun illumination direction. Thereafter, by calculating the statistic values of each buffer zone which is collected from the building areas the Improved Parallelepiped Supervised Classification is executed to detect the buildings. Standard deviation thresholding applied to the Parallelepiped classification method to improve its accuracy. Finally, simple morphological operations conducted for releasing the noises and increasing the accuracy of the results. The experiments were performed on set of high resolution Google Earth images. The performance of the proposed approach was assessed by comparing the results of the proposed approach with the reference data by using well-known quality measurements (Precision, Recall and F1-score) to evaluate the pixel-based and object-based performances of the proposed approach. Evaluation of the results illustrates that buildings detected from dense and suburban districts with divers characteristics and color combinations using our proposed method have 88.4 % and 853 % overall pixel-based and object-based precision performances, respectively.

  11. Graph Theory-Based Brain Connectivity for Automatic Classification of Multiple Sclerosis Clinical Courses.

    PubMed

    Kocevar, Gabriel; Stamile, Claudio; Hannoun, Salem; Cotton, François; Vukusic, Sandra; Durand-Dubief, Françoise; Sappey-Marinier, Dominique

    2016-01-01

    Purpose: In this work, we introduce a method to classify Multiple Sclerosis (MS) patients into four clinical profiles using structural connectivity information. For the first time, we try to solve this question in a fully automated way using a computer-based method. The main goal is to show how the combination of graph-derived metrics with machine learning techniques constitutes a powerful tool for a better characterization and classification of MS clinical profiles. Materials and Methods: Sixty-four MS patients [12 Clinical Isolated Syndrome (CIS), 24 Relapsing Remitting (RR), 24 Secondary Progressive (SP), and 17 Primary Progressive (PP)] along with 26 healthy controls (HC) underwent MR examination. T1 and diffusion tensor imaging (DTI) were used to obtain structural connectivity matrices for each subject. Global graph metrics, such as density and modularity, were estimated and compared between subjects' groups. These metrics were further used to classify patients using tuned Support Vector Machine (SVM) combined with Radial Basic Function (RBF) kernel. Results: When comparing MS patients to HC subjects, a greater assortativity, transitivity, and characteristic path length as well as a lower global efficiency were found. Using all graph metrics, the best F-Measures (91.8, 91.8, 75.6, and 70.6%) were obtained for binary (HC-CIS, CIS-RR, RR-PP) and multi-class (CIS-RR-SP) classification tasks, respectively. When using only one graph metric, the best F-Measures (83.6, 88.9, and 70.7%) were achieved for modularity with previous binary classification tasks. Conclusion: Based on a simple DTI acquisition associated with structural brain connectivity analysis, this automatic method allowed an accurate classification of different MS patients' clinical profiles.

  12. Effective key parameter determination for an automatic approach to land cover classification based on multispectral remote sensing imagery.

    PubMed

    Wang, Yong; Jiang, Dong; Zhuang, Dafang; Huang, Yaohuan; Wang, Wei; Yu, Xinfang

    2013-01-01

    The classification of land cover based on satellite data is important for many areas of scientific research. Unfortunately, some traditional land cover classification methods (e.g. known as supervised classification) are very labor-intensive and subjective because of the required human involvement. Jiang et al. proposed a simple but robust method for land cover classification using a prior classification map and a current multispectral remote sensing image. This new method has proven to be a suitable classification method; however, its drawback is that it is a semi-automatic method because the key parameters cannot be selected automatically. In this study, we propose an approach in which the two key parameters are chosen automatically. The proposed method consists primarily of the following three interdependent parts: the selection procedure for the pure-pixel training-sample dataset, the method to determine the key parameters, and the optimal combination model. In this study, the proposed approach employs both overall accuracy and their Kappa Coefficients (KC), and Time-Consumings (TC, unit: second) in order to select the two key parameters automatically instead of using a test-decision, which avoids subjective bias. A case study of Weichang District of Hebei Province, China, using Landsat-5/TM data of 2010 with 30 m spatial resolution and prior classification map of 2005 recognised as relatively precise data, was conducted to test the performance of this method. The experimental results show that the methodology determining the key parameters uses the portfolio optimisation model and increases the degree of automation of Jiang et al.'s classification method, which may have a wide scope of scientific application.

  13. Effective Key Parameter Determination for an Automatic Approach to Land Cover Classification Based on Multispectral Remote Sensing Imagery

    PubMed Central

    Wang, Yong; Jiang, Dong; Zhuang, Dafang; Huang, Yaohuan; Wang, Wei; Yu, Xinfang

    2013-01-01

    The classification of land cover based on satellite data is important for many areas of scientific research. Unfortunately, some traditional land cover classification methods (e.g. known as supervised classification) are very labor-intensive and subjective because of the required human involvement. Jiang et al. proposed a simple but robust method for land cover classification using a prior classification map and a current multispectral remote sensing image. This new method has proven to be a suitable classification method; however, its drawback is that it is a semi-automatic method because the key parameters cannot be selected automatically. In this study, we propose an approach in which the two key parameters are chosen automatically. The proposed method consists primarily of the following three interdependent parts: the selection procedure for the pure-pixel training-sample dataset, the method to determine the key parameters, and the optimal combination model. In this study, the proposed approach employs both overall accuracy and their Kappa Coefficients (KC), and Time-Consumings (TC, unit: second) in order to select the two key parameters automatically instead of using a test-decision, which avoids subjective bias. A case study of Weichang District of Hebei Province, China, using Landsat-5/TM data of 2010 with 30 m spatial resolution and prior classification map of 2005 recognised as relatively precise data, was conducted to test the performance of this method. The experimental results show that the methodology determining the key parameters uses the portfolio optimisation model and increases the degree of automation of Jiang et al.'s classification method, which may have a wide scope of scientific application. PMID:24204582

  14. Relevance popularity: A term event model based feature selection scheme for text classification

    PubMed Central

    Yang, Fengqin; Wang, Han; Zhang, Libiao

    2017-01-01

    Feature selection is a practical approach for improving the performance of text classification methods by optimizing the feature subsets input to classifiers. In traditional feature selection methods such as information gain and chi-square, the number of documents that contain a particular term (i.e. the document frequency) is often used. However, the frequency of a given term appearing in each document has not been fully investigated, even though it is a promising feature to produce accurate classifications. In this paper, we propose a new feature selection scheme based on a term event Multinomial naive Bayes probabilistic model. According to the model assumptions, the matching score function, which is based on the prediction probability ratio, can be factorized. Finally, we derive a feature selection measurement for each term after replacing inner parameters by their estimators. On a benchmark English text datasets (20 Newsgroups) and a Chinese text dataset (MPH-20), our numerical experiment results obtained from using two widely used text classifiers (naive Bayes and support vector machine) demonstrate that our method outperformed the representative feature selection methods. PMID:28379986

  15. Relevance popularity: A term event model based feature selection scheme for text classification.

    PubMed

    Feng, Guozhong; An, Baiguo; Yang, Fengqin; Wang, Han; Zhang, Libiao

    2017-01-01

    Feature selection is a practical approach for improving the performance of text classification methods by optimizing the feature subsets input to classifiers. In traditional feature selection methods such as information gain and chi-square, the number of documents that contain a particular term (i.e. the document frequency) is often used. However, the frequency of a given term appearing in each document has not been fully investigated, even though it is a promising feature to produce accurate classifications. In this paper, we propose a new feature selection scheme based on a term event Multinomial naive Bayes probabilistic model. According to the model assumptions, the matching score function, which is based on the prediction probability ratio, can be factorized. Finally, we derive a feature selection measurement for each term after replacing inner parameters by their estimators. On a benchmark English text datasets (20 Newsgroups) and a Chinese text dataset (MPH-20), our numerical experiment results obtained from using two widely used text classifiers (naive Bayes and support vector machine) demonstrate that our method outperformed the representative feature selection methods.

  16. Why Text Segment Classification Based on Part of Speech Feature Selection

    NASA Astrophysics Data System (ADS)

    Nagy, Iulia; Tanaka, Katsuyuki; Ariki, Yasuo

    The aim of our research is to develop a scalable automatic why question answering system for English based on supervised method that uses part of speech analysis. The prior approach consisted in building a why-classifier using function words. This paper investigates the performance of combining supervised data mining methods with various feature selection strategies in order to obtain a more accurate why classifier.Feature selection was performed a priori on the dataset to extract representative verbs and/or nouns and avoid the dimensionality curse. LogitBoost and SVM were used for the classification process. Three methods of extending the initial "function words only" approach, to handle context-dependent features, are proposed and experimentally evaluated on various datasets. The first considers function words and context-independent adverbs; the second incorporates selected lemmatized verbs; the third contains selected lemmatized verbs & nouns. Experiments on web-extracted datasets showed that all methods performed better than the baseline, with slightly more reliable results for the third one.

  17. Automatic sleep stage classification of single-channel EEG by using complex-valued convolutional neural network.

    PubMed

    Zhang, Junming; Wu, Yan

    2017-02-21

    Many systems are developed for automatic sleep stage classification. However, nearly all models are based on handcrafted features. Because of the large feature space, there are so many features that feature selection should be used. Meanwhile, designing handcrafted features is a difficult and time-consuming task because the feature designing needs domain knowledge of experienced experts. Results vary when different sets of features are chosen to identify sleep stages. Additionally, many features that we may be unaware of exist. However, these features may be important for sleep stage classification. Therefore, a new sleep stage classification system, which is based on the complex-valued convolutional neural network (CCNN), is proposed in this study. Unlike the existing sleep stage methods, our method can automatically extract features from raw electroencephalography data and then classify sleep stage based on the learned features. Additionally, we also prove that the decision boundaries for the real and imaginary parts of a complex-valued convolutional neuron intersect orthogonally. The classification performances of handcrafted features are compared with those of learned features via CCNN. Experimental results show that the proposed method is comparable to the existing methods. CCNN obtains a better classification performance and considerably faster convergence speed than convolutional neural network. Experimental results also show that the proposed method is a useful decision-support tool for automatic sleep stage classification.

  18. Automatic Galaxy Classification via Machine Learning Techniques: Parallelized Rotation/Flipping INvariant Kohonen Maps (PINK)

    NASA Astrophysics Data System (ADS)

    Polsterer, K. L.; Gieseke, F.; Igel, C.

    2015-09-01

    In the last decades more and more all-sky surveys created an enormous amount of data which is publicly available on the Internet. Crowd-sourcing projects such as Galaxy-Zoo and Radio-Galaxy-Zoo used encouraged users from all over the world to manually conduct various classification tasks. The combination of the pattern-recognition capabilities of thousands of volunteers enabled scientists to finish the data analysis within acceptable time. For up-coming surveys with billions of sources, however, this approach is not feasible anymore. In this work, we present an unsupervised method that can automatically process large amounts of galaxy data and which generates a set of prototypes. This resulting model can be used to both visualize the given galaxy data as well as to classify so far unseen images.

  19. Automatic classification of skin lesions using color mathematical morphology-based texture descriptors

    NASA Astrophysics Data System (ADS)

    Gonzalez-Castro, Victor; Debayle, Johan; Wazaefi, Yanal; Rahim, Mehdi; Gaudy-Marqueste, Caroline; Grob, Jean-Jacques; Fertil, Bernard

    2015-04-01

    In this paper an automatic classification method of skin lesions from dermoscopic images is proposed. This method is based on color texture analysis based both on color mathematical morphology and Kohonen Self-Organizing Maps (SOM), and it does not need any previous segmentation process. More concretely, mathematical morphology is used to compute a local descriptor for each pixel of the image, while the SOM is used to cluster them and, thus, create the texture descriptor of the global image. Two approaches are proposed, depending on whether the pixel descriptor is computed using classical (i.e. spatially invariant) or adaptive (i.e. spatially variant) mathematical morphology by means of the Color Adaptive Neighborhoods (CANs) framework. Both approaches obtained similar areas under the ROC curve (AUC): 0.854 and 0.859 outperforming the AUC built upon dermatologists' predictions (0.792).

  20. Automatic segmentation and classification of tendon nuclei from IHC stained images

    NASA Astrophysics Data System (ADS)

    Kuok, Chan-Pang; Wu, Po-Ting; Jou, I.-Ming; Su, Fong-Chin; Sun, Yung-Nien

    2015-12-01

    Immunohistochemical (IHC) staining is commonly used for detecting cells in microscopy. It is used for analyzing many types of diseases, e.g. breast cancer. Dispersion problem often exist at cell staining which will affect the accuracy of automatic counting. In this paper, we introduce a new method to overcome this problem. Otsu's thresholding method is first applied to exclude the background, so that only cells with dispersed staining are left at foreground, and then refinement will be applied by local adaptive thresholding method according to the irregularity index of the segmented shape at foreground. The segmentation results are also compared to the refinement results using Otsu's thresholding method. Cell classification based on the shape and color indices obtained from the segmentation result is applied to determine the cell condition into normal, abnormal and suspected abnormal cases.

  1. Automatic segmentation and classification of mycobacterium tuberculosis with conventional light microscopy

    NASA Astrophysics Data System (ADS)

    Xu, Chao; Zhou, Dongxiang; Zhai, Yongping; Liu, Yunhui

    2015-12-01

    This paper realizes the automatic segmentation and classification of Mycobacterium tuberculosis with conventional light microscopy. First, the candidate bacillus objects are segmented by the marker-based watershed transform. The markers are obtained by an adaptive threshold segmentation based on the adaptive scale Gaussian filter. The scale of the Gaussian filter is determined according to the color model of the bacillus objects. Then the candidate objects are extracted integrally after region merging and contaminations elimination. Second, the shape features of the bacillus objects are characterized by the Hu moments, compactness, eccentricity, and roughness, which are used to classify the single, touching and non-bacillus objects. We evaluated the logistic regression, random forest, and intersection kernel support vector machines classifiers in classifying the bacillus objects respectively. Experimental results demonstrate that the proposed method yields to high robustness and accuracy. The logistic regression classifier performs best with an accuracy of 91.68%.

  2. Field demonstration of an instrument performing automatic classification of geologic surfaces.

    PubMed

    Bekker, Dmitriy L; Thompson, David R; Abbey, William J; Cabrol, Nathalie A; Francis, Raymond; Manatt, Ken S; Ortega, Kevin F; Wagstaff, Kiri L

    2014-06-01

    This work presents a method with which to automate simple aspects of geologic image analysis during space exploration. Automated image analysis on board the spacecraft can make operations more efficient by generating compressed maps of long traverses for summary downlink. It can also enable immediate automatic responses to science targets of opportunity, improving the quality of targeted measurements collected with each command cycle. In addition, automated analyses on Earth can process large image catalogs, such as the growing database of Mars surface images, permitting more timely and quantitative summaries that inform tactical mission operations. We present TextureCam, a new instrument that incorporates real-time image analysis to produce texture-sensitive classifications of geologic surfaces in mesoscale scenes. A series of tests at the Cima Volcanic Field in the Mojave Desert, California, demonstrated mesoscale surficial mapping at two distinct sites of geologic interest.

  3. Performance of an automatic arrhythmia classification algorithm: comparison to the ALTITUDE electrophysiologist panel adjudications.

    PubMed

    Mahajan, Deepa; Dong, Yanting; Saxon, Leslie A; Cha, Yong-Mei; Gilliam, Francis Roosevelt; Asirvatham, Samuel J; Cesario, David A; Jones, Paul W; Seth, Milan; Powell, Brian D

    2014-07-01

    Adjudication of thousands of implantable cardioverter defibrillator (ICD)-treated arrhythmia episodes is labor intensive and, as a result, is most often left undone. The objective of this study was to evaluate an automatic classification algorithm for adjudication of ICD-treated arrhythmia episodes. The algorithm uses a machine learning algorithm and was developed using 776 arrhythmia episodes. The algorithm was validated on 131 dual-chamber ICD shock episodes from 127 patients adjudicated by seven electrophysiologists (EPs). Episodes were classified by panel consensus as ventricular tachycardia/ventricular fibrillation (VT/VF) or non-VT/VF, with the resulting classifications used as the reference. Subsequently, each episode electrogram (EGM) data was randomly assigned to three EPs without the atrial lead information, and to three EPs with the atrial lead information. Those episodes were also classified by the automatic algorithm with and without atrial information. Agreement with the reference was compared between the three EPs consensus group and the algorithm. The overall agreement with the reference was similar between three-EP consensus and the algorithm for both with atrial EGM (94% vs 95%, P = 0.87) and without atrial EGM (90% vs 91%, P = 0.91). The odds of accurate adjudication, after adjusting for covariates, did not significantly differ between the algorithm and EP consensus (odds ratio 1.02, 95% confidence interval: 0.97-1.06). This algorithm performs at a level comparable to an EP panel in the adjudication of arrhythmia episodes treated by both dual- and single-chamber ICDs. This type of algorithm has the potential for automated analysis of clinical ICD episodes, and adjudication of EGMs for research studies and quality analyses. ©2014 Wiley Periodicals, Inc.

  4. Automatic Detection and Classification of Unsafe Events During Power Wheelchair Use

    PubMed Central

    Moghaddam, Athena K.; Yuen, Hiu Kim; Archambault, Philippe S.; Routhier, François; Michaud, François; Boissy, Patrick

    2014-01-01

    Using a powered wheelchair (PW) is a complex task requiring advanced perceptual and motor control skills. Unfortunately, PW incidents and accidents are not uncommon and their consequences can be serious. The objective of this paper is to develop technological tools that can be used to characterize a wheelchair user’s driving behavior under various settings. In the experiments conducted, PWs are outfitted with a datalogging platform that records, in real-time, the 3-D acceleration of the PW. Data collection was conducted over 35 different activities, designed to capture a spectrum of PW driving events performed at different speeds (collisions with fixed or moving objects, rolling on incline plane, and rolling across multiple types obstacles). The data was processed using time-series analysis and data mining techniques, to automatically detect and identify the different events. We compared the classification accuracy using four different types of time-series features: 1) time-delay embeddings; 2) time-domain characterization; 3) frequency-domain features; and 4) wavelet transforms. In the analysis, we compared the classification accuracy obtained when distinguishing between safe and unsafe events during each of the 35 different activities. For the purposes of this study, unsafe events were defined as activities containing collisions against objects at different speed, and the remainder were defined as safe events. We were able to accurately detect 98% of unsafe events, with a low (12%) false positive rate, using only five examples of each activity. This proof-of-concept study shows that the proposed approach has the potential of capturing, based on limited input from embedded sensors, contextual information on PW use, and of automatically characterizing a user’s PW driving behavior. PMID:27170879

  5. A Topic-modeling Based Framework for Drug-drug Interaction Classification from Biomedical Text.

    PubMed

    Li, Dingcheng; Liu, Sijia; Rastegar-Mojarad, Majid; Wang, Yanshan; Chaudhary, Vipin; Therneau, Terry; Liu, Hongfang

    2016-01-01

    Classification of drug-drug interaction (DDI) from medical literatures is significant in preventing medication-related errors. Most of the existing machine learning approaches are based on supervised learning methods. However, the dynamic nature of drug knowledge, combined with the enormity and rapidly growing of the biomedical literatures make supervised DDI classification methods easily overfit the corpora and may not meet the needs of real-world applications. In this paper, we proposed a relation classification framework based on topic modeling (RelTM) augmented with distant supervision for the task of DDI from biomedical text. The uniqueness of RelTM lies in its two-level sampling from both DDI and drug entities. Through this design, RelTM take both relation features and drug mention features into considerations. An efficient inference algorithm for the model using Gibbs sampling is also proposed. Compared to the previous supervised models, our approach does not require human efforts such as annotation and labeling, which is its advantage in trending big data applications. Meanwhile, the distant supervision combination allows RelTM to incorporate rich existing knowledge resources provided by domain experts. The experimental results on the 2013 DDI challenge corpus reach 48% in F1 score, showing the effectiveness of RelTM.

  6. A Topic-modeling Based Framework for Drug-drug Interaction Classification from Biomedical Text

    PubMed Central

    Li, Dingcheng; Liu, Sijia; Rastegar-Mojarad, Majid; Wang, Yanshan; Chaudhary, Vipin; Therneau, Terry; Liu, Hongfang

    2016-01-01

    Classification of drug-drug interaction (DDI) from medical literatures is significant in preventing medication-related errors. Most of the existing machine learning approaches are based on supervised learning methods. However, the dynamic nature of drug knowledge, combined with the enormity and rapidly growing of the biomedical literatures make supervised DDI classification methods easily overfit the corpora and may not meet the needs of real-world applications. In this paper, we proposed a relation classification framework based on topic modeling (RelTM) augmented with distant supervision for the task of DDI from biomedical text. The uniqueness of RelTM lies in its two-level sampling from both DDI and drug entities. Through this design, RelTM take both relation features and drug mention features into considerations. An efficient inference algorithm for the model using Gibbs sampling is also proposed. Compared to the previous supervised models, our approach does not require human efforts such as annotation and labeling, which is its advantage in trending big data applications. Meanwhile, the distant supervision combination allows RelTM to incorporate rich existing knowledge resources provided by domain experts. The experimental results on the 2013 DDI challenge corpus reach 48% in F1 score, showing the effectiveness of RelTM. PMID:28269875

  7. Automatic classification of unexploded ordnance applied to live sites for MetalMapper sensor

    NASA Astrophysics Data System (ADS)

    Sigman, John Brevard; O'Neill, Kevin; Barrowes, Benjamin; Wang, Yinlin; Shubitidze, Fridon

    2014-06-01

    This paper extends a previously-introduced method for automatic classification of Unexploded Ordnance (UXO) across several datasets from live sites. We used the MetalMapper sensor, from which extrinsic and intrinsic parameters are determined by the combined Differential Evolution (DE) and Ortho-Normalized Volume Magnetic Source (ONVMS) algorithms. The inversion provides spatial locations and intrinsic time-series total ONVMS principal eigenvalues. These are fit to a power-decay empirical model, providing dimensionality reduction to 3 coefficients (k, b, and g) for polarizability decay. Anomaly target features are grouped using the unsupervised clustering Weighted-Pair Group Method with Averaging (WPGMA) algorithm. Central elements of each cluster are dug, and the results are used to train the next round of dig requests. A Naive Bayes classifier is used as a supervised learning algorithm, in which the product of each feature's independent probability density represents each class of UXO in the feature space. We request ground truths for anomalies in rounds, until there are no more Targets of Interest (TOI) in consecutive requests. This fully automatic procedure requires no expert intervention, saving time and money. Naive Bayes outperformed previous efforts with Gaussian Mixture Models(GMM) in all cases.

  8. An automatic segmentation and classification framework for anti-nuclear antibody images

    PubMed Central

    2013-01-01

    Autoimmune disease is a disorder of immune system due to the over-reaction of lymphocytes against one's own body tissues. Anti-Nuclear Antibody (ANA) is an autoantibody produced by the immune system directed against the self body tissues or cells, which plays an important role in the diagnosis of autoimmune diseases. Indirect ImmunoFluorescence (IIF) method with HEp-2 cells provides the major screening method to detect ANA for the diagnosis of autoimmune diseases. Fluorescence patterns at present are usually examined laboriously by experienced physicians through manually inspecting the slides with the help of a microscope, which usually suffers from inter-observer variability that limits its reproducibility. Previous researches only provided simple segmentation methods and criterions for cell segmentation and recognition, but a fully automatic framework for the segmentation and recognition of HEp-2 cells had never been reported before. This study proposes a method based on the watershed algorithm to automatically detect the HEp-2 cells with different patterns. The experimental results show that the segmentation performance of the proposed method is satisfactory when evaluated with percent volume overlap (PVO: 89%). The classification performance using a SVM classifier designed based on the features calculated from the segmented cells achieves an average accuracy of 96.90%, which outperforms other methods presented in previous studies. The proposed method can be used to develop a computer-aided system to assist the physicians in the diagnosis of auto-immune diseases. PMID:24565042

  9. Automatic identification and classification of muscle spasms in long-term EMG recordings.

    PubMed

    Winslow, Jeffrey; Martinez, Adriana; Thomas, Christine K

    2015-03-01

    Spinal cord injured (SCI) individuals may be afflicted by spasticity, a condition in which involuntary muscle spasms are common. EMG recordings can be analyzed to quantify this symptom of spasticity but manual identification and classification of spasms are time consuming. Here, an algorithm was created to find and classify spasm events automatically within 24-h recordings of EMG. The algorithm used expert rules and time-frequency techniques to classify spasm events as tonic, unit, or clonus spasms. A companion graphical user interface (GUI) program was also built to verify and correct the results of the automatic algorithm or manually defined events. Eight channel EMG recordings were made from seven different SCI subjects. The algorithm was able to correctly identify an average (±SD) of 94.5 ± 3.6% spasm events and correctly classify 91.6 ± 1.9% of spasm events, with an accuracy of 61.7 ± 16.2%. The accuracy improved to 85.5 ± 5.9% and the false positive rate decreased to 7.1 ± 7.3%, respectively, if noise events between spasms were removed. On average, the algorithm was more than 11 times faster than manual analysis. Use of both the algorithm and the GUI program provide a powerful tool for characterizing muscle spasms in 24-h EMG recordings, information which is important for clinical management of spasticity.

  10. Automatic classification of sulcal regions of the human brain cortex using pattern recognition

    NASA Astrophysics Data System (ADS)

    Behnke, Kirsten J.; Rettmann, Maryam E.; Pham, Dzung L.; Shen, Dinggang; Resnick, Susan M.; Davatzikos, Christos; Prince, Jerry L.

    2003-05-01

    Parcellation of the cortex has received a great deal of attention in magnetic resonance (MR) image analysis, but its usefulness has been limited by time-consuming algorithms that require manual labeling. An automatic labeling scheme is necessary to accurately and consistently parcellate a large number of brains. The large variation of cortical folding patterns makes automatic labeling a challenging problem, which cannot be solved by deformable atlas registration alone. In this work, an automated classification scheme that consists of a mix of both atlas driven and data driven methods is proposed to label the sulcal regions, which are defined as the gray matter regions of the cortical surface surrounding each sulcus. The premise for this algorithm is that sulcal regions can be classified according to the pattern of anatomical features (e.g. supramarginal gyrus, cuneus, etc.) associated with each region. Using a nearest-neighbor approach, a sulcal region is classified as being in the same class as the sulcus from a set of training data which has the nearest pattern of anatomical features. Using just one subject as training data, the algorithm correctly labeled 83% of the regions that make up the main sulci of the cortex.

  11. Automatic Detection of Cervical Cancer Cells by a Two-Level Cascade Classification System

    PubMed Central

    Su, Jie; Xu, Xuan; He, Yongjun; Song, Jinming

    2016-01-01

    We proposed a method for automatic detection of cervical cancer cells in images captured from thin liquid based cytology slides. We selected 20,000 cells in images derived from 120 different thin liquid based cytology slides, which include 5000 epithelial cells (normal 2500, abnormal 2500), lymphoid cells, neutrophils, and junk cells. We first proposed 28 features, including 20 morphologic features and 8 texture features, based on the characteristics of each cell type. We then used a two-level cascade integration system of two classifiers to classify the cervical cells into normal and abnormal epithelial cells. The results showed that the recognition rates for abnormal cervical epithelial cells were 92.7% and 93.2%, respectively, when C4.5 classifier or LR (LR: logical regression) classifier was used individually; while the recognition rate was significantly higher (95.642%) when our two-level cascade integrated classifier system was used. The false negative rate and false positive rate (both 1.44%) of the proposed automatic two-level cascade classification system are also much lower than those of traditional Pap smear review. PMID:27298758

  12. Automatic Classification of a Taxon-Rich Community Recorded in the Wild

    PubMed Central

    Potamitis, Ilyas

    2014-01-01

    There is a rich literature on automatic species identification of a specific target taxon as regards various vocalizing animals. Research usually is restricted to specific species – in most cases a single one. It is only very recently that the number of monitored species has started to increase for certain habitats involving birds. Automatic acoustic monitoring has not yet been proven to be generic enough to scale to other taxa and habitats than the ones described in the original research. Although attracting much attention, the acoustic monitoring procedure is neither well established yet nor universally adopted as a biodiversity monitoring tool. Recently, the multi-instance multi-label framework on bird vocalizations has been introduced to face the obstacle of simultaneously vocalizing birds of different species. We build on this framework to integrate novel, image-based heterogeneous features designed to capture different aspects of the spectrum. We applied our approach to a taxon-rich habitat that included 78 birds, 8 insect species and 1 amphibian. This dataset constituted the Multi-label Bird Species Classification Challenge-NIPS 2013 where the proposed approach achieved an average accuracy of 91.25% on unseen data. PMID:24826989

  13. Automatic Detection of Cervical Cancer Cells by a Two-Level Cascade Classification System.

    PubMed

    Su, Jie; Xu, Xuan; He, Yongjun; Song, Jinming

    2016-01-01

    We proposed a method for automatic detection of cervical cancer cells in images captured from thin liquid based cytology slides. We selected 20,000 cells in images derived from 120 different thin liquid based cytology slides, which include 5000 epithelial cells (normal 2500, abnormal 2500), lymphoid cells, neutrophils, and junk cells. We first proposed 28 features, including 20 morphologic features and 8 texture features, based on the characteristics of each cell type. We then used a two-level cascade integration system of two classifiers to classify the cervical cells into normal and abnormal epithelial cells. The results showed that the recognition rates for abnormal cervical epithelial cells were 92.7% and 93.2%, respectively, when C4.5 classifier or LR (LR: logical regression) classifier was used individually; while the recognition rate was significantly higher (95.642%) when our two-level cascade integrated classifier system was used. The false negative rate and false positive rate (both 1.44%) of the proposed automatic two-level cascade classification system are also much lower than those of traditional Pap smear review.

  14. Scaling up the evaluation of psychotherapy: evaluating motivational interviewing fidelity via statistical text classification

    PubMed Central

    2014-01-01

    Background Behavioral interventions such as psychotherapy are leading, evidence-based practices for a variety of problems (e.g., substance abuse), but the evaluation of provider fidelity to behavioral interventions is limited by the need for human judgment. The current study evaluated the accuracy of statistical text classification in replicating human-based judgments of provider fidelity in one specific psychotherapy—motivational interviewing (MI). Method Participants (n = 148) came from five previously conducted randomized trials and were either primary care patients at a safety-net hospital or university students. To be eligible for the original studies, participants met criteria for either problematic drug or alcohol use. All participants received a type of brief motivational interview, an evidence-based intervention for alcohol and substance use disorders. The Motivational Interviewing Skills Code is a standard measure of MI provider fidelity based on human ratings that was used to evaluate all therapy sessions. A text classification approach called a labeled topic model was used to learn associations between human-based fidelity ratings and MI session transcripts. It was then used to generate codes for new sessions. The primary comparison was the accuracy of model-based codes with human-based codes. Results Receiver operating characteristic (ROC) analyses of model-based codes showed reasonably strong sensitivity and specificity with those from human raters (range of area under ROC curve (AUC) scores: 0.62 – 0.81; average AUC: 0.72). Agreement with human raters was evaluated based on talk turns as well as code tallies for an entire session. Generated codes had higher reliability with human codes for session tallies and also varied strongly by individual code. Conclusion To scale up the evaluation of behavioral interventions, technological solutions will be required. The current study demonstrated preliminary, encouraging findings regarding the utility

  15. Scaling up the evaluation of psychotherapy: evaluating motivational interviewing fidelity via statistical text classification.

    PubMed

    Atkins, David C; Steyvers, Mark; Imel, Zac E; Smyth, Padhraic

    2014-04-24

    Behavioral interventions such as psychotherapy are leading, evidence-based practices for a variety of problems (e.g., substance abuse), but the evaluation of provider fidelity to behavioral interventions is limited by the need for human judgment. The current study evaluated the accuracy of statistical text classification in replicating human-based judgments of provider fidelity in one specific psychotherapy--motivational interviewing (MI). Participants (n = 148) came from five previously conducted randomized trials and were either primary care patients at a safety-net hospital or university students. To be eligible for the original studies, participants met criteria for either problematic drug or alcohol use. All participants received a type of brief motivational interview, an evidence-based intervention for alcohol and substance use disorders. The Motivational Interviewing Skills Code is a standard measure of MI provider fidelity based on human ratings that was used to evaluate all therapy sessions. A text classification approach called a labeled topic model was used to learn associations between human-based fidelity ratings and MI session transcripts. It was then used to generate codes for new sessions. The primary comparison was the accuracy of model-based codes with human-based codes. Receiver operating characteristic (ROC) analyses of model-based codes showed reasonably strong sensitivity and specificity with those from human raters (range of area under ROC curve (AUC) scores: 0.62 - 0.81; average AUC: 0.72). Agreement with human raters was evaluated based on talk turns as well as code tallies for an entire session. Generated codes had higher reliability with human codes for session tallies and also varied strongly by individual code. To scale up the evaluation of behavioral interventions, technological solutions will be required. The current study demonstrated preliminary, encouraging findings regarding the utility of statistical text classification in

  16. Automatic extraction of biological information from scientific text: protein-protein interactions.

    PubMed

    Blaschke, C; Andrade, M A; Ouzounis, C; Valencia, A

    1999-01-01

    We describe the basic design of a system for automatic detection of protein-protein interactions extracted from scientific abstracts. By restricting the problem domain and imposing a number of strong assumptions which include pre-specified protein names and a limited set of verbs that represent actions, we show that it is possible to perform accurate information extraction. The performance of the system is evaluated with different cases of real-world interaction networks, including the Drosophila cell cycle control. The results obtained computationally are in good agreement with current biological knowledge and demonstrate the feasibility of developing a fully automated system able to describe networks of protein interactions with sufficient accuracy.

  17. Application of the AutoClass Automatic Bayesian Classification System to HMI Solar Images

    NASA Astrophysics Data System (ADS)

    Parker, D. G.; Beck, J. G.; Ulrich, R. K.

    2011-12-01

    When applied to a sample set of observed data, the Bayesian automatic classification system known as AutoClass finds a set of class definitions based on specified attributes of the data, such as magnetic field and intensity, without human supervision. These class definitions can then be applied to new data sets to identify automatically in them the classes found in the sample set. AutoClass can be applied to solar magnetic and intensity images to identify surface features associated with different values of magnetic and intensity fields in a consistent manner without the need for human judgment. AutoClass has been applied to Mt. Wilson magnetograms and intensity-grams to identify solar surface features associated with variations in total solar irradiance (TSI) and, using those identifications, to improve modeling of TSI variations over time. (Ulrich, et al, 2010) Here, we apply AutoClass to observables derived from the high resolution 4096 x 4096 HMI magnetic, intensity continuum, line width and line depth images to identify solar surface regions which may be associated with variations in TSI and other solar irradiance measurements. To prevent small instrument artifacts from interfering with class identification, we apply a flat-field correction and a rotationally shifted temporal average to the HMI images prior to processing with AutoClass. This pre-processing also allows an investigation of the sensitivity of AutoClass to instrumental artifacts. The ability to categorize automatically surface features in the HMI images holds out the promise of consistent, relatively quick and manageable analysis of the large quantity of data available in these highly resolved images and the use of that analysis to enhance understanding of the physical processes at work in solar surface features and their implications for the solar-terrestrial environment. Reference Ulrich, R.K., Parker, D, Bertello, L. and Boyden, J. 2010, Solar Phys., 261, 11.

  18. Scene text detection via extremal region based double threshold convolutional network classification

    PubMed Central

    Zhu, Wei; Lou, Jing; Chen, Longtao; Xia, Qingyuan

    2017-01-01

    In this paper, we present a robust text detection approach in natural images which is based on region proposal mechanism. A powerful low-level detector named saliency enhanced-MSER extended from the widely-used MSER is proposed by incorporating saliency detection methods, which ensures a high recall rate. Given a natural image, character candidates are extracted from three channels in a perception-based illumination invariant color space by saliency-enhanced MSER algorithm. A discriminative convolutional neural network (CNN) is jointly trained with multi-level information including pixel-level and character-level information as character candidate classifier. Each image patch is classified as strong text, weak text and non-text by double threshold filtering instead of conventional one-step classification, leveraging confident scores obtained via CNN. To further prune non-text regions, we develop a recursive neighborhood search algorithm to track credible texts from weak text set. Finally, characters are grouped into text lines using heuristic features such as spatial location, size, color, and stroke width. We compare our approach with several state-of-the-art methods, and experiments show that our method achieves competitive performance on public datasets ICDAR 2011 and ICDAR 2013. PMID:28820891

  19. Scene text detection via extremal region based double threshold convolutional network classification.

    PubMed

    Zhu, Wei; Lou, Jing; Chen, Longtao; Xia, Qingyuan; Ren, Mingwu

    2017-01-01

    In this paper, we present a robust text detection approach in natural images which is based on region proposal mechanism. A powerful low-level detector named saliency enhanced-MSER extended from the widely-used MSER is proposed by incorporating saliency detection methods, which ensures a high recall rate. Given a natural image, character candidates are extracted from three channels in a perception-based illumination invariant color space by saliency-enhanced MSER algorithm. A discriminative convolutional neural network (CNN) is jointly trained with multi-level information including pixel-level and character-level information as character candidate classifier. Each image patch is classified as strong text, weak text and non-text by double threshold filtering instead of conventional one-step classification, leveraging confident scores obtained via CNN. To further prune non-text regions, we develop a recursive neighborhood search algorithm to track credible texts from weak text set. Finally, characters are grouped into text lines using heuristic features such as spatial location, size, color, and stroke width. We compare our approach with several state-of-the-art methods, and experiments show that our method achieves competitive performance on public datasets ICDAR 2011 and ICDAR 2013.

  20. Automatic retinal vessel classification using a Least Square-Support Vector Machine in VAMPIRE.

    PubMed

    Relan, D; MacGillivray, T; Ballerini, L; Trucco, E

    2014-01-01

    It is important to classify retinal blood vessels into arterioles and venules for computerised analysis of the vasculature and to aid discovery of disease biomarkers. For instance, zone B is the standardised region of a retinal image utilised for the measurement of the arteriole to venule width ratio (AVR), a parameter indicative of microvascular health and systemic disease. We introduce a Least Square-Support Vector Machine (LS-SVM) classifier for the first time (to the best of our knowledge) to label automatically arterioles and venules. We use only 4 image features and consider vessels inside zone B (802 vessels from 70 fundus camera images) and in an extended zone (1,207 vessels, 70 fundus camera images). We achieve an accuracy of 94.88% and 93.96% in zone B and the extended zone, respectively, with a training set of 10 images and a testing set of 60 images. With a smaller training set of only 5 images and the same testing set we achieve an accuracy of 94.16% and 93.95%, respectively. This experiment was repeated five times by randomly choosing 10 and 5 images for the training set. Mean classification accuracy are close to the above mentioned result. We conclude that the performance of our system is very promising and outperforms most recently reported systems. Our approach requires smaller training data sets compared to others but still results in a similar or higher classification rate.

  1. Automatic modulation classification of digital modulations in presence of HF noise

    NASA Astrophysics Data System (ADS)

    Alharbi, Hazza; Mobien, Shoaib; Alshebeili, Saleh; Alturki, Fahd

    2012-12-01

    Designing an automatic modulation classifier (AMC) for high frequency (HF) band is a research challenge. This is due to the recent observation that noise distribution in HF band is changing over time. Existing AMCs are often designed for one type of noise distribution, e.g., additive white Gaussian noise. This means their performance is severely compromised in the presence of HF noise. Therefore, an AMC capable of mitigating the time-varying nature of HF noise is required. This article presents a robust AMC method for the classification of FSK, PSK, OQPSK, QAM, and amplitude-phase shift keying modulations in presence of HF noise using feature-based methods. Here, extracted features are insensitive to symbol synchronization and carrier frequency and phase offsets. The proposed AMC method is simple to implement as it uses decision-tree approach with pre-computed thresholds for signal classification. In addition, it is capable to classify type and order of modulation in both Gaussian and non-Gaussian environments.

  2. Automatic ICD-10 multi-class classification of cause of death from plaintext autopsy reports through expert-driven feature selection

    PubMed Central

    Mujtaba, Ghulam; Shuib, Liyana; Raj, Ram Gopal; Rajandram, Retnagowri; Shaikh, Khairunisa; Al-Garadi, Mohammed Ali

    2017-01-01

    Objectives Widespread implementation of electronic databases has improved the accessibility of plaintext clinical information for supplementary use. Numerous machine learning techniques, such as supervised machine learning approaches or ontology-based approaches, have been employed to obtain useful information from plaintext clinical data. This study proposes an automatic multi-class classification system to predict accident-related causes of death from plaintext autopsy reports through expert-driven feature selection with supervised automatic text classification decision models. Methods Accident-related autopsy reports were obtained from one of the largest hospital in Kuala Lumpur. These reports belong to nine different accident-related causes of death. Master feature vector was prepared by extracting features from the collected autopsy reports by using unigram with lexical categorization. This master feature vector was used to detect cause of death [according to internal classification of disease version 10 (ICD-10) classification system] through five automated feature selection schemes, proposed expert-driven approach, five subset sizes of features, and five machine learning classifiers. Model performance was evaluated using precisionM, recallM, F-measureM, accuracy, and area under ROC curve. Four baselines were used to compare the results with the proposed system. Results Random forest and J48 decision models parameterized using expert-driven feature selection yielded the highest evaluation measure approaching (85% to 90%) for most metrics by using a feature subset size of 30. The proposed system also showed approximately 14% to 16% improvement in the overall accuracy compared with the existing techniques and four baselines. Conclusion The proposed system is feasible and practical to use for automatic classification of ICD-10-related cause of death from autopsy reports. The proposed system assists pathologists to accurately and rapidly determine underlying

  3. Automatic ICD-10 multi-class classification of cause of death from plaintext autopsy reports through expert-driven feature selection.

    PubMed

    Mujtaba, Ghulam; Shuib, Liyana; Raj, Ram Gopal; Rajandram, Retnagowri; Shaikh, Khairunisa; Al-Garadi, Mohammed Ali

    2017-01-01

    Widespread implementation of electronic databases has improved the accessibility of plaintext clinical information for supplementary use. Numerous machine learning techniques, such as supervised machine learning approaches or ontology-based approaches, have been employed to obtain useful information from plaintext clinical data. This study proposes an automatic multi-class classification system to predict accident-related causes of death from plaintext autopsy reports through expert-driven feature selection with supervised automatic text classification decision models. Accident-related autopsy reports were obtained from one of the largest hospital in Kuala Lumpur. These reports belong to nine different accident-related causes of death. Master feature vector was prepared by extracting features from the collected autopsy reports by using unigram with lexical categorization. This master feature vector was used to detect cause of death [according to internal classification of disease version 10 (ICD-10) classification system] through five automated feature selection schemes, proposed expert-driven approach, five subset sizes of features, and five machine learning classifiers. Model performance was evaluated using precisionM, recallM, F-measureM, accuracy, and area under ROC curve. Four baselines were used to compare the results with the proposed system. Random forest and J48 decision models parameterized using expert-driven feature selection yielded the highest evaluation measure approaching (85% to 90%) for most metrics by using a feature subset size of 30. The proposed system also showed approximately 14% to 16% improvement in the overall accuracy compared with the existing techniques and four baselines. The proposed system is feasible and practical to use for automatic classification of ICD-10-related cause of death from autopsy reports. The proposed system assists pathologists to accurately and rapidly determine underlying cause of death based on autopsy findings

  4. SYRIAC: The systematic review information automated collection system a data warehouse for facilitating automated biomedical text classification.

    PubMed

    Yang, Jianji J; Cohen, Aaron M; Cohen, Aaron; McDonagh, Marian S

    2008-11-06

    Automatic document classification can be valuable in increasing the efficiency in updating systematic reviews (SR). In order for the machine learning process to work well, it is critical to create and maintain high-quality training datasets consisting of expert SR inclusion/exclusion decisions. This task can be laborious, especially when the number of topics is large and source data format is inconsistent.To approach this problem, we build an automated system to streamline the required steps, from initial notification of update in source annotation files to loading the data warehouse, along with a web interface to monitor the status of each topic. In our current collection of 26 SR topics, we were able to standardize almost all of the relevance judgments and recovered PMIDs for over 80% of all articles. Of those PMIDs, over 99% were correct in a manual random sample study. Our system performs an essential function in creating training and evaluation data sets for SR text mining research.

  5. Automatic identification and classification of noun argument structures in biomedical literature.

    PubMed

    Ozyurt, Ibrahim Burak

    2012-01-01

    The accelerating increase in the biomedical literature makes keeping up with recent advances challenging for researchers thus making automatic extraction and discovery of knowledge from this vast literature a necessity. Building such systems requires automatic detection of lexico-semantic event structures governed by the syntactic and semantic constraints of human languages in sentences of biomedical texts. The lexico-semantic event structures in sentences are centered around the predicates and most semantic role labeling (SRL) approaches focus only on the arguments of verb predicates and neglect argument taking nouns which also convey information in a sentence. In this article, a noun argument structure (NAS) annotated corpus named BioNom and a SRL system to identify and classify these structures is introduced. Also, a genetic algorithm-based feature selection (GAFS) method is introduced and global inference is applied to significantly improve the performance of the NAS Bio SRL system.

  6. Automatic classification of epilepsy types using ontology-based and genetics-based machine learning.

    PubMed

    Kassahun, Yohannes; Perrone, Roberta; De Momi, Elena; Berghöfer, Elmar; Tassi, Laura; Canevini, Maria Paola; Spreafico, Roberto; Ferrigno, Giancarlo; Kirchner, Frank

    2014-06-01

    In the presurgical analysis for drug-resistant focal epilepsies, the definition of the epileptogenic zone, which is the cortical area where ictal discharges originate, is usually carried out by using clinical, electrophysiological and neuroimaging data analysis. Clinical evaluation is based on the visual detection of symptoms during epileptic seizures. This work aims at developing a fully automatic classifier of epileptic types and their localization using ictal symptoms and machine learning methods. We present the results achieved by using two machine learning methods. The first is an ontology-based classification that can directly incorporate human knowledge, while the second is a genetics-based data mining algorithm that learns or extracts the domain knowledge from medical data in implicit form. The developed methods are tested on a clinical dataset of 129 patients. The performance of the methods is measured against the performance of seven clinicians, whose level of expertise is high/very high, in classifying two epilepsy types: temporal lobe epilepsy and extra-temporal lobe epilepsy. When comparing the performance of the algorithms with that of a single clinician, who is one of the seven clinicians, the algorithms show a slightly better performance than the clinician on three test sets generated randomly from 99 patients out of the 129 patients. The accuracy obtained for the two methods and the clinician is as follows: first test set 65.6% and 75% for the methods and 56.3% for the clinician, second test set 66.7% and 76.2% for the methods and 61.9% for the clinician, and third test set 77.8% for the methods and the clinician. When compared with the performance of the whole population of clinicians on the rest 30 patients out of the 129 patients, where the patients were selected by the clinicians themselves, the mean accuracy of the methods (60%) is slightly worse than the mean accuracy of the clinicians (61.6%). Results show that the methods perform at the

  7. Support Vector Machine Model for Automatic Detection and Classification of Seismic Events

    NASA Astrophysics Data System (ADS)

    Barros, Vesna; Barros, Lucas

    2016-04-01

    The automated processing of multiple seismic signals to detect, localize and classify seismic events is a central tool in both natural hazards monitoring and nuclear treaty verification. However, false detections and missed detections caused by station noise and incorrect classification of arrivals are still an issue and the events are often unclassified or poorly classified. Thus, machine learning techniques can be used in automatic processing for classifying the huge database of seismic recordings and provide more confidence in the final output. Applied in the context of the International Monitoring System (IMS) - a global sensor network developed for the Comprehensive Nuclear-Test-Ban Treaty (CTBT) - we propose a fully automatic method for seismic event detection and classification based on a supervised pattern recognition technique called the Support Vector Machine (SVM). According to Kortström et al., 2015, the advantages of using SVM are handleability of large number of features and effectiveness in high dimensional spaces. Our objective is to detect seismic events from one IMS seismic station located in an area of high seismicity and mining activity and classify them as earthquakes or quarry blasts. It is expected to create a flexible and easily adjustable SVM method that can be applied in different regions and datasets. Taken a step further, accurate results for seismic stations could lead to a modification of the model and its parameters to make it applicable to other waveform technologies used to monitor nuclear explosions such as infrasound and hydroacoustic waveforms. As an authorized user, we have direct access to all IMS data and bulletins through a secure signatory account. A set of significant seismic waveforms containing different types of events (e.g. earthquake, quarry blasts) and noise is being analysed to train the model and learn the typical pattern of the signal from these events. Moreover, comparing the performance of the support

  8. Multi-spectral brain tissue segmentation using automatically trained k-Nearest-Neighbor classification.

    PubMed

    Vrooman, Henri A; Cocosco, Chris A; van der Lijn, Fedde; Stokking, Rik; Ikram, M Arfan; Vernooij, Meike W; Breteler, Monique M B; Niessen, Wiro J

    2007-08-01

    Conventional k-Nearest-Neighbor (kNN) classification, which has been successfully applied to classify brain tissue in MR data, requires training on manually labeled subjects. This manual labeling is a laborious and time-consuming procedure. In this work, a new fully automated brain tissue classification procedure is presented, in which kNN training is automated. This is achieved by non-rigidly registering the MR data with a tissue probability atlas to automatically select training samples, followed by a post-processing step to keep the most reliable samples. The accuracy of the new method was compared to rigid registration-based training and to conventional kNN-based segmentation using training on manually labeled subjects for segmenting gray matter (GM), white matter (WM) and cerebrospinal fluid (CSF) in 12 data sets. Furthermore, for all classification methods, the performance was assessed when varying the free parameters. Finally, the robustness of the fully automated procedure was evaluated on 59 subjects. The automated training method using non-rigid registration with a tissue probability atlas was significantly more accurate than rigid registration. For both automated training using non-rigid registration and for the manually trained kNN classifier, the difference with the manual labeling by observers was not significantly larger than inter-observer variability for all tissue types. From the robustness study, it was clear that, given an appropriate brain atlas and optimal parameters, our new fully automated, non-rigid registration-based method gives accurate and robust segmentation results. A similarity index was used for comparison with manually trained kNN. The similarity indices were 0.93, 0.92 and 0.92, for CSF, GM and WM, respectively. It can be concluded that our fully automated method using non-rigid registration may replace manual segmentation, and thus that automated brain tissue segmentation without laborious manual training is feasible.

  9. Automatically Generating Reading Comprehension Look-Back Strategy: Questions from Expository Texts

    DTIC Science & Technology

    2008-05-14

    Carnegie Mellon University,School of Computer Science ,5000 Forbes Ave,Pittsburgh,PA,15213 8. PERFORMING ORGANIZATION REPORT NUMBER 9. SPONSORING...strategy is taught via engaging the student in meaningful text for an independent goal such as understanding a social studies text or a science text... topics . Additionally, it is the transition from reading narrative text to expository text that creates new comprehension problems for students around

  10. Unsupervised method for automatic construction of a disease dictionary from a large free text collection.

    PubMed

    Xu, Rong; Supekar, Kaustubh; Morgan, Alex; Das, Amar; Garber, Alan

    2008-11-06

    Concept specific lexicons (e.g. diseases, drugs, anatomy) are a critical source of background knowledge for many medical language-processing systems. However, the rapid pace of biomedical research and the lack of constraints on usage ensure that such dictionaries are incomplete. Focusing on disease terminology, we have developed an automated, unsupervised, iterative pattern learning approach for constructing a comprehensive medical dictionary of disease terms from randomized clinical trial (RCT) abstracts, and we compared different ranking methods for automatically extracting con-textual patterns and concept terms. When used to identify disease concepts from 100 randomly chosen, manually annotated clinical abstracts, our disease dictionary shows significant performance improvement (F1 increased by 35-88%) over available, manually created disease terminologies.

  11. Unsupervised Method for Automatic Construction of a Disease Dictionary from a Large Free Text Collection

    PubMed Central

    Xu, Rong; Supekar, Kaustubh; Morgan, Alex; Das, Amar; Garber, Alan

    2008-01-01

    Concept specific lexicons (e.g. diseases, drugs, anatomy) are a critical source of background knowledge for many medical language-processing systems. However, the rapid pace of biomedical research and the lack of constraints on usage ensure that such dictionaries are incomplete. Focusing on disease terminology, we have developed an automated, unsupervised, iterative pattern learning approach for constructing a comprehensive medical dictionary of disease terms from randomized clinical trial (RCT) abstracts, and we compared different ranking methods for automatically extracting contextual patterns and concept terms. When used to identify disease concepts from 100 randomly chosen, manually annotated clinical abstracts, our disease dictionary shows significant performance improvement (F1 increased by 35–88%) over available, manually created disease terminologies. PMID:18999169

  12. Classification of fatty and dense breast parenchyma: comparison of automatic volumetric density measurement and radiologists' classification and their inter-observer variation.

    PubMed

    Østerås, Bjørn Helge; Martinsen, Anne Catrine T; Brandal, Siri Helene B; Chaudhry, Khalida Nasreen; Eben, Ellen; Haakenaasen, Unni; Falk, Ragnhild Sørum; Skaane, Per

    2016-10-01

    Automatically calculated breast density is a promising alternative to subjective BI-RADS density assessment. However, such software needs a cutoff value for density classification. To determine the volumetric density threshold which classifies fatty and dense breasts with highest accuracy compared to average BI-RADS density assessment, and to analyze radiologists' inter-observer variation. A total of 537 full field digital mammography examinations were randomly selected from a population based screening program. Five radiologists assessed density using the BI-RADS density scale, where BI-RADS I-II were classified as fatty and III-IV as dense. A commercially available software (Quantra) calculated volumetric breast density. We calculated the cutoff (threshold) values in volumetric density that yielded highest accuracy compared to median and individual radiologists' classification. Inter-observer variation was analyzed using the kappa statistic. The threshold that best matched the median radiologists' classification was 10%, which resulted in 87% accuracy. Thresholds that best matched individual radiologist's classification had a range of 8-15%. A total of 191 (35.6 %) cases were scored both dense and fatty by at least one radiologist. Fourteen (2.6 %) cases were unanimously scored by the radiologists, yet differently using automatic assessment. The agreement (kappa) between reader's median classification and individual radiologists was 0.624 to 0.902, and agreement between median classification and Quantra was 0.731. The optimal volumetric threshold of 10% using automatic assessment would classify breast parenchyma as fatty or dense with substantial accuracy and consistency compared to radiologists' BI-RADS categorization, which suffers from high inter-observer variation. © The Foundation Acta Radiologica 2016.

  13. Automatic classification of transiently evoked otoacoustic emissions using an artificial neural network.

    PubMed

    Buller, G; Lutman, M E

    1998-08-01

    The increasing use of transiently evoked otoacoustic emissions (TEOAE) in large neonatal hearing screening programmes makes a standardized method of response classification desirable. Until now methods have been either subjective or based on arbitrary response characteristics. This study takes an expert system approach to standardize the subjective judgements of an experienced scorer. The method that is developed comprises three stages. First, it transforms TEOAEs from waveforms in the time domain into a simplified parameter set. Second, the parameter set is classified by an artificial neural network that has been taught on a large database TEOAE waveforms and corresponding expert scores. Third, additional fuzzy logic rules automatically detect probable artefacts in the waveforms and synchronized spontaneous emission components. In this way, the knowledge of the experienced scorer is encapsulated in the expert system software and thereafter can be accessed by non-experts. Teaching and evaluation of the neural network was based on TEOAEs from a database totalling 2190 neonatal hearing screening tests. The database was divided into learning and test groups with 820 and 1370 waveforms respectively. From each recorded waveform a set of 12 parameters was calculated, representing signal static and dynamic properties. The artifical network was taught with parameter sets of only the learning groups. Reproduction of the human scorer classification by the neural net in the learning group showed a sensitivity for detecting screen fails of 99.3% (299 from 301 failed results on subjective scoring) and a specificity for detecting screen passes of 81.1% (421 of 519 pass results). To quantify the post hoc performance of the net (generalization), the test group was then presented to the network input. Sensitivity was 99.4% (474 from 477) and specificity was 87.3% (780 from 893). To check the efficiency of the classification method, a second learning group was selected out of the

  14. The Role of Domain Knowledge in Automating Medical Text Report Classification

    PubMed Central

    Wilcox, Adam B.; Hripcsak, George

    2003-01-01

    Objective: To analyze the effect of expert knowledge on the inductive learning process in creating classifiers for medical text reports. Design: The authors converted medical text reports to a structured form through natural language processing. They then inductively created classifiers for medical text reports using varying degrees and types of expert knowledge and different inductive learning algorithms. The authors measured performance of the different classifiers as well as the costs to induce classifiers and acquire expert knowledge. Measurements: The measurements used were classifier performance, training-set size efficiency, and classifier creation cost. Results: Expert knowledge was shown to be the most significant factor affecting inductive learning performance, outweighing differences in learning algorithms. The use of expert knowledge can affect comparisons between learning algorithms. This expert knowledge may be obtained and represented separately as knowledge about the clinical task or about the data representation used. The benefit of the expert knowledge is more than that of inductive learning itself, with less cost to obtain. Conclusion: For medical text report classification, expert knowledge acquisition is more significant to performance and more cost-effective to obtain than knowledge discovery. Building classifiers should therefore focus more on acquiring knowledge from experts than trying to learn this knowledge inductively. PMID:12668687

  15. Experimenting with Automatic Text-to-Diagram Conversion: A Novel Teaching Aid for the Blind People

    ERIC Educational Resources Information Center

    Mukherjee, Anirban; Garain, Utpal; Biswas, Arindam

    2014-01-01

    Diagram describing texts are integral part of science and engineering subjects including geometry, physics, engineering drawing, etc. In order to understand such text, one, at first, tries to draw or perceive the underlying diagram. For perception of the blind students such diagrams need to be drawn in some non-visual accessible form like tactile…

  16. Improvement in accuracy of defect size measurement by automatic defect classification

    NASA Astrophysics Data System (ADS)

    Samir, Bhamidipati; Pereira, Mark; Paninjath, Sankaranarayanan; Jeon, Chan-Uk; Chung, Dong-Hoon; Yoon, Gi-Sung; Jung, Hong-Yul

    2015-10-01

    The blank mask defect review process involves detailed analysis of defects observed across a substrate's multiple preparation stages, such as cleaning and resist-coating. The detailed knowledge of these defects plays an important role in the eventual yield obtained by using the blank. Defect knowledge predominantly comprises of details such as the number of defects observed, and their accurate sizes. Mask usability assessment at the start of the preparation process, is crudely based on number of defects. Similarly, defect size gives an idea of eventual wafer defect printability. Furthermore, monitoring defect characteristics, specifically size and shape, aids in obtaining process related information such as cleaning or coating process efficiencies. Blank mask defect review process is largely manual in nature. However, the large number of defects, observed for latest technology nodes with reducing half-pitch sizes; and the associated amount of information, together make the process increasingly inefficient in terms of review time, accuracy and consistency. The usage of additional tools such as CDSEM may be required to further aid the review process resulting in increasing costs. Calibre® MDPAutoClassify™ provides an automated software alternative, in the form of a powerful analysis tool for fast, accurate, consistent and automatic classification of blank defects. Elaborate post-processing algorithms are applied on defect images generated by inspection machines, to extract and report significant defect information such as defect size, affecting defect printability and mask usability. The algorithm's capabilities are challenged by the variety and complexity of defects encountered, in terms of defect nature, size, shape and composition; and the optical phenomena occurring around the defect [1]. This paper mainly focuses on the results from the evaluation of Calibre® MDPAutoClassify™ product. The main objective of this evaluation is to assess the capability of

  17. Automatic classification of small bowel mucosa alterations in celiac disease for confocal laser endomicroscopy

    NASA Astrophysics Data System (ADS)

    Boschetto, Davide; Di Claudio, Gianluca; Mirzaei, Hadis; Leong, Rupert; Grisan, Enrico

    2016-03-01

    Celiac disease (CD) is an immune-mediated enteropathy triggered by exposure to gluten and similar proteins, affecting genetically susceptible persons, increasing their risk of different complications. Small bowels mucosa damage due to CD involves various degrees of endoscopically relevant lesions, which are not easily recognized: their overall sensitivity and positive predictive values are poor even when zoom-endoscopy is used. Confocal Laser Endomicroscopy (CLE) allows skilled and trained experts to qualitative evaluate mucosa alteration such as a decrease in goblet cells density, presence of villous atrophy or crypt hypertrophy. We present a method for automatically classifying CLE images into three different classes: normal regions, villous atrophy and crypt hypertrophy. This classification is performed after a features selection process, in which four features are extracted from each image, through the application of homomorphic filtering and border identification through Canny and Sobel operators. Three different classifiers have been tested on a dataset of 67 different images labeled by experts in three classes (normal, VA and CH): linear approach, Naive-Bayes quadratic approach and a standard quadratic analysis, all validated with a ten-fold cross validation. Linear classification achieves 82.09% accuracy (class accuracies: 90.32% for normal villi, 82.35% for VA and 68.42% for CH, sensitivity: 0.68, specificity 1.00), Naive Bayes analysis returns 83.58% accuracy (90.32% for normal villi, 70.59% for VA and 84.21% for CH, sensitivity: 0.84 specificity: 0.92), while the quadratic analysis achieves a final accuracy of 94.03% (96.77% accuracy for normal villi, 94.12% for VA and 89.47% for CH, sensitivity: 0.89, specificity: 0.98).

  18. Automatic target classification of slow moving ground targets using space-time adaptive processing

    NASA Astrophysics Data System (ADS)

    Malas, John Alexander

    2002-04-01

    Air-to-ground surveillance radar technologies are increasingly being used by theater commanders to detect, track, and identify ground moving targets. New radar automatic target recognition (ATR) technologies are being developed to aid the pilot in assessing the ground combat picture. Most air-to-ground surveillance radars use Doppler filtering techniques to separate target returns from ground clutter. Unfortunately, Doppler filter techniques fall short on performance when target geometry and ground vehicle speed result in low line of sight velocities. New clutter filter techniques compatible with emerging advancements in wideband radar operation are needed to support surveillance modes of radar operation when targets enter this low velocity regime. In this context, space-time adaptive processing (STAP) in conjunction with other algorithms offers a class of signal processing that provide improved target detection, tracking, and classification in the presence of interference through the adaptive nulling of both ground clutter and/or jamming. Of particular interest is the ability of the radar to filter and process the complex target signature data needed to generate high range resolution (HRR) signature profiles on ground targets. A new approach is proposed which will allow air-to-ground target classification of slow moving vehicles in clutter. A wideband STAP approach for clutter suppression is developed which preserves the amplitude integrity of returns from multiple range bins consistent with the HRR ATR approach. The wideband STAP processor utilizes narrowband STAP principles to generate a series of adaptive sub-band filters. Each sub-band filter output is used to construct the complete filtered response of the ground target. The performance of this new approach is demonstrated and quantified through the implementation of a one dimensional template-based minimum mean squared error classifier. Successful minimum velocity identification is defined in terms of

  19. Generating Automated Text Complexity Classifications That Are Aligned with Targeted Text Complexity Standards. Research Report. ETS RR-10-28

    ERIC Educational Resources Information Center

    Sheehan, Kathleen M.; Kostin, Irene; Futagi, Yoko; Flor, Michael

    2010-01-01

    The Common Core Standards call for students to be exposed to a much greater level of text complexity than has been the norm in schools for the past 40 years. Textbook publishers, teachers, and assessment developers are being asked to refocus materials and methods to ensure that students are challenged to read texts at steadily increasing…

  20. Semi-automatic software increases CT measurement accuracy but not response classification of colorectal liver metastases after chemotherapy.

    PubMed

    van Kessel, Charlotte S; van Leeuwen, Maarten S; Witteveen, Petronella O; Kwee, Thomas C; Verkooijen, Helena M; van Hillegersberg, Richard

    2012-10-01

    This study evaluates intra- and interobserver variability of automatic diameter and volume measurements of colorectal liver metastases (CRLM) before and after chemotherapy and its influence on response classification. Pre-and post-chemotherapy CT-scans of 33 patients with 138 CRLM were evaluated. Two observers measured all metastases three times on pre-and post-chemotherapy CT-scans, using three different techniques: manual diameter (MD), automatic diameter (AD) and automatic volume (AV). RECIST 1.0 criteria were used to define response classification. For each technique, we assessed intra- and interobserver reliability by determining the intraclass correlation coefficient (α-level 0.05). Intra-observer agreement was estimated by the variance coefficient (%). For inter-observer agreement the relative measurement error (%) was calculated using Bland-Altman analysis. In addition, we compared agreement in response classification by calculating kappa-scores (κ) and estimating proportions of discordance between methods (%). Intra-observer variability was 6.05%, 4.28% and 12.72% for MD, AD and AV, respectively. Inter-observer variability was 4.23%, 2.02% and 14.86% for MD, AD and AV, respectively. Chemotherapy marginally affected these estimates. Agreement in response classification did not improve using AD or AV (MD κ=0.653, AD κ=0.548, AV κ=0.548) and substantial discordance between observers was observed with all three methods (MD 17.8%, AD 22.2%, AV 22.2%). Semi-automatic software allows repeatable and reproducible measurement of both diameter and volume measurements of CRLM, but does not reduce variability in response classification. Copyright © 2011 Elsevier Ireland Ltd. All rights reserved.

  1. Automatic classification of endogenous seismic sources within a landslide body using random forest algorithm

    NASA Astrophysics Data System (ADS)

    Provost, Floriane; Hibert, Clément; Malet, Jean-Philippe; Stumpf, André; Doubre, Cécile

    2016-04-01

    Different studies have shown the presence of microseismic activity in soft-rock landslides. The seismic signals exhibit significantly different features in the time and frequency domains which allow their classification and interpretation. Most of the classes could be associated with different mechanisms of deformation occurring within and at the surface (e.g. rockfall, slide-quake, fissure opening, fluid circulation). However, some signals remain not fully understood and some classes contain few examples that prevent any interpretation. To move toward a more complete interpretation of the links between the dynamics of soft-rock landslides and the physical processes controlling their behaviour, a complete catalog of the endogeneous seismicity is needed. We propose a multi-class detection method based on the random forests algorithm to automatically classify the source of seismic signals. Random forests is a supervised machine learning technique that is based on the computation of a large number of decision trees. The multiple decision trees are constructed from training sets including each of the target classes. In the case of seismic signals, these attributes may encompass spectral features but also waveform characteristics, multi-stations observations and other relevant information. The Random Forest classifier is used because it provides state-of-the-art performance when compared with other machine learning techniques (e.g. SVM, Neural Networks) and requires no fine tuning. Furthermore it is relatively fast, robust, easy to parallelize, and inherently suitable for multi-class problems. In this work, we present the first results of the classification method applied to the seismicity recorded at the Super-Sauze landslide between 2013 and 2015. We selected a dozen of seismic signal features that characterize precisely its spectral content (e.g. central frequency, spectrum width, energy in several frequency bands, spectrogram shape, spectrum local and global maxima

  2. Automatic classification of atherosclerotic plaques imaged with intravascular OCT (Conference Presentation)

    NASA Astrophysics Data System (ADS)

    Rico-Jimenez, Jose D.; Campos-Delgado, Daniel U.; Villiger, Martin; Bouma, Brett; Jo, Javier A.

    2016-03-01

    A novel computational method for plaque tissue characterization based on Intravascular Optical Coherence Tomography (IV-OCT) is presented. IV-OCT is becoming a powerful tool for the clinical evaluation of atherosclerotic plaques; however, it requires a trained expert for visual assessment and interpretation of the imaged plaques. Moreover, due to the inherit effect of speckle and the scattering attenuation of the optical scheme the direct interpretation of OCT images is limited. To overcome these difficulties, we propose to automatically identify the A-line profiles of the most significant plaque types (normal, fibrotic, or lipid-rich) and their respective abundance by using a probabilistic framework and blind alternated least squares to achieve the optimal decomposition. In this context, we present preliminary results of this novel probabilistic classification tool for intravascular OCT that relies on two steps. First, the B-scan is pre-processed to remove catheter artifacts, segment the lumen, select the region of interest (ROI), flatten the tissue surface, and reduce the speckle effect by a spatial entropy filter. Next, the resulting image is decomposed and its A-lines are classified by an automated strategy based on alternating-least-squares optimization. Our early results are encouraging and suggest that the proposed methodology can identify normal tissue, fibrotic and lipid-rich plaques from IV-OCT images.

  3. A Hessian-based methodology for automatic surface crack detection and classification from pavement images

    NASA Astrophysics Data System (ADS)

    Ghanta, Sindhu; Shahini Shamsabadi, Salar; Dy, Jennifer; Wang, Ming; Birken, Ralf

    2015-04-01

    Around 3,000,000 million vehicle miles are annually traveled utilizing the US transportation systems alone. In addition to the road traffic safety, maintaining the road infrastructure in a sound condition promotes a more productive and competitive economy. Due to the significant amounts of financial and human resources required to detect surface cracks by visual inspection, detection of these surface defects are often delayed resulting in deferred maintenance operations. This paper introduces an automatic system for acquisition, detection, classification, and evaluation of pavement surface cracks by unsupervised analysis of images collected from a camera mounted on the rear of a moving vehicle. A Hessian-based multi-scale filter has been utilized to detect ridges in these images at various scales. Post-processing on the extracted features has been implemented to produce statistics of length, width, and area covered by cracks, which are crucial for roadway agencies to assess pavement quality. This process has been realized on three sets of roads with different pavement conditions in the city of Brockton, MA. A ground truth dataset labeled manually is made available to evaluate this algorithm and results rendered more than 90% segmentation accuracy demonstrating the feasibility of employing this approach at a larger scale.

  4. A robust automatic birdsong phrase classification: A template-based approach.

    PubMed

    Kaewtip, Kantapon; Alwan, Abeer; O'Reilly, Colm; Taylor, Charles E

    2016-11-01

    Automatic phrase detection systems of bird sounds are useful in several applications as they reduce the need for manual annotations. However, birdphrase detection is challenging due to limited training data and background noise. Limited data occur because of limited recordings or the existence of rare phrases. Background noise interference occurs because of the intrinsic nature of the recording environment such as wind or other animals. This paper presents a different approach to birdsong phrase classification using template-based techniques suitable even for limited training data and noisy environments. The algorithm utilizes dynamic time-warping (DTW) and prominent (high-energy) time-frequency regions of training spectrograms to derive templates. The performance of the proposed algorithm is compared with the traditional DTW and hidden Markov models (HMMs) methods under several training and test conditions. DTW works well when the data are limited, while HMMs do better when more data are available, yet they both suffer when the background noise is severe. The proposed algorithm outperforms DTW and HMMs in most training and testing conditions, usually with a high margin when the background noise level is high. The innovation of this work is that the proposed algorithm is robust to both limited training data and background noise.

  5. Progress toward automatic classification of human brown adipose tissue using biomedical imaging

    NASA Astrophysics Data System (ADS)

    Gifford, Aliya; Towse, Theodore F.; Walker, Ronald C.; Avison, Malcom J.; Welch, E. B.

    2015-03-01

    Brown adipose tissue (BAT) is a small but significant tissue, which may play an important role in obesity and the pathogenesis of metabolic syndrome. Interest in studying BAT in adult humans is increasing, but in order to quantify BAT volume in a single measurement or to detect changes in BAT over the time course of a longitudinal experiment, BAT needs to first be reliably differentiated from surrounding tissue. Although the uptake of the radiotracer 18F-Fluorodeoxyglucose (18F-FDG) in adipose tissue on positron emission tomography (PET) scans following cold exposure is accepted as an indication of BAT, it is not a definitive indicator, and to date there exists no standardized method for segmenting BAT. Consequently, there is a strong need for robust automatic classification of BAT based on properties measured with biomedical imaging. In this study we begin the process of developing an automated segmentation method based on properties obtained from fat-water MRI and PET-CT scans acquired on ten healthy adult subjects.

  6. Development of a rapid method for the automatic classification of biological agents' fluorescence spectral signatures

    NASA Astrophysics Data System (ADS)

    Carestia, Mariachiara; Pizzoferrato, Roberto; Gelfusa, Michela; Cenciarelli, Orlando; Ludovici, Gian Marco; Gabriele, Jessica; Malizia, Andrea; Murari, Andrea; Vega, Jesus; Gaudio, Pasquale

    2015-11-01

    Biosecurity and biosafety are key concerns of modern society. Although nanomaterials are improving the capacities of point detectors, standoff detection still appears to be an open issue. Laser-induced fluorescence of biological agents (BAs) has proved to be one of the most promising optical techniques to achieve early standoff detection, but its strengths and weaknesses are still to be fully investigated. In particular, different BAs tend to have similar fluorescence spectra due to the ubiquity of biological endogenous fluorophores producing a signal in the UV range, making data analysis extremely challenging. The Universal Multi Event Locator (UMEL), a general method based on support vector regression, is commonly used to identify characteristic structures in arrays of data. In the first part of this work, we investigate fluorescence emission spectra of different simulants of BAs and apply UMEL for their automatic classification. In the second part of this work, we elaborate a strategy for the application of UMEL to the discrimination of different BAs' simulants spectra. Through this strategy, it has been possible to discriminate between these BAs' simulants despite the high similarity of their fluorescence spectra. These preliminary results support the use of SVR methods to classify BAs' spectral signatures.

  7. Analysis of individual classification of lameness using automatic measurement of back posture in dairy cattle.

    PubMed

    Viazzi, S; Bahr, C; Schlageter-Tello, A; Van Hertem, T; Romanini, C E B; Pluk, A; Halachmi, I; Lokhorst, C; Berckmans, D

    2013-01-01

    Currently, diagnosis of lameness at an early stage in dairy cows relies on visual observation by the farmer, which is time consuming and often omitted. Many studies have tried to develop automatic cow lameness detection systems. However, those studies apply thresholds to the whole population to detect whether or not an individual cow is lame. Therefore, the objective of this study was to develop and test an individualized version of the body movement pattern score, which uses back posture to classify lameness into 3 classes, and to compare both the population and the individual approach under farm conditions. In a data set of 223 videos from 90 cows, 76% of cows were correctly classified, with an 83% true positive rate and 22% false positive rate when using the population approach. A new data set, containing 105 videos of 8 cows that had moved through all 3 lameness classes, was used for an ANOVA on the 3 different classes, showing that body movement pattern scores differed significantly among cows. Moreover, the classification accuracy and the true positive rate increased by 10 percentage units up to 91%, and the false positive rate decreased by 4 percentage units down to 6% when based on an individual threshold compared with a population threshold. Copyright © 2013 American Dairy Science Association. Published by Elsevier Inc. All rights reserved.

  8. Automatic classification of minimally invasive instruments based on endoscopic image sequences

    NASA Astrophysics Data System (ADS)

    Speidel, Stefanie; Benzko, Julia; Krappe, Sebastian; Sudra, Gunther; Azad, Pedram; Müller-Stich, Beat Peter; Gutt, Carsten; Dillmann, Rüdiger

    2009-02-01

    Minimally invasive surgery is nowadays a frequently applied technique and can be regarded as a major breakthrough in surgery. The surgeon has to adopt special operation-techniques and deal with difficulties like the complex hand-eye coordination and restricted mobility. To alleviate these constraints we propose to enhance the surgeon's capabilities by providing a context-aware assistance using augmented reality techniques. To analyze the current situation for context-aware assistance, we need intraoperatively gained sensor data and a model of the intervention. A situation consists of information about the performed activity, the used instruments, the surgical objects, the anatomical structures and defines the state of an intervention for a given moment in time. The endoscopic images provide a rich source of information which can be used for an image-based analysis. Different visual cues are observed in order to perform an image-based analysis with the objective to gain as much information as possible about the current situation. An important visual cue is the automatic recognition of the instruments which appear in the scene. In this paper we present the classification of minimally invasive instruments using the endoscopic images. The instruments are not modified by markers. The system segments the instruments in the current image and recognizes the instrument type based on three-dimensional instrument models.

  9. Automatic supervised classification of multi-temporal images using the expectation-maximization algorithm

    NASA Astrophysics Data System (ADS)

    Chi, Junhwa; Kim, Hyun-cheol

    2017-04-01

    The impact of nonstationary phenomena is a challenging problem for analyzing multi-temporal remote sensing data. Spectral signatures are subject to change over time due to natural (e.g. seasonal phenology or environmental conditions) and disruptive impacts. For example, the same class shows quite different spectral signatures in two temporal remote sensing images. The phenomenon of evolving spectral features is referred as spectral drift in the remote sensing community, or data shifting in the machine learning community. Under the effect of spectral drift, we need to address the problem that the distributions of training and testing set are different, which is more difficult than for single-image classification. That is, a supervised model may not be capable of explaining the testing set. In this study, we utilize the expectation-maximization algorithm to classify multi-temporal sea ice images acquired by optical remote sensing sensors. The proposed technique allows the classifier's parameters, obtained by supervised learning on a specific image, to be updated in an automatic way on the basis of the distribution of a new image to be classified.

  10. Automatic Classification of the Vestibulo-Ocular Reflex Nystagmus: Integration of Data Clustering and System Identification.

    PubMed

    Ranjbaran, Mina; Smith, Heather L H; Galiana, Henrietta L

    2016-04-01

    The vestibulo-ocular reflex (VOR) plays an important role in our daily activities by enabling us to fixate on objects during head movements. Modeling and identification of the VOR improves our insight into the system behavior and improves diagnosis of various disorders. However, the switching nature of eye movements (nystagmus), including the VOR, makes dynamic analysis challenging. The first step in such analysis is to segment data into its subsystem responses (here slow and fast segment intervals). Misclassification of segments results in biased analysis of the system of interest. Here, we develop a novel three-step algorithm to classify the VOR data into slow and fast intervals automatically. The proposed algorithm is initialized using a K-means clustering method. The initial classification is then refined using system identification approaches and prediction error statistics. The performance of the algorithm is evaluated on simulated and experimental data. It is shown that the new algorithm performance is much improved over the previous methods, in terms of higher specificity.

  11. Automatic detection and classification of EOL-concrete and resulting recovered products by hyperspectral imaging

    NASA Astrophysics Data System (ADS)

    Palmieri, Roberta; Bonifazi, Giuseppe; Serranti, Silvia

    2014-05-01

    The recovery of materials from Demolition Waste (DW) represents one of the main target of the recycling industry and the its characterization is important in order to set up efficient sorting and/or quality control systems. End-Of-Life (EOL) concrete materials identification is necessary to maximize DW conversion into useful secondary raw materials, so it is fundamental to develop strategies for the implementation of an automatic recognition system of the recovered products. In this paper, HyperSpectral Imaging (HSI) technique was applied in order to detect DW composition. Hyperspectral images were acquired by a laboratory device equipped with a HSI sensing device working in the near infrared range (1000-1700 nm): NIR Spectral Camera™, embedding an ImSpector™ N17E (SPECIM Ltd, Finland). Acquired spectral data were analyzed adopting the PLS_Toolbox (Version 7.5, Eigenvector Research, Inc.) under Matlab® environment (Version 7.11.1, The Mathworks, Inc.), applying different chemometric methods: Principal Component Analysis (PCA) for exploratory data approach and Partial Least Square- Discriminant Analysis (PLS-DA) to build classification models. Results showed that it is possible to recognize DW materials, distinguishing recycled aggregates from contaminants (e.g. bricks, gypsum, plastics, wood, foam, etc.). The developed procedure is cheap, fast and non-destructive: it could be used to make some steps of the recycling process more efficient and less expensive.

  12. Automatic classification and quantification of cell adhesion locations on the endothelium

    PubMed Central

    Wei, Jie; Cai, Bin; Zhang, Lin; Fu, Bingmei M.

    2015-01-01

    To target tumor hematogenous metastasis and to understand how leukocytes cross the microvessel wall to perform immune functions, it is necessary to elucidate the adhesion location and transmigration pathway of tumor cells and leukocytes on/across the endothelial cells forming the microvessel wall. We developed an algorithm to classify and quantify cell adhesion locations from microphotographs taken from the experiments of tumor cell/leukocyte adhesion in individual microvessels. The first step in is to identify the microvessel by a novel gravity-field dynamic programming procedure. Next, an anisotropic image smoothing suppresses noises without unduly mitigating crucial visual features. After an adaptive thresholding process further tackles uneven lighting conditions during the imaging process, a series of local mathematical morphological operators and eigenanalysis identify tumor cells or leukocytes. Finally, a novel double component labeling procedure categorizes the cell adhesion locations. This algorithm has generated consistently encouraging performances on microphotographs obtained from in vivo experiments for tumor cell and leukocyte adhesion locations on the endothelium forming the microvessel wall. Compared with human experts, this algorithm used 1/500–1/200 of the time without having the errors due to human subjectivity. Our automatic classification and quantification method provides a reliable and cost efficient approach for biomedical image processing. PMID:25549777

  13. Automatic type classification and speaker identification of african elephant (Loxodonta africana) vocalizations

    NASA Astrophysics Data System (ADS)

    Clemins, Patrick J.; Johnson, Michael T.

    2003-04-01

    This paper presents a system for automatically classifying African elephant vocalizations based on systems used for human speech recognition and speaker identification. The experiments are performed on vocalizations collected from captive elephants in a naturalistic environment. Features used for classification include Mel-Frequency Cepstral Coefficients (MFCCs) and log energy which are the most common features used in human speech processing. Since African elephants use lower frequencies than humans in their vocalizations, the MFCCs are computed using a shifted Mel-Frequency filter bank to emphasize the infrasound range of the frequency spectrum. In addition to these features, the use of less traditional features such as those based on fundamental frequency and the phase of the frequency spectrum is also considered. A Hidden Markov Model with Gaussian mixture state probabilities is used to model each type of vocalization. Vocalizations are classified based on type, speaker and estrous cycle. Experiments on continuous call type recognition, which can classify multiple vocalizations in the same utterance, are also performed. The long-term goal of this research is to develop a universal analysis framework and robust feature set for animal vocalizations that can be applied to many species.

  14. Statistical Comparison of Classifiers Applied to the Interferential Tear Film Lipid Layer Automatic Classification

    PubMed Central

    Remeseiro, B.; Penas, M.; Mosquera, A.; Novo, J.; Penedo, M. G.; Yebra-Pimentel, E.

    2012-01-01

    The tear film lipid layer is heterogeneous among the population. Its classification depends on its thickness and can be done using the interference pattern categories proposed by Guillon. The interference phenomena can be characterised as a colour texture pattern, which can be automatically classified into one of these categories. From a photography of the eye, a region of interest is detected and its low-level features are extracted, generating a feature vector that describes it, to be finally classified in one of the target categories. This paper presents an exhaustive study about the problem at hand using different texture analysis methods in three colour spaces and different machine learning algorithms. All these methods and classifiers have been tested on a dataset composed of 105 images from healthy subjects and the results have been statistically analysed. As a result, the manual process done by experts can be automated with the benefits of being faster and unaffected by subjective factors, with maximum accuracy over 95%. PMID:22567040

  15. Automatic Classification of Normal and Cancer Lung CT Images Using Multiscale AM-FM Features.

    PubMed

    Magdy, Eman; Zayed, Nourhan; Fakhr, Mahmoud

    2015-01-01

    Computer-aided diagnostic (CAD) systems provide fast and reliable diagnosis for medical images. In this paper, CAD system is proposed to analyze and automatically segment the lungs and classify each lung into normal or cancer. Using 70 different patients' lung CT dataset, Wiener filtering on the original CT images is applied firstly as a preprocessing step. Secondly, we combine histogram analysis with thresholding and morphological operations to segment the lung regions and extract each lung separately. Amplitude-Modulation Frequency-Modulation (AM-FM) method thirdly, has been used to extract features for ROIs. Then, the significant AM-FM features have been selected using Partial Least Squares Regression (PLSR) for classification step. Finally, K-nearest neighbour (KNN), support vector machine (SVM), naïve Bayes, and linear classifiers have been used with the selected AM-FM features. The performance of each classifier in terms of accuracy, sensitivity, and specificity is evaluated. The results indicate that our proposed CAD system succeeded to differentiate between normal and cancer lungs and achieved 95% accuracy in case of the linear classifier.

  16. Automatic Classification of Normal and Cancer Lung CT Images Using Multiscale AM-FM Features

    PubMed Central

    Magdy, Eman; Zayed, Nourhan; Fakhr, Mahmoud

    2015-01-01

    Computer-aided diagnostic (CAD) systems provide fast and reliable diagnosis for medical images. In this paper, CAD system is proposed to analyze and automatically segment the lungs and classify each lung into normal or cancer. Using 70 different patients' lung CT dataset, Wiener filtering on the original CT images is applied firstly as a preprocessing step. Secondly, we combine histogram analysis with thresholding and morphological operations to segment the lung regions and extract each lung separately. Amplitude-Modulation Frequency-Modulation (AM-FM) method thirdly, has been used to extract features for ROIs. Then, the significant AM-FM features have been selected using Partial Least Squares Regression (PLSR) for classification step. Finally, K-nearest neighbour (KNN), support vector machine (SVM), naïve Bayes, and linear classifiers have been used with the selected AM-FM features. The performance of each classifier in terms of accuracy, sensitivity, and specificity is evaluated. The results indicate that our proposed CAD system succeeded to differentiate between normal and cancer lungs and achieved 95% accuracy in case of the linear classifier. PMID:26451137

  17. Automatic screening and classification of diabetic retinopathy and maculopathy using fuzzy image processing.

    PubMed

    Rahim, Sarni Suhaila; Palade, Vasile; Shuttleworth, James; Jayne, Chrisina

    2016-12-01

    Digital retinal imaging is a challenging screening method for which effective, robust and cost-effective approaches are still to be developed. Regular screening for diabetic retinopathy and diabetic maculopathy diseases is necessary in order to identify the group at risk of visual impairment. This paper presents a novel automatic detection of diabetic retinopathy and maculopathy in eye fundus images by employing fuzzy image processing techniques. The paper first introduces the existing systems for diabetic retinopathy screening, with an emphasis on the maculopathy detection methods. The proposed medical decision support system consists of four parts, namely: image acquisition, image preprocessing including four retinal structures localisation, feature extraction and the classification of diabetic retinopathy and maculopathy. A combination of fuzzy image processing techniques, the Circular Hough Transform and several feature extraction methods are implemented in the proposed system. The paper also presents a novel technique for the macula region localisation in order to detect the maculopathy. In addition to the proposed detection system, the paper highlights a novel online dataset and it presents the dataset collection, the expert diagnosis process and the advantages of our online database compared to other public eye fundus image databases for diabetic retinopathy purposes.

  18. Automatic Classification of Staphylococci by Principal-Component Analysis and a Gradient Method1

    PubMed Central

    Hill, L. R.; Silvestri, L. G.; Ihm, P.; Farchi, G.; Lanciani, P.

    1965-01-01

    Hill, L. R. (Università Statale, Milano, Italy), L. G. Silvestri, P. Ihm, G. Farchi, and P. Lanciani. Automatic classification of staphylococci by principal-component analysis and a gradient method. J. Bacteriol. 89:1393–1401. 1965.—Forty-nine strains from the species Staphylococcus aureus, S. saprophyticus, S. lactis, S. afermentans, and S. roseus were submitted to different taxometric analyses; clustering was performed by single linkage, by the unweighted pair group method, and by principal-component analysis followed by a gradient method. Results were substantially the same with all methods. All S. aureus clustered together, sharply separated from S. roseus and S. afermentans; S. lactis and S. saprophyticus fell between, with the latter nearer to S. aureus. The main purpose of this study was to introduce a new taxometric technique, based on principal-component analysis followed by a gradient method, and to compare it with some other methods in current use. Advantages of the new method are complete automation and therefore greater objectivity, execution of the clustering in a space of reduced dimensions in which different characters have different weights, easy recognition of taxonomically important characters, and opportunity for representing clusters in three-dimensional models; the principal disadvantage is the need for large computer facilities. Images PMID:14293013

  19. Semi-Automatic Grading of Students' Answers Written in Free Text

    ERIC Educational Resources Information Center

    Escudeiro, Nuno; Escudeiro, Paula; Cruz, Augusto

    2011-01-01

    The correct grading of free text answers to exam questions during an assessment process is time consuming and subject to fluctuations in the application of evaluation criteria, particularly when the number of answers is high (in the hundreds). In consequence of these fluctuations, inherent to human nature, and largely determined by emotional…

  20. Improved chemical text mining of patents with infinite dictionaries and automatic spelling correction.

    PubMed

    Sayle, Roger; Xie, Paul Hongxing; Muresan, Sorel

    2012-01-23

    The text mining of patents of pharmaceutical interest poses a number of unique challenges not encountered in other fields of text mining. Unlike fields, such as bioinformatics, where the number of terms of interest is enumerable and essentially static, systematic chemical nomenclature can describe an infinite number of molecules. Hence, the dictionary- and ontology-based techniques that are commonly used for gene names, diseases, species, etc., have limited utility when searching for novel therapeutic compounds in patents. Additionally, the length and the composition of IUPAC-like names make them more susceptible to typographic problems: OCR failures, human spelling errors, and hyphenation and line breaking issues. This work describes a novel technique, called CaffeineFix, designed to efficiently identify chemical names in free text, even in the presence of typographical errors. Corrected chemical names are generated as input for name-to-structure software. This forms a preprocessing pass, independent of the name-to-structure software used, and is shown to greatly improve the results of chemical text mining in our study.

  1. The Automatic Assessment of Free Text Answers Using a Modified BLEU Algorithm

    ERIC Educational Resources Information Center

    Noorbehbahani, F.; Kardan, A. A.

    2011-01-01

    e-Learning plays an undoubtedly important role in today's education and assessment is one of the most essential parts of any instruction-based learning process. Assessment is a common way to evaluate a student's knowledge regarding the concepts related to learning objectives. In this paper, a new method for assessing the free text answers of…

  2. BROWSER: An Automatic Indexing On-Line Text Retrieval System. Annual Progress Report.

    ERIC Educational Resources Information Center

    Williams, J. H., Jr.

    The development and testing of the Browsing On-line With Selective Retrieval (BROWSER) text retrieval system allowing a natural language query statement and providing on-line browsing capabilities through an IBM 2260 display terminal is described. The prototype system contains data bases of 25,000 German language patent abstracts, 9,000 English…

  3. The Automatic Assessment of Free Text Answers Using a Modified BLEU Algorithm

    ERIC Educational Resources Information Center

    Noorbehbahani, F.; Kardan, A. A.

    2011-01-01

    e-Learning plays an undoubtedly important role in today's education and assessment is one of the most essential parts of any instruction-based learning process. Assessment is a common way to evaluate a student's knowledge regarding the concepts related to learning objectives. In this paper, a new method for assessing the free text answers of…

  4. Automatic classification for mammogram backgrounds based on bi-rads complexity definition and on a multi content analysis framework

    NASA Astrophysics Data System (ADS)

    Wu, Jie; Besnehard, Quentin; Marchessoux, Cédric

    2011-03-01

    Clinical studies for the validation of new medical imaging devices require hundreds of images. An important step in creating and tuning the study protocol is the classification of images into "difficult" and "easy" cases. This consists of classifying the image based on features like the complexity of the background, the visibility of the disease (lesions). Therefore, an automatic medical background classification tool for mammograms would help for such clinical studies. This classification tool is based on a multi-content analysis framework (MCA) which was firstly developed to recognize image content of computer screen shots. With the implementation of new texture features and a defined breast density scale, the MCA framework is able to automatically classify digital mammograms with a satisfying accuracy. BI-RADS (Breast Imaging Reporting Data System) density scale is used for grouping the mammograms, which standardizes the mammography reporting terminology and assessment and recommendation categories. Selected features are input into a decision tree classification scheme in MCA framework, which is the so called "weak classifier" (any classifier with a global error rate below 50%). With the AdaBoost iteration algorithm, these "weak classifiers" are combined into a "strong classifier" (a classifier with a low global error rate) for classifying one category. The results of classification for one "strong classifier" show the good accuracy with the high true positive rates. For the four categories the results are: TP=90.38%, TN=67.88%, FP=32.12% and FN =9.62%.

  5. A Grid Service for Automatic Land Cover Classification Using Hyperspectral Images

    NASA Astrophysics Data System (ADS)

    Jasso, H.; Shin, P.; Fountain, T.; Pennington, D.; Ding, L.; Cotofana, N.

    2004-12-01

    Hyperspectral images are collected using Airborne Visible/Infrared Imaging Spectrometer (Aviris) optical sensors [1]. 224 contiguous channels are measured across the spectral range, from 400 to 2500 nanometers. We present a system for the automatic classification of land cover using hyperspectral images, and propose an architecture for deploying the system in a grid environment that harnesses distributed file storage and CPU resources for the task. Originally, we ran the following data mining algorithms on a 300x300 image of a section of the Sevilleta National Wildlife Refuge in New Mexico [2]: Maximum Likelihood, Naive Bayes Classifier, Minimum Distance, and Support Vector Machine (SVM). For this, ground truth for 673 pixels was manually collected according to eight possible land covers: river, riparian, agriculture, arid upland, semi-arid upland, barren, pavement, or clouds. The classification accuracies for these algorithms were of 96.4%, 90.9%, 88.4%, and 77.6%, respectively [3]. In this study, we noticed that the slope between adjacent frequencies produces specific patterns across the whole spectrum, giving a good indication of the pixel's land cover type. Wavelet analysis makes these global patterns explicit, by breaking down the signal into variable-sized windows, where long time windows capture low-frequency information and short time windows capture high-frequency information. High frequency information translates to information among close neighbors while low frequency information displays the overall trend of the features. We pre-processed the data using different families of wavelets, resulting in an increase in the performance of the Naive Bayesian Classifier and SVM to 94.2% and 90.1%, respectively. Classification accuracy with SVM was further increased to 97.1 % by modifying the mechanism by which multi-class is achieved using basic two-class SVMs. The original winner-take-all SVM scheme was replaced with a one-against-one scheme, in which k(k-1

  6. Classification of building infrastructure and automatic building footprint delineation using airborne laser swath mapping data

    NASA Astrophysics Data System (ADS)

    Caceres, Jhon

    image analysis for obtaining an initial classification, an automatic approach for delineating accurate building footprints is presented. The physical fact that laser pulses that happen to strike building edges can produce very different 1st and last return elevations has been long recognized. However, in older generation ALSM systems (<50 kHz pulse rates) such points were too few and far between to delineate building footprints precisely. Furthermore, without the robust separation of nearby trees and vegetation from the buildings, simply extracting ALSM shots where the elevation of the first return was much higher than the elevation of the last return, was not a reliable means of identifying building footprints. However, with the advent of ALSM systems with pulse rates in excess of 100 kHz, and by using spin-imaged based segmentation, it is now possible to extract building edges from the point cloud. A refined classification resulting from incorporating "on-edge" information is developed for obtaining quadrangular footprints. The footprint fitting process involves line generalization, least squares-based clustering and dominant points finding for segmenting individual building edges. In addition, an algorithm for fitting complex footprints using the segmented edges and data inside footprints is also proposed.

  7. Automatic segmentation of MR brain images of preterm infants using supervised classification.

    PubMed

    Moeskops, Pim; Benders, Manon J N L; Chiţ, Sabina M; Kersbergen, Karina J; Groenendaal, Floris; de Vries, Linda S; Viergever, Max A; Išgum, Ivana

    2015-09-01

    Preterm birth is often associated with impaired brain development. The state and expected progression of preterm brain development can be evaluated using quantitative assessment of MR images. Such measurements require accurate segmentation of different tissue types in those images. This paper presents an algorithm for the automatic segmentation of unmyelinated white matter (WM), cortical grey matter (GM), and cerebrospinal fluid in the extracerebral space (CSF). The algorithm uses supervised voxel classification in three subsequent stages. In the first stage, voxels that can easily be assigned to one of the three tissue types are labelled. In the second stage, dedicated analysis of the remaining voxels is performed. The first and the second stages both use two-class classification for each tissue type separately. Possible inconsistencies that could result from these tissue-specific segmentation stages are resolved in the third stage, which performs multi-class classification. A set of T1- and T2-weighted images was analysed, but the optimised system performs automatic segmentation using a T2-weighted image only. We have investigated the performance of the algorithm when using training data randomly selected from completely annotated images as well as when using training data from only partially annotated images. The method was evaluated on images of preterm infants acquired at 30 and 40weeks postmenstrual age (PMA). When the method was trained using random selection from the completely annotated images, the average Dice coefficients were 0.95 for WM, 0.81 for GM, and 0.89 for CSF on an independent set of images acquired at 30weeks PMA. When the method was trained using only the partially annotated images, the average Dice coefficients were 0.95 for WM, 0.78 for GM and 0.87 for CSF for the images acquired at 30weeks PMA, and 0.92 for WM, 0.80 for GM and 0.85 for CSF for the images acquired at 40weeks PMA. Even though the segmentations obtained using training data

  8. EnvMine: a text-mining system for the automatic extraction of contextual information.

    PubMed

    Tamames, Javier; de Lorenzo, Victor

    2010-06-01

    For ecological studies, it is crucial to count on adequate descriptions of the environments and samples being studied. Such a description must be done in terms of their physicochemical characteristics, allowing a direct comparison between different environments that would be difficult to do otherwise. Also the characterization must include the precise geographical location, to make possible the study of geographical distributions and biogeographical patterns. Currently, there is no schema for annotating these environmental features, and these data have to be extracted from textual sources (published articles). So far, this had to be performed by manual inspection of the corresponding documents. To facilitate this task, we have developed EnvMine, a set of text-mining tools devoted to retrieve contextual information (physicochemical variables and geographical locations) from textual sources of any kind. EnvMine is capable of retrieving the physicochemical variables cited in the text, by means of the accurate identification of their associated units of measurement. In this task, the system achieves a recall (percentage of items retrieved) of 92% with less than 1% error. Also a Bayesian classifier was tested for distinguishing parts of the text describing environmental characteristics from others dealing with, for instance, experimental settings.Regarding the identification of geographical locations, the system takes advantage of existing databases such as GeoNames to achieve 86% recall with 92% precision. The identification of a location includes also the determination of its exact coordinates (latitude and longitude), thus allowing the calculation of distance between the individual locations. EnvMine is a very efficient method for extracting contextual information from different text sources, like published articles or web pages. This tool can help in determining the precise location and physicochemical variables of sampling sites, thus facilitating the performance

  9. EnvMine: A text-mining system for the automatic extraction of contextual information

    PubMed Central

    2010-01-01

    Background For ecological studies, it is crucial to count on adequate descriptions of the environments and samples being studied. Such a description must be done in terms of their physicochemical characteristics, allowing a direct comparison between different environments that would be difficult to do otherwise. Also the characterization must include the precise geographical location, to make possible the study of geographical distributions and biogeographical patterns. Currently, there is no schema for annotating these environmental features, and these data have to be extracted from textual sources (published articles). So far, this had to be performed by manual inspection of the corresponding documents. To facilitate this task, we have developed EnvMine, a set of text-mining tools devoted to retrieve contextual information (physicochemical variables and geographical locations) from textual sources of any kind. Results EnvMine is capable of retrieving the physicochemical variables cited in the text, by means of the accurate identification of their associated units of measurement. In this task, the system achieves a recall (percentage of items retrieved) of 92% with less than 1% error. Also a Bayesian classifier was tested for distinguishing parts of the text describing environmental characteristics from others dealing with, for instance, experimental settings. Regarding the identification of geographical locations, the system takes advantage of existing databases such as GeoNames to achieve 86% recall with 92% precision. The identification of a location includes also the determination of its exact coordinates (latitude and longitude), thus allowing the calculation of distance between the individual locations. Conclusion EnvMine is a very efficient method for extracting contextual information from different text sources, like published articles or web pages. This tool can help in determining the precise location and physicochemical variables of sampling sites, thus

  10. FigSum: automatically generating structured text summaries for figures in biomedical literature.

    PubMed

    Agarwal, Shashank; Yu, Hong

    2009-11-14

    Figures are frequently used in biomedical articles to support research findings; however, they are often difficult to comprehend based on their legends alone and information from the full-text articles is required to fully understand them. Previously, we found that the information associated with a single figure is distributed throughout the full-text article the figure appears in. Here, we develop and evaluate a figure summarization system - FigSum, which aggregates this scattered information to improve figure comprehension. For each figure in an article, FigSum generates a structured text summary comprising one sentence from each of the four rhetorical categories - Introduction, Methods, Results and Discussion (IMRaD). The IMRaD category of sentences is predicted by an automated machine learning classifier. Our evaluation shows that FigSum captures 53% of the sentences in the gold standard summaries annotated by biomedical scientists and achieves an average ROUGE-1 score of 0.70, which is higher than a baseline system.

  11. Automatic detection of patients with invasive fungal disease from free-text computed tomography (CT) scans.

    PubMed

    Martinez, David; Ananda-Rajah, Michelle R; Suominen, Hanna; Slavin, Monica A; Thursky, Karin A; Cavedon, Lawrence

    2015-02-01

    Invasive fungal diseases (IFDs) are associated with considerable health and economic costs. Surveillance of the more diagnostically challenging invasive fungal diseases, specifically of the sino-pulmonary system, is not feasible for many hospitals because case finding is a costly and labour intensive exercise. We developed text classifiers for detecting such IFDs from free-text radiology (CT) reports, using machine-learning techniques. We obtained free-text reports of CT scans performed over a specific hospitalisation period (2003-2011), for 264 IFD and 289 control patients from three tertiary hospitals. We analysed IFD evidence at patient, report, and sentence levels. Three infectious disease experts annotated the reports of 73 IFD-positive patients for language suggestive of IFD at sentence level, and graded the sentences as to whether they suggested or excluded the presence of IFD. Reliable agreement between annotators was obtained and this was used as training data for our classifiers. We tested a variety of Machine Learning (ML), rule based, and hybrid systems, with feature types including bags of words, bags of phrases, and bags of concepts, as well as report-level structured features. Evaluation was carried out over a robust framework with separate Development and Held-Out datasets. The best systems (using Support Vector Machines) achieved very high recall at report- and patient-levels over unseen data: 95% and 100% respectively. Precision at report-level over held-out data was 71%; however, most of the associated false-positive reports (53%) belonged to patients who had a previous positive report appropriately flagged by the classifier, reducing negative impact in practice. Our machine learning application holds the potential for developing systematic IFD surveillance systems for hospital populations. Copyright © 2014 Elsevier Inc. All rights reserved.

  12. Does semi-automatic bone-fragment segmentation improve the reproducibility of the Letournel acetabular fracture classification?

    PubMed

    Boudissa, M; Orfeuvre, B; Chabanas, M; Tonetti, J

    2017-09-01

    The Letournel classification of acetabular fracture shows poor reproducibility in inexperienced observers, despite the introduction of 3D imaging. We therefore developed a method of semi-automatic segmentation based on CT data. The present prospective study aimed to assess: (1) whether semi-automatic bone-fragment segmentation increased the rate of correct classification; (2) if so, in which fracture types; and (3) feasibility using the open-source itksnap 3.0 software package without incurring extra cost for users. Semi-automatic segmentation of acetabular fractures significantly increases the rate of correct classification by orthopedic surgery residents. Twelve orthopedic surgery residents classified 23 acetabular fractures. Six used conventional 3D reconstructions provided by the center's radiology department (conventional group) and 6 others used reconstructions obtained by semi-automatic segmentation using the open-source itksnap 3.0 software package (segmentation group). Bone fragments were identified by specific colors. Correct classification rates were compared between groups on Chi(2) test. Assessment was repeated 2 weeks later, to determine intra-observer reproducibility. Correct classification rates were significantly higher in the "segmentation" group: 114/138 (83%) versus 71/138 (52%); P<0.0001. The difference was greater for simple (36/36 (100%) versus 17/36 (47%); P<0.0001) than complex fractures (79/102 (77%) versus 54/102 (53%); P=0.0004). Mean segmentation time per fracture was 27±3min [range, 21-35min]. The segmentation group showed excellent intra-observer correlation coefficients, overall (ICC=0.88), and for simple (ICC=0.92) and complex fractures (ICC=0.84). Semi-automatic segmentation, identifying the various bone fragments, was effective in increasing the rate of correct acetabular fracture classification on the Letournel system by orthopedic surgery residents. It may be considered for routine use in education and training. III

  13. Regional Image Features Model for Automatic Classification between Normal and Glaucoma in Fundus and Scanning Laser Ophthalmoscopy (SLO) Images.

    PubMed

    Haleem, Muhammad Salman; Han, Liangxiu; Hemert, Jano van; Fleming, Alan; Pasquale, Louis R; Silva, Paolo S; Song, Brian J; Aiello, Lloyd Paul

    2016-06-01

    Glaucoma is one of the leading causes of blindness worldwide. There is no cure for glaucoma but detection at its earliest stage and subsequent treatment can aid patients to prevent blindness. Currently, optic disc and retinal imaging facilitates glaucoma detection but this method requires manual post-imaging modifications that are time-consuming and subjective to image assessment by human observers. Therefore, it is necessary to automate this process. In this work, we have first proposed a novel computer aided approach for automatic glaucoma detection based on Regional Image Features Model (RIFM) which can automatically perform classification between normal and glaucoma images on the basis of regional information. Different from all the existing methods, our approach can extract both geometric (e.g. morphometric properties) and non-geometric based properties (e.g. pixel appearance/intensity values, texture) from images and significantly increase the classification performance. Our proposed approach consists of three new major contributions including automatic localisation of optic disc, automatic segmentation of disc, and classification between normal and glaucoma based on geometric and non-geometric properties of different regions of an image. We have compared our method with existing approaches and tested it on both fundus and Scanning laser ophthalmoscopy (SLO) images. The experimental results show that our proposed approach outperforms the state-of-the-art approaches using either geometric or non-geometric properties. The overall glaucoma classification accuracy for fundus images is 94.4% and accuracy of detection of suspicion of glaucoma in SLO images is 93.9 %.

  14. Automatic extraction of reference gene from literature in plants based on texting mining.

    PubMed

    He, Lin; Shen, Gengyu; Li, Fei; Huang, Shuiqing

    2015-01-01

    Real-Time Quantitative Polymerase Chain Reaction (qRT-PCR) is widely used in biological research. It is a key to the availability of qRT-PCR experiment to select a stable reference gene. However, selecting an appropriate reference gene usually requires strict biological experiment for verification with high cost in the process of selection. Scientific literatures have accumulated a lot of achievements on the selection of reference gene. Therefore, mining reference genes under specific experiment environments from literatures can provide quite reliable reference genes for similar qRT-PCR experiments with the advantages of reliability, economic and efficiency. An auxiliary reference gene discovery method from literature is proposed in this paper which integrated machine learning, natural language processing and text mining approaches. The validity tests showed that this new method has a better precision and recall on the extraction of reference genes and their environments.

  15. Automatic classification of RDoC positive valence severity with a neural network.

    PubMed

    Clark, Cheryl; Wellner, Ben; Davis, Rachel; Aberdeen, John; Hirschman, Lynette

    2017-07-08

    Our objective was to develop a machine learning-based system to determine the severity of Positive Valance symptoms for a patient, based on information included in their initial psychiatric evaluation. Severity was rated on an ordinal scale of 0-3 as follows: 0 (absent=no symptoms), 1 (mild=modest significance), 2 (moderate=requires treatment), 3 (severe=causes substantial impairment) by experts. We treated the task of assigning Positive Valence severity as a text classification problem. During development, we experimented with regularized multinomial logistic regression classifiers, gradient boosted trees, and feedforward, fully-connected neural networks. We found both regularization and feature selection via mutual information to be very important in preventing models from overfitting the data. Our best configuration was a neural network with three fully connected hidden layers with rectified linear unit activations. Our best performing system achieved a score of 77.86%. The evaluation metric is an inverse normalization of the Mean Absolute Error presented as a percentage number between 0 and 100, where 100 means the highest performance. Error analysis showed that 90% of the system errors involved neighboring severity categories. Machine learning text classification techniques with feature selection can be trained to recognize broad differences in Positive Valence symptom severity with a modest amount of training data (in this case 600 documents, 167 of which were unannotated). An increase in the amount of annotated data can increase accuracy of symptom severity classification by several percentage points. Additional features and/or a larger training corpus may further improve accuracy. Copyright © 2017. Published by Elsevier Inc.

  16. Text mining electronic hospital records to automatically classify admissions against disease: Measuring the impact of linking data sources.

    PubMed

    Kocbek, Simon; Cavedon, Lawrence; Martinez, David; Bain, Christopher; Manus, Chris Mac; Haffari, Gholamreza; Zukerman, Ingrid; Verspoor, Karin

    2016-12-01

    Text and data mining play an important role in obtaining insights from Health and Hospital Information Systems. This paper presents a text mining system for detecting admissions marked as positive for several diseases: Lung Cancer, Breast Cancer, Colon Cancer, Secondary Malignant Neoplasm of Respiratory and Digestive Organs, Multiple Myeloma and Malignant Plasma Cell Neoplasms, Pneumonia, and Pulmonary Embolism. We specifically examine the effect of linking multiple data sources on text classification performance. Support Vector Machine classifiers are built for eight data source combinations, and evaluated using the metrics of Precision, Recall and F-Score. Sub-sampling techniques are used to address unbalanced datasets of medical records. We use radiology reports as an initial data source and add other sources, such as pathology reports and patient and hospital admission data, in order to assess the research question regarding the impact of the value of multiple data sources. Statistical significance is measured using the Wilcoxon signed-rank test. A second set of experiments explores aspects of the system in greater depth, focusing on Lung Cancer. We explore the impact of feature selection; analyse the learning curve; examine the effect of restricting admissions to only those containing reports from all data sources; and examine the impact of reducing the sub-sampling. These experiments provide better understanding of how to best apply text classification in the context of imbalanced data of variable completeness. Radiology questions plus patient and hospital admission data contribute valuable information for detecting most of the diseases, significantly improving performance when added to radiology reports alone or to the combination of radiology and pathology reports. Overall, linking data sources significantly improved classification performance for all the diseases examined. However, there is no single approach that suits all scenarios; the choice of the

  17. AuDis: an automatic CRF-enhanced disease normalization in biomedical text.

    PubMed

    Lee, Hsin-Chun; Hsu, Yi-Yu; Kao, Hung-Yu

    2016-01-01

    Diseases play central roles in many areas of biomedical research and healthcare. Consequently, aggregating the disease knowledge and treatment research reports becomes an extremely critical issue, especially in rapid-growth knowledge bases (e.g. PubMed). We therefore developed a system, AuDis, for disease mention recognition and normalization in biomedical texts. Our system utilizes an order two conditional random fields model. To optimize the results, we customize several post-processing steps, including abbreviation resolution, consistency improvement and stopwords filtering. As the official evaluation on the CDR task in BioCreative V, AuDis obtained the best performance (86.46% of F-score) among 40 runs (16 unique teams) on disease normalization of the DNER sub task. These results suggest that AuDis is a high-performance recognition system for disease recognition and normalization from biomedical literature.Database URL: http://ikmlab.csie.ncku.edu.tw/CDR2015/AuDis.html.

  18. An automatic system to identify heart disease risk factors in clinical texts over time.

    PubMed

    Chen, Qingcai; Li, Haodi; Tang, Buzhou; Wang, Xiaolong; Liu, Xin; Liu, Zengjian; Liu, Shu; Wang, Weida; Deng, Qiwen; Zhu, Suisong; Chen, Yangxin; Wang, Jingfeng

    2015-12-01

    Despite recent progress in prediction and prevention, heart disease remains a leading cause of death. One preliminary step in heart disease prediction and prevention is risk factor identification. Many studies have been proposed to identify risk factors associated with heart disease; however, none have attempted to identify all risk factors. In 2014, the National Center of Informatics for Integrating Biology and Beside (i2b2) issued a clinical natural language processing (NLP) challenge that involved a track (track 2) for identifying heart disease risk factors in clinical texts over time. This track aimed to identify medically relevant information related to heart disease risk and track the progression over sets of longitudinal patient medical records. Identification of tags and attributes associated with disease presence and progression, risk factors, and medications in patient medical history were required. Our participation led to development of a hybrid pipeline system based on both machine learning-based and rule-based approaches. Evaluation using the challenge corpus revealed that our system achieved an F1-score of 92.68%, making it the top-ranked system (without additional annotations) of the 2014 i2b2 clinical NLP challenge.

  19. AuDis: an automatic CRF-enhanced disease normalization in biomedical text

    PubMed Central

    Lee, Hsin-Chun; Hsu, Yi-Yu; Kao, Hung-Yu

    2016-01-01

    Diseases play central roles in many areas of biomedical research and healthcare. Consequently, aggregating the disease knowledge and treatment research reports becomes an extremely critical issue, especially in rapid-growth knowledge bases (e.g. PubMed). We therefore developed a system, AuDis, for disease mention recognition and normalization in biomedical texts. Our system utilizes an order two conditional random fields model. To optimize the results, we customize several post-processing steps, including abbreviation resolution, consistency improvement and stopwords filtering. As the official evaluation on the CDR task in BioCreative V, AuDis obtained the best performance (86.46% of F-score) among 40 runs (16 unique teams) on disease normalization of the DNER sub task. These results suggest that AuDis is a high-performance recognition system for disease recognition and normalization from biomedical literature. Database URL: http://ikmlab.csie.ncku.edu.tw/CDR2015/AuDis.html PMID:27278815

  20. Automatism

    PubMed Central

    McCaldon, R. J.

    1964-01-01

    Individuals can carry out complex activity while in a state of impaired consciousness, a condition termed “automatism”. Consciousness must be considered from both an organic and a psychological aspect, because impairment of consciousness may occur in both ways. Automatism may be classified as normal (hypnosis), organic (temporal lobe epilepsy), psychogenic (dissociative fugue) or feigned. Often painstaking clinical investigation is necessary to clarify the diagnosis. There is legal precedent for assuming that all crimes must embody both consciousness and will. Jurists are loath to apply this principle without reservation, as this would necessitate acquittal and release of potentially dangerous individuals. However, with the sole exception of the defence of insanity, there is at present no legislation to prohibit release without further investigation of anyone acquitted of a crime on the grounds of “automatism”. PMID:14199824

  1. Classification, Characterization, and Automatic Detection of Volcanic Explosion Complexity using Infrasound

    NASA Astrophysics Data System (ADS)

    Fee, D.; Matoza, R. S.; Lopez, T. M.; Ruiz, M. C.; Gee, K.; Neilsen, T.

    2014-12-01

    Infrasound signals from volcanoes represent the acceleration of the atmosphere during an eruption and have traditionally been classified into two end members: 1) "explosions" consisting primarily of a high amplitude bi-polar pressure pulse that lasts a few to tens of seconds, and 2) "tremor" or "jetting" consisting of sustained, broadband infrasound lasting for minutes to hours. However, as our knowledge and recordings of volcanic eruptions have increased, significant infrasound signal diversity has been found. Here we focus on identifying and characterizing trends in volcano infrasound data to help better understand eruption processes. We explore infrasound signal metrics that may be used to quantitatively compare, classify, and identify explosive eruptive styles by systematic analysis of the data. We analyze infrasound data from short-to-medium duration explosive events recorded during recent infrasound deployments at Sakurajima Volcano, Japan; Karymsky Volcano, Kamchatka; and Tungurahua Volcano, Ecuador. Preliminary results demonstrate that a great variety of explosion styles and flow behaviors from these volcanoes can produce relatively similar bulk acoustic waveform properties, such as peak pressure and event duration, indicating that accurate classification of physical eruptive styles requires more advanced field studies, waveform analyses, and modeling. Next we evaluate the spectral and temporal properties of longer-duration tremor and jetting signals from large eruptions at Tungurahua Volcano; Redoubt Volcano, Alaska; Augustine Volcano, Alaska; and Nabro Volcano, Eritrea, in an effort to identify distinguishing infrasound features relatable to eruption features. We find that unique transient signals (such as repeated shocks) within sustained infrasound signals can provide critical information on the volcanic jet flow and exhibit a distinct acoustic signature to facilitate automatic detection. Automated detection and characterization of infrasound associated

  2. A Method for Automatic and Objective Scoring of Bradykinesia Using Orientation Sensors and Classification Algorithms.

    PubMed

    Martinez-Manzanera, O; Roosma, E; Beudel, M; Borgemeester, R W K; van Laar, T; Maurits, N M

    2016-05-01

    Correct assessment of bradykinesia is a key element in the diagnosis and monitoring of Parkinson's disease. Its evaluation is based on a careful assessment of symptoms and it is quantified using rating scales, where the Movement Disorders Society-Sponsored Revision of the Unified Parkinson's Disease Rating Scale (MDS-UPDRS) is the gold standard. Regardless of their importance, the bradykinesia-related items show low agreement between different evaluators. In this study, we design an applicable tool that provides an objective quantification of bradykinesia and that evaluates all characteristics described in the MDS-UPDRS. Twenty-five patients with Parkinson's disease performed three of the five bradykinesia-related items of the MDS-UPDRS. Their movements were assessed by four evaluators and were recorded with a nine degrees-of-freedom sensor. Sensor fusion was employed to obtain a 3-D representation of movements. Based on the resulting signals, a set of features related to the characteristics described in the MDS-UPDRS was defined. Feature selection methods were employed to determine the most important features to quantify bradykinesia. The features selected were used to train support vector machine classifiers to obtain an automatic score of the movements of each patient. The best results were obtained when seven features were included in the classifiers. The classification errors for finger tapping, diadochokinesis and toe tapping were 15-16.5%, 9.3-9.8%, and 18.2-20.2% smaller than the average interrater scoring error, respectively. The introduction of objective scoring in the assessment of bradykinesia might eliminate inconsistencies within evaluators and interrater assessment disagreements and might improve the monitoring of movement disorders.

  3. Automatic approach to solve the morphological galaxy classification problem using the sparse representation technique and dictionary learning

    NASA Astrophysics Data System (ADS)

    Diaz-Hernandez, R.; Ortiz-Esquivel, A.; Peregrina-Barreto, H.; Altamirano-Robles, L.; Gonzalez-Bernal, J.

    2016-06-01

    The observation of celestial objects in the sky is a practice that helps astronomers to understand the way in which the Universe is structured. However, due to the large number of observed objects with modern telescopes, the analysis of these by hand is a difficult task. An important part in galaxy research is the morphological structure classification based on the Hubble sequence. In this research, we present an approach to solve the morphological galaxy classification problem in an automatic way by using the Sparse Representation technique and dictionary learning with K-SVD. For the tests in this work, we use a database of galaxies extracted from the Principal Galaxy Catalog (PGC) and the APM Equatorial Catalogue of Galaxies obtaining a total of 2403 useful galaxies. In order to represent each galaxy frame, we propose to calculate a set of 20 features such as Hu's invariant moments, galaxy nucleus eccentricity, gabor galaxy ratio and some other features commonly used in galaxy classification. A stage of feature relevance analysis was performed using Relief-f in order to determine which are the best parameters for the classification tests using 2, 3, 4, 5, 6 and 7 galaxy classes making signal vectors of different length values with the most important features. For the classification task, we use a 20-random cross-validation technique to evaluate classification accuracy with all signal sets achieving a score of 82.27 % for 2 galaxy classes and up to 44.27 % for 7 galaxy classes.

  4. Continuous automatic classification of seismic signals of volcanic origin at Mt. Merapi, Java, Indonesia

    NASA Astrophysics Data System (ADS)

    Ohrnberger, Matthias

    2001-07-01

    Merapi volcano is one of the most active and dangerous volcanoes of the earth. Located in central part of Java island (Indonesia), even a moderate eruption of Merapi poses a high risk to the highly populated area. Due to the close relationship between the volcanic unrest and the occurrence of seismic events at Mt. Merapi, the monitoring of Merapi's seismicity plays an important role for recognizing major changes in the volcanic activity. An automatic seismic event detection and classification system, which is capable to characterize the actual seismic activity in near real-time, is an important tool which allows the scientists in charge to take immediate decisions during a volcanic crisis. In order to accomplish the task of detecting and classifying volcano-seismic signals automatically in the continuous data streams, a pattern recognition approach has been used. It is based on the method of hidden Markov models (HMM), a technique, which has proven to provide high recognition rates at high confidence levels in classification tasks of similar complexity (e.g. speech recognition). Any pattern recognition system relies on the appropriate representation of the input data in order to allow a reasonable class-decision by means of a mathematical test function. Based on the experiences from seismological observatory practice, a parametrization scheme of the seismic waveform data is derived using robust seismological analysis techniques. The wavefield parameters are summarized into a real-valued feature vector per time step. The time series of this feature vector build the basis for the HMM-based classification system. In order to make use of discrete hidden Markov (DHMM) techniques, the feature vectors are further processed by applying a de-correlating and prewhitening transformation and additional vector quantization. The seismic wavefield is finally represented as a discrete symbol sequence with a finite alphabet. This sequence is subject to a maximum likelihood test

  5. Automatic semantic classification of scientific literature according to the hallmarks of cancer.

    PubMed

    Baker, Simon; Silins, Ilona; Guo, Yufan; Ali, Imran; Högberg, Johan; Stenius, Ulla; Korhonen, Anna

    2016-02-01

    The hallmarks of cancer have become highly influential in cancer research. They reduce the complexity of cancer into 10 principles (e.g. resisting cell death and sustaining proliferative signaling) that explain the biological capabilities acquired during the development of human tumors. Since new research depends crucially on existing knowledge, technology for semantic classification of scientific literature according to the hallmarks of cancer could greatly support literature review, knowledge discovery and applications in cancer research. We present the first step toward the development of such technology. We introduce a corpus of 1499 PubMed abstracts annotated according to the scientific evidence they provide for the 10 currently known hallmarks of cancer. We use this corpus to train a system that classifies PubMed literature according to the hallmarks. The system uses supervised machine learning and rich features largely based on biomedical text mining. We report good performance in both intrinsic and extrinsic evaluations, demonstrating both the accuracy of the methodology and its potential in supporting practical cancer research. We discuss how this approach could be developed and applied further in the future. The corpus of hallmark-annotated PubMed abstracts and the software for classification are available at: http://www.cl.cam.ac.uk/∼sb895/HoC.html. simon.baker@cl.cam.ac.uk. © The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

  6. AUTOMATIC CLASSIFICATION OF QUESTION TURNS IN SPONTANEOUS SPEECH USING LEXICAL AND PROSODIC EVIDENCE.

    PubMed

    Ananthakrishnan, Sankaranarayanan; Ghosh, Prasanta; Narayanan, Shrikanth

    2008-01-01

    The ability to identify speech acts reliably is desirable in any spoken language system that interacts with humans. Minimally, such a system should be capable of distinguishing between question-bearing turns and other types of utterances. However, this is a non-trivial task, since spontaneous speech tends to have incomplete syntactic, and even ungrammatical, structure and is characterized by disfluencies, repairs and other non-linguistic vocalizations that make simple rule based pattern learning difficult. In this paper, we present a system for identifying question-bearing turns in spontaneous multi-party speech (ICSI Meeting Corpus) using lexical and prosodic evidence. On a balanced test set, our system achieves an accuracy of 71.9% for the binary question vs. non-question classification task. Further, we investigate the robustness of our proposed technique to uncertainty in the lexical feature stream (e.g. caused by speech recognition errors). Our experiments indicate that classification accuracy of the proposed method is robust to errors in the text stream, dropping only about 0.8% for every 10% increase in word error rate (WER).

  7. Automatic Detection, Segmentation and Classification of Retinal Horizontal Neurons in Large-scale 3D Confocal Imagery

    SciTech Connect

    Karakaya, Mahmut; Kerekes, Ryan A; Gleason, Shaun Scott; Martins, Rodrigo; Dyer, Michael

    2011-01-01

    Automatic analysis of neuronal structure from wide-field-of-view 3D image stacks of retinal neurons is essential for statistically characterizing neuronal abnormalities that may be causally related to neural malfunctions or may be early indicators for a variety of neuropathies. In this paper, we study classification of neuron fields in large-scale 3D confocal image stacks, a challenging neurobiological problem because of the low spatial resolution imagery and presence of intertwined dendrites from different neurons. We present a fully automated, four-step processing approach for neuron classification with respect to the morphological structure of their dendrites. In our approach, we first localize each individual soma in the image by using morphological operators and active contours. By using each soma position as a seed point, we automatically determine an appropriate threshold to segment dendrites of each neuron. We then use skeletonization and network analysis to generate the morphological structures of segmented dendrites, and shape-based features are extracted from network representations of each neuron to characterize the neuron. Based on qualitative results and quantitative comparisons, we show that we are able to automatically compute relevant features that clearly distinguish between normal and abnormal cases for postnatal day 6 (P6) horizontal neurons.

  8. Natural Language Processing Based Instrument for Classification of Free Text Medical Records

    PubMed Central

    2016-01-01

    According to the Ministry of Labor, Health and Social Affairs of Georgia a new health management system has to be introduced in the nearest future. In this context arises the problem of structuring and classifying documents containing all the history of medical services provided. The present work introduces the instrument for classification of medical records based on the Georgian language. It is the first attempt of such classification of the Georgian language based medical records. On the whole 24.855 examination records have been studied. The documents were classified into three main groups (ultrasonography, endoscopy, and X-ray) and 13 subgroups using two well-known methods: Support Vector Machine (SVM) and K-Nearest Neighbor (KNN). The results obtained demonstrated that both machine learning methods performed successfully, with a little supremacy of SVM. In the process of classification a “shrink” method, based on features selection, was introduced and applied. At the first stage of classification the results of the “shrink” case were better; however, on the second stage of classification into subclasses 23% of all documents could not be linked to only one definite individual subclass (liver or binary system) due to common features characterizing these subclasses. The overall results of the study were successful. PMID:27668260

  9. Effectiveness of Global Features for Automatic Medical Image Classification and Retrieval – the experiences of OHSU at ImageCLEFmed

    PubMed Central

    Kalpathy-Cramer, Jayashree; Hersh, William

    2008-01-01

    In 2006 and 2007, Oregon Health & Science University (OHSU) participated in the automatic image annotation task for medical images at ImageCLEF, an annual international benchmarking event that is part of the Cross Language Evaluation Forum (CLEF). The goal of the automatic annotation task was to classify 1000 test images based on the Image Retrieval in Medical Applications (IRMA) code, given a set of 10,000 training images. There were 116 distinct classes in 2006 and 2007. We evaluated the efficacy of a variety of primarily global features for this classification task. These included features based on histograms, gray level correlation matrices and the gist technique. A multitude of classifiers including k-nearest neighbors, two-level neural networks, support vector machines, and maximum likelihood classifiers were evaluated. Our official error rates for the 1000 test images were 26% in 2006 using the flat classification structure. The error count in 2007 was 67.8 using the hierarchical classification error computation based on the IRMA code in 2007. Confusion matrices as well as clustering experiments were used to identify visually similar classes. The use of the IRMA code did not help us in the classification task as the semantic hierarchy of the IRMA classes did not correspond well with the hierarchy based on clustering of image features that we used. Our most frequent misclassification errors were along the view axis. Subsequent experiments based on a two-stage classification system decreased our error rate to 19.8% for the 2006 dataset and our error count to 55.4 for the 2007 data. PMID:19884953

  10. Automatic Classification of High Resolution Satellite Imagery - a Case Study for Urban Areas in the Kingdom of Saudi Arabia

    NASA Astrophysics Data System (ADS)

    Maas, A.; Alrajhi, M.; Alobeid, A.; Heipke, C.

    2017-05-01

    Updating topographic geospatial databases is often performed based on current remotely sensed images. To automatically extract the object information (labels) from the images, supervised classifiers are being employed. Decisions to be taken in this process concern the definition of the classes which should be recognised, the features to describe each class and the training data necessary in the learning part of classification. With a view to large scale topographic databases for fast developing urban areas in the Kingdom of Saudi Arabia we conducted a case study, which investigated the following two questions: (a) which set of features is best suitable for the classification?; (b) what is the added value of height information, e.g. derived from stereo imagery? Using stereoscopic GeoEye and Ikonos satellite data we investigate these two questions based on our research on label tolerant classification using logistic regression and partly incorrect training data. We show that in between five and ten features can be recommended to obtain a stable solution, that height information consistently yields an improved overall classification accuracy of about 5%, and that label noise can be successfully modelled and thus only marginally influences the classification results.

  11. Automatic classification of artifactual ICA-components for artifact removal in EEG signals.

    PubMed

    Winkler, Irene; Haufe, Stefan; Tangermann, Michael

    2011-08-02

    Artifacts contained in EEG recordings hamper both, the visual interpretation by experts as well as the algorithmic processing and analysis (e.g. for Brain-Computer Interfaces (BCI) or for Mental State Monitoring). While hand-optimized selection of source components derived from Independent Component Analysis (ICA) to clean EEG data is widespread, the field could greatly profit from automated solutions based on Machine Learning methods. Existing ICA-based removal strategies depend on explicit recordings of an individual's artifacts or have not been shown to reliably identify muscle artifacts. We propose an automatic method for the classification of general artifactual source components. They are estimated by TDSEP, an ICA method that takes temporal correlations into account. The linear classifier is based on an optimized feature subset determined by a Linear Programming Machine (LPM). The subset is composed of features from the frequency-, the spatial- and temporal domain. A subject independent classifier was trained on 640 TDSEP components (reaction time (RT) study, n = 12) that were hand labeled by experts as artifactual or brain sources and tested on 1080 new components of RT data of the same study. Generalization was tested on new data from two studies (auditory Event Related Potential (ERP) paradigm, n = 18; motor imagery BCI paradigm, n = 80) that used data with different channel setups and from new subjects. Based on six features only, the optimized linear classifier performed on level with the inter-expert disagreement (<10% Mean Squared Error (MSE)) on the RT data. On data of the auditory ERP study, the same pre-calculated classifier generalized well and achieved 15% MSE. On data of the motor imagery paradigm, we demonstrate that the discriminant information used for BCI is preserved when removing up to 60% of the most artifactual source components. We propose a universal and efficient classifier of ICA components for the subject independent removal of

  12. Automatic MRI segmentation of para-pharyngeal fat pads using interactive visual feature space analysis for classification.

    PubMed

    Shahid, Muhammad Laiq Ur Rahman; Chitiboi, Teodora; Ivanovska, Tetyana; Molchanov, Vladimir; Völzke, Henry; Linsen, Lars

    2017-02-14

    Obstructive sleep apnea (OSA) is a public health problem. Detailed analysis of the para-pharyngeal fat pads can help us to understand the pathogenesis of OSA and may mediate the intervention of this sleeping disorder. A reliable and automatic para-pharyngeal fat pads segmentation technique plays a vital role in investigating larger data bases to identify the anatomic risk factors for the OSA. Our research aims to develop a context-based automatic segmentation algorithm to delineate the fat pads from magnetic resonance images in a population-based study. Our segmentation pipeline involves texture analysis, connected component analysis, object-based image analysis, and supervised classification using an interactive visual analysis tool to segregate fat pads from other structures automatically. We developed a fully automatic segmentation technique that does not need any user interaction to extract fat pads. Our algorithm is fast enough that we can apply it to population-based epidemiological studies that provide a large amount of data. We evaluated our approach qualitatively on thirty datasets and quantitatively against the ground truths of ten datasets resulting in an average of approximately 78% detected volume fraction and a 79% Dice coefficient, which is within the range of the inter-observer variation of manual segmentation results. The suggested method produces sufficiently accurate results and has potential to be applied for the study of large data to understand the pathogenesis of the OSA syndrome.

  13. The Contribution of the Vaccine Adverse Event Text Mining System to the Classification of Possible Guillain-Barré Syndrome Reports

    PubMed Central

    Botsis, T.; Woo, E. J.; Ball, R.

    2013-01-01

    Background We previously demonstrated that a general purpose text mining system, the Vaccine adverse event Text Mining (VaeTM) system, could be used to automatically classify reports of an-aphylaxis for post-marketing safety surveillance of vaccines. Objective To evaluate the ability of VaeTM to classify reports to the Vaccine Adverse Event Reporting System (VAERS) of possible Guillain-Barré Syndrome (GBS). Methods We used VaeTM to extract the key diagnostic features from the text of reports in VAERS. Then, we applied the Brighton Collaboration (BC) case definition for GBS, and an information retrieval strategy (i.e. the vector space model) to quantify the specific information that is included in the key features extracted by VaeTM and compared it with the encoded information that is already stored in VAERS as Medical Dictionary for Regulatory Activities (MedDRA) Preferred Terms (PTs). We also evaluated the contribution of the primary (diagnosis and cause of death) and secondary (second level diagnosis and symptoms) diagnostic VaeTM-based features to the total VaeTM-based information. Results MedDRA captured more information and better supported the classification of reports for GBS than VaeTM (AUC: 0.904 vs. 0.777); the lower performance of VaeTM is likely due to the lack of extraction by VaeTM of specific laboratory results that are included in the BC criteria for GBS. On the other hand, the VaeTM-based classification exhibited greater specificity than the MedDRA-based approach (94.96% vs. 87.65%). Most of the VaeTM-based information was contained in the secondary diagnostic features. Conclusion For GBS, clinical signs and symptoms alone are not sufficient to match MedDRA coding for purposes of case classification, but are preferred if specificity is the priority. PMID:23650490

  14. The contribution of the vaccine adverse event text mining system to the classification of possible Guillain-Barré syndrome reports.

    PubMed

    Botsis, T; Woo, E J; Ball, R

    2013-01-01

    We previously demonstrated that a general purpose text mining system, the Vaccine adverse event Text Mining (VaeTM) system, could be used to automatically classify reports of an-aphylaxis for post-marketing safety surveillance of vaccines. To evaluate the ability of VaeTM to classify reports to the Vaccine Adverse Event Reporting System (VAERS) of possible Guillain-Barré Syndrome (GBS). We used VaeTM to extract the key diagnostic features from the text of reports in VAERS. Then, we applied the Brighton Collaboration (BC) case definition for GBS, and an information retrieval strategy (i.e. the vector space model) to quantify the specific information that is included in the key features extracted by VaeTM and compared it with the encoded information that is already stored in VAERS as Medical Dictionary for Regulatory Activities (MedDRA) Preferred Terms (PTs). We also evaluated the contribution of the primary (diagnosis and cause of death) and secondary (second level diagnosis and symptoms) diagnostic VaeTM-based features to the total VaeTM-based information. MedDRA captured more information and better supported the classification of reports for GBS than VaeTM (AUC: 0.904 vs. 0.777); the lower performance of VaeTM is likely due to the lack of extraction by VaeTM of specific laboratory results that are included in the BC criteria for GBS. On the other hand, the VaeTM-based classification exhibited greater specificity than the MedDRA-based approach (94.96% vs. 87.65%). Most of the VaeTM-based information was contained in the secondary diagnostic features. For GBS, clinical signs and symptoms alone are not sufficient to match MedDRA coding for purposes of case classification, but are preferred if specificity is the priority.

  15. Automatic classification of the sub-techniques (gears) used in cross-country ski skating employing a mobile phone.

    PubMed

    Stöggl, Thomas; Holst, Anders; Jonasson, Arndt; Andersson, Erik; Wunsch, Tobias; Norström, Christer; Holmberg, Hans-Christer

    2014-10-31

    The purpose of the current study was to develop and validate an automatic algorithm for classification of cross-country (XC) ski-skating gears (G) using Smartphone accelerometer data. Eleven XC skiers (seven men, four women) with regional-to-international levels of performance carried out roller skiing trials on a treadmill using fixed gears (G2left, G2right, G3, G4left, G4right) and a 950-m trial using different speeds and inclines, applying gears and sides as they normally would. Gear classification by the Smartphone (on the chest) and based on video recordings were compared. Formachine-learning, a collective database was compared to individual data. The Smartphone application identified the trials with fixed gears correctly in all cases. In the 950-m trial, participants executed 140 ± 22 cycles as assessed by video analysis, with the automatic Smartphone application giving a similar value. Based on collective data, gears were identified correctly 86.0% ± 8.9% of the time, a value that rose to 90.3% ± 4.1% (P < 0.01) with machine learning from individual data. Classification was most often incorrect during transition between gears, especially to or from G3. Identification was most often correct for skiers who made relatively few transitions between gears. The accuracy of the automatic procedure for identifying G2left, G2right, G3, G4left and G4right was 96%, 90%, 81%, 88% and 94%, respectively. The algorithm identified gears correctly 100% of the time when a single gear was used and 90% of the time when different gears were employed during a variable protocol. This algorithm could be improved with respect to identification of transitions between gears or the side employed within a given gear.

  16. Automatic Classification of the Sub-Techniques (Gears) Used in Cross-Country Ski Skating Employing a Mobile Phone

    PubMed Central

    Stöggl, Thomas; Holst, Anders; Jonasson, Arndt; Andersson, Erik; Wunsch, Tobias; Norström, Christer; Holmberg, Hans-Christer

    2014-01-01

    The purpose of the current study was to develop and validate an automatic algorithm for classification of cross-country (XC) ski-skating gears (G) using Smartphone accelerometer data. Eleven XC skiers (seven men, four women) with regional-to-international levels of performance carried out roller skiing trials on a treadmill using fixed gears (G2left, G2right, G3, G4left, G4right) and a 950-m trial using different speeds and inclines, applying gears and sides as they normally would. Gear classification by the Smartphone (on the chest) and based on video recordings were compared. Formachine-learning, a collective database was compared to individual data. The Smartphone application identified the trials with fixed gears correctly in all cases. In the 950-m trial, participants executed 140 ± 22 cycles as assessed by video analysis, with the automatic Smartphone application giving a similar value. Based on collective data, gears were identified correctly 86.0% ± 8.9% of the time, a value that rose to 90.3% ± 4.1% (P < 0.01) with machine learning from individual data. Classification was most often incorrect during transition between gears, especially to or from G3. Identification was most often correct for skiers who made relatively few transitions between gears. The accuracy of the automatic procedure for identifying G2left, G2right, G3, G4left and G4right was 96%, 90%, 81%, 88% and 94%, respectively. The algorithm identified gears correctly 100% of the time when a single gear was used and 90% of the time when different gears were employed during a variable protocol. This algorithm could be improved with respect to identification of transitions between gears or the side employed within a given gear. PMID:25365459

  17. Automatic classification of gait in children with early-onset ataxia or developmental coordination disorder and controls using inertial sensors.

    PubMed

    Mannini, Andrea; Martinez-Manzanera, Octavio; Lawerman, Tjitske F; Trojaniello, Diana; Croce, Ugo Della; Sival, Deborah A; Maurits, Natasha M; Sabatini, Angelo Maria

    2017-02-01

    Early-Onset Ataxia (EOA) and Developmental Coordination Disorder (DCD) are two conditions that affect coordination in children. Phenotypic identification of impaired coordination plays an important role in their diagnosis. Gait is one of the tests included in rating scales that can be used to assess motor coordination. A practical problem is that the resemblance between EOA and DCD symptoms can hamper their diagnosis. In this study we employed inertial sensors and a supervised classifier to obtain an automatic classification of the condition of participants. Data from shank and waist mounted inertial measurement units were used to extract features during gait in children diagnosed with EOA or DCD and age-matched controls. We defined a set of features from the recorded signals and we obtained the optimal features for classification using a backward sequential approach. We correctly classified 80.0%, 85.7%, and 70.0% of the control, DCD and EOA children, respectively. Overall, the automatic classifier correctly classified 78.4% of the participants, which is slightly better than the phenotypic assessment of gait by two pediatric neurologists (73.0%). These results demonstrate that automatic classification employing signals from inertial sensors obtained during gait maybe used as a support tool in the differential diagnosis of EOA and DCD. Furthermore, future extension of the classifier's test domains may help to further improve the diagnostic accuracy of pediatric coordination impairment. In this sense, this study may provide a first step towards incorporating a clinically objective and viable biomarker for identification of EOA and DCD.

  18. Back-and-Forth Methodology for Objective Voice Quality Assessment: From/to Expert Knowledge to/from Automatic Classification of Dysphonia

    NASA Astrophysics Data System (ADS)

    Fredouille, Corinne; Pouchoulin, Gilles; Ghio, Alain; Revis, Joana; Bonastre, Jean-François; Giovanni, Antoine

    2009-12-01

    This paper addresses voice disorder assessment. It proposes an original back-and-forth methodology involving an automatic classification system as well as knowledge of the human experts (machine learning experts, phoneticians, and pathologists). The goal of this methodology is to bring a better understanding of acoustic phenomena related to dysphonia. The automatic system was validated on a dysphonic corpus (80 female voices), rated according to the GRBAS perceptual scale by an expert jury. Firstly, focused on the frequency domain, the classification system showed the interest of 0-3000 Hz frequency band for the classification task based on the GRBAS scale. Later, an automatic phonemic analysis underlined the significance of consonants and more surprisingly of unvoiced consonants for the same classification task. Submitted to the human experts, these observations led to a manual analysis of unvoiced plosives, which highlighted a lengthening of VOT according to the dysphonia severity validated by a preliminary statistical analysis.

  19. Automatic recognition of disorders, findings, pharmaceuticals and body structures from clinical text: an annotation and machine learning study.

    PubMed

    Skeppstedt, Maria; Kvist, Maria; Nilsson, Gunnar H; Dalianis, Hercules

    2014-06-01

    Automatic recognition of clinical entities in the narrative text of health records is useful for constructing applications for documentation of patient care, as well as for secondary usage in the form of medical knowledge extraction. There are a number of named entity recognition studies on English clinical text, but less work has been carried out on clinical text in other languages. This study was performed on Swedish health records, and focused on four entities that are highly relevant for constructing a patient overview and for medical hypothesis generation, namely the entities: Disorder, Finding, Pharmaceutical Drug and Body Structure. The study had two aims: to explore how well named entity recognition methods previously applied to English clinical text perform on similar texts written in Swedish; and to evaluate whether it is meaningful to divide the more general category Medical Problem, which has been used in a number of previous studies, into the two more granular entities, Disorder and Finding. Clinical notes from a Swedish internal medicine emergency unit were annotated for the four selected entity categories, and the inter-annotator agreement between two pairs of annotators was measured, resulting in an average F-score of 0.79 for Disorder, 0.66 for Finding, 0.90 for Pharmaceutical Drug and 0.80 for Body Structure. A subset of the developed corpus was thereafter used for finding suitable features for training a conditional random fields model. Finally, a new model was trained on this subset, using the best features and settings, and its ability to generalise to held-out data was evaluated. This final model obtained an F-score of 0.81 for Disorder, 0.69 for Finding, 0.88 for Pharmaceutical Drug, 0.85 for Body Structure and 0.78 for the combined category Disorder+Finding. The obtained results, which are in line with or slightly lower than those for similar studies on English clinical text, many of them conducted using a larger training data set, show that

  20. Cross-over between discrete and continuous protein structure space: insights into automatic classification and networks of protein structures.

    PubMed

    Pascual-García, Alberto; Abia, David; Ortiz, Angel R; Bastolla, Ugo

    2009-03-01

    Structural classifications of proteins assume the existence of the fold, which is an intrinsic equivalence class of protein domains. Here, we test in which conditions such an equivalence class is compatible with objective similarity measures. We base our analysis on the transitive property of the equivalence relationship, requiring that similarity of A with B and B with C implies that A and C are also similar. Divergent gene evolution leads us to expect that the transitive property should approximately hold. However, if protein domains are a combination of recurrent short polypeptide fragments, as proposed by several authors, then similarity of partial fragments may violate the transitive property, favouring the continuous view of the protein structure space. We propose a measure to quantify the violations of the transitive property when a clustering algorithm joins elements into clusters, and we find out that such violations present a well defined and detectable cross-over point, from an approximately transitive regime at high structure similarity to a regime with large transitivity violations and large differences in length at low similarity. We argue that protein structure space is discrete and hierarchic classification is justified up to this cross-over point, whereas at lower similarities the structure space is continuous and it should be represented as a network. We have tested the qualitative behaviour of this measure, varying all the choices involved in the automatic classification procedure, i.e., domain decomposition, alignment algorithm, similarity score, and clustering algorithm, and we have found out that this behaviour is quite robust. The final classification depends on the chosen algorithms. We used the values of the clustering coefficient and the transitivity violations to select the optimal choices among those that we tested. Interestingly, this criterion also favours the agreement between automatic and expert classifications. As a domain set, we

  1. Automatic Classification Using Supervised Learning in a Medical Document Filtering Application.

    ERIC Educational Resources Information Center

    Mostafa, J.; Lam, W.

    2000-01-01

    Presents a multilevel model of the information filtering process that permits document classification. Evaluates a document classification approach based on a supervised learning algorithm, measures the accuracy of the algorithm in a neural network that was trained to classify medical documents on cell biology, and discusses filtering…

  2. Automatic Classification Using Supervised Learning in a Medical Document Filtering Application.

    ERIC Educational Resources Information Center

    Mostafa, J.; Lam, W.

    2000-01-01

    Presents a multilevel model of the information filtering process that permits document classification. Evaluates a document classification approach based on a supervised learning algorithm, measures the accuracy of the algorithm in a neural network that was trained to classify medical documents on cell biology, and discusses filtering…

  3. a Semi-Automatic Rule Set Building Method for Urban Land Cover Classification Based on Machine Learning and Human Knowledge

    NASA Astrophysics Data System (ADS)

    Gu, H. Y.; Li, H. T.; Liu, Z. Y.; Shao, C. Y.

    2017-09-01

    Classification rule set is important for Land Cover classification, which refers to features and decision rules. The selection of features and decision are based on an iterative trial-and-error approach that is often utilized in GEOBIA, however, it is time-consuming and has a poor versatility. This study has put forward a rule set building method for Land cover classification based on human knowledge and machine learning. The use of machine learning is to build rule sets effectively which will overcome the iterative trial-and-error approach. The use of human knowledge is to solve the shortcomings of existing machine learning method on insufficient usage of prior knowledge, and improve the versatility of rule sets. A two-step workflow has been introduced, firstly, an initial rule is built based on Random Forest and CART decision tree. Secondly, the initial rule is analyzed and validated based on human knowledge, where we use statistical confidence interval to determine its threshold. The test site is located in Potsdam City. We utilised the TOP, DSM and ground truth data. The results show that the method could determine rule set for Land Cover classification semi-automatically, and there are static features for different land cover classes.

  4. LTRsift: a graphical user interface for semi-automatic classification and postprocessing of de novo detected LTR retrotransposons

    PubMed Central

    2012-01-01

    Background Long terminal repeat (LTR) retrotransposons are a class of eukaryotic mobile elements characterized by a distinctive sequence similarity-based structure. Hence they are well suited for computational identification. Current software allows for a comprehensive genome-wide de novo detection of such elements. The obvious next step is the classification of newly detected candidates resulting in (super-)families. Such a de novo classification approach based on sequence-based clustering of transposon features has been proposed before, resulting in a preliminary assignment of candidates to families as a basis for subsequent manual refinement. However, such a classification workflow is typically split across a heterogeneous set of glue scripts and generic software (for example, spreadsheets), making it tedious for a human expert to inspect, curate and export the putative families produced by the workflow. Results We have developed LTRsift, an interactive graphical software tool for semi-automatic postprocessing of de novo predicted LTR retrotransposon annotations. Its user-friendly interface offers customizable filtering and classification functionality, displaying the putative candidate groups, their members and their internal structure in a hierarchical fashion. To ease manual work, it also supports graphical user interface-driven reassignment, splitting and further annotation of candidates. Export of grouped candidate sets in standard formats is possible. In two case studies, we demonstrate how LTRsift can be employed in the context of a genome-wide LTR retrotransposon survey effort. Conclusions LTRsift is a useful and convenient tool for semi-automated classification of newly detected LTR retrotransposons based on their internal features. Its efficient implementation allows for convenient and seamless filtering and classification in an integrated environment. Developed for life scientists, it is helpful in postprocessing and refining the output of software

  5. The Automatic Method of EEG State Classification by Using Self-Organizing Map

    NASA Astrophysics Data System (ADS)

    Tamura, Kazuhiro; Shimada, Takamasa; Saito, Yoichi

    In psychiatry, the sleep stage is one of the most important evidence for diagnosing mental disease. However, when doctor diagnose the sleep stage, much labor and skill are required, and a quantitative and objective method is required for more accurate diagnosis. For this reason, an automatic diagnosis system must be developed. In this paper, we propose an automatic sleep stage diagnosis method by using Self Organizing Maps (SOM). Neighborhood learning of SOM makes input data which has similar feature output closely. This function is effective to understandable classifying of complex input data automatically. We applied Elman-type feedback SOM to EEG of not only normal subjects but also subjects suffer from disease. The spectrum of characteristic waves in EEG of disease subjects is often different from it of normal subjects. So, it is difficult to classifying EEG of disease subjects with the rule for normal subjects. On the other hand, Elman-type feedback SOM Classifies the EEG with features which data include and classifying rule is made automatically, so even the EEG with disease subjects is able to be classified automatically. And this Elman-type feedback SOM has context units for diagnosing sleep stages considering contextual information of EEG. Experimental results indicate that the proposed method is able to achieve sleep stage judgment along with doctor's diagnosis.

  6. Impact of the accuracy of automatic segmentation of cell nuclei clusters on classification of thyroid follicular lesions.

    PubMed

    Jung, Chanho; Kim, Changick

    2014-08-01

    Automatic segmentation of cell nuclei clusters is a key building block in systems for quantitative analysis of microscopy cell images. For that reason, it has received a great attention over the last decade, and diverse automatic approaches to segment clustered nuclei with varying levels of performance under different test conditions have been proposed in literature. To the best of our knowledge, however, so far there is no comparative study on the methods. This study is a first attempt to fill this research gap. More precisely, the purpose of this study is to present an objective performance comparison of existing state-of-the-art segmentation methods. Particularly, the impact of their accuracy on classification of thyroid follicular lesions is also investigated "quantitatively" under the same experimental condition, to evaluate the applicability of the methods. Thirteen different segmentation approaches are compared in terms of not only errors in nuclei segmentation and delineation, but also their impact on the performance of system to classify thyroid follicular lesions using different metrics (e.g., diagnostic accuracy, sensitivity, specificity, etc.). Extensive experiments have been conducted on a total of 204 digitized thyroid biopsy specimens. Our study demonstrates that significant diagnostic errors can be avoided using more advanced segmentation approaches. We believe that this comprehensive comparative study serves as a reference point and guide for developers and practitioners in choosing an appropriate automatic segmentation technique adopted for building automated systems for specifically classifying follicular thyroid lesions. © 2014 International Society for Advancement of Cytometry.

  7. Comparative analysis of different implementations of a parallel algorithm for automatic target detection and classification of hyperspectral images

    NASA Astrophysics Data System (ADS)

    Paz, Abel; Plaza, Antonio; Plaza, Javier

    2009-08-01

    Automatic target detection in hyperspectral images is a task that has attracted a lot of attention recently. In the last few years, several algoritms have been developed for this purpose, including the well-known RX algorithm for anomaly detection, or the automatic target detection and classification algorithm (ATDCA), which uses an orthogonal subspace projection (OSP) approach to extract a set of spectrally distinct targets automatically from the input hyperspectral data. Depending on the complexity and dimensionality of the analyzed image scene, the target/anomaly detection process may be computationally very expensive, a fact that limits the possibility of utilizing this process in time-critical applications. In this paper, we develop computationally efficient parallel versions of both the RX and ATDCA algorithms for near real-time exploitation of these algorithms. In the case of ATGP, we use several distance metrics in addition to the OSP approach. The parallel versions are quantitatively compared in terms of target detection accuracy, using hyperspectral data collected by NASA's Airborne Visible Infra-Red Imaging Spectrometer (AVIRIS) over the World Trade Center in New York, five days after the terrorist attack of September 11th, 2001, and also in terms of parallel performance, using a massively Beowulf cluster available at NASA's Goddard Space Flight Center in Maryland.

  8. Triplex transfer learning: exploiting both shared and distinct concepts for text classification.

    PubMed

    Zhuang, Fuzhen; Luo, Ping; Du, Changying; He, Qing; Shi, Zhongzhi; Xiong, Hui

    2014-07-01

    Transfer learning focuses on the learning scenarios when the test data from target domains and the training data from source domains are drawn from similar but different data distributions with respect to the raw features. Along this line, some recent studies revealed that the high-level concepts, such as word clusters, could help model the differences of data distributions, and thus are more appropriate for classification. In other words, these methods assume that all the data domains have the same set of shared concepts, which are used as the bridge for knowledge transfer. However, in addition to these shared concepts, each domain may have its own distinct concepts. In light of this, we systemically analyze the high-level concepts, and propose a general transfer learning framework based on nonnegative matrix trifactorization, which allows to explore both shared and distinct concepts among all the domains simultaneously. Since this model provides more flexibility in fitting the data, it can lead to better classification accuracy. Moreover, we propose to regularize the manifold structure in the target domains to improve the prediction performances. To solve the proposed optimization problem, we also develop an iterative algorithm and theoretically analyze its convergence properties. Finally, extensive experiments show that the proposed model can outperform the baseline methods with a significant margin. In particular, we show that our method works much better for the more challenging tasks when there are distinct concepts in the data.

  9. ARC: automated resource classifier for agglomerative functional classification of prokaryotic proteins using annotation texts.

    PubMed

    Gnanamani, Muthiah; Kumar, Naveen; Ramachandran, Srinivasan

    2007-08-01

    Functional classification of proteins is central to comparative genomics. The need for algorithms tuned to enable integrative interpretation of analytical data is felt globally. The availability of a general,automated software with built-in flexibility will significantly aid this activity. We have prepared ARC (Automated Resource Classifier), which is an open source software meeting the user requirements of flexibility. The default classification scheme based on keyword match is agglomerative and directs entries into any of the 7 basic non-overlapping functional classes: Cell wall, Cell membrane and Transporters (C), Cell division (D), Information (I), Translocation (L), Metabolism (M), Stress(R), Signal and communication (S) and 2 ancillary classes: Others (O) and Hypothetical (H). The keyword library of ARC was built serially by first drawing keywords from Bacillus subtilis and Escherichia coli K12. In subsequent steps,this library was further enriched by collecting terms from archaeal representative Archaeoglobus fulgidus, Gene Ontology, and Gene Symbols. ARC is 94.04% successful on 6,75,663 annotated proteins from 348 prokaryotes. Three examples are provided to illuminate the current perspectives on mycobacterial physiology and costs of proteins in 333 prokaryotes. ARC is available at http://arc.igib.res.in.

  10. Automatic Classification of coarse density LiDAR data in urban area

    NASA Astrophysics Data System (ADS)

    Badawy, H. M.; Moussa, A.; El-Sheimy, N.

    2014-06-01

    The classification of different objects in the urban area using airborne LIDAR point clouds is a challenging problem especially with low density data. This problem is even more complicated if RGB information is not available with the point clouds. The aim of this paper is to present a framework for the classification of the low density LIDAR data in urban area with the objective to identify buildings, vehicles, trees and roads, without the use of RGB information. The approach is based on several steps, from the extraction of above the ground objects, classification using PCA, computing the NDSM and intensity analysis, for which a correction strategy was developed. The airborne LIDAR data used to test the research framework are of low density (1.41 pts/m2) and were taken over an urban area in San Diego, California, USA. The results showed that the proposed framework is efficient and robust for the classification of objects.

  11. Automatic classification of clouds on Meteosat imagery - Application to high-level clouds

    NASA Technical Reports Server (NTRS)

    Desbois, M.; Seze, G.; Szejwach, G.

    1982-01-01

    A statistical classification method based on clustering on three-dimensional histograms is applied to the three channels of the Meteosat imagery. The results of this classification are studied for different cloud cover cases over tropical regions. For high-level cloud classes, it is shown that the bidimensional IR-water vapor histogram allows one to deduce the cloud top temperature even for semi-transparent clouds.

  12. Automatic detection of clustered microcalcifications in digital mammograms based on wavelet features and neural network classification

    NASA Astrophysics Data System (ADS)

    Yu, Songyang; Guan, Ling; Brown, Stephen

    1998-06-01

    The appearance of clustered microcalcifications in mammogram films is one of the important early signs of breast cancer. This paper presents a new image processing system for the automatic detection of clustered microcalcifications in digitized mammogram films. The detection method uses wavelet features and feed forward neural network to find possible microcalcifications pixels and a set of features to locate individual microcalcifications.

  13. Automatic Classification of Question & Answer Discourse Segments from Teacher's Speech in Classrooms

    ERIC Educational Resources Information Center

    Blanchard, Nathaniel; D'Mello, Sidney; Olney, Andrew M.; Nystrand, Martin

    2015-01-01

    Question-answer (Q&A) is fundamental for dialogic instruction, an important pedagogical technique based on the free exchange of ideas and open-ended discussion. Automatically detecting Q&A is key to providing teachers with feedback on appropriate use of dialogic instructional strategies. In line with this, this paper studies the…

  14. Convolutional neural networks for an automatic classification of prostate tissue slides with high-grade Gleason score

    NASA Astrophysics Data System (ADS)

    Jiménez del Toro, Oscar; Atzori, Manfredo; Otálora, Sebastian; Andersson, Mats; Eurén, Kristian; Hedlund, Martin; Rönnquist, Peter; Müller, Henning

    2017-03-01

    The Gleason grading system was developed for assessing prostate histopathology slides. It is correlated to the outcome and incidence of relapse in prostate cancer. Although this grading is part of a standard protocol performed by pathologists, visual inspection of whole slide images (WSIs) has an inherent subjectivity when evaluated by different pathologists. Computer aided pathology has been proposed to generate an objective and reproducible assessment that can help pathologists in their evaluation of new tissue samples. Deep convolutional neural networks are a promising approach for the automatic classification of histopathology images and can hierarchically learn subtle visual features from the data. However, a large number of manual annotations from pathologists are commonly required to obtain sufficient statistical generalization when training new models that can evaluate the daily generated large amounts of pathology data. A fully automatic approach that detects prostatectomy WSIs with high-grade Gleason score is proposed. We evaluate the performance of various deep learning architectures training them with patches extracted from automatically generated regions-of-interest rather than from manually segmented ones. Relevant parameters for training the deep learning model such as size and number of patches as well as the inclusion or not of data augmentation are compared between the tested deep learning architectures. 235 prostate tissue WSIs with their pathology report from the publicly available TCGA data set were used. An accuracy of 78% was obtained in a balanced set of 46 unseen test images with different Gleason grades in a 2-class decision: high vs. low Gleason grade. Grades 7-8, which represent the boundary decision of the proposed task, were particularly well classified. The method is scalable to larger data sets with straightforward re-training of the model to include data from multiple sources, scanners and acquisition techniques. Automatically

  15. Enhancing automatic classification of hepatocellular carcinoma images through image masking, tissue changes and trabecular features

    PubMed Central

    Aziz, Maulana Abdul; Kanazawa, Hiroshi; Murakami, Yuri; Kimura, Fumikazu; Yamaguchi, Masahiro; Kiyuna, Tomoharu; Yamashita, Yoshiko; Saito, Akira; Ishikawa, Masahiro; Kobayashi, Naoki; Abe, Tokiya; Hashiguchi, Akinori; Sakamoto, Michiie

    2015-01-01

    Background: Recent breakthroughs in computer vision and digital microscopy have prompted the application of such technologies in cancer diagnosis, especially in histopathological image analysis. Earlier, an attempt to classify hepatocellular carcinoma images based on nuclear and structural features has been carried out on a set of surgical resected samples. Here, we proposed methods to enhance the process and improve the classification performance. Methods: First, we segmented the histological components of the liver tissues and generated several masked images. By utilizing the masked images, some set of new features were introduced, producing three sets of features consisting nuclei, trabecular and tissue changes features. Furthermore, we extended the classification process by using biopsy resected samples in addition to the surgical samples. Results: Experiments by using support vector machine (SVM) classifier with combinations of features and sample types showed that the proposed methods improve the classification rate in HCC detection for about 1-3%. Moreover, detection rate of low-grades cancer increased when the new features were appended in the classification process, although the rate was worsen in the case of undifferentiated tumors. Conclusions: The masking process increased the reliability of extracted nuclei features. The additional of new features improved the system especially for early HCC detection. Likewise, the combination of surgical and biopsy samples as training data could also improve the classification rates. Therefore, the methods will extend the support for pathologists in the HCC diagnosis. PMID:26110093

  16. Automatic Cataract Classification based on Ultrasound Technique Using Machine Learning: A comparative Study

    NASA Astrophysics Data System (ADS)

    Caxinha, Miguel; Velte, Elena; Santos, Mário; Perdigão, Fernando; Amaro, João; Gomes, Marco; Santos, Jaime

    This paper addresses the use of computer-aided diagnosis (CAD) system for the cataract classification based on ultrasound technique. Ultrasound A-scan signals were acquired in 220 porcine lenses. B-mode and Nakagami images were constructed. Ninety-seven parameters were extracted from acoustical, spectral and image textural analyses and were subjected to feature selection by Principal Component Analysis (PCA). Bayes, K Nearest-Neighbors (KNN), Fisher Linear Discriminant (FLD) and Support Vector Machine (SVM) classifiers were tested. The classification of healthy and cataractous lenses shows a good performance for the four classifiers (F-measure ≥92.68%) with SVM showing the highest performance (90.62%) for initial versus severe cataract classification.

  17. Automatic classification of the interferential tear film lipid layer using colour texture analysis.

    PubMed

    Remeseiro, B; Penas, M; Barreira, N; Mosquera, A; Novo, J; García-Resúa, C

    2013-07-01

    The tear film lipid layer is heterogeneous among the population. Its classification depends on its thickness and can be done using the interference pattern categories proposed by Guillon. This papers presents an exhaustive study about the characterisation of the interference phenomena as a texture pattern, using different feature extraction methods in different colour spaces. These methods are first analysed individually and then combined to achieve the best results possible. The principal component analysis (PCA) technique has also been tested to reduce the dimensionality of the feature vectors. The proposed methodologies have been tested on a dataset composed of 105 images from healthy subjects, with a classification rate of over 95% in some cases.

  18. Automatic classification of thermal patterns in diabetic foot based on morphological pattern spectrum

    NASA Astrophysics Data System (ADS)

    Hernandez-Contreras, D.; Peregrina-Barreto, H.; Rangel-Magdaleno, J.; Ramirez-Cortes, J.; Renero-Carrillo, F.

    2015-11-01

    This paper presents a novel approach to characterize and identify patterns of temperature in thermographic images of the human foot plant in support of early diagnosis and follow-up of diabetic patients. Composed feature vectors based on 3D morphological pattern spectrum (pecstrum) and relative position, allow the system to quantitatively characterize and discriminate non-diabetic (control) and diabetic (DM) groups. Non-linear classification using neural networks is used for that purpose. A classification rate of 94.33% in average was obtained with the composed feature extraction process proposed in this paper. Performance evaluation and obtained results are presented.

  19. Automatic migraine classification via feature selection committee and machine learning techniques over imaging and questionnaire data.

    PubMed

    Garcia-Chimeno, Yolanda; Garcia-Zapirain, Begonya; Gomez-Beldarrain, Marian; Fernandez-Ruanova, Begonya; Garcia-Monco, Juan Carlos

    2017-04-13

    Feature selection methods are commonly used to identify subsets of relevant features to facilitate the construction of models for classification, yet little is known about how feature selection methods perform in diffusion tensor images (DTIs). In this study, feature selection and machine learning classification methods were tested for the purpose of automating diagnosis of migraines using both DTIs and questionnaire answers related to emotion and cognition - factors that influence of pain perceptions. We select 52 adult subjects for the study divided into three groups: control group (15), subjects with sporadic migraine (19) and subjects with chronic migraine and medication overuse (18). These subjects underwent magnetic resonance with diffusion tensor to see white matter pathway integrity of the regions of interest involved in pain and emotion. The tests also gather data about pathology. The DTI images and test results were then introduced into feature selection algorithms (Gradient Tree Boosting, L1-based, Random Forest and Univariate) to reduce features of the first dataset and classification algorithms (SVM (Support Vector Machine), Boosting (Adaboost) and Naive Bayes) to perform a classification of migraine group. Moreover we implement a committee method to improve the classification accuracy based on feature selection algorithms. When classifying the migraine group, the greatest improvements in accuracy were made using the proposed committee-based feature selection method. Using this approach, the accuracy of classification into three types improved from 67 to 93% when using the Naive Bayes classifier, from 90 to 95% with the support vector machine classifier, 93 to 94% in boosting. The features that were determined to be most useful for classification included are related with the pain, analgesics and left uncinate brain (connected with the pain and emotions). The proposed feature selection committee method improved the performance of migraine diagnosis

  20. Automatic segmentation and classification of seven-segment display digits on auroral images

    NASA Astrophysics Data System (ADS)

    Savolainen, Tuomas; Whiter, Daniel Keith; Partamies, Noora

    2016-07-01

    In this paper we describe a new and fully automatic method for segmenting and classifying digits in seven-segment displays. The method is applied to a dataset consisting of about 7 million auroral all-sky images taken during the time period of 1973-1997 at camera stations centred around Sodankylä observatory in northern Finland. In each image there is a clock display for the date and time together with the reflection of the whole night sky through a spherical mirror. The digitised film images of the night sky contain valuable scientific information but are impractical to use without an automatic method for extracting the date-time from the display. We describe the implementation and the results of such a method in detail in this paper.

  1. Automatic Modulation Classification of Common Communication and Pulse Compression Radar Waveforms using Cyclic Features

    DTIC Science & Technology

    2013-03-01

    from estimated duty cycle, cyclic spectral correlation, and cyclic cumulants. The modulations considered in this research are BPSK, QPSK, 16- QAM , 64- QAM ...spectral density PSK phase shift keying QAM quadrature amplitude modulation QPSK quadrature phase shift keying RADAR radio detection and ranging RF radio...spectrum sensing research, automatic modulation recognition has emerged as an important process in cognitive spectrum management and EW applications

  2. Automatic GPR image classification using a Support Vector Machine Pre-screener with Hidden Markov Model confirmation

    NASA Astrophysics Data System (ADS)

    Williams, R. M.; Ray, L. E.

    2012-12-01

    This paper presents methods to automatically classify ground penetrating radar (GPR) images of crevasses on ice sheets for use with a completely autonomous robotic system. We use a combination of support vector machines (SVM) and hidden Markov models (HMM) with appropriate un-biased processing that is suitable for real-time analysis and detection. We tested and evaluated three processing schemes on 96 examples of Antarctic GPR imagery from 2010 and 104 examples of Greenland imagery from 2011, collected by our robot and a Pisten Bully tractor. The Antarctic and Greenland data were collected in the shear zone near McMurdo Station and between Thule Air Base and Summit Station, respectively. Using a modified cross validation technique, we correctly classified 86 of the Antarctic examples and 90 of the Greenland examples with a radial basis kernel SVM trained and evaluated on down-sampled and texture-mapped GPR images of crevasses, compared to 60% classification rate using raw data. In order to reduce false positives, we use the SVM classification results as pre-screener flags that mark locations in the GPR files to evaluate with two gaussian HMMs, and evaluate our results with a similar modified cross validation technique. The combined SVM pre-screen-HMM confirm method retains all the correct classifications by the SVM, and reduces the false positive rate to 4%. This method also reduces the computational burden in classifying GPR traces because the HMM is only being evaluated on select pre-screened traces. Our experiments demonstrate the promise, robustness and reliability of real-time crevasse detection and classification with robotic GPR surveys.

  3. Automatic classification of athletes with residual functional deficits following concussion by means of EEG signal using support vector machine.

    PubMed

    Cao, Cheng; Tutwiler, Richard Laurence; Slobounov, Semyon

    2008-08-01

    There is a growing body of knowledge indicating long-lasting residual electroencephalography (EEG) abnormalities in concussed athletes that may persist up to 10-year postinjury. Most often, these abnormalities are initially overlooked using traditional concussion assessment tools. Accordingly, premature return to sport participation may lead to recurrent episodes of concussion, increasing the risk of recurrent concussions with more severe consequences. Sixty-one athletes at high risk for concussion (i.e., collegiate rugby and football players) were recruited and underwent EEG baseline assessment. Thirty of these athletes suffered from concussion and were retested at day 30 postinjury. A number of task-related EEG recordings were conducted. A novel classification algorithm, the support vector machine (SVM), was applied as a classifier to identify residual functional abnormalities in athletes suffering from concussion using a multichannel EEG data set. The total accuracy of the classifier using the 10 features was 77.1%. The classifier has a high sensitivity of 96.7% (linear SVM), 80.0% (nonlinear SVM), and a relatively lower but acceptable selectivity of 69.1% (linear SVM) and 75.0% (nonlinear SVM). The major findings of this report are as follows: 1) discriminative features were observed at theta, alpha, and beta frequency bands, 2) the minimal redundancy relevance method was identified as being superior to the univariate t -test method in selecting features for the model calculation, 3) the EEG features selected for the classification model are linked to temporal and occipital areas, and 4) postural parameters influence EEG data set and can be used as discriminative features for the classification model. Overall, this report provides sufficient evidence that 10 EEG features selected for final analysis and SVM may be potentially used in clinical practice for automatic classification of athletes with residual brain functional abnormalities following a concussion

  4. A System to Automatically Classify and Name Any Individual Genome-Sequenced Organism Independently of Current Biological Classification and Nomenclature

    PubMed Central

    Song, Yuhyun; Leman, Scotland; Monteil, Caroline L.; Heath, Lenwood S.; Vinatzer, Boris A.

    2014-01-01

    A broadly accepted and stable biological classification system is a prerequisite for biological sciences. It provides the means to describe and communicate about life without ambiguity. Current biological classification and nomenclature use the species as the basic unit and require lengthy and laborious species descriptions before newly discovered organisms can be assigned to a species and be named. The current system is thus inadequate to classify and name the immense genetic diversity within species that is now being revealed by genome sequencing on a daily basis. To address this lack of a general intra-species classification and naming system adequate for today’s speed of discovery of new diversity, we propose a classification and naming system that is exclusively based on genome similarity and that is suitable for automatic assignment of codes to any genome-sequenced organism without requiring any phenotypic or phylogenetic analysis. We provide examples demonstrating that genome similarity-based codes largely align with current taxonomic groups at many different levels in bacteria, animals, humans, plants, and viruses. Importantly, the proposed approach is only slightly affected by the order of code assignment and can thus provide codes that reflect similarity between organisms and that do not need to be revised upon discovery of new diversity. We envision genome similarity-based codes to complement current biological nomenclature and to provide a universal means to communicate unambiguously about any genome-sequenced organism in fields as diverse as biodiversity research, infectious disease control, human and microbial forensics, animal breed and plant cultivar certification, and human ancestry research. PMID:24586551

  5. Automatic classification of schizophrenia using resting-state functional language network via an adaptive learning algorithm

    NASA Astrophysics Data System (ADS)

    Zhu, Maohu; Jie, Nanfeng; Jiang, Tianzi

    2014-03-01

    A reliable and precise classification of schizophrenia is significant for its diagnosis and treatment of schizophrenia. Functional magnetic resonance imaging (fMRI) is a novel tool increasingly used in schizophrenia research. Recent advances in statistical learning theory have led to applying pattern classification algorithms to access the diagnostic value of functional brain networks, discovered from resting state fMRI data. The aim of this study was to propose an adaptive learning algorithm to distinguish schizophrenia patients from normal controls using resting-state functional language network. Furthermore, here the classification of schizophrenia was regarded as a sample selection problem where a sparse subset of samples was chosen from the labeled training set. Using these selected samples, which we call informative vectors, a classifier for the clinic diagnosis of schizophrenia was established. We experimentally demonstrated that the proposed algorithm incorporating resting-state functional language network achieved 83.6% leaveone- out accuracy on resting-state fMRI data of 27 schizophrenia patients and 28 normal controls. In contrast with KNearest- Neighbor (KNN), Support Vector Machine (SVM) and l1-norm, our method yielded better classification performance. Moreover, our results suggested that a dysfunction of resting-state functional language network plays an important role in the clinic diagnosis of schizophrenia.

  6. ASTErIsM: application of topometric clustering algorithms in automatic galaxy detection and classification

    NASA Astrophysics Data System (ADS)

    Tramacere, A.; Paraficz, D.; Dubath, P.; Kneib, J.-P.; Courbin, F.

    2016-12-01

    We present a study on galaxy detection and shape classification using topometric clustering algorithms. We first use the DBSCAN algorithm to extract, from CCD frames, groups of adjacent pixels with significant fluxes and we then apply the DENCLUE algorithm to separate the contributions of overlapping sources. The DENCLUE separation is based on the localization of pattern of local maxima, through an iterative algorithm, which associates each pixel to the closest local maximum. Our main classification goal is to take apart elliptical from spiral galaxies. We introduce new sets of features derived from the computation of geometrical invariant moments of the pixel group shape and from the statistics of the spatial distribution of the DENCLUE local maxima patterns. Ellipticals are characterized by a single group of local maxima, related to the galaxy core, while spiral galaxies have additional groups related to segments of spiral arms. We use two different supervised ensemble classification algorithms: Random Forest and Gradient Boosting. Using a sample of ≃24 000 galaxies taken from the Galaxy Zoo 2 main sample with spectroscopic redshifts, and we test our classification against the Galaxy Zoo 2 catalogue. We find that features extracted from our pipeline give, on average, an accuracy of ≃93 per cent, when testing on a test set with a size of 20 per cent of our full data set, with features deriving from the angular distribution of density attractor ranking at the top of the discrimination power.

  7. Automatic classification of dyslexic children by applying machine learning to fMRI images.

    PubMed

    García Chimeno, Yolanda; García Zapirain, Begonya; Saralegui Prieto, Ibone; Fernandez-Ruanova, Begonya

    2014-01-01

    Functional Magnetic Resonance Imaging (fMRI) and Diffusion Tensor Imaging (DTI) are a source of information to study different pathologies. This tool allows to classify subjects under study, analysing in this case, the functions related to language in young patients with dyslexia. Images are obtained using a scanner and different tests are performed on subjects. After processing the images, the areas that are activated by patients when performing the paradigms or anatomy of the tracts were obtained. The main objective is to ultimately introduce a group of monocular vision subjects, whose brain activation model is unknown. This classification helps to assess whether these subjects are more akin to dyslexic or control subjects. Machine learning techniques study systems that learn how to perform non-linear classifications through supervised or unsupervised training, or a combination of both. Once the machine has been set up, it is validated with the subjects who have not been entered in the training stage. The results are obtained using a user-friendly chart. Finally, a new tool for the classification of subjects with dyslexia and monocular vision was obtained (achieving a success rate of 94.8718% on the Neuronal Network classifier), which can be extended to other further classifications.

  8. Automatic classification of patients with idiopathic Parkinson's disease and progressive supranuclear palsy using diffusion MRI datasets

    NASA Astrophysics Data System (ADS)

    Talai, Sahand; Boelmans, Kai; Sedlacik, Jan; Forkert, Nils D.

    2017-03-01

    Parkinsonian syndromes encompass a spectrum of neurodegenerative diseases, which can be classified into various subtypes. The differentiation of these subtypes is typically conducted based on clinical criteria. Due to the overlap of intra-syndrome symptoms, the accurate differential diagnosis based on clinical guidelines remains a challenge with failure rates up to 25%. The aim of this study is to present an image-based classification method of patients with Parkinson's disease (PD) and patients with progressive supranuclear palsy (PSP), an atypical variant of PD. Therefore, apparent diffusion coefficient (ADC) parameter maps were calculated based on diffusion-tensor magnetic resonance imaging (MRI) datasets. Mean ADC values were determined in 82 brain regions using an atlas-based approach. The extracted mean ADC values for each patient were then used as features for classification using a linear kernel support vector machine classifier. To increase the classification accuracy, a feature selection was performed, which resulted in the top 17 attributes to be used as the final input features. A leave-one-out cross validation based on 56 PD and 21 PSP subjects revealed that the proposed method is capable of differentiating PD and PSP patients with an accuracy of 94.8%. In conclusion, the classification of PD and PSP patients based on ADC features obtained from diffusion MRI datasets is a promising new approach for the differentiation of Parkinsonian syndromes in the broader context of decision support systems.

  9. miRFam: an effective automatic miRNA classification method based on n-grams and a multiclass SVM

    PubMed Central

    2011-01-01

    Background MicroRNAs (miRNAs) are ~22 nt long integral elements responsible for post-transcriptional control of gene expressions. After the identification of thousands of miRNAs, the challenge is now to explore their specific biological functions. To this end, it will be greatly helpful to construct a reasonable organization of these miRNAs according to their homologous relationships. Given an established miRNA family system (e.g. the miRBase family organization), this paper addresses the problem of automatically and accurately classifying newly found miRNAs to their corresponding families by supervised learning techniques. Concretely, we propose an effective method, miRFam, which uses only primary information of pre-miRNAs or mature miRNAs and a multiclass SVM, to automatically classify miRNA genes. Results An existing miRNA family system prepared by miRBase was downloaded online. We first employed n-grams to extract features from known precursor sequences, and then trained a multiclass SVM classifier to classify new miRNAs (i.e. their families are unknown). Comparing with miRBase's sequence alignment and manual modification, our study shows that the application of machine learning techniques to miRNA family classification is a general and more effective approach. When the testing dataset contains more than 300 families (each of which holds no less than 5 members), the classification accuracy is around 98%. Even with the entire miRBase15 (1056 families and more than 650 of them hold less than 5 samples), the accuracy surprisingly reaches 90%. Conclusions Based on experimental results, we argue that miRFam is suitable for application as an automated method of family classification, and it is an important supplementary tool to the existing alignment-based small non-coding RNA (sncRNA) classification methods, since it only requires primary sequence information. Availability The source code of miRFam, written in C++, is freely and publicly available at: http

  10. Semi-Automatic Classification of Birdsong Elements Using a Linear Support Vector Machine

    PubMed Central

    Tachibana, Ryosuke O.; Oosugi, Naoya; Okanoya, Kazuo

    2014-01-01

    Birdsong provides a unique model for understanding the behavioral and neural bases underlying complex sequential behaviors. However, birdsong analyses require laborious effort to make the data quantitatively analyzable. The previous attempts had succeeded to provide some reduction of human efforts involved in birdsong segment classification. The present study was aimed to further reduce human efforts while increasing classification performance. In the current proposal, a linear-kernel support vector machine was employed to minimize the amount of human-generated label samples for reliable element classification in birdsong, and to enable the classifier to handle highly-dimensional acoustic features while avoiding the over-fitting problem. Bengalese finch's songs in which distinct elements (i.e., syllables) were aligned in a complex sequential pattern were used as a representative test case in the neuroscientific research field. Three evaluations were performed to test (1) algorithm validity and accuracy with exploring appropriate classifier settings, (2) capability to provide accuracy with reducing amount of instruction dataset, and (3) capability in classifying large dataset with minimized manual labeling. The results from the evaluation (1) showed that the algorithm is 99.5% reliable in song syllables classification. This accuracy was indeed maintained in evaluation (2), even when the instruction data classified by human were reduced to one-minute excerpt (corresponding to 300–400 syllables) for classifying two-minute excerpt. The reliability remained comparable, 98.7% accuracy, when a large target dataset of whole day recordings (∼30,000 syllables) was used. Use of a linear-kernel support vector machine showed sufficient accuracies with minimized manually generated instruction data in bird song element classification. The methodology proposed would help reducing laborious processes in birdsong analysis without sacrificing reliability, and therefore can help

  11. Food Safety by Using Machine Learning for Automatic Classification of Seeds of the South-American Incanut Plant

    NASA Astrophysics Data System (ADS)

    Lemanzyk, Thomas; Anding, Katharina; Linss, Gerhard; Rodriguez Hernández, Jorge; Theska, René

    2015-02-01

    The following paper deals with the classification of seeds and seed components of the South-American Incanut plant and the modification of a machine to handle this task. Initially the state of the art is being illustrated. The research was executed in Germany and with a relevant part in Peru and Ecuador. Theoretical considerations for the solution of an automatically analysis of the Incanut seeds were specified. The optimization of the analyzing software and the separation unit of the mechanical hardware are carried out with recognition results. In a final step the practical application of the analysis of the Incanut seeds is held on a trial basis and rated on the bases of statistic values.

  12. A Feature Mining Based Approach for the Classification of Text Documents into Disjoint Classes.

    ERIC Educational Resources Information Center

    Nieto Sanchez, Salvador; Triantaphyllou, Evangelos; Kraft, Donald

    2002-01-01

    Proposes a new approach for classifying text documents into two disjoint classes. Highlights include a brief overview of document clustering; a data mining approach called the One Clause at a Time (OCAT) algorithm which is based on mathematical logic; vector space model (VSM); and comparing the OCAT to the VSM. (Author/LRW)

  13. A Feature Mining Based Approach for the Classification of Text Documents into Disjoint Classes.

    ERIC Educational Resources Information Center

    Nieto Sanchez, Salvador; Triantaphyllou, Evangelos; Kraft, Donald

    2002-01-01

    Proposes a new approach for classifying text documents into two disjoint classes. Highlights include a brief overview of document clustering; a data mining approach called the One Clause at a Time (OCAT) algorithm which is based on mathematical logic; vector space model (VSM); and comparing the OCAT to the VSM. (Author/LRW)

  14. Shared Features of L2 Writing: Intergroup Homogeneity and Text Classification

    ERIC Educational Resources Information Center

    Crossley, Scott A.; McNamara, Danielle S.

    2011-01-01

    This study investigates intergroup homogeneity within high intermediate and advanced L2 writers of English from Czech, Finnish, German, and Spanish first language backgrounds. A variety of linguistic features related to lexical sophistication, syntactic complexity, and cohesion were used to compare texts written by L1 speakers of English to L2…

  15. Automatic knee cartilage segmentation from multi-contrast MR images using support vector machine classification with spatial dependencies.

    PubMed

    Zhang, Kunlei; Lu, Wenmiao; Marziliano, Pina

    2013-12-01

    Accurate segmentation of knee cartilage is required to obtain quantitative cartilage measurements, which is crucial for the assessment of knee pathology caused by musculoskeletal diseases or sudden injuries. This paper presents an automatic knee cartilage segmentation technique which exploits a rich set of image features from multi-contrast magnetic resonance (MR) images and the spatial dependencies between neighbouring voxels. The image features and the spatial dependencies are modelled into a support vector machine (SVM)-based association potential and a discriminative random field (DRF)-based interaction potential. Subsequently, both potentials are incorporated into an inference graphical model such that the knee cartilage segmentation is cast into an optimal labelling problem which can be efficiently solved by loopy belief propagation. The effectiveness of the proposed technique is validated on a database of multi-contrast MR images. The experimental results show that using diverse forms of image and anatomical structure information as the features are helpful in improving the segmentation, and the joint SVM-DRF model is superior to the classification models based solely on DRF or SVM in terms of accuracy when the same features are used. The developed segmentation technique achieves good performance compared with gold standard segmentations and obtained higher average DSC values than the state-of-the-art automatic cartilage segmentation studies.

  16. SVM-based classification selection algorithm for the automatic selection of guide star

    NASA Astrophysics Data System (ADS)

    Zheng, Sheng; Xiong, Chengyi; Wu, Weiren; Tian, Jinwen; Liu, Jian

    2003-09-01

    A new general method of the automatic selection of guide star, which based on a new dynamic Visual Magnitude Threshold (VMT) hyper-plane and the Support Vector Machines (SVM), is introduced. The high dimensional nonlinear VMT plane can be easily obtained by using the SVM, then the guide star sets are generated by the SVM classifier. The experiment results demonstrate that the catalog obtained by the proposed algorithm has a lot of advantages including, fewer total numbers, smaller catalog size and better distribution uniformity.

  17. EOG and EMG: two important switches in automatic sleep stage classification.

    PubMed

    Estrada, E; Nazeran, H; Barragan, J; Burk, J R; Lucas, E A; Behbehani, K

    2006-01-01

    Sleep is a natural periodic state of rest for the body, in which the eyes are usually closed and consciousness is completely or partially lost. In this investigation we used the EOG and EMG signals acquired from 10 patients undergoing overnight polysomnography with their sleep stages determined by expert sleep specialists based on RK rules. Differentiation between Stage 1, Awake and REM stages challenged a well trained neural network classifier to distinguish between classes when only EEG-derived signal features were used. To meet this challenge and improve the classification rate, extra features extracted from EOG and EMG signals were fed to the classifier. In this study, two simple feature extraction algorithms were applied to EOG and EMG signals. The statistics of the results were calculated and displayed in an easy to visualize fashion to observe tendencies for each sleep stage. Inclusion of these features show a great promise to improve the classification rate towards the target rate of 100%

  18. A software tool for automatic classification and segmentation of 2D/3D medical images

    NASA Astrophysics Data System (ADS)

    Strzelecki, Michal; Szczypinski, Piotr; Materka, Andrzej; Klepaczko, Artur

    2013-02-01

    Modern medical diagnosis utilizes techniques of visualization of human internal organs (CT, MRI) or of its metabolism (PET). However, evaluation of acquired images made by human experts is usually subjective and qualitative only. Quantitative analysis of MR data, including tissue classification and segmentation, is necessary to perform e.g. attenuation compensation, motion detection, and correction of partial volume effect in PET images, acquired with PET/MR scanners. This article presents briefly a MaZda software package, which supports 2D and 3D medical image analysis aiming at quantification of image texture. MaZda implements procedures for evaluation, selection and extraction of highly discriminative texture attributes combined with various classification, visualization and segmentation tools. Examples of MaZda application in medical studies are also provided.

  19. Automatic Classification of Land Cover on Smith Island, VA, Using HyMAP Imagery

    DTIC Science & Technology

    2002-10-01

    particular areas, labeled data con- sisted of isolated single-pixel waypoints. Both approaches to the classification problem produced consistent results for...based on 112 HyMAP spectra, labeled in ground surveys, ob- tained reasonably consistent results for many of the dominant cat- egories, with a few...salt flats or salt pannes. Wash flats result , for example, from sudden storm surge events in which the dune line is breached. Salt pannes occur in

  20. Search strategies in a human water maze analogue analyzed with automatic classification methods.

    PubMed

    Schoenfeld, Robby; Moenich, Nadine; Mueller, Franz-Josef; Lehmann, Wolfgang; Leplow, Bernd

    2010-03-17

    Although human spatial cognition is at the focus of intense research efforts, experimental evidence on how search strategies differ among age and gender groups remains elusive. To address this problem, we investigated the interaction between age, sex, and strategy usage within a novel virtual water maze-like procedure (VWM). We studied 28 young adults 20-29 years (14 males) and 30 middle-aged adults 50-59 years (15 males). Younger age groups outperformed older groups with respect to place learning. We also observed a moderate sex effect, with males outperforming females. Unbiased classification of human search behavior within this paradigm was done by means of an exploratory method using sparse non-negative matrix factorization (SNMF) and a parameter-based algorithm as an a priori classifier. Analyses of search behavior with the SNMF and the parameter-based method showed that the older group relied on less efficient search strategies, but females did not drop so dramatically. Place learning was related to the adaptation of elaborated search strategies. Participants using place-directed strategies obtained the highest score on place learning, and deterioration of place learning in the elderly was due to the use of less efficient non-specific strategies. A high convergence of the SNMF and the parameter-based classifications could be shown. Furthermore, the SNMF classification was cross validated with the traditional eyeballing method. As a result of this analysis, we conclude that SNMF is a robust exploratory method for the classification of search behavior in water maze procedures.

  1. Automatic classification and pattern discovery in high-throughput protein crystallization trials.

    PubMed

    Cumbaa, Christian; Jurisica, Igor

    2005-01-01

    Conceptually, protein crystallization can be divided into two phases search and optimization. Robotic protein crystallization screening can speed up the search phase, and has a potential to increase process quality. Automated image classification helps to increase throughput and consistently generate objective results. Although the classification accuracy can always be improved, our image analysis system can classify images from 1,536-well plates with high classification accuracy (85%) and ROC score (0.87), as evaluated on 127 human-classified protein screens containing 5,600 crystal images and 189,472 non-crystal images. Data mining can integrate results from high-throughput screens with information about crystallizing conditions, intrinsic protein properties, and results from crystallization optimization. We apply association mining, a data mining approach that identifies frequently occurring patterns among variables and their values. This approach segregates proteins into groups based on how they react in a broad range of conditions, and clusters cocktails to reflect their potential to achieve crystallization. These results may lead to crystallization screen optimization, and reveal associations between protein properties and crystallization conditions. We also postulate that past experience may lead us to the identification of initial conditions favorable to crystallization for novel proteins.

  2. HaploGrep: a fast and reliable algorithm for automatic classification of mitochondrial DNA haplogroups.

    PubMed

    Kloss-Brandstätter, Anita; Pacher, Dominic; Schönherr, Sebastian; Weissensteiner, Hansi; Binna, Robert; Specht, Günther; Kronenberg, Florian

    2011-01-01

    An ongoing source of controversy in mitochondrial DNA (mtDNA) research is based on the detection of numerous errors in mtDNA profiles that led to erroneous conclusions and false disease associations. Most of these controversies could be avoided if the samples' haplogroup status would be taken into consideration. Knowing the mtDNA haplogroup affiliation is a critical prerequisite for studying mechanisms of human evolution and discovering genes involved in complex diseases, and validating phylogenetic consistency using haplogroup classification is an important step in quality control. However, despite the availability of Phylotree, a regularly updated classification tree of global mtDNA variation, the process of haplogroup classification is still time-consuming and error-prone, as researchers have to manually compare the polymorphisms found in a population sample to those summarized in Phylotree, polymorphism by polymorphism, sample by sample. We present HaploGrep, a fast, reliable and straight-forward algorithm implemented in a Web application to determine the haplogroup affiliation of thousands of mtDNA profiles genotyped for the entire mtDNA or any part of it. HaploGrep uses the latest version of Phylotree and offers an all-in-one solution for quality assessment of mtDNA profiles in clinical genetics, population genetics and forensics. HaploGrep can be accessed freely at http://haplogrep.uibk.ac.at.

  3. HClass: Automatic classification tool for health pathologies using artificial intelligence techniques.

    PubMed

    Garcia-Chimeno, Yolanda; Garcia-Zapirain, Begonya

    2015-01-01

    The classification of subjects' pathologies enables a rigorousness to be applied to the treatment of certain pathologies, as doctors on occasions play with so many variables that they can end up confusing some illnesses with others. Thanks to Machine Learning techniques applied to a health-record database, it is possible to make using our algorithm. hClass contains a non-linear classification of either a supervised, non-supervised or semi-supervised type. The machine is configured using other techniques such as validation of the set to be classified (cross-validation), reduction in features (PCA) and committees for assessing the various classifiers. The tool is easy to use, and the sample matrix and features that one wishes to classify, the number of iterations and the subjects who are going to be used to train the machine all need to be introduced as inputs. As a result, the success rate is shown either via a classifier or via a committee if one has been formed. A 90% success rate is obtained in the ADABoost classifier and 89.7% in the case of a committee (comprising three classifiers) when PCA is applied. This tool can be expanded to allow the user to totally characterise the classifiers by adjusting them to each classification use.

  4. Semi-automatic classification of cementitious materials using scanning electron microscope images

    NASA Astrophysics Data System (ADS)

    Drumetz, L.; Dalla Mura, M.; Meulenyzer, S.; Lombard, S.; Chanussot, J.

    2015-04-01

    A new interactive approach for segmentation and classification of cementitious materials using Scanning Electron Microscope images is presented in this paper. It is based on the denoising of the data with the Block Matching 3D (BM3D) algorithm, Binary Partition Tree (BPT) segmentation and Support Vector Machines (SVM) classification. The latter two operations are both performed in an interactive way. The BPT provides a hierarchical representation of the spatial regions of the data and, after an appropriate pruning, it yields a segmentation map which can be improved by the user. SVMs are used to obtain a classification map of the image with which the user can interact to get better results. The interactivity is twofold: it allows the user to get a better segmentation by exploring the BPT structure, and to help the classifier to better discriminate the classes. This is performed by improving the representativity of the training set, adding new pixels from the segmented regions to the training samples. This approach performs similarly or better than methods currently used in an industrial environment. The validation is performed on several cement samples, both qualitatively by visual examination and quantitatively by the comparison of experimental results with theoretical values.

  5. Hierarchical learning architecture with automatic feature selection for multiclass protein fold classification.

    PubMed

    Huang, Chuen-Der; Lin, Chin-Teng; Pal, Nikhil Ranjan

    2003-12-01

    The structure classification of proteins plays a very important role in bioinformatics, since the relationships and characteristics among those known proteins can be exploited to predict the structure of new proteins. The success of a classification system depends heavily on two things: the tools being used and the features considered. For the bioinformatics applications, the role of appropriate features has not been paid adequate importance. In this investigation we use three novel ideas for multiclass protein fold classification. First, we use the gating neural network, where each input node is associated with a gate. This network can select important features in an online manner when the learning goes on. At the beginning of the training, all gates are almost closed, i.e., no feature is allowed to enter the network. Through the training, gates corresponding to good features are completely opened while gates corresponding to bad features are closed more tightly, and some gates may be partially open. The second novel idea is to use a hierarchical learning architecture (HLA). The classifier in the first level of HLA classifies the protein features into four major classes: all alpha, all beta, alpha + beta, and alpha/beta. And in the next level we have another set of classifiers, which further classifies the protein features into 27 folds. The third novel idea is to induce the indirect coding features from the amino-acid composition sequence of proteins based on the N-gram concept. This provides us with more representative and discriminative new local features of protein sequences for multiclass protein fold classification. The proposed HLA with new indirect coding features increases the protein fold classification accuracy by about 12%. Moreover, the gating neural network is found to reduce the number of features drastically. Using only half of the original features selected by the gating neural network can reach comparable test accuracy as that using all the

  6. Facility Detection and Popularity Assessment from Text Classification of Social Media and Crowdsourced Data

    SciTech Connect

    Sparks, Kevin A; Li, Roger G; Thakur, Gautam; Stewart, Robert N; Urban, Marie L

    2016-01-01

    Advances in technology have continually progressed our understanding of where people are, how they use the environment around them, and why they are at their current location. Having a better knowledge of when various locations become popular through space and time could have large impacts on research fields like urban dynamics and energy consumption. In this paper, we discuss the ability to identify and locate various facility types (e.g. restaurant, airport, stadiums) using social media, and assess methods in determining when these facilities become popular over time. We use natural language processing tools and machine learning classifiers to interpret geotagged Twitter text and determine if a user is seemingly at a location of interest when the tweet was sent. On average our classifiers are approximately 85% accurate varying across multiple facility types, with a peak precision of 98%. By using these methods to classify unstructured text, geotagged social media data can be an extremely useful tool to better understanding the composition of places and how and when people use them.

  7. Automatic rocks detection and classification on high resolution images of planetary surfaces

    NASA Astrophysics Data System (ADS)

    Aboudan, A.; Pacifici, A.; Murana, A.; Cannarsa, F.; Ori, G. G.; Dell'Arciprete, I.; Allemand, P.; Grandjean, P.; Portigliotti, S.; Marcer, A.; Lorenzoni, L.

    2013-12-01

    High-resolution images can be used to obtain rocks location and size on planetary surfaces. In particular rock size-frequency distribution is a key parameter to evaluate the surface roughness, to investigate the geologic processes that formed the surface and to assess the hazards related with spacecraft landing. The manual search for rocks on high-resolution images (even for small areas) can be a very intensive work. An automatic or semi-automatic algorithm to identify rocks is mandatory to enable further processing as determining the rocks presence, size, height (by means of shadows) and spatial distribution over an area of interest. Accurate rocks and shadows contours localization are the key steps for rock detection. An approach to contour detection based on morphological operators and statistical thresholding is presented in this work. The identified contours are then fitted using a proper geometric model of the rocks or shadows and used to estimate salient rocks parameters (position, size, area, height). The performances of this approach have been evaluated both on images of Martian analogue area of Morocco desert and on HiRISE images. Results have been compared with ground truth obtained by means of manual rock mapping and proved the effectiveness of the algorithm. The rock abundance and rocks size-frequency distribution derived on selected HiRISE images have been compared with the results of similar analyses performed for the landing site certification of Mars landers (Viking, Pathfinder, MER, MSL) and with the available thermal data from IRTM and TES.

  8. Automatic Generation of Data Types for Classification of Deep Web Sources

    SciTech Connect

    Ngu, A H; Buttler, D J; Critchlow, T J

    2005-02-14

    A Service Class Description (SCD) is an effective meta-data based approach for discovering Deep Web sources whose data exhibit some regular patterns. However, it is tedious and error prone to create an SCD description manually. Moreover, a manually created SCD is not adaptive to the frequent changes of Web sources. It requires its creator to identify all the possible input and output types of a service a priori. In many domains, it is impossible to exhaustively list all the possible input and output data types of a source in advance. In this paper, we describe machine learning approaches for automatic generation of the data types of an SCD. We propose two different approaches for learning data types of a class of Web sources. The Brute-Force Learner is able to generate data types that can achieve high recall, but with low precision. The Clustering-based Learner generates data types that have a high precision rate, but with a lower recall rate. We demonstrate the feasibility of these two learning-based solutions for automatic generation of data types for citation Web sources and presented a quantitative evaluation of these two solutions.

  9. Automatic segmentation and classification of the reflected laser dots during analytic measurement of mirror surfaces

    NASA Astrophysics Data System (ADS)

    Wang, ZhenZhou

    2016-08-01

    In the past research, we have proposed a one-shot-projection method for analytic measurement of the shapes of the mirror surfaces, which utilizes the information of two captured laser dots patterns to reconstruct the mirror surfaces. Yet, the automatic image processing algorithms to extract the laser dots patterns have not been addressed. In this paper, a series of automatic image processing algorithms are proposed to segment and classify the projected laser dots robustly and efficiently during measuring the shapes of mirror surfaces and each algorithm is indispensible for the finally achieved accuracy. Firstly, the captured image is modeled and filtered by the designed frequency domain filter. Then, it is segmented by a robust threshold selection method. A novel iterative erosion method is proposed to separate connected dots. Novel methods to remove noise blob and retrieve missing dots are also proposed. An effective registration method is used to help to select the used SNF laser and the dot generation pattern by analyzing if the dot pattern obeys the principle of central projection well. Experimental results verified the effectiveness of all the proposed algorithms.

  10. Hybrid three-dimensional and support vector machine approach for automatic vehicle tracking and classification using a single camera

    NASA Astrophysics Data System (ADS)

    Kachach, Redouane; Cañas, José María

    2016-05-01

    Using video in traffic monitoring is one of the most active research domains in the computer vision community. TrafficMonitor, a system that employs a hybrid approach for automatic vehicle tracking and classification on highways using a simple stationary calibrated camera, is presented. The proposed system consists of three modules: vehicle detection, vehicle tracking, and vehicle classification. Moving vehicles are detected by an enhanced Gaussian mixture model background estimation algorithm. The design includes a technique to resolve the occlusion problem by using a combination of two-dimensional proximity tracking algorithm and the Kanade-Lucas-Tomasi feature tracking algorithm. The last module classifies the shapes identified into five vehicle categories: motorcycle, car, van, bus, and truck by using three-dimensional templates and an algorithm based on histogram of oriented gradients and the support vector machine classifier. Several experiments have been performed using both real and simulated traffic in order to validate the system. The experiments were conducted on GRAM-RTM dataset and a proper real video dataset which is made publicly available as part of this work.

  11. A semi-automatic method for quantification and classification of erythrocytes infected with malaria parasites in microscopic images.

    PubMed

    Díaz, Gloria; González, Fabio A; Romero, Eduardo

    2009-04-01

    Visual quantification of parasitemia in thin blood films is a very tedious, subjective and time-consuming task. This study presents an original method for quantification and classification of erythrocytes in stained thin blood films infected with Plasmodium falciparum. The proposed approach is composed of three main phases: a preprocessing step, which corrects luminance differences. A segmentation step that uses the normalized RGB color space for classifying pixels either as erythrocyte or background followed by an Inclusion-Tree representation that structures the pixel information into objects, from which erythrocytes are found. Finally, a two step classification process identifies infected erythrocytes and differentiates the infection stage, using a trained bank of classifiers. Additionally, user intervention is allowed when the approach cannot make a proper decision. Four hundred fifty malaria images were used for training and evaluating the method. Automatic identification of infected erythrocytes showed a specificity of 99.7% and a sensitivity of 94%. The infection stage was determined with an average sensitivity of 78.8% and average specificity of 91.2%.

  12. [Automatic classification method of star spectra data based on manifold-based discriminant anaysis and Support Vector Machine].

    PubMed

    Liu, Zhong-Bao; Wang, Zhao-Ba; Zhao, Wen-Juan

    2014-01-01

    Although Support Vector Machine (SVM) is widely used in astronomy, it only takes the margin between classes into consideration while neglects the data distribution in each class, which seriously limits the classification efficiency. In view of this, a novel automatic classification method of star spectra data based on manifold-based discriminant analysis (MDA) and SVM is proposed in this paper. Two important concepts in MDA, manifold-based within-class scatter (MWCS) and manifold-based between-class scatter (MBCS), are introduced in the proposed method, the separating hyperplane found by which ensures MWCS is minimized and MBCS is maximized. Based on the above analysis, the corresponding optimal problem can be established, and then MDA transforms the original optimization problem to the QP dual form and we can obtain the support vectors and decision function. The classes of test samples are decided by the decision function. The advantage of the proposed method is that it not only focuses on the information between classes and distribution characteristics, but also preserves the manifold structure of each class. Experiments on SDSS star spectra datasets verify the effectiveness of the proposed method.

  13. Automatic segmentation of textures on a database of remote-sensing images and classification by neural network

    NASA Astrophysics Data System (ADS)

    Durand, Philippe; Jaupi, Luan; Ghorbanzdeh, Dariush

    2012-11-01

    Analysis and automatic segmentation of texture is always a delicate problem. Objectively, one can opt, quite naturally, for a statistical approach. Based on higher moments, these technics are very reliable and accurate but expensive experimentally. We propose in this paper, a well-proven approach for texture analysis in remote sensing, based on geostatistics. The labeling of different textures like ice, clouds, water and forest on a sample test image is learned by a neural network. The texture parameters are extracted from the shape of the autocorrelation function, calculated on the appropriate window sizes for the optimal characterization of textures. A mathematical model from fractal geometry is particularly well suited to characterize the cloud texture. It provides a very fine segmentation between the texture and the cloud from the ice. The geostatistical parameters are entered as a vector characterize by textures. A neural network and a robust multilayer are then asked to rank all the images in the database from a learning set correctly selected. In the design phase, several alternatives were considered and it turns out that a network with three layers is very suitable for the proposed classification. Therefore it contains a layer of input neurons, an intermediate layer and a layer of output. With the coming of the learning phase the results of the classifications are very good. This approach can bring precious geographic information system. such as the exploitation of the cloud texture (or disposal) if we want to focus on other thematic deforestation, changes in the ice ...

  14. Multistation alarm system for eruptive activity based on the automatic classification of volcanic tremor: specifications and performance

    NASA Astrophysics Data System (ADS)

    Langer, Horst; Falsaperla, Susanna; Messina, Alfio; Spampinato, Salvatore

    2015-04-01

    With over fifty eruptive episodes (Strombolian activity, lava fountains, and lava flows) between 2006 and 2013, Mt Etna, Italy, underscored its role as the most active volcano in Europe. Seven paroxysmal lava fountains at the South East Crater occurred in 2007-2008 and 46 at the New South East Crater between 2011 and 2013. Month-lasting lava emissions affected the upper eastern flank of the volcano in 2006 and 2008-2009. On this background, effective monitoring and forecast of volcanic phenomena are a first order issue for their potential socio-economic impact in a densely populated region like the town of Catania and its surroundings. For example, explosive activity has often formed thick ash clouds with widespread tephra fall able to disrupt the air traffic, as well as to cause severe problems at infrastructures, such as highways and roads. For timely information on changes in the state of the volcano and possible onset of dangerous eruptive phenomena, the analysis of the continuous background seismic signal, the so-called volcanic tremor, turned out of paramount importance. Changes in the state of the volcano as well as in its eruptive style are usually concurrent with variations of the spectral characteristics (amplitude and frequency content) of tremor. The huge amount of digital data continuously acquired by INGV's broadband seismic stations every day makes a manual analysis difficult, and techniques of automatic classification of the tremor signal are therefore applied. The application of unsupervised classification techniques to the tremor data revealed significant changes well before the onset of the eruptive episodes. This evidence led to the development of specific software packages related to real-time processing of the tremor data. The operational characteristics of these tools - fail-safe, robustness with respect to noise and data outages, as well as computational efficiency - allowed the identification of criteria for automatic alarm flagging. The

  15. Automatic Pulmonary Artery-Vein Separation and Classification in Computed Tomography Using Tree Partitioning and Peripheral Vessel Matching.

    PubMed

    Charbonnier, Jean-Paul; Brink, Monique; Ciompi, Francesco; Scholten, Ernst T; Schaefer-Prokop, Cornelia M; van Rikxoort, Eva M

    2016-03-01

    We present a method for automatic separation and classification of pulmonary arteries and veins in computed tomography. Our method takes advantage of local information to separate segmented vessels, and global information to perform the artery-vein classification. Given a vessel segmentation, a geometric graph is constructed that represents both the topology and the spatial distribution of the vessels. All nodes in the geometric graph where arteries and veins are potentially merged are identified based on graph pruning and individual branching patterns. At the identified nodes, the graph is split into subgraphs that each contain only arteries or veins. Based on the anatomical information that arteries and veins approach a common alveolar sag, an arterial subgraph is expected to be intertwined with a venous subgraph in the periphery of the lung. This relationship is quantified using periphery matching and is used to group subgraphs of the same artery-vein class. Artery-vein classification is performed on these grouped subgraphs based on the volumetric difference between arteries and veins. A quantitative evaluation was performed on 55 publicly available non-contrast CT scans. In all scans, two observers manually annotated randomly selected vessels as artery or vein. Our method was able to separate and classify arteries and veins with a median accuracy of 89%, closely approximating the inter-observer agreement. All CT scans used in this study, including all results of our system and all manual annotations, are publicly available at "http://www.w3.org/1999/xlink">http://arteryvein.grand-challenge.org".

  16. EVEREST: automatic identification and classification of protein domains in all protein sequences

    PubMed Central

    Portugaly, Elon; Harel, Amir; Linial, Nathan; Linial, Michal

    2006-01-01

    Background Proteins are comprised of one or several building blocks, known as domains. Such domains can be classified into families according to their evolutionary origin. Whereas sequencing technologies have advanced immensely in recent years, there are no matching computational methodologies for large-scale determination of protein domains and their boundaries. We provide and rigorously evaluate a novel set of domain families that is automatically generated from sequence data. Our domain family identification process, called EVEREST (EVolutionary Ensembles of REcurrent SegmenTs), begins by constructing a library of protein segments that emerge in an all vs. all pairwise sequence comparison. It then proceeds to cluster these segments into putative domain families. The selection of the best putative families is done using machine learning techniques. A statistical model is then created for each of the chosen families. This procedure is then iterated: the aforementioned statistical models are used to scan all protein sequences, to recreate a library of segments and to cluster them again. Results Processing the Swiss-Prot section of the UniProt Knoledgebase, release 7.2, EVEREST defines 20,230 domains, covering 85% of the amino acids of the Swiss-Prot database. EVEREST annotates 11,852 proteins (6% of the database) that are not annotated by Pfam A. In addition, in 43,086 proteins (20% of the database), EVEREST annotates a part of the protein that is not annotated by Pfam A. Performance tests show that EVEREST recovers 56% of Pfam A families and 63% of SCOP families with high accuracy, and suggests previously unknown domain families with at least 51% fidelity. EVEREST domains are often a combination of domains as defined by Pfam or SCOP and are frequently sub-domains of such domains. Conclusion The EVEREST process and its output domain families provide an exhaustive and validated view of the protein domain world that is automatically generated from sequence data. The

  17. EVEREST: automatic identification and classification of protein domains in all protein sequences.

    PubMed

    Portugaly, Elon; Harel, Amir; Linial, Nathan; Linial, Michal

    2006-06-02

    Proteins are comprised of one or several building blocks, known as domains. Such domains can be classified into families according to their evolutionary origin. Whereas sequencing technologies have advanced immensely in recent years, there are no matching computational methodologies for large-scale determination of protein domains and their boundaries. We provide and rigorously evaluate a novel set of domain families that is automatically generated from sequence data. Our domain family identification process, called EVEREST (EVolutionary Ensembles of REcurrent SegmenTs), begins by constructing a library of protein segments that emerge in an all vs. all pairwise sequence comparison. It then proceeds to cluster these segments into putative domain families. The selection of the best putative families is done using machine learning techniques. A statistical model is then created for each of the chosen families. This procedure is then iterated: the aforementioned statistical models are used to scan all protein sequences, to recreate a library of segments and to cluster them again. Processing the Swiss-Prot section of the UniProt Knoledgebase, release 7.2, EVEREST defines 20,230 domains, covering 85% of the amino acids of the Swiss-Prot database. EVEREST annotates 11,852 proteins (6% of the database) that are not annotated by Pfam A. In addition, in 43,086 proteins (20% of the database), EVEREST annotates a part of the protein that is not annotated by Pfam A. Performance tests show that EVEREST recovers 56% of Pfam A families and 63% of SCOP families with high accuracy, and suggests previously unknown domain families with at least 51% fidelity. EVEREST domains are often a combination of domains as defined by Pfam or SCOP and are frequently sub-domains of such domains. The EVEREST process and its output domain families provide an exhaustive and validated view of the protein domain world that is automatically generated from sequence data. The EVEREST library of domain

  18. Hyperspectral classification of grassland species: towards a UAS application for semi-automatic field surveys

    NASA Astrophysics Data System (ADS)

    Lopatin, Javier; Fassnacht, Fabian E.; Kattenborn, Teja; Schmidtlein, Seb