Duchrow, Timo; Shtatland, Timur; Guettler, Daniel; Pivovarov, Misha; Kramer, Stefan; Weissleder, Ralph
2009-01-01
Background The breadth of biological databases and their information content continues to increase exponentially. Unfortunately, our ability to query such sources is still often suboptimal. Here, we introduce and apply community voting, database-driven text classification, and visual aids as a means to incorporate distributed expert knowledge, to automatically classify database entries and to efficiently retrieve them. Results Using a previously developed peptide database as an example, we compared several machine learning algorithms in their ability to classify abstracts of published literature results into categories relevant to peptide research, such as related or not related to cancer, angiogenesis, molecular imaging, etc. Ensembles of bagged decision trees met the requirements of our application best. No other algorithm consistently performed better in comparative testing. Moreover, we show that the algorithm produces meaningful class probability estimates, which can be used to visualize the confidence of automatic classification during the retrieval process. To allow viewing long lists of search results enriched by automatic classifications, we added a dynamic heat map to the web interface. We take advantage of community knowledge by enabling users to cast votes in Web 2.0 style in order to correct automated classification errors, which triggers reclassification of all entries. We used a novel framework in which the database "drives" the entire vote aggregation and reclassification process to increase speed while conserving computational resources and keeping the method scalable. In our experiments, we simulate community voting by adding various levels of noise to nearly perfectly labelled instances, and show that, under such conditions, classification can be improved significantly. Conclusion Using PepBank as a model database, we show how to build a classification-aided retrieval system that gathers training data from the community, is completely controlled by the database, scales well with concurrent change events, and can be adapted to add text classification capability to other biomedical databases. The system can be accessed at . PMID:19799796
Van Gijn, Marielle E; Ceccherini, Isabella; Shinar, Yael; Carbo, Ellen C; Slofstra, Mariska; Arostegui, Juan I; Sarrabay, Guillaume; Rowczenio, Dorota; Omoyımnı, Ebun; Balci-Peynircioglu, Banu; Hoffman, Hal M; Milhavet, Florian; Swertz, Morris A; Touitou, Isabelle
2018-03-29
Hereditary recurrent fevers (HRFs) are rare inflammatory diseases sharing similar clinical symptoms and effectively treated with anti-inflammatory biological drugs. Accurate diagnosis of HRF relies heavily on genetic testing. This study aimed to obtain an experts' consensus on the clinical significance of gene variants in four well-known HRF genes: MEFV , TNFRSF1A , NLRP3 and MVK . We configured a MOLGENIS web platform to share and analyse pathogenicity classifications of the variants and to manage a consensus-based classification process. Four experts in HRF genetics submitted independent classifications of 858 variants. Classifications were driven to consensus by recruiting four more expert opinions and by targeting discordant classifications in five iterative rounds. Consensus classification was reached for 804/858 variants (94%). None of the unsolved variants (6%) remained with opposite classifications (eg, pathogenic vs benign). New mutational hotspots were found in all genes. We noted a lower pathogenic variant load and a higher fraction of variants with unknown or unsolved clinical significance in the MEFV gene. Applying a consensus-driven process on the pathogenicity assessment of experts yielded rapid classification of almost all variants of four HRF genes. The high-throughput database will profoundly assist clinicians and geneticists in the diagnosis of HRFs. The configured MOLGENIS platform and consensus evolution protocol are usable for assembly of other variant pathogenicity databases. The MOLGENIS software is available for reuse at http://github.com/molgenis/molgenis; the specific HRF configuration is available at http://molgenis.org/said/. The HRF pathogenicity classifications will be published on the INFEVERS database at https://fmf.igh.cnrs.fr/ISSAID/infevers/. © Article author(s) (or their employer(s) unless otherwise stated in the text of the article) 2018. All rights reserved. No commercial use is permitted unless otherwise expressly granted.
Mujtaba, Ghulam; Shuib, Liyana; Raj, Ram Gopal; Rajandram, Retnagowri; Shaikh, Khairunisa; Al-Garadi, Mohammed Ali
2017-01-01
Widespread implementation of electronic databases has improved the accessibility of plaintext clinical information for supplementary use. Numerous machine learning techniques, such as supervised machine learning approaches or ontology-based approaches, have been employed to obtain useful information from plaintext clinical data. This study proposes an automatic multi-class classification system to predict accident-related causes of death from plaintext autopsy reports through expert-driven feature selection with supervised automatic text classification decision models. Accident-related autopsy reports were obtained from one of the largest hospital in Kuala Lumpur. These reports belong to nine different accident-related causes of death. Master feature vector was prepared by extracting features from the collected autopsy reports by using unigram with lexical categorization. This master feature vector was used to detect cause of death [according to internal classification of disease version 10 (ICD-10) classification system] through five automated feature selection schemes, proposed expert-driven approach, five subset sizes of features, and five machine learning classifiers. Model performance was evaluated using precisionM, recallM, F-measureM, accuracy, and area under ROC curve. Four baselines were used to compare the results with the proposed system. Random forest and J48 decision models parameterized using expert-driven feature selection yielded the highest evaluation measure approaching (85% to 90%) for most metrics by using a feature subset size of 30. The proposed system also showed approximately 14% to 16% improvement in the overall accuracy compared with the existing techniques and four baselines. The proposed system is feasible and practical to use for automatic classification of ICD-10-related cause of death from autopsy reports. The proposed system assists pathologists to accurately and rapidly determine underlying cause of death based on autopsy findings. Furthermore, the proposed expert-driven feature selection approach and the findings are generally applicable to other kinds of plaintext clinical reports.
Mujtaba, Ghulam; Shuib, Liyana; Raj, Ram Gopal; Rajandram, Retnagowri; Shaikh, Khairunisa; Al-Garadi, Mohammed Ali
2017-01-01
Objectives Widespread implementation of electronic databases has improved the accessibility of plaintext clinical information for supplementary use. Numerous machine learning techniques, such as supervised machine learning approaches or ontology-based approaches, have been employed to obtain useful information from plaintext clinical data. This study proposes an automatic multi-class classification system to predict accident-related causes of death from plaintext autopsy reports through expert-driven feature selection with supervised automatic text classification decision models. Methods Accident-related autopsy reports were obtained from one of the largest hospital in Kuala Lumpur. These reports belong to nine different accident-related causes of death. Master feature vector was prepared by extracting features from the collected autopsy reports by using unigram with lexical categorization. This master feature vector was used to detect cause of death [according to internal classification of disease version 10 (ICD-10) classification system] through five automated feature selection schemes, proposed expert-driven approach, five subset sizes of features, and five machine learning classifiers. Model performance was evaluated using precisionM, recallM, F-measureM, accuracy, and area under ROC curve. Four baselines were used to compare the results with the proposed system. Results Random forest and J48 decision models parameterized using expert-driven feature selection yielded the highest evaluation measure approaching (85% to 90%) for most metrics by using a feature subset size of 30. The proposed system also showed approximately 14% to 16% improvement in the overall accuracy compared with the existing techniques and four baselines. Conclusion The proposed system is feasible and practical to use for automatic classification of ICD-10-related cause of death from autopsy reports. The proposed system assists pathologists to accurately and rapidly determine underlying cause of death based on autopsy findings. Furthermore, the proposed expert-driven feature selection approach and the findings are generally applicable to other kinds of plaintext clinical reports. PMID:28166263
Toward an endovascular internal carotid artery classification system.
Shapiro, M; Becske, T; Riina, H A; Raz, E; Zumofen, D; Jafar, J J; Huang, P P; Nelson, P K
2014-02-01
Does the world need another ICA classification scheme? We believe so. The purpose of proposed angiography-driven classification is to optimize description of the carotid artery from the endovascular perspective. A review of existing, predominantly surgically-driven classifications is performed, and a new scheme, based on the study of NYU aneurysm angiographic and cross-sectional databases is proposed. Seven segments - cervical, petrous, cavernous, paraophthlamic, posterior communicating, choroidal, and terminus - are named. This nomenclature recognizes intrinsic uncertainty in precise angiographic and cross-sectional localization of aneurysms adjacent to the dural rings, regarding all lesions distal to the cavernous segment as potentially intradural. Rather than subdividing various transitional, ophthalmic, and hypophyseal aneurysm subtypes, as necessitated by their varied surgical approaches and risks, the proposed classification emphasizes their common endovascular treatment features, while recognizing that many complex, trans-segmental, and fusiform aneurysms not readily classifiable into presently available, saccular aneurysm-driven schemes, are being increasingly addressed by endovascular means. We believe this classification may find utility in standardizing nomenclature for outcome tracking, treatment trials and physician communication.
Task-Driven Dynamic Text Summarization
ERIC Educational Resources Information Center
Workman, Terri Elizabeth
2011-01-01
The objective of this work is to examine the efficacy of natural language processing (NLP) in summarizing bibliographic text for multiple purposes. Researchers have noted the accelerating growth of bibliographic databases. Information seekers using traditional information retrieval techniques when searching large bibliographic databases are often…
NASA Astrophysics Data System (ADS)
Brodic, D.
2011-01-01
Text line segmentation represents the key element in the optical character recognition process. Hence, testing of text line segmentation algorithms has substantial relevance. All previously proposed testing methods deal mainly with text database as a template. They are used for testing as well as for the evaluation of the text segmentation algorithm. In this manuscript, methodology for the evaluation of the algorithm for text segmentation based on extended binary classification is proposed. It is established on the various multiline text samples linked with text segmentation. Their results are distributed according to binary classification. Final result is obtained by comparative analysis of cross linked data. At the end, its suitability for different types of scripts represents its main advantage.
Event Driven Messaging with Role-Based Subscriptions
NASA Technical Reports Server (NTRS)
Bui, Tung; Bui, Bach; Malhotra, Shantanu; Chen, Fannie; Kim, rachel; Allen, Christopher; Luong, Ivy; Chang, George; Zendejas, Silvino; Sadaqathulla, Syed
2009-01-01
Event Driven Messaging with Role-Based Subscriptions (EDM-RBS) is a framework integrated into the Service Management Database (SMDB) to allow for role-based and subscription-based delivery of synchronous and asynchronous messages over JMS (Java Messaging Service), SMTP (Simple Mail Transfer Protocol), or SMS (Short Messaging Service). This allows for 24/7 operation with users in all parts of the world. The software classifies messages by triggering data type, application source, owner of data triggering event (mission), classification, sub-classification and various other secondary classifying tags. Messages are routed to applications or users based on subscription rules using a combination of the above message attributes. This program provides a framework for identifying connected users and their applications for targeted delivery of messages over JMS to the client applications the user is logged into. EDMRBS provides the ability to send notifications over e-mail or pager rather than having to rely on a live human to do it. It is implemented as an Oracle application that uses Oracle relational database management system intrinsic functions. It is configurable to use Oracle AQ JMS API or an external JMS provider for messaging. It fully integrates into the event-logging framework of SMDB (Subnet Management Database).
Protein Information Resource: a community resource for expert annotation of protein data
Barker, Winona C.; Garavelli, John S.; Hou, Zhenglin; Huang, Hongzhan; Ledley, Robert S.; McGarvey, Peter B.; Mewes, Hans-Werner; Orcutt, Bruce C.; Pfeiffer, Friedhelm; Tsugita, Akira; Vinayaka, C. R.; Xiao, Chunlin; Yeh, Lai-Su L.; Wu, Cathy
2001-01-01
The Protein Information Resource, in collaboration with the Munich Information Center for Protein Sequences (MIPS) and the Japan International Protein Information Database (JIPID), produces the most comprehensive and expertly annotated protein sequence database in the public domain, the PIR-International Protein Sequence Database. To provide timely and high quality annotation and promote database interoperability, the PIR-International employs rule-based and classification-driven procedures based on controlled vocabulary and standard nomenclature and includes status tags to distinguish experimentally determined from predicted protein features. The database contains about 200 000 non-redundant protein sequences, which are classified into families and superfamilies and their domains and motifs identified. Entries are extensively cross-referenced to other sequence, classification, genome, structure and activity databases. The PIR web site features search engines that use sequence similarity and database annotation to facilitate the analysis and functional identification of proteins. The PIR-International databases and search tools are accessible on the PIR web site at http://pir.georgetown.edu/ and at the MIPS web site at http://www.mips.biochem.mpg.de. The PIR-International Protein Sequence Database and other files are also available by FTP. PMID:11125041
Automated compound classification using a chemical ontology.
Bobach, Claudia; Böhme, Timo; Laube, Ulf; Püschel, Anett; Weber, Lutz
2012-12-29
Classification of chemical compounds into compound classes by using structure derived descriptors is a well-established method to aid the evaluation and abstraction of compound properties in chemical compound databases. MeSH and recently ChEBI are examples of chemical ontologies that provide a hierarchical classification of compounds into general compound classes of biological interest based on their structural as well as property or use features. In these ontologies, compounds have been assigned manually to their respective classes. However, with the ever increasing possibilities to extract new compounds from text documents using name-to-structure tools and considering the large number of compounds deposited in databases, automated and comprehensive chemical classification methods are needed to avoid the error prone and time consuming manual classification of compounds. In the present work we implement principles and methods to construct a chemical ontology of classes that shall support the automated, high-quality compound classification in chemical databases or text documents. While SMARTS expressions have already been used to define chemical structure class concepts, in the present work we have extended the expressive power of such class definitions by expanding their structure-based reasoning logic. Thus, to achieve the required precision and granularity of chemical class definitions, sets of SMARTS class definitions are connected by OR and NOT logical operators. In addition, AND logic has been implemented to allow the concomitant use of flexible atom lists and stereochemistry definitions. The resulting chemical ontology is a multi-hierarchical taxonomy of concept nodes connected by directed, transitive relationships. A proposal for a rule based definition of chemical classes has been made that allows to define chemical compound classes more precisely than before. The proposed structure-based reasoning logic allows to translate chemistry expert knowledge into a computer interpretable form, preventing erroneous compound assignments and allowing automatic compound classification. The automated assignment of compounds in databases, compound structure files or text documents to their related ontology classes is possible through the integration with a chemical structure search engine. As an application example, the annotation of chemical structure files with a prototypic ontology is demonstrated.
Automated compound classification using a chemical ontology
2012-01-01
Background Classification of chemical compounds into compound classes by using structure derived descriptors is a well-established method to aid the evaluation and abstraction of compound properties in chemical compound databases. MeSH and recently ChEBI are examples of chemical ontologies that provide a hierarchical classification of compounds into general compound classes of biological interest based on their structural as well as property or use features. In these ontologies, compounds have been assigned manually to their respective classes. However, with the ever increasing possibilities to extract new compounds from text documents using name-to-structure tools and considering the large number of compounds deposited in databases, automated and comprehensive chemical classification methods are needed to avoid the error prone and time consuming manual classification of compounds. Results In the present work we implement principles and methods to construct a chemical ontology of classes that shall support the automated, high-quality compound classification in chemical databases or text documents. While SMARTS expressions have already been used to define chemical structure class concepts, in the present work we have extended the expressive power of such class definitions by expanding their structure-based reasoning logic. Thus, to achieve the required precision and granularity of chemical class definitions, sets of SMARTS class definitions are connected by OR and NOT logical operators. In addition, AND logic has been implemented to allow the concomitant use of flexible atom lists and stereochemistry definitions. The resulting chemical ontology is a multi-hierarchical taxonomy of concept nodes connected by directed, transitive relationships. Conclusions A proposal for a rule based definition of chemical classes has been made that allows to define chemical compound classes more precisely than before. The proposed structure-based reasoning logic allows to translate chemistry expert knowledge into a computer interpretable form, preventing erroneous compound assignments and allowing automatic compound classification. The automated assignment of compounds in databases, compound structure files or text documents to their related ontology classes is possible through the integration with a chemical structure search engine. As an application example, the annotation of chemical structure files with a prototypic ontology is demonstrated. PMID:23273256
Symbolic rule-based classification of lung cancer stages from free-text pathology reports.
Nguyen, Anthony N; Lawley, Michael J; Hansen, David P; Bowman, Rayleen V; Clarke, Belinda E; Duhig, Edwina E; Colquist, Shoni
2010-01-01
To classify automatically lung tumor-node-metastases (TNM) cancer stages from free-text pathology reports using symbolic rule-based classification. By exploiting report substructure and the symbolic manipulation of systematized nomenclature of medicine-clinical terms (SNOMED CT) concepts in reports, statements in free text can be evaluated for relevance against factors relating to the staging guidelines. Post-coordinated SNOMED CT expressions based on templates were defined and populated by concepts in reports, and tested for subsumption by staging factors. The subsumption results were used to build logic according to the staging guidelines to calculate the TNM stage. The accuracy measure and confusion matrices were used to evaluate the TNM stages classified by the symbolic rule-based system. The system was evaluated against a database of multidisciplinary team staging decisions and a machine learning-based text classification system using support vector machines. Overall accuracy on a corpus of pathology reports for 718 lung cancer patients against a database of pathological TNM staging decisions were 72%, 78%, and 94% for T, N, and M staging, respectively. The system's performance was also comparable to support vector machine classification approaches. A system to classify lung TNM stages from free-text pathology reports was developed, and it was verified that the symbolic rule-based approach using SNOMED CT can be used for the extraction of key lung cancer characteristics from free-text reports. Future work will investigate the applicability of using the proposed methodology for extracting other cancer characteristics and types.
Using statistical text classification to identify health information technology incidents
Chai, Kevin E K; Anthony, Stephen; Coiera, Enrico; Magrabi, Farah
2013-01-01
Objective To examine the feasibility of using statistical text classification to automatically identify health information technology (HIT) incidents in the USA Food and Drug Administration (FDA) Manufacturer and User Facility Device Experience (MAUDE) database. Design We used a subset of 570 272 incidents including 1534 HIT incidents reported to MAUDE between 1 January 2008 and 1 July 2010. Text classifiers using regularized logistic regression were evaluated with both ‘balanced’ (50% HIT) and ‘stratified’ (0.297% HIT) datasets for training, validation, and testing. Dataset preparation, feature extraction, feature selection, cross-validation, classification, performance evaluation, and error analysis were performed iteratively to further improve the classifiers. Feature-selection techniques such as removing short words and stop words, stemming, lemmatization, and principal component analysis were examined. Measurements κ statistic, F1 score, precision and recall. Results Classification performance was similar on both the stratified (0.954 F1 score) and balanced (0.995 F1 score) datasets. Stemming was the most effective technique, reducing the feature set size to 79% while maintaining comparable performance. Training with balanced datasets improved recall (0.989) but reduced precision (0.165). Conclusions Statistical text classification appears to be a feasible method for identifying HIT reports within large databases of incidents. Automated identification should enable more HIT problems to be detected, analyzed, and addressed in a timely manner. Semi-supervised learning may be necessary when applying machine learning to big data analysis of patient safety incidents and requires further investigation. PMID:23666777
Transporter taxonomy - a comparison of different transport protein classification schemes.
Viereck, Michael; Gaulton, Anna; Digles, Daniela; Ecker, Gerhard F
2014-06-01
Currently, there are more than 800 well characterized human membrane transport proteins (including channels and transporters) and there are estimates that about 10% (approx. 2000) of all human genes are related to transport. Membrane transport proteins are of interest as potential drug targets, for drug delivery, and as a cause of side effects and drug–drug interactions. In light of the development of Open PHACTS, which provides an open pharmacological space, we analyzed selected membrane transport protein classification schemes (Transporter Classification Database, ChEMBL, IUPHAR/BPS Guide to Pharmacology, and Gene Ontology) for their ability to serve as a basis for pharmacology driven protein classification. A comparison of these membrane transport protein classification schemes by using a set of clinically relevant transporters as use-case reveals the strengths and weaknesses of the different taxonomy approaches.
Automating document classification for the Immune Epitope Database
Wang, Peng; Morgan, Alexander A; Zhang, Qing; Sette, Alessandro; Peters, Bjoern
2007-01-01
Background The Immune Epitope Database contains information on immune epitopes curated manually from the scientific literature. Like similar projects in other knowledge domains, significant effort is spent on identifying which articles are relevant for this purpose. Results We here report our experience in automating this process using Naïve Bayes classifiers trained on 20,910 abstracts classified by domain experts. Improvements on the basic classifier performance were made by a) utilizing information stored in PubMed beyond the abstract itself b) applying standard feature selection criteria and c) extracting domain specific feature patterns that e.g. identify peptides sequences. We have implemented the classifier into the curation process determining if abstracts are clearly relevant, clearly irrelevant, or if no certain classification can be made, in which case the abstracts are manually classified. Testing this classification scheme on an independent dataset, we achieve 95% sensitivity and specificity in the 51.1% of abstracts that were automatically classified. Conclusion By implementing text classification, we have sped up the reference selection process without sacrificing sensitivity or specificity of the human expert classification. This study provides both practical recommendations for users of text classification tools, as well as a large dataset which can serve as a benchmark for tool developers. PMID:17655769
Bréant, C; Borst, F; Campi, D; Griesser, V; Momjian, S
1999-01-01
The use of a controlled vocabulary set in a hospital-wide clinical information system is of crucial importance for many departmental database systems to communicate and exchange information. In the absence of an internationally recognized clinical controlled vocabulary set, a new extension of the International statistical Classification of Diseases (ICD) is proposed. It expands the scope of the standard ICD beyond diagnosis and procedures to clinical terminology. In addition, the common Clinical Findings Dictionary (CFD) further records the definition of clinical entities. The construction of the vocabulary set and the CFD is incremental and manual. Tools have been implemented to facilitate the tasks of defining/maintaining/publishing dictionary versions. The design of database applications in the integrated clinical information system is driven by the CFD which is part of the Medical Questionnaire Designer tool. Several integrated clinical database applications in the field of diabetes and neuro-surgery have been developed at the HUG.
Bréant, C.; Borst, F.; Campi, D.; Griesser, V.; Momjian, S.
1999-01-01
The use of a controlled vocabulary set in a hospital-wide clinical information system is of crucial importance for many departmental database systems to communicate and exchange information. In the absence of an internationally recognized clinical controlled vocabulary set, a new extension of the International statistical Classification of Diseases (ICD) is proposed. It expands the scope of the standard ICD beyond diagnosis and procedures to clinical terminology. In addition, the common Clinical Findings Dictionary (CFD) further records the definition of clinical entities. The construction of the vocabulary set and the CFD is incremental and manual. Tools have been implemented to facilitate the tasks of defining/maintaining/publishing dictionary versions. The design of database applications in the integrated clinical information system is driven by the CFD which is part of the Medical Questionnaire Designer tool. Several integrated clinical database applications in the field of diabetes and neuro-surgery have been developed at the HUG. Images Figure 1 PMID:10566451
EUCLID: automatic classification of proteins in functional classes by their database annotations.
Tamames, J; Ouzounis, C; Casari, G; Sander, C; Valencia, A
1998-01-01
A tool is described for the automatic classification of sequences in functional classes using their database annotations. The Euclid system is based on a simple learning procedure from examples provided by human experts. Euclid is freely available for academics at http://www.gredos.cnb.uam.es/EUCLID, with the corresponding dictionaries for the generation of three, eight and 14 functional classes. E-mail: valencia@cnb.uam.es The results of the EUCLID classification of different genomes are available at http://www.sander.ebi.ac. uk/genequiz/. A detailed description of the different applications mentioned in the text is available at http://www.gredos.cnb.uam. es/EUCLID/Full_Paper
Literature classification for semi-automated updating of biological knowledgebases
2013-01-01
Background As the output of biological assays increase in resolution and volume, the body of specialized biological data, such as functional annotations of gene and protein sequences, enables extraction of higher-level knowledge needed for practical application in bioinformatics. Whereas common types of biological data, such as sequence data, are extensively stored in biological databases, functional annotations, such as immunological epitopes, are found primarily in semi-structured formats or free text embedded in primary scientific literature. Results We defined and applied a machine learning approach for literature classification to support updating of TANTIGEN, a knowledgebase of tumor T-cell antigens. Abstracts from PubMed were downloaded and classified as either "relevant" or "irrelevant" for database update. Training and five-fold cross-validation of a k-NN classifier on 310 abstracts yielded classification accuracy of 0.95, thus showing significant value in support of data extraction from the literature. Conclusion We here propose a conceptual framework for semi-automated extraction of epitope data embedded in scientific literature using principles from text mining and machine learning. The addition of such data will aid in the transition of biological databases to knowledgebases. PMID:24564403
NASA Astrophysics Data System (ADS)
Ichii, Kazuhito; Ueyama, Masahito; Kondo, Masayuki; Saigusa, Nobuko; Kim, Joon; Alberto, Ma. Carmelita; Ardö, Jonas; Euskirchen, Eugénie S.; Kang, Minseok; Hirano, Takashi; Joiner, Joanna; Kobayashi, Hideki; Marchesini, Luca Belelli; Merbold, Lutz; Miyata, Akira; Saitoh, Taku M.; Takagi, Kentaro; Varlagin, Andrej; Bret-Harte, M. Syndonia; Kitamura, Kenzo; Kosugi, Yoshiko; Kotani, Ayumi; Kumar, Kireet; Li, Sheng-Gong; Machimura, Takashi; Matsuura, Yojiro; Mizoguchi, Yasuko; Ohta, Takeshi; Mukherjee, Sandipan; Yanagi, Yuji; Yasuda, Yukio; Zhang, Yiping; Zhao, Fenghua
2017-04-01
The lack of a standardized database of eddy covariance observations has been an obstacle for data-driven estimation of terrestrial CO2 fluxes in Asia. In this study, we developed such a standardized database using 54 sites from various databases by applying consistent postprocessing for data-driven estimation of gross primary productivity (GPP) and net ecosystem CO2 exchange (NEE). Data-driven estimation was conducted by using a machine learning algorithm: support vector regression (SVR), with remote sensing data for 2000 to 2015 period. Site-level evaluation of the estimated CO2 fluxes shows that although performance varies in different vegetation and climate classifications, GPP and NEE at 8 days are reproduced (e.g., r2 = 0.73 and 0.42 for 8 day GPP and NEE). Evaluation of spatially estimated GPP with Global Ozone Monitoring Experiment 2 sensor-based Sun-induced chlorophyll fluorescence shows that monthly GPP variations at subcontinental scale were reproduced by SVR (r2 = 1.00, 0.94, 0.91, and 0.89 for Siberia, East Asia, South Asia, and Southeast Asia, respectively). Evaluation of spatially estimated NEE with net atmosphere-land CO2 fluxes of Greenhouse Gases Observing Satellite (GOSAT) Level 4A product shows that monthly variations of these data were consistent in Siberia and East Asia; meanwhile, inconsistency was found in South Asia and Southeast Asia. Furthermore, differences in the land CO2 fluxes from SVR-NEE and GOSAT Level 4A were partially explained by accounting for the differences in the definition of land CO2 fluxes. These data-driven estimates can provide a new opportunity to assess CO2 fluxes in Asia and evaluate and constrain terrestrial ecosystem models.
Jiang, Xiangying; Ringwald, Martin; Blake, Judith; Shatkay, Hagit
2017-01-01
The Gene Expression Database (GXD) is a comprehensive online database within the Mouse Genome Informatics resource, aiming to provide available information about endogenous gene expression during mouse development. The information stems primarily from many thousands of biomedical publications that database curators must go through and read. Given the very large number of biomedical papers published each year, automatic document classification plays an important role in biomedical research. Specifically, an effective and efficient document classifier is needed for supporting the GXD annotation workflow. We present here an effective yet relatively simple classification scheme, which uses readily available tools while employing feature selection, aiming to assist curators in identifying publications relevant to GXD. We examine the performance of our method over a large manually curated dataset, consisting of more than 25 000 PubMed abstracts, of which about half are curated as relevant to GXD while the other half as irrelevant to GXD. In addition to text from title-and-abstract, we also consider image captions, an important information source that we integrate into our method. We apply a captions-based classifier to a subset of about 3300 documents, for which the full text of the curated articles is available. The results demonstrate that our proposed approach is robust and effectively addresses the GXD document classification. Moreover, using information obtained from image captions clearly improves performance, compared to title and abstract alone, affirming the utility of image captions as a substantial evidence source for automatically determining the relevance of biomedical publications to a specific subject area. www.informatics.jax.org. © The Author(s) 2017. Published by Oxford University Press.
Improving imbalanced scientific text classification using sampling strategies and dictionaries.
Borrajo, L; Romero, R; Iglesias, E L; Redondo Marey, C M
2011-09-15
Many real applications have the imbalanced class distribution problem, where one of the classes is represented by a very small number of cases compared to the other classes. One of the systems affected are those related to the recovery and classification of scientific documentation. Sampling strategies such as Oversampling and Subsampling are popular in tackling the problem of class imbalance. In this work, we study their effects on three types of classifiers (Knn, SVM and Naive-Bayes) when they are applied to search on the PubMed scientific database. Another purpose of this paper is to study the use of dictionaries in the classification of biomedical texts. Experiments are conducted with three different dictionaries (BioCreative, NLPBA, and an ad-hoc subset of the UniProt database named Protein) using the mentioned classifiers and sampling strategies. Best results were obtained with NLPBA and Protein dictionaries and the SVM classifier using the Subsampling balancing technique. These results were compared with those obtained by other authors using the TREC Genomics 2005 public corpus. Copyright 2011 The Author(s). Published by Journal of Integrative Bioinformatics.
NASA Astrophysics Data System (ADS)
Liu, G.; Wu, C.; Li, X.; Song, P.
2013-12-01
The 3D urban geological information system has been a major part of the national urban geological survey project of China Geological Survey in recent years. Large amount of multi-source and multi-subject data are to be stored in the urban geological databases. There are various models and vocabularies drafted and applied by industrial companies in urban geological data. The issues such as duplicate and ambiguous definition of terms and different coding structure increase the difficulty of information sharing and data integration. To solve this problem, we proposed a national standard-driven information classification and coding method to effectively store and integrate urban geological data, and we applied the data dictionary technology to achieve structural and standard data storage. The overall purpose of this work is to set up a common data platform to provide information sharing service. Research progresses are as follows: (1) A unified classification and coding method for multi-source data based on national standards. Underlying national standards include GB 9649-88 for geology and GB/T 13923-2006 for geography. Current industrial models are compared with national standards to build a mapping table. The attributes of various urban geological data entity models are reduced to several categories according to their application phases and domains. Then a logical data model is set up as a standard format to design data file structures for a relational database. (2) A multi-level data dictionary for data standardization constraint. Three levels of data dictionary are designed: model data dictionary is used to manage system database files and enhance maintenance of the whole database system; attribute dictionary organizes fields used in database tables; term and code dictionary is applied to provide a standard for urban information system by adopting appropriate classification and coding methods; comprehensive data dictionary manages system operation and security. (3) An extension to system data management function based on data dictionary. Data item constraint input function is making use of the standard term and code dictionary to get standard input result. Attribute dictionary organizes all the fields of an urban geological information database to ensure the consistency of term use for fields. Model dictionary is used to generate a database operation interface automatically with standard semantic content via term and code dictionary. The above method and technology have been applied to the construction of Fuzhou Urban Geological Information System, South-East China with satisfactory results.
Sahoo, Satya S; Zhang, Guo-Qiang; Lhatoo, Samden D
2013-08-01
The epilepsy community increasingly recognizes the need for a modern classification system that can also be easily integrated with effective informatics tools. The 2010 reports by the United States President's Council of Advisors on Science and Technology (PCAST) identified informatics as a critical resource to improve quality of patient care, drive clinical research, and reduce the cost of health services. An effective informatics infrastructure for epilepsy, which is underpinned by a formal knowledge model or ontology, can leverage an ever increasing amount of multimodal data to improve (1) clinical decision support, (2) access to information for patients and their families, (3) easier data sharing, and (4) accelerate secondary use of clinical data. Modeling the recommendations of the International League Against Epilepsy (ILAE) classification system in the form of an epilepsy domain ontology is essential for consistent use of terminology in a variety of applications, including electronic health records systems and clinical applications. In this review, we discuss the data management issues in epilepsy and explore the benefits of an ontology-driven informatics infrastructure and its role in adoption of a "data-driven" paradigm in epilepsy research. Wiley Periodicals, Inc. © 2013 International League Against Epilepsy.
Broad phonetic class definition driven by phone confusions
NASA Astrophysics Data System (ADS)
Lopes, Carla; Perdigão, Fernando
2012-12-01
Intermediate representations between the speech signal and phones may be used to improve discrimination among phones that are often confused. These representations are usually found according to broad phonetic classes, which are defined by a phonetician. This article proposes an alternative data-driven method to generate these classes. Phone confusion information from the analysis of the output of a phone recognition system is used to find clusters at high risk of mutual confusion. A metric is defined to compute the distance between phones. The results, using TIMIT data, show that the proposed confusion-driven phone clustering method is an attractive alternative to the approaches based on human knowledge. A hierarchical classification structure to improve phone recognition is also proposed using a discriminative weight training method. Experiments show improvements in phone recognition on the TIMIT database compared to a baseline system.
Valkhoff, Vera E; Coloma, Preciosa M; Masclee, Gwen M C; Gini, Rosa; Innocenti, Francesco; Lapi, Francesco; Molokhia, Mariam; Mosseveld, Mees; Nielsson, Malene Schou; Schuemie, Martijn; Thiessard, Frantz; van der Lei, Johan; Sturkenboom, Miriam C J M; Trifirò, Gianluca
2014-08-01
To evaluate the accuracy of disease codes and free text in identifying upper gastrointestinal bleeding (UGIB) from electronic health-care records (EHRs). We conducted a validation study in four European electronic health-care record (EHR) databases such as Integrated Primary Care Information (IPCI), Health Search/CSD Patient Database (HSD), ARS, and Aarhus, in which we identified UGIB cases using free text or disease codes: (1) International Classification of Disease (ICD)-9 (HSD, ARS); (2) ICD-10 (Aarhus); and (3) International Classification of Primary Care (ICPC) (IPCI). From each database, we randomly selected and manually reviewed 200 cases to calculate positive predictive values (PPVs). We employed different case definitions to assess the effect of outcome misclassification on estimation of risk of drug-related UGIB. PPV was 22% [95% confidence interval (CI): 16, 28] and 21% (95% CI: 16, 28) in IPCI for free text and ICPC codes, respectively. PPV was 91% (95% CI: 86, 95) for ICD-9 codes and 47% (95% CI: 35, 59) for free text in HSD. PPV for ICD-9 codes in ARS was 72% (95% CI: 65, 78) and 77% (95% CI: 69, 83) for ICD-10 codes (Aarhus). More specific definitions did not have significant impact on risk estimation of drug-related UGIB, except for wider CIs. ICD-9-CM and ICD-10 disease codes have good PPV in identifying UGIB from EHR; less granular terminology (ICPC) may require additional strategies. Use of more specific UGIB definitions affects precision, but not magnitude, of risk estimates. Copyright © 2014 Elsevier Inc. All rights reserved.
Inayat-Hussain, Salmaan H; Fukumura, Masao; Muiz Aziz, A; Jin, Chai Meng; Jin, Low Wei; Garcia-Milian, Rolando; Vasiliou, Vasilis; Deziel, Nicole C
2018-08-01
Recent trends have witnessed the global growth of unconventional oil and gas (UOG) production. Epidemiologic studies have suggested associations between proximity to UOG operations with increased adverse birth outcomes and cancer, though specific potential etiologic agents have not yet been identified. To perform effective risk assessment of chemicals used in UOG production, the first step of hazard identification followed by prioritization specifically for reproductive toxicity, carcinogenicity and mutagenicity is crucial in an evidence-based risk assessment approach. To date, there is no single hazard classification list based on the United Nations Globally Harmonized System (GHS), with countries applying the GHS standards to generate their own chemical hazard classification lists. A current challenge for chemical prioritization, particularly for a multi-national industry, is inconsistent hazard classification which may result in misjudgment of the potential public health risks. We present a novel approach for hazard identification followed by prioritization of reproductive toxicants found in UOG operations using publicly available regulatory databases. GHS classification for reproductive toxicity of 157 UOG-related chemicals identified as potential reproductive or developmental toxicants in a previous publication was assessed using eleven governmental regulatory agency databases. If there was discordance in classifications across agencies, the most stringent classification was assigned. Chemicals in the category of known or presumed human reproductive toxicants were further evaluated for carcinogenicity and germ cell mutagenicity based on government classifications. A scoring system was utilized to assign numerical values for reproductive health, cancer and germ cell mutation hazard endpoints. Using a Cytoscape analysis, both qualitative and quantitative results were presented visually to readily identify high priority UOG chemicals with evidence of multiple adverse effects. We observed substantial inconsistencies in classification among the 11 databases. By adopting the most stringent classification within and across countries, 43 chemicals were classified as known or presumed human reproductive toxicants (GHS Category 1), while 31 chemicals were classified as suspected human reproductive toxicants (GHS Category 2). The 43 reproductive toxicants were further subjected to analysis for carcinogenic and mutagenic properties. Calculated hazard scores and Cytoscape visualization yielded several high priority chemicals including potassium dichromate, cadmium, benzene and ethylene oxide. Our findings reveal diverging GHS classification outcomes for UOG chemicals across regulatory agencies. Adoption of the most stringent classification with application of hazard scores provides a useful approach to prioritize reproductive toxicants in UOG and other industries for exposure assessments and selection of safer alternatives. Copyright © 2018 Elsevier Ltd. All rights reserved.
Visual affective classification by combining visual and text features.
Liu, Ningning; Wang, Kai; Jin, Xin; Gao, Boyang; Dellandréa, Emmanuel; Chen, Liming
2017-01-01
Affective analysis of images in social networks has drawn much attention, and the texts surrounding images are proven to provide valuable semantic meanings about image content, which can hardly be represented by low-level visual features. In this paper, we propose a novel approach for visual affective classification (VAC) task. This approach combines visual representations along with novel text features through a fusion scheme based on Dempster-Shafer (D-S) Evidence Theory. Specifically, we not only investigate different types of visual features and fusion methods for VAC, but also propose textual features to effectively capture emotional semantics from the short text associated to images based on word similarity. Experiments are conducted on three public available databases: the International Affective Picture System (IAPS), the Artistic Photos and the MirFlickr Affect set. The results demonstrate that the proposed approach combining visual and textual features provides promising results for VAC task.
Visual affective classification by combining visual and text features
Liu, Ningning; Wang, Kai; Jin, Xin; Gao, Boyang; Dellandréa, Emmanuel; Chen, Liming
2017-01-01
Affective analysis of images in social networks has drawn much attention, and the texts surrounding images are proven to provide valuable semantic meanings about image content, which can hardly be represented by low-level visual features. In this paper, we propose a novel approach for visual affective classification (VAC) task. This approach combines visual representations along with novel text features through a fusion scheme based on Dempster-Shafer (D-S) Evidence Theory. Specifically, we not only investigate different types of visual features and fusion methods for VAC, but also propose textual features to effectively capture emotional semantics from the short text associated to images based on word similarity. Experiments are conducted on three public available databases: the International Affective Picture System (IAPS), the Artistic Photos and the MirFlickr Affect set. The results demonstrate that the proposed approach combining visual and textual features provides promising results for VAC task. PMID:28850566
Developing a theory driven text messaging intervention for addiction care with user driven content.
Muench, Frederick; Weiss, Rebecca A; Kuerbis, Alexis; Morgenstern, Jon
2013-03-01
The number of text messaging interventions designed to initiate and support behavioral health changes have been steadily increasing over the past 5 years. Messaging interventions can be tailored and adapted to an individual's needs in their natural environment-fostering just-in-time therapies and making them a logical intervention for addiction continuing care. This study assessed the acceptability of using text messaging for substance abuse continuing care and the intervention preferences of individuals in substance abuse treatment in order to develop an interactive mobile text messaging intervention. Fifty individuals enrolled in intensive outpatient substance abuse treatment completed an assessment battery relating to preferred logistics of mobile interventions, behavior change strategies, and types of messages they thought would be most helpful to them at different time points. Results indicated that 98% participants were potentially interested in using text messaging as a continuing care strategy. Participants wrote different types of messages that they perceived might be most helpful, based on various hypothetical situations often encountered during the recovery process. Although individuals tended to prefer benefit driven over consequence driven messages, differences in the perceived benefits of change among individuals predicted message preference. Implications for the development of mobile messaging interventions for the addictions are discussed. (PsycINFO Database Record (c) 2013 APA, all rights reserved).
A computer-based information system for epilepsy and electroencephalography.
Finnerup, N B; Fuglsang-Frederiksen, A; Røssel, P; Jennum, P
1999-08-01
This paper describes a standardised computer-based information system for electroencephalography (EEG) focusing on epilepsy. The system was developed using a prototyping approach. It is based on international recommendations for EEG examination, interpretation and terminology, international guidelines for epidemiological studies on epilepsy and classification of epileptic seizures and syndromes and international classification of diseases. It is divided into: (1) clinical information and epilepsy relevant data; and (2) EEG data, which is hierarchically structured including description and interpretation of EEG. Data is coded but is supplemented with unrestricted text. The resulting patient database can be integrated with other clinical databases and with the patient record system and may facilitate clinical and epidemiological research and development of standards and guidelines for EEG description and interpretation. The system is currently used for teleconsultation between Gentofte and Lisbon.
Universal fragment descriptors for predicting properties of inorganic crystals
NASA Astrophysics Data System (ADS)
Isayev, Olexandr; Oses, Corey; Toher, Cormac; Gossett, Eric; Curtarolo, Stefano; Tropsha, Alexander
2017-06-01
Although historically materials discovery has been driven by a laborious trial-and-error process, knowledge-driven materials design can now be enabled by the rational combination of Machine Learning methods and materials databases. Here, data from the AFLOW repository for ab initio calculations is combined with Quantitative Materials Structure-Property Relationship models to predict important properties: metal/insulator classification, band gap energy, bulk/shear moduli, Debye temperature and heat capacities. The prediction's accuracy compares well with the quality of the training data for virtually any stoichiometric inorganic crystalline material, reciprocating the available thermomechanical experimental data. The universality of the approach is attributed to the construction of the descriptors: Property-Labelled Materials Fragments. The representations require only minimal structural input allowing straightforward implementations of simple heuristic design rules.
Universal fragment descriptors for predicting properties of inorganic crystals.
Isayev, Olexandr; Oses, Corey; Toher, Cormac; Gossett, Eric; Curtarolo, Stefano; Tropsha, Alexander
2017-06-05
Although historically materials discovery has been driven by a laborious trial-and-error process, knowledge-driven materials design can now be enabled by the rational combination of Machine Learning methods and materials databases. Here, data from the AFLOW repository for ab initio calculations is combined with Quantitative Materials Structure-Property Relationship models to predict important properties: metal/insulator classification, band gap energy, bulk/shear moduli, Debye temperature and heat capacities. The prediction's accuracy compares well with the quality of the training data for virtually any stoichiometric inorganic crystalline material, reciprocating the available thermomechanical experimental data. The universality of the approach is attributed to the construction of the descriptors: Property-Labelled Materials Fragments. The representations require only minimal structural input allowing straightforward implementations of simple heuristic design rules.
A practical guide to big data research in psychology.
Chen, Eric Evan; Wojcik, Sean P
2016-12-01
The massive volume of data that now covers a wide variety of human behaviors offers researchers in psychology an unprecedented opportunity to conduct innovative theory- and data-driven field research. This article is a practical guide to conducting big data research, covering data management, acquisition, processing, and analytics (including key supervised and unsupervised learning data mining methods). It is accompanied by walkthrough tutorials on data acquisition, text analysis with latent Dirichlet allocation topic modeling, and classification with support vector machines. Big data practitioners in academia, industry, and the community have built a comprehensive base of tools and knowledge that makes big data research accessible to researchers in a broad range of fields. However, big data research does require knowledge of software programming and a different analytical mindset. For those willing to acquire the requisite skills, innovative analyses of unexpected or previously untapped data sources can offer fresh ways to develop, test, and extend theories. When conducted with care and respect, big data research can become an essential complement to traditional research. (PsycINFO Database Record (c) 2016 APA, all rights reserved).
High-Reproducibility and High-Accuracy Method for Automated Topic Classification
NASA Astrophysics Data System (ADS)
Lancichinetti, Andrea; Sirer, M. Irmak; Wang, Jane X.; Acuna, Daniel; Körding, Konrad; Amaral, Luís A. Nunes
2015-01-01
Much of human knowledge sits in large databases of unstructured text. Leveraging this knowledge requires algorithms that extract and record metadata on unstructured text documents. Assigning topics to documents will enable intelligent searching, statistical characterization, and meaningful classification. Latent Dirichlet allocation (LDA) is the state of the art in topic modeling. Here, we perform a systematic theoretical and numerical analysis that demonstrates that current optimization techniques for LDA often yield results that are not accurate in inferring the most suitable model parameters. Adapting approaches from community detection in networks, we propose a new algorithm that displays high reproducibility and high accuracy and also has high computational efficiency. We apply it to a large set of documents in the English Wikipedia and reveal its hierarchical structure.
Racicki, Stephanie; Gerwin, Sarah; Diclaudio, Stacy; Reinmann, Samuel; Donaldson, Megan
2013-05-01
The purpose of this systematic review was to assess the effectiveness of conservative physical therapy management of cervicogenic headache (CGH). CGH affects 22-25% of the adult population with females being four times more affected than men. CGHs are thought to arise from musculoskeletal impairments in the neck with symptoms most commonly consisting of suboccipital neck pain, dizziness, and lightheadedness. Currently, both invasive and non-invasive techniques are available to address these symptoms; however, the efficacy of non-invasive treatment techniques has yet to be established. Computerized searches of CINAHL, ProQuest, PubMed, MEDLINE, and SportDiscus, were performed to obtain a qualitative analysis of the literature. Inclusion criteria were: randomized controlled trial design, population diagnosed with CGH using the International Headache Society classification, at least one baseline measurement and one outcomes measure, and assessment of a conservative technique. Physiotherapy evidence-based database scale was utilized for quality assessment. One computerized database search and two hand searches yielded six articles. Of the six included randomized controlled trials, all were considered to be of 'good quality' utilizing the physiotherapy evidence-based database scale. The interventions utilized were: therapist-driven cervical manipulation and mobilization, self-applied cervical mobilization, cervico-scapular strengthening, and therapist-driven cervical and thoracic manipulation. With the exception of one study, all reported reduction in pain and disability, as well as improvement in function. Calculated effect sizes allowed comparison of intervention groups between studies. A combination of therapist-driven cervical manipulation and mobilization with cervico-scapular strengthening was most effective for decreasing pain outcomes in those with CGH.
Gradishar, William; Johnson, KariAnne; Brown, Krystal; Mundt, Erin; Manley, Susan
2017-07-01
There is a growing move to consult public databases following receipt of a genetic test result from a clinical laboratory; however, the well-documented limitations of these databases call into question how often clinicians will encounter discordant variant classifications that may introduce uncertainty into patient management. Here, we evaluate discordance in BRCA1 and BRCA2 variant classifications between a single commercial testing laboratory and a public database commonly consulted in clinical practice. BRCA1 and BRCA2 variant classifications were obtained from ClinVar and compared with the classifications from a reference laboratory. Full concordance and discordance were determined for variants whose ClinVar entries were of the same pathogenicity (pathogenic, benign, or uncertain). Variants with conflicting ClinVar classifications were considered partially concordant if ≥1 of the listed classifications agreed with the reference laboratory classification. Four thousand two hundred and fifty unique BRCA1 and BRCA2 variants were available for analysis. Overall, 73.2% of classifications were fully concordant and 12.3% were partially concordant. The remaining 14.5% of variants had discordant classifications, most of which had a definitive classification (pathogenic or benign) from the reference laboratory compared with an uncertain classification in ClinVar (14.0%). Here, we show that discrepant classifications between a public database and single reference laboratory potentially account for 26.7% of variants in BRCA1 and BRCA2 . The time and expertise required of clinicians to research these discordant classifications call into question the practicality of checking all test results against a database and suggest that discordant classifications should be interpreted with these limitations in mind. With the increasing use of clinical genetic testing for hereditary cancer risk, accurate variant classification is vital to ensuring appropriate medical management. There is a growing move to consult public databases following receipt of a genetic test result from a clinical laboratory; however, we show that up to 26.7% of variants in BRCA1 and BRCA2 have discordant classifications between ClinVar and a reference laboratory. The findings presented in this paper serve as a note of caution regarding the utility of database consultation. © AlphaMed Press 2017.
LMSD: LIPID MAPS structure database
Sud, Manish; Fahy, Eoin; Cotter, Dawn; Brown, Alex; Dennis, Edward A.; Glass, Christopher K.; Merrill, Alfred H.; Murphy, Robert C.; Raetz, Christian R. H.; Russell, David W.; Subramaniam, Shankar
2007-01-01
The LIPID MAPS Structure Database (LMSD) is a relational database encompassing structures and annotations of biologically relevant lipids. Structures of lipids in the database come from four sources: (i) LIPID MAPS Consortium's core laboratories and partners; (ii) lipids identified by LIPID MAPS experiments; (iii) computationally generated structures for appropriate lipid classes; (iv) biologically relevant lipids manually curated from LIPID BANK, LIPIDAT and other public sources. All the lipid structures in LMSD are drawn in a consistent fashion. In addition to a classification-based retrieval of lipids, users can search LMSD using either text-based or structure-based search options. The text-based search implementation supports data retrieval by any combination of these data fields: LIPID MAPS ID, systematic or common name, mass, formula, category, main class, and subclass data fields. The structure-based search, in conjunction with optional data fields, provides the capability to perform a substructure search or exact match for the structure drawn by the user. Search results, in addition to structure and annotations, also include relevant links to external databases. The LMSD is publicly available at PMID:17098933
Sahoo, Satya S.; Zhang, Guo-Qiang; Lhatoo, Samden D.
2013-01-01
Summary The epilepsy community increasingly recognizes the need for a modern classification system that can also be easily integrated with effective informatics tools. The 2010 reports by the United States President's Council of Advisors on Science and Technology (PCAST) identified informatics as a critical resource to improve quality of patient care, drive clinical research, and reduce the cost of health services. An effective informatics infrastructure for epilepsy, which is underpinned by a formal knowledge model or ontology, can leverage an ever increasing amount of multimodal data to improve (1) clinical decision support, (2) access to information for patients and their families, (3) easier data sharing, and (4) accelerate secondary use of clinical data. Modeling the recommendations of the International League Against Epilepsy (ILAE) classification system in the form of an epilepsy domain ontology is essential for consistent use of terminology in a variety of applications, including electronic health records systems and clinical applications. In this review, we discuss the data management issues in epilepsy and explore the benefits of an ontology-driven informatics infrastructure and its role in adoption of a “data-driven” paradigm in epilepsy research. PMID:23647220
A systematic literature review of automated clinical coding and classification systems
Williams, Margaret; Fenton, Susan H; Jenders, Robert A; Hersh, William R
2010-01-01
Clinical coding and classification processes transform natural language descriptions in clinical text into data that can subsequently be used for clinical care, research, and other purposes. This systematic literature review examined studies that evaluated all types of automated coding and classification systems to determine the performance of such systems. Studies indexed in Medline or other relevant databases prior to March 2009 were considered. The 113 studies included in this review show that automated tools exist for a variety of coding and classification purposes, focus on various healthcare specialties, and handle a wide variety of clinical document types. Automated coding and classification systems themselves are not generalizable, nor are the results of the studies evaluating them. Published research shows these systems hold promise, but these data must be considered in context, with performance relative to the complexity of the task and the desired outcome. PMID:20962126
A systematic literature review of automated clinical coding and classification systems.
Stanfill, Mary H; Williams, Margaret; Fenton, Susan H; Jenders, Robert A; Hersh, William R
2010-01-01
Clinical coding and classification processes transform natural language descriptions in clinical text into data that can subsequently be used for clinical care, research, and other purposes. This systematic literature review examined studies that evaluated all types of automated coding and classification systems to determine the performance of such systems. Studies indexed in Medline or other relevant databases prior to March 2009 were considered. The 113 studies included in this review show that automated tools exist for a variety of coding and classification purposes, focus on various healthcare specialties, and handle a wide variety of clinical document types. Automated coding and classification systems themselves are not generalizable, nor are the results of the studies evaluating them. Published research shows these systems hold promise, but these data must be considered in context, with performance relative to the complexity of the task and the desired outcome.
Service Management Database for DSN Equipment
NASA Technical Reports Server (NTRS)
Zendejas, Silvino; Bui, Tung; Bui, Bach; Malhotra, Shantanu; Chen, Fannie; Wolgast, Paul; Allen, Christopher; Luong, Ivy; Chang, George; Sadaqathulla, Syed
2009-01-01
This data- and event-driven persistent storage system leverages the use of commercial software provided by Oracle for portability, ease of maintenance, scalability, and ease of integration with embedded, client-server, and multi-tiered applications. In this role, the Service Management Database (SMDB) is a key component of the overall end-to-end process involved in the scheduling, preparation, and configuration of the Deep Space Network (DSN) equipment needed to perform the various telecommunication services the DSN provides to its customers worldwide. SMDB makes efficient use of triggers, stored procedures, queuing functions, e-mail capabilities, data management, and Java integration features provided by the Oracle relational database management system. SMDB uses a third normal form schema design that allows for simple data maintenance procedures and thin layers of integration with client applications. The software provides an integrated event logging system with ability to publish events to a JMS messaging system for synchronous and asynchronous delivery to subscribed applications. It provides a structured classification of events and application-level messages stored in database tables that are accessible by monitoring applications for real-time monitoring or for troubleshooting and analysis over historical archives.
Vishnyakova, Dina; Pasche, Emilie; Ruch, Patrick
2012-01-01
We report on the original integration of an automatic text categorization pipeline, so-called ToxiCat (Toxicogenomic Categorizer), that we developed to perform biomedical documents classification and prioritization in order to speed up the curation of the Comparative Toxicogenomics Database (CTD). The task can be basically described as a binary classification task, where a scoring function is used to rank a selected set of articles. Then components of a question-answering system are used to extract CTD-specific annotations from the ranked list of articles. The ranking function is generated using a Support Vector Machine, which combines three main modules: an information retrieval engine for MEDLINE (EAGLi), a gene normalization service (NormaGene) developed for a previous BioCreative campaign and finally, a set of answering components and entity recognizer for diseases and chemicals. The main components of the pipeline are publicly available both as web application and web services. The specific integration performed for the BioCreative competition is available via a web user interface at http://pingu.unige.ch:8080/Toxicat.
Page layout analysis and classification for complex scanned documents
NASA Astrophysics Data System (ADS)
Erkilinc, M. Sezer; Jaber, Mustafa; Saber, Eli; Bauer, Peter; Depalov, Dejan
2011-09-01
A framework for region/zone classification in color and gray-scale scanned documents is proposed in this paper. The algorithm includes modules for extracting text, photo, and strong edge/line regions. Firstly, a text detection module which is based on wavelet analysis and Run Length Encoding (RLE) technique is employed. Local and global energy maps in high frequency bands of the wavelet domain are generated and used as initial text maps. Further analysis using RLE yields a final text map. The second module is developed to detect image/photo and pictorial regions in the input document. A block-based classifier using basis vector projections is employed to identify photo candidate regions. Then, a final photo map is obtained by applying probabilistic model based on Markov random field (MRF) based maximum a posteriori (MAP) optimization with iterated conditional mode (ICM). The final module detects lines and strong edges using Hough transform and edge-linkages analysis, respectively. The text, photo, and strong edge/line maps are combined to generate a page layout classification of the scanned target document. Experimental results and objective evaluation show that the proposed technique has a very effective performance on variety of simple and complex scanned document types obtained from MediaTeam Oulu document database. The proposed page layout classifier can be used in systems for efficient document storage, content based document retrieval, optical character recognition, mobile phone imagery, and augmented reality.
River reach classification for the Greater Mekong Region at high spatial resolution
NASA Astrophysics Data System (ADS)
Ouellet Dallaire, C.; Lehner, B.
2014-12-01
River classifications have been used in river health and ecological assessments as coarse proxies to represent aquatic biodiversity when comprehensive biological and/or species data is unavailable. Currently there are no river classifications or biological data available in a consistent format for the extent of the Greater Mekong Region (GMR; including the Irrawaddy, the Salween, the Chao Praya, the Mekong and the Red River basins). The current project proposes a new river habitat classification for the region, facilitated by the HydroSHEDS (HYDROlogical SHuttle Elevation Derivatives at multiple Scales) database at 500m pixel resolution. The classification project is based on the Global River Classification framework relying on the creation of multiple sub-classifications based on different disciplines. The resulting classes from the sub-classification are later combined into final classes to create a holistic river reach classification. For the GMR, a final habitat classification was created based on three sub-classifications: a hydrological sub-classification based only on discharge indices (river size and flow variability); a physio-climatic sub-classification based on large scale indices of climate and elevation (biomes, ecoregions and elevation); and a geomorphological sub-classification based on local morphology (presence of floodplains, reach gradient and sand transport). Key variables and thresholds were identified in collaboration with local experts to ensure that regional knowledge was included. The final classification is composed 54 unique final classes based on 3 sub-classifications with less than 15 classes each. The resulting classifications are driven by abiotic variables and do not include biological data, but they represent a state-of-the art product based on best available data (mostly global data). The most common river habitat type is the "dry broadleaf, low gradient, very small river". These classifications could be applied in a wide range of hydro-ecological assessments and useful for a variety of stakeholders such as NGO, governments and researchers.
Just-in-time Database-Driven Web Applications
2003-01-01
"Just-in-time" database-driven Web applications are inexpensive, quickly-developed software that can be put to many uses within a health care organization. Database-driven Web applications garnered 73873 hits on our system-wide intranet in 2002. They enabled collaboration and communication via user-friendly Web browser-based interfaces for both mission-critical and patient-care-critical functions. Nineteen database-driven Web applications were developed. The application categories that comprised 80% of the hits were results reporting (27%), graduate medical education (26%), research (20%), and bed availability (8%). The mean number of hits per application was 3888 (SD = 5598; range, 14-19879). A model is described for just-in-time database-driven Web application development and an example given with a popular HTML editor and database program. PMID:14517109
The immune epitope database: a historical retrospective of the first decade.
Salimi, Nima; Fleri, Ward; Peters, Bjoern; Sette, Alessandro
2012-10-01
As the amount of biomedical information available in the literature continues to increase, databases that aggregate this information continue to grow in importance and scope. The population of databases can occur either through fully automated text mining approaches or through manual curation by human subject experts. We here report our experiences in populating the National Institute of Allergy and Infectious Diseases sponsored Immune Epitope Database and Analysis Resource (IEDB, http://iedb.org), which was created in 2003, and as of 2012 captures the epitope information from approximately 99% of all papers published to date that describe immune epitopes (with the exception of cancer and HIV data). This was achieved using a hybrid model based on automated document categorization and extensive human expert involvement. This task required automated scanning of over 22 million PubMed abstracts followed by classification and curation of over 13 000 references, including over 7000 infectious disease-related manuscripts, over 1000 allergy-related manuscripts, roughly 4000 related to autoimmunity, and 1000 transplant/alloantigen-related manuscripts. The IEDB curation involves an unprecedented level of detail, capturing for each paper the actual experiments performed for each different epitope structure. Key to enabling this process was the extensive use of ontologies to ensure rigorous and consistent data representation as well as interoperability with other bioinformatics resources, including the Protein Data Bank, Chemical Entities of Biological Interest, and the NIAID Bioinformatics Resource Centers. A growing fraction of the IEDB data derives from direct submissions by research groups engaged in epitope discovery, and is being facilitated by the implementation of novel data submission tools. The present explosion of information contained in biological databases demands effective query and display capabilities to optimize the user experience. Accordingly, the development of original ways to query the database, on the basis of ontologically driven hierarchical trees, and display of epitope data in aggregate in a biologically intuitive yet rigorous fashion is now at the forefront of the IEDB efforts. We also highlight advances made in the realm of epitope analysis and predictive tools available in the IEDB. © 2012 The Authors. Immunology © 2012 Blackwell Publishing Ltd.
Feature generation and representations for protein-protein interaction classification.
Lan, Man; Tan, Chew Lim; Su, Jian
2009-10-01
Automatic detecting protein-protein interaction (PPI) relevant articles is a crucial step for large-scale biological database curation. The previous work adopted POS tagging, shallow parsing and sentence splitting techniques, but they achieved worse performance than the simple bag-of-words representation. In this paper, we generated and investigated multiple types of feature representations in order to further improve the performance of PPI text classification task. Besides the traditional domain-independent bag-of-words approach and the term weighting methods, we also explored other domain-dependent features, i.e. protein-protein interaction trigger keywords, protein named entities and the advanced ways of incorporating Natural Language Processing (NLP) output. The integration of these multiple features has been evaluated on the BioCreAtIvE II corpus. The experimental results showed that both the advanced way of using NLP output and the integration of bag-of-words and NLP output improved the performance of text classification. Specifically, in comparison with the best performance achieved in the BioCreAtIvE II IAS, the feature-level and classifier-level integration of multiple features improved the performance of classification 2.71% and 3.95%, respectively.
Ghazizadeh, Mahtab; McDonald, Anthony D; Lee, John D
2014-09-01
This study applies text mining to extract clusters of vehicle problems and associated trends from free-response data in the National Highway Traffic Safety Administration's vehicle owner's complaint database. As the automotive industry adopts new technologies, it is important to systematically assess the effect of these changes on traffic safety. Driving simulators, naturalistic driving data, and crash databases all contribute to a better understanding of how drivers respond to changing vehicle technology, but other approaches, such as automated analysis of incident reports, are needed. Free-response data from incidents representing two severity levels (fatal incidents and incidents involving injury) were analyzed using a text mining approach: latent semantic analysis (LSA). LSA and hierarchical clustering identified clusters of complaints for each severity level, which were compared and analyzed across time. Cluster analysis identified eight clusters of fatal incidents and six clusters of incidents involving injury. Comparisons showed that although the airbag clusters across the two severity levels have the same most frequent terms, the circumstances around the incidents differ. The time trends show clear increases in complaints surrounding the Ford/Firestone tire recall and the Toyota unintended acceleration recall. Increases in complaints may be partially driven by these recall announcements and the associated media attention. Text mining can reveal useful information from free-response databases that would otherwise be prohibitively time-consuming and difficult to summarize manually. Text mining can extend human analysis capabilities for large free-response databases to support earlier detection of problems and more timely safety interventions.
Hancock, Matthew C; Magnan, Jerry F
2016-10-01
In the assessment of nodules in CT scans of the lungs, a number of image-derived features are diagnostically relevant. Currently, many of these features are defined only qualitatively, so they are difficult to quantify from first principles. Nevertheless, these features (through their qualitative definitions and interpretations thereof) are often quantified via a variety of mathematical methods for the purpose of computer-aided diagnosis (CAD). To determine the potential usefulness of quantified diagnostic image features as inputs to a CAD system, we investigate the predictive capability of statistical learning methods for classifying nodule malignancy. We utilize the Lung Image Database Consortium dataset and only employ the radiologist-assigned diagnostic feature values for the lung nodules therein, as well as our derived estimates of the diameter and volume of the nodules from the radiologists' annotations. We calculate theoretical upper bounds on the classification accuracy that are achievable by an ideal classifier that only uses the radiologist-assigned feature values, and we obtain an accuracy of 85.74 [Formula: see text], which is, on average, 4.43% below the theoretical maximum of 90.17%. The corresponding area-under-the-curve (AUC) score is 0.932 ([Formula: see text]), which increases to 0.949 ([Formula: see text]) when diameter and volume features are included and has an accuracy of 88.08 [Formula: see text]. Our results are comparable to those in the literature that use algorithmically derived image-based features, which supports our hypothesis that lung nodules can be classified as malignant or benign using only quantified, diagnostic image features, and indicates the competitiveness of this approach. We also analyze how the classification accuracy depends on specific features and feature subsets, and we rank the features according to their predictive power, statistically demonstrating the top four to be spiculation, lobulation, subtlety, and calcification.
Logo detection and classification in a sport video: video indexing for sponsorship revenue control
NASA Astrophysics Data System (ADS)
Kovar, Bohumil; Hanjalic, Alan
2001-12-01
This paper presents a novel approach to detecting and classifying a trademark logo in frames of a sport video. In view of the fact that we attempt to detect and recognize a logo in a natural scene, the algorithm developed in this paper differs from traditional techniques for logo detection and classification that are applicable either to well-structured general text documents (e.g. invoices, memos, bank cheques) or to specialized trademark logo databases, where logos appear isolated on a clear background and where their detection and classification is not disturbed by the surrounding visual detail. Although the development of our algorithm is still in its starting phase, experimental results performed so far on a set of soccer TV broadcasts are very encouraging.
SFINX-a drug-drug interaction database designed for clinical decision support systems.
Böttiger, Ylva; Laine, Kari; Andersson, Marine L; Korhonen, Tuomas; Molin, Björn; Ovesjö, Marie-Louise; Tirkkonen, Tuire; Rane, Anders; Gustafsson, Lars L; Eiermann, Birgit
2009-06-01
The aim was to develop a drug-drug interaction database (SFINX) to be integrated into decision support systems or to be used in website solutions for clinical evaluation of interactions. Key elements such as substance properties and names, drug formulations, text structures and references were defined before development of the database. Standard operating procedures for literature searches, text writing rules and a classification system for clinical relevance and documentation level were determined. ATC codes, CAS numbers and country-specific codes for substances were identified and quality assured to ensure safe integration of SFINX into other data systems. Much effort was put into giving short and practical advice regarding clinically relevant drug-drug interactions. SFINX includes over 8,000 interaction pairs and is integrated into Swedish and Finnish computerised decision support systems. Over 31,000 physicians and pharmacists are receiving interaction alerts through SFINX. User feedback is collected for continuous improvement of the content. SFINX is a potentially valuable tool delivering instant information on drug interactions during prescribing and dispensing.
Vail, Paris J; Morris, Brian; van Kan, Aric; Burdett, Brianna C; Moyes, Kelsey; Theisen, Aaron; Kerr, Iain D; Wenstrup, Richard J; Eggington, Julie M
2015-10-01
Genetic variants of uncertain clinical significance (VUSs) are a common outcome of clinical genetic testing. Locus-specific variant databases (LSDBs) have been established for numerous disease-associated genes as a research tool for the interpretation of genetic sequence variants to facilitate variant interpretation via aggregated data. If LSDBs are to be used for clinical practice, consistent and transparent criteria regarding the deposition and interpretation of variants are vital, as variant classifications are often used to make important and irreversible clinical decisions. In this study, we performed a retrospective analysis of 2017 consecutive BRCA1 and BRCA2 genetic variants identified from 24,650 consecutive patient samples referred to our laboratory to establish an unbiased dataset representative of the types of variants seen in the US patient population, submitted by clinicians and researchers for BRCA1 and BRCA2 testing. We compared the clinical classifications of these variants among five publicly accessible BRCA1 and BRCA2 variant databases: BIC, ClinVar, HGMD (paid version), LOVD, and the UMD databases. Our results show substantial disparity of variant classifications among publicly accessible databases. Furthermore, it appears that discrepant classifications are not the result of a single outlier but widespread disagreement among databases. This study also shows that databases sometimes favor a clinical classification when current best practice guidelines (ACMG/AMP/CAP) would suggest an uncertain classification. Although LSDBs have been well established for research applications, our results suggest several challenges preclude their wider use in clinical practice.
Schomburg, Ida; Chang, Antje; Placzek, Sandra; Söhngen, Carola; Rother, Michael; Lang, Maren; Munaretto, Cornelia; Ulas, Susanne; Stelzer, Michael; Grote, Andreas; Scheer, Maurice; Schomburg, Dietmar
2013-01-01
The BRENDA (BRaunschweig ENzyme DAtabase) enzyme portal (http://www.brenda-enzymes.org) is the main information system of functional biochemical and molecular enzyme data and provides access to seven interconnected databases. BRENDA contains 2.7 million manually annotated data on enzyme occurrence, function, kinetics and molecular properties. Each entry is connected to a reference and the source organism. Enzyme ligands are stored with their structures and can be accessed via their names, synonyms or via a structure search. FRENDA (Full Reference ENzyme DAta) and AMENDA (Automatic Mining of ENzyme DAta) are based on text mining methods and represent a complete survey of PubMed abstracts with information on enzymes in different organisms, tissues or organelles. The supplemental database DRENDA provides more than 910 000 new EC number-disease relations in more than 510 000 references from automatic search and a classification of enzyme-disease-related information. KENDA (Kinetic ENzyme DAta), a new amendment extracts and displays kinetic values from PubMed abstracts. The integration of the EnzymeDetector offers an automatic comparison, evaluation and prediction of enzyme function annotations for prokaryotic genomes. The biochemical reaction database BKM-react contains non-redundant enzyme-catalysed and spontaneous reactions and was developed to facilitate and accelerate the construction of biochemical models.
Chan, Vincy; Thurairajah, Pravheen; Colantonio, Angela
2013-11-13
Although healthcare administrative data are commonly used for traumatic brain injury research, there is currently no consensus or consistency on using the International Classification of Diseases version 10 codes to define traumatic brain injury among children and youth. This protocol is for a systematic review of the literature to explore the range of International Classification of Diseases version 10 codes that are used to define traumatic brain injury in this population. The databases MEDLINE, MEDLINE In-Process, Embase, PsychINFO, CINAHL, SPORTDiscus, and Cochrane Database of Systematic Reviews will be systematically searched. Grey literature will be searched using Grey Matters and Google. Reference lists of included articles will also be searched. Articles will be screened using predefined inclusion and exclusion criteria and all full-text articles that meet the predefined inclusion criteria will be included for analysis. The study selection process and reasons for exclusion at the full-text level will be presented using a PRISMA study flow diagram. Information on the data source of included studies, year and location of study, age of study population, range of incidence, and study purpose will be abstracted into a separate table and synthesized for analysis. All International Classification of Diseases version 10 codes will be listed in tables and the codes that are used to define concussion, acquired traumatic brain injury, head injury, or head trauma will be identified. The identification of the optimal International Classification of Diseases version 10 codes to define this population in administrative data is crucial, as it has implications for policy, resource allocation, planning of healthcare services, and prevention strategies. It also allows for comparisons across countries and studies. This protocol is for a review that identifies the range and most common diagnoses used to conduct surveillance for traumatic brain injury in children and youth. This is an important first step in reaching an appropriate definition using International Classification of Diseases version 10 codes and can inform future work on reaching consensus on the codes to define traumatic brain injury for this vulnerable population.
Shao, Wei; Liu, Mingxia; Zhang, Daoqiang
2016-01-01
The systematic study of subcellular location pattern is very important for fully characterizing the human proteome. Nowadays, with the great advances in automated microscopic imaging, accurate bioimage-based classification methods to predict protein subcellular locations are highly desired. All existing models were constructed on the independent parallel hypothesis, where the cellular component classes are positioned independently in a multi-class classification engine. The important structural information of cellular compartments is missed. To deal with this problem for developing more accurate models, we proposed a novel cell structure-driven classifier construction approach (SC-PSorter) by employing the prior biological structural information in the learning model. Specifically, the structural relationship among the cellular components is reflected by a new codeword matrix under the error correcting output coding framework. Then, we construct multiple SC-PSorter-based classifiers corresponding to the columns of the error correcting output coding codeword matrix using a multi-kernel support vector machine classification approach. Finally, we perform the classifier ensemble by combining those multiple SC-PSorter-based classifiers via majority voting. We evaluate our method on a collection of 1636 immunohistochemistry images from the Human Protein Atlas database. The experimental results show that our method achieves an overall accuracy of 89.0%, which is 6.4% higher than the state-of-the-art method. The dataset and code can be downloaded from https://github.com/shaoweinuaa/. dqzhang@nuaa.edu.cn Supplementary data are available at Bioinformatics online. © The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
Investigating trends in acoustics research from 1970-1999.
Viator, J A; Pestorius, F M
2001-05-01
Text data mining is a burgeoning field in which new information is extracted from existing text databases. Computational methods are used to compare relationships between database elements to yield new information about the existing data. Text data mining software was used to determine research trends in acoustics for the years 1970, 1980, 1990, and 1999. Trends were indicated by the number of published articles in the categories of acoustics using the Journal of the Acoustical Society of America (JASA) as the article source. Research was classified using a method based on the Physics and Astronomy Classification Scheme (PACS). Research was further subdivided into world regions, including North and South America, Eastern and Western Europe, Asia, Africa, Middle East, and Australia/New Zealand. In order to gauge the use of JASA as an indicator of international acoustics research, three subjects, underwater sound, nonlinear acoustics, and bioacoustics, were further tracked in 1999, using all journals in the INSPEC database. Research trends indicated a shift in emphasis of certain areas, notably underwater sound, audition, and speech. JASA also showed steady growth, with increasing participation by non-US authors, from about 20% in 1970 to nearly 50% in 1999.
Srivastava, Saurabh Kumar; Singh, Sandeep Kumar; Suri, Jasjit S
2018-04-13
A machine learning (ML)-based text classification system has several classifiers. The performance evaluation (PE) of the ML system is typically driven by the training data size and the partition protocols used. Such systems lead to low accuracy because the text classification systems lack the ability to model the input text data in terms of noise characteristics. This research study proposes a concept of misrepresentation ratio (MRR) on input healthcare text data and models the PE criteria for validating the hypothesis. Further, such a novel system provides a platform to amalgamate several attributes of the ML system such as: data size, classifier type, partitioning protocol and percentage MRR. Our comprehensive data analysis consisted of five types of text data sets (TwitterA, WebKB4, Disease, Reuters (R8), and SMS); five kinds of classifiers (support vector machine with linear kernel (SVM-L), MLP-based neural network, AdaBoost, stochastic gradient descent and decision tree); and five types of training protocols (K2, K4, K5, K10 and JK). Using the decreasing order of MRR, our ML system demonstrates the mean classification accuracies as: 70.13 ± 0.15%, 87.34 ± 0.06%, 93.73 ± 0.03%, 94.45 ± 0.03% and 97.83 ± 0.01%, respectively, using all the classifiers and protocols. The corresponding AUC is 0.98 for SMS data using Multi-Layer Perceptron (MLP) based neural network. All the classifiers, the best accuracy of 91.84 ± 0.04% is shown to be of MLP-based neural network and this is 6% better over previously published. Further we observed that as MRR decreases, the system robustness increases and validated by standard deviations. The overall text system accuracy using all data types, classifiers, protocols is 89%, thereby showing the entire ML system to be novel, robust and unique. The system is also tested for stability and reliability.
Virus Database and Online Inquiry System Based on Natural Vectors.
Dong, Rui; Zheng, Hui; Tian, Kun; Yau, Shek-Chung; Mao, Weiguang; Yu, Wenping; Yin, Changchuan; Yu, Chenglong; He, Rong Lucy; Yang, Jie; Yau, Stephen St
2017-01-01
We construct a virus database called VirusDB (http://yaulab.math.tsinghua.edu.cn/VirusDB/) and an online inquiry system to serve people who are interested in viral classification and prediction. The database stores all viral genomes, their corresponding natural vectors, and the classification information of the single/multiple-segmented viral reference sequences downloaded from National Center for Biotechnology Information. The online inquiry system serves the purpose of computing natural vectors and their distances based on submitted genomes, providing an online interface for accessing and using the database for viral classification and prediction, and back-end processes for automatic and manual updating of database content to synchronize with GenBank. Submitted genomes data in FASTA format will be carried out and the prediction results with 5 closest neighbors and their classifications will be returned by email. Considering the one-to-one correspondence between sequence and natural vector, time efficiency, and high accuracy, natural vector is a significant advance compared with alignment methods, which makes VirusDB a useful database in further research.
Gao, Xiang; Lin, Huaiying; Revanna, Kashi; Dong, Qunfeng
2017-05-10
Species-level classification for 16S rRNA gene sequences remains a serious challenge for microbiome researchers, because existing taxonomic classification tools for 16S rRNA gene sequences either do not provide species-level classification, or their classification results are unreliable. The unreliable results are due to the limitations in the existing methods which either lack solid probabilistic-based criteria to evaluate the confidence of their taxonomic assignments, or use nucleotide k-mer frequency as the proxy for sequence similarity measurement. We have developed a method that shows significantly improved species-level classification results over existing methods. Our method calculates true sequence similarity between query sequences and database hits using pairwise sequence alignment. Taxonomic classifications are assigned from the species to the phylum levels based on the lowest common ancestors of multiple database hits for each query sequence, and further classification reliabilities are evaluated by bootstrap confidence scores. The novelty of our method is that the contribution of each database hit to the taxonomic assignment of the query sequence is weighted by a Bayesian posterior probability based upon the degree of sequence similarity of the database hit to the query sequence. Our method does not need any training datasets specific for different taxonomic groups. Instead only a reference database is required for aligning to the query sequences, making our method easily applicable for different regions of the 16S rRNA gene or other phylogenetic marker genes. Reliable species-level classification for 16S rRNA or other phylogenetic marker genes is critical for microbiome research. Our software shows significantly higher classification accuracy than the existing tools and we provide probabilistic-based confidence scores to evaluate the reliability of our taxonomic classification assignments based on multiple database matches to query sequences. Despite its higher computational costs, our method is still suitable for analyzing large-scale microbiome datasets for practical purposes. Furthermore, our method can be applied for taxonomic classification of any phylogenetic marker gene sequences. Our software, called BLCA, is freely available at https://github.com/qunfengdong/BLCA .
Analysis of a Bibliographic Database Enhanced with a Library Classification.
ERIC Educational Resources Information Center
Drabenstott, Karen Markey; And Others
1990-01-01
Describes a project that examined the effects of incorporating subject terms from the Dewey Decimal Classification (DDC) into a bibliographic database. It is concluded that the incorporation of DDC and possibly other library classifications into online catalogs can enhance subject access and provide additional subject searching strategies. (11…
Data fusion and classification using a hybrid intrinsic cellular inference network
NASA Astrophysics Data System (ADS)
Woodley, Robert; Walenz, Brett; Seiffertt, John; Robinette, Paul; Wunsch, Donald
2010-04-01
Hybrid Intrinsic Cellular Inference Network (HICIN) is designed for battlespace decision support applications. We developed an automatic method of generating hypotheses for an entity-attribute classifier. The capability and effectiveness of a domain specific ontology was used to generate automatic categories for data classification. Heterogeneous data is clustered using an Adaptive Resonance Theory (ART) inference engine on a sample (unclassified) data set. The data set is the Lahman baseball database. The actual data is immaterial to the architecture, however, parallels in the data can be easily drawn (i.e., "Team" maps to organization, "Runs scored/allowed" to Measure of organization performance (positive/negative), "Payroll" to organization resources, etc.). Results show that HICIN classifiers create known inferences from the heterogonous data. These inferences are not explicitly stated in the ontological description of the domain and are strictly data driven. HICIN uses data uncertainty handling to reduce errors in the classification. The uncertainty handling is based on subjective logic. The belief mass allows evidence from multiple sources to be mathematically combined to increase or discount an assertion. In military operations the ability to reduce uncertainty will be vital in the data fusion operation.
Ontology-based knowledge representation for resolution of semantic heterogeneity in GIS
NASA Astrophysics Data System (ADS)
Liu, Ying; Xiao, Han; Wang, Limin; Han, Jialing
2017-07-01
Lack of semantic interoperability in geographical information systems has been identified as the main obstacle for data sharing and database integration. The new method should be found to overcome the problems of semantic heterogeneity. Ontologies are considered to be one approach to support geographic information sharing. This paper presents an ontology-driven integration approach to help in detecting and possibly resolving semantic conflicts. Its originality is that each data source participating in the integration process contains an ontology that defines the meaning of its own data. This approach ensures the automation of the integration through regulation of semantic integration algorithm. Finally, land classification in field GIS is described as the example.
Conceptual biology, hypothesis discovery, and text mining: Swanson's legacy.
Bekhuis, Tanja
2006-04-03
Innovative biomedical librarians and information specialists who want to expand their roles as expert searchers need to know about profound changes in biology and parallel trends in text mining. In recent years, conceptual biology has emerged as a complement to empirical biology. This is partly in response to the availability of massive digital resources such as the network of databases for molecular biologists at the National Center for Biotechnology Information. Developments in text mining and hypothesis discovery systems based on the early work of Swanson, a mathematician and information scientist, are coincident with the emergence of conceptual biology. Very little has been written to introduce biomedical digital librarians to these new trends. In this paper, background for data and text mining, as well as for knowledge discovery in databases (KDD) and in text (KDT) is presented, then a brief review of Swanson's ideas, followed by a discussion of recent approaches to hypothesis discovery and testing. 'Testing' in the context of text mining involves partially automated methods for finding evidence in the literature to support hypothetical relationships. Concluding remarks follow regarding (a) the limits of current strategies for evaluation of hypothesis discovery systems and (b) the role of literature-based discovery in concert with empirical research. Report of an informatics-driven literature review for biomarkers of systemic lupus erythematosus is mentioned. Swanson's vision of the hidden value in the literature of science and, by extension, in biomedical digital databases, is still remarkably generative for information scientists, biologists, and physicians.
A data-driven modeling approach to stochastic computation for low-energy biomedical devices.
Lee, Kyong Ho; Jang, Kuk Jin; Shoeb, Ali; Verma, Naveen
2011-01-01
Low-power devices that can detect clinically relevant correlations in physiologically-complex patient signals can enable systems capable of closed-loop response (e.g., controlled actuation of therapeutic stimulators, continuous recording of disease states, etc.). In ultra-low-power platforms, however, hardware error sources are becoming increasingly limiting. In this paper, we present how data-driven methods, which allow us to accurately model physiological signals, also allow us to effectively model and overcome prominent hardware error sources with nearly no additional overhead. Two applications, EEG-based seizure detection and ECG-based arrhythmia-beat classification, are synthesized to a logic-gate implementation, and two prominent error sources are introduced: (1) SRAM bit-cell errors and (2) logic-gate switching errors ('stuck-at' faults). Using patient data from the CHB-MIT and MIT-BIH databases, performance similar to error-free hardware is achieved even for very high fault rates (up to 0.5 for SRAMs and 7 × 10(-2) for logic) that cause computational bit error rates as high as 50%.
NASA Astrophysics Data System (ADS)
Oses, Corey; Isayev, Olexandr; Toher, Cormac; Curtarolo, Stefano; Tropsha, Alexander
Historically, materials discovery is driven by a laborious trial-and-error process. The growth of materials databases and emerging informatics approaches finally offer the opportunity to transform this practice into data- and knowledge-driven rational design-accelerating discovery of novel materials exhibiting desired properties. By using data from the AFLOW repository for high-throughput, ab-initio calculations, we have generated Quantitative Materials Structure-Property Relationship (QMSPR) models to predict critical materials properties, including the metal/insulator classification, band gap energy, and bulk modulus. The prediction accuracy obtained with these QMSPR models approaches training data for virtually any stoichiometric inorganic crystalline material. We attribute the success and universality of these models to the construction of new materials descriptors-referred to as the universal Property-Labeled Material Fragments (PLMF). This representation affords straightforward model interpretation in terms of simple heuristic design rules that could guide rational materials design. This proof-of-concept study demonstrates the power of materials informatics to dramatically accelerate the search for new materials.
Development of an Engineering Soil Database
2017-12-27
systems such as agricultural and geological soil classifications and soil parameters. Tier 3 Data were converted into equivalent USCS classification...14 2.7 U.S. Department of Agriculture (USDA) textural soil classification ............................ 16 2.7.1 Properties of USDA textural...Defense ERDC U.S. Army Engineer Research and Development Center ESDB European Soil Database FAO Food and Agriculture Organization (of the United
Lhermitte, L; Mejstrikova, E; van der Sluijs-Gelling, A J; Grigore, G E; Sedek, L; Bras, A E; Gaipa, G; Sobral da Costa, E; Novakova, M; Sonneveld, E; Buracchi, C; de Sá Bacelar, T; te Marvelde, J G; Trinquand, A; Asnafi, V; Szczepanski, T; Matarraz, S; Lopez, A; Vidriales, B; Bulsa, J; Hrusak, O; Kalina, T; Lecrevisse, Q; Martin Ayuso, M; Brüggemann, M; Verde, J; Fernandez, P; Burgos, L; Paiva, B; Pedreira, C E; van Dongen, J J M; Orfao, A; van der Velden, V H J
2018-01-01
Precise classification of acute leukemia (AL) is crucial for adequate treatment. EuroFlow has previously designed an AL orientation tube (ALOT) to guide towards the relevant classification panel (T-cell acute lymphoblastic leukemia (T-ALL), B-cell precursor (BCP)-ALL and/or acute myeloid leukemia (AML)) and final diagnosis. Now we built a reference database with 656 typical AL samples (145 T-ALL, 377 BCP-ALL, 134 AML), processed and analyzed via standardized protocols. Using principal component analysis (PCA)-based plots and automated classification algorithms for direct comparison of single-cells from individual patients against the database, another 783 cases were subsequently evaluated. Depending on the database-guided results, patients were categorized as: (i) typical T, B or Myeloid without or; (ii) with a transitional component to another lineage; (iii) atypical; or (iv) mixed-lineage. Using this automated algorithm, in 781/783 cases (99.7%) the right panel was selected, and data comparable to the final WHO-diagnosis was already provided in >93% of cases (85% T-ALL, 97% BCP-ALL, 95% AML and 87% mixed-phenotype AL patients), even without data on the full-characterization panels. Our results show that database-guided analysis facilitates standardized interpretation of ALOT results and allows accurate selection of the relevant classification panels, hence providing a solid basis for designing future WHO AL classifications. PMID:29089646
Ronald, L A; Ling, D I; FitzGerald, J M; Schwartzman, K; Bartlett-Esquilant, G; Boivin, J-F; Benedetti, A; Menzies, D
2017-05-01
An increasing number of studies are using health administrative databases for tuberculosis (TB) research. However, there are limitations to using such databases for identifying patients with TB. To summarise validated methods for identifying TB in health administrative databases. We conducted a systematic literature search in two databases (Ovid Medline and Embase, January 1980-January 2016). We limited the search to diagnostic accuracy studies assessing algorithms derived from drug prescription, International Classification of Diseases (ICD) diagnostic code and/or laboratory data for identifying patients with TB in health administrative databases. The search identified 2413 unique citations. Of the 40 full-text articles reviewed, we included 14 in our review. Algorithms and diagnostic accuracy outcomes to identify TB varied widely across studies, with positive predictive value ranging from 1.3% to 100% and sensitivity ranging from 20% to 100%. Diagnostic accuracy measures of algorithms using out-patient, in-patient and/or laboratory data to identify patients with TB in health administrative databases vary widely across studies. Use solely of ICD diagnostic codes to identify TB, particularly when using out-patient records, is likely to lead to incorrect estimates of case numbers, given the current limitations of ICD systems in coding TB.
NASA Astrophysics Data System (ADS)
Kale, Mandar; Mukhopadhyay, Sudipta; Dash, Jatindra K.; Garg, Mandeep; Khandelwal, Niranjan
2016-03-01
Interstitial lung disease (ILD) is complicated group of pulmonary disorders. High Resolution Computed Tomography (HRCT) considered to be best imaging technique for analysis of different pulmonary disorders. HRCT findings can be categorised in several patterns viz. Consolidation, Emphysema, Ground Glass Opacity, Nodular, Normal etc. based on their texture like appearance. Clinician often find it difficult to diagnosis these pattern because of their complex nature. In such scenario computer-aided diagnosis system could help clinician to identify patterns. Several approaches had been proposed for classification of ILD patterns. This includes computation of textural feature and training /testing of classifier such as artificial neural network (ANN), support vector machine (SVM) etc. In this paper, wavelet features are calculated from two different ILD database, publically available MedGIFT ILD database and private ILD database, followed by performance evaluation of ANN and SVM classifiers in terms of average accuracy. It is found that average classification accuracy by SVM is greater than ANN where trained and tested on same database. Investigation continued further to test variation in accuracy of classifier when training and testing is performed with alternate database and training and testing of classifier with database formed by merging samples from same class from two individual databases. The average classification accuracy drops when two independent databases used for training and testing respectively. There is significant improvement in average accuracy when classifiers are trained and tested with merged database. It infers dependency of classification accuracy on training data. It is observed that SVM outperforms ANN when same database is used for training and testing.
Decision Manifold Approximation for Physics-Based Simulations
NASA Technical Reports Server (NTRS)
Wong, Jay Ming; Samareh, Jamshid A.
2016-01-01
With the recent surge of success in big-data driven deep learning problems, many of these frameworks focus on the notion of architecture design and utilizing massive databases. However, in some scenarios massive sets of data may be difficult, and in some cases infeasible, to acquire. In this paper we discuss a trajectory-based framework that quickly learns the underlying decision manifold of binary simulation classifications while judiciously selecting exploratory target states to minimize the number of required simulations. Furthermore, we draw particular attention to the simulation prediction application idealized to the case where failures in simulations can be predicted and avoided, providing machine intelligence to novice analysts. We demonstrate this framework in various forms of simulations and discuss its efficacy.
Multimodal Task-Driven Dictionary Learning for Image Classification
2015-12-18
1 Multimodal Task-Driven Dictionary Learning for Image Classification Soheil Bahrampour, Student Member, IEEE, Nasser M. Nasrabadi, Fellow, IEEE...Asok Ray, Fellow, IEEE, and W. Kenneth Jenkins, Life Fellow, IEEE Abstract— Dictionary learning algorithms have been suc- cessfully used for both...reconstructive and discriminative tasks, where an input signal is represented with a sparse linear combination of dictionary atoms. While these methods are
Audio stream classification for multimedia database search
NASA Astrophysics Data System (ADS)
Artese, M.; Bianco, S.; Gagliardi, I.; Gasparini, F.
2013-03-01
Search and retrieval of huge archives of Multimedia data is a challenging task. A classification step is often used to reduce the number of entries on which to perform the subsequent search. In particular, when new entries of the database are continuously added, a fast classification based on simple threshold evaluation is desirable. In this work we present a CART-based (Classification And Regression Tree [1]) classification framework for audio streams belonging to multimedia databases. The database considered is the Archive of Ethnography and Social History (AESS) [2], which is mainly composed of popular songs and other audio records describing the popular traditions handed down generation by generation, such as traditional fairs, and customs. The peculiarities of this database are that it is continuously updated; the audio recordings are acquired in unconstrained environment; and for the non-expert human user is difficult to create the ground truth labels. In our experiments, half of all the available audio files have been randomly extracted and used as training set. The remaining ones have been used as test set. The classifier has been trained to distinguish among three different classes: speech, music, and song. All the audio files in the dataset have been previously manually labeled into the three classes above defined by domain experts.
UCbase 2.0: ultraconserved sequences database (2014 update)
Lomonaco, Vincenzo; Martoglia, Riccardo; Mandreoli, Federica; Anderlucci, Laura; Emmett, Warren; Bicciato, Silvio; Taccioli, Cristian
2014-01-01
UCbase 2.0 (http://ucbase.unimore.it) is an update, extension and evolution of UCbase, a Web tool dedicated to the analysis of ultraconserved sequences (UCRs). UCRs are 481 sequences >200 bases sharing 100% identity among human, mouse and rat genomes. They are frequently located in genomic regions known to be involved in cancer or differentially expressed in human leukemias and carcinomas. UCbase 2.0 is a platform-independent Web resource that includes the updated version of the human genome annotation (hg19), information linking disorders to chromosomal coordinates based on the Systematized Nomenclature of Medicine classification, a query tool to search for Single Nucleotide Polymorphisms (SNPs) and a new text box to directly interrogate the database using a MySQL interface. To facilitate the interactive visual interpretation of UCR chromosomal positioning, UCbase 2.0 now includes a graph visualization interface directly linked to UCSC genome browser. Database URL: http://ucbase.unimore.it PMID:24951797
Fernandes, Andrea C; Dutta, Rina; Velupillai, Sumithra; Sanyal, Jyoti; Stewart, Robert; Chandran, David
2018-05-09
Research into suicide prevention has been hampered by methodological limitations such as low sample size and recall bias. Recently, Natural Language Processing (NLP) strategies have been used with Electronic Health Records to increase information extraction from free text notes as well as structured fields concerning suicidality and this allows access to much larger cohorts than previously possible. This paper presents two novel NLP approaches - a rule-based approach to classify the presence of suicide ideation and a hybrid machine learning and rule-based approach to identify suicide attempts in a psychiatric clinical database. Good performance of the two classifiers in the evaluation study suggest they can be used to accurately detect mentions of suicide ideation and attempt within free-text documents in this psychiatric database. The novelty of the two approaches lies in the malleability of each classifier if a need to refine performance, or meet alternate classification requirements arises. The algorithms can also be adapted to fit infrastructures of other clinical datasets given sufficient clinical recording practice knowledge, without dependency on medical codes or additional data extraction of known risk factors to predict suicidal behaviour.
NASA Astrophysics Data System (ADS)
Klump, J. F.; Huber, R.; Robertson, J.; Cox, S. J. D.; Woodcock, R.
2014-12-01
Despite the recent explosion of quantitative geological data, geology remains a fundamentally qualitative science. Numerical data only constitute a certain part of data collection in the geosciences. In many cases, geological observations are compiled as text into reports and annotations on drill cores, thin sections or drawings of outcrops. The observations are classified into concepts such as lithology, stratigraphy, geological structure, etc. These descriptions are semantically rich and are generally supported by more quantitative observations using geochemical analyses, XRD, hyperspectral scanning, etc, but the goal is geological semantics. In practice it has been difficult to bring the different observations together due to differing perception or granularity of classification in human observation, or the partial observation of only some characteristics using quantitative sensors. In the past years many geological classification schemas have been transferred into ontologies and vocabularies, formalized using RDF and OWL, and published through SPARQL endpoints. Several lithological ontologies were compiled by stratigraphy.net and published through a SPARQL endpoint. This work is complemented by the development of a Python API to integrate this vocabulary into Python-based text mining applications. The applications for the lithological vocabulary and Python API are automated semantic tagging of geochemical data and descriptions of drill cores, machine learning of geochemical compositions that are diagnostic for lithological classifications, and text mining for lithological concepts in reports and geological literature. This combination of applications can be used to identify anomalies in databases, where composition and lithological classification do not match. It can also be used to identify lithological concepts in the literature and infer quantitative values. The resulting semantic tagging opens new possibilities for linking these diverse sources of data.
Data-driven RBE parameterization for helium ion beams
NASA Astrophysics Data System (ADS)
Mairani, A.; Magro, G.; Dokic, I.; Valle, S. M.; Tessonnier, T.; Galm, R.; Ciocca, M.; Parodi, K.; Ferrari, A.; Jäkel, O.; Haberer, T.; Pedroni, P.; Böhlen, T. T.
2016-01-01
Helium ion beams are expected to be available again in the near future for clinical use. A suitable formalism to obtain relative biological effectiveness (RBE) values for treatment planning (TP) studies is needed. In this work we developed a data-driven RBE parameterization based on published in vitro experimental values. The RBE parameterization has been developed within the framework of the linear-quadratic (LQ) model as a function of the helium linear energy transfer (LET), dose and the tissue specific parameter {{(α /β )}\\text{ph}} of the LQ model for the reference radiation. Analytic expressions are provided, derived from the collected database, describing the \\text{RB}{{\\text{E}}α}={α\\text{He}}/{α\\text{ph}} and {{\\text{R}}β}={β\\text{He}}/{β\\text{ph}} ratios as a function of LET. Calculated RBE values at 2 Gy photon dose and at 10% survival (\\text{RB}{{\\text{E}}10} ) are compared with the experimental ones. Pearson’s correlation coefficients were, respectively, 0.85 and 0.84 confirming the soundness of the introduced approach. Moreover, due to the lack of experimental data at low LET, clonogenic experiments have been performed irradiating A549 cell line with {{(α /β )}\\text{ph}}=5.4 Gy at the entrance of a 56.4 MeV u-1He beam at the Heidelberg Ion Beam Therapy Center. The proposed parameterization reproduces the measured cell survival within the experimental uncertainties. A RBE formula, which depends only on dose, LET and {{(α /β )}\\text{ph}} as input parameters is proposed, allowing a straightforward implementation in a TP system.
Implementing a Dynamic Database-Driven Course Using LAMP
ERIC Educational Resources Information Center
Laverty, Joseph Packy; Wood, David; Turchek, John
2011-01-01
This paper documents the formulation of a database driven open source architecture web development course. The design of a web-based curriculum faces many challenges: a) relative emphasis of client and server-side technologies, b) choice of a server-side language, and c) the cost and efficient delivery of a dynamic web development, database-driven…
Gobeill, Julien; Pasche, Emilie; Vishnyakova, Dina; Ruch, Patrick
2013-01-01
The available curated data lag behind current biological knowledge contained in the literature. Text mining can assist biologists and curators to locate and access this knowledge, for instance by characterizing the functional profile of publications. Gene Ontology (GO) category assignment in free text already supports various applications, such as powering ontology-based search engines, finding curation-relevant articles (triage) or helping the curator to identify and encode functions. Popular text mining tools for GO classification are based on so called thesaurus-based--or dictionary-based--approaches, which exploit similarities between the input text and GO terms themselves. But their effectiveness remains limited owing to the complex nature of GO terms, which rarely occur in text. In contrast, machine learning approaches exploit similarities between the input text and already curated instances contained in a knowledge base to infer a functional profile. GO Annotations (GOA) and MEDLINE make possible to exploit a growing amount of curated abstracts (97 000 in November 2012) for populating this knowledge base. Our study compares a state-of-the-art thesaurus-based system with a machine learning system (based on a k-Nearest Neighbours algorithm) for the task of proposing a functional profile for unseen MEDLINE abstracts, and shows how resources and performances have evolved. Systems are evaluated on their ability to propose for a given abstract the GO terms (2.8 on average) used for curation in GOA. We show that since 2006, although a massive effort was put into adding synonyms in GO (+300%), our thesaurus-based system effectiveness is rather constant, reaching from 0.28 to 0.31 for Recall at 20 (R20). In contrast, thanks to its knowledge base growth, our machine learning system has steadily improved, reaching from 0.38 in 2006 to 0.56 for R20 in 2012. Integrated in semi-automatic workflows or in fully automatic pipelines, such systems are more and more efficient to provide assistance to biologists. DATABASE URL: http://eagl.unige.ch/GOCat/
Concurrent Tumor Segmentation and Registration with Uncertainty-based Sparse non-Uniform Graphs
Parisot, Sarah; Wells, William; Chemouny, Stéphane; Duffau, Hugues; Paragios, Nikos
2014-01-01
In this paper, we present a graph-based concurrent brain tumor segmentation and atlas to diseased patient registration framework. Both segmentation and registration problems are modeled using a unified pairwise discrete Markov Random Field model on a sparse grid superimposed to the image domain. Segmentation is addressed based on pattern classification techniques, while registration is performed by maximizing the similarity between volumes and is modular with respect to the matching criterion. The two problems are coupled by relaxing the registration term in the tumor area, corresponding to areas of high classification score and high dissimilarity between volumes. In order to overcome the main shortcomings of discrete approaches regarding appropriate sampling of the solution space as well as important memory requirements, content driven samplings of the discrete displacement set and the sparse grid are considered, based on the local segmentation and registration uncertainties recovered by the min marginal energies. State of the art results on a substantial low-grade glioma database demonstrate the potential of our method, while our proposed approach shows maintained performance and strongly reduced complexity of the model. PMID:24717540
Research on Classification of Chinese Text Data Based on SVM
NASA Astrophysics Data System (ADS)
Lin, Yuan; Yu, Hongzhi; Wan, Fucheng; Xu, Tao
2017-09-01
Data Mining has important application value in today’s industry and academia. Text classification is a very important technology in data mining. At present, there are many mature algorithms for text classification. KNN, NB, AB, SVM, decision tree and other classification methods all show good classification performance. Support Vector Machine’ (SVM) classification method is a good classifier in machine learning research. This paper will study the classification effect based on the SVM method in the Chinese text data, and use the support vector machine method in the chinese text to achieve the classify chinese text, and to able to combination of academia and practical application.
Object-Oriented Approach to Integrating Database Semantics. Volume 4.
1987-12-01
schemata for; 1. Object Classification Shema -- Entities 2. Object Structure and Relationship Schema -- Relations 3. Operation Classification and... relationships are represented in a database is non- intuitive for naive users. *It is difficult to access and combine information in multiple databases. In this...from the CURRENT-.CLASSES table. Choosing a selected item do-selects it. Choose 0 to exit. 1. STUDENTS 2. CUR~RENT-..CLASSES 3. MANAGMNT -.CLASS
NASA Astrophysics Data System (ADS)
Linsebarth, A.; Moscicka, A.
2010-01-01
The article describes the infl uence of the Bible geographic object peculiarities on the spatiotemporal geoinformation system of the Bible events. In the proposed concept of this system the special attention was concentrated to the Bible geographic objects and interrelations between the names of these objects and their location in the geospace. In the Bible, both in the Old and New Testament, there are hundreds of geographical names, but the selection of these names from the Bible text is not so easy. The same names are applied for the persons and geographic objects. The next problem which arises is the classification of the geographical object, because in several cases the same name is used for the towns, mountains, hills, valleys etc. Also very serious problem is related to the time-changes of the names. The interrelation between the object name and its location is also complicated. The geographic object of this same name is located in various places which should be properly correlated with the Bible text. Above mentioned peculiarities of Bible geographic objects infl uenced the concept of the proposed system which consists of three databases: reference, geographic object, and subject/thematic. The crucial component of this system is proper architecture of the geographic object database. In the paper very detailed description of this database is presented. The interrelation between the databases allows to the Bible readers to connect the Bible text with the geography of the terrain on which the Bible events occurred and additionally to have access to the other geographical and historical information related to the geographic objects.
Classifications for Cesarean Section: A Systematic Review
Torloni, Maria Regina; Betran, Ana Pilar; Souza, Joao Paulo; Widmer, Mariana; Allen, Tomas; Gulmezoglu, Metin; Merialdi, Mario
2011-01-01
Background Rising cesarean section (CS) rates are a major public health concern and cause worldwide debates. To propose and implement effective measures to reduce or increase CS rates where necessary requires an appropriate classification. Despite several existing CS classifications, there has not yet been a systematic review of these. This study aimed to 1) identify the main CS classifications used worldwide, 2) analyze advantages and deficiencies of each system. Methods and Findings Three electronic databases were searched for classifications published 1968–2008. Two reviewers independently assessed classifications using a form created based on items rated as important by international experts. Seven domains (ease, clarity, mutually exclusive categories, totally inclusive classification, prospective identification of categories, reproducibility, implementability) were assessed and graded. Classifications were tested in 12 hypothetical clinical case-scenarios. From a total of 2948 citations, 60 were selected for full-text evaluation and 27 classifications identified. Indications classifications present important limitations and their overall score ranged from 2–9 (maximum grade = 14). Degree of urgency classifications also had several drawbacks (overall scores 6–9). Woman-based classifications performed best (scores 5–14). Other types of classifications require data not routinely collected and may not be relevant in all settings (scores 3–8). Conclusions This review and critical appraisal of CS classifications is a methodologically sound contribution to establish the basis for the appropriate monitoring and rational use of CS. Results suggest that women-based classifications in general, and Robson's classification, in particular, would be in the best position to fulfill current international and local needs and that efforts to develop an internationally applicable CS classification would be most appropriately placed in building upon this classification. The use of a single CS classification will facilitate auditing, analyzing and comparing CS rates across different settings and help to create and implement effective strategies specifically targeted to optimize CS rates where necessary. PMID:21283801
Olier, Ivan; Springate, David A; Ashcroft, Darren M; Doran, Tim; Reeves, David; Planner, Claire; Reilly, Siobhan; Kontopantelis, Evangelos
2016-01-01
The use of Electronic Health Records databases for medical research has become mainstream. In the UK, increasing use of Primary Care Databases is largely driven by almost complete computerisation and uniform standards within the National Health Service. Electronic Health Records research often begins with the development of a list of clinical codes with which to identify cases with a specific condition. We present a methodology and accompanying Stata and R commands (pcdsearch/Rpcdsearch) to help researchers in this task. We present severe mental illness as an example. We used the Clinical Practice Research Datalink, a UK Primary Care Database in which clinical information is largely organised using Read codes, a hierarchical clinical coding system. Pcdsearch is used to identify potentially relevant clinical codes and/or product codes from word-stubs and code-stubs suggested by clinicians. The returned code-lists are reviewed and codes relevant to the condition of interest are selected. The final code-list is then used to identify patients. We identified 270 Read codes linked to SMI and used them to identify cases in the database. We observed that our approach identified cases that would have been missed with a simpler approach using SMI registers defined within the UK Quality and Outcomes Framework. We described a framework for researchers of Electronic Health Records databases, for identifying patients with a particular condition or matching certain clinical criteria. The method is invariant to coding system or database and can be used with SNOMED CT, ICD or other medical classification code-lists.
Momo, Kenji
2018-01-01
Hospital-prepared drugs (HP), known as In'Naiseizai in Japan, are custom-prepared formulations which offer medical professionals an alternative administration pathway by changing the formulation of existing drugs according to a patients' needs. Preparing the HP is one of several roles of pharmacists in providing personalized medicine at hospitals in Japan. In 2012, the Japanese Society of Hospital Pharmacists provided guidelines for the appropriate use of "Hospital-prepared drugs". The following information was included in this guide: 1) documentation of the proper procedures, materials, prescription practices, etc., 2) required approval from the institutional review board of each HP on the risk-based classifications, and 3) to assess the stability, efficacy, and safety of each HP. However, several problems persist for pharmacists trying to prepare or use HP appropriately; the most common is insufficient manpower to both assess and prepare these drugs during routine hospital work. To resolve this problem, we are developing an evidence database for HP based on surveys of the current literature. This database has been developed for 109 drugs to date. Data-driven assessment of the stability of HP showed that 52 out of 109 drugs examined (47.7%). Notably, only 6 of the 109 HP (5.5%) in the database had all three characteristics of "stability", "safety", and "efficacy". In conclusion, the application of this database will save manpower hours for hospital pharmacists in the preparation of HP. In the near future, we will make this database available to the wider medical community via the web or through literature.
Thompson, Bryony A; Spurdle, Amanda B; Plazzer, John-Paul; Greenblatt, Marc S; Akagi, Kiwamu; Al-Mulla, Fahd; Bapat, Bharati; Bernstein, Inge; Capellá, Gabriel; den Dunnen, Johan T; du Sart, Desiree; Fabre, Aurelie; Farrell, Michael P; Farrington, Susan M; Frayling, Ian M; Frebourg, Thierry; Goldgar, David E; Heinen, Christopher D; Holinski-Feder, Elke; Kohonen-Corish, Maija; Robinson, Kristina Lagerstedt; Leung, Suet Yi; Martins, Alexandra; Moller, Pal; Morak, Monika; Nystrom, Minna; Peltomaki, Paivi; Pineda, Marta; Qi, Ming; Ramesar, Rajkumar; Rasmussen, Lene Juel; Royer-Pokora, Brigitte; Scott, Rodney J; Sijmons, Rolf; Tavtigian, Sean V; Tops, Carli M; Weber, Thomas; Wijnen, Juul; Woods, Michael O; Macrae, Finlay; Genuardi, Maurizio
2014-02-01
The clinical classification of hereditary sequence variants identified in disease-related genes directly affects clinical management of patients and their relatives. The International Society for Gastrointestinal Hereditary Tumours (InSiGHT) undertook a collaborative effort to develop, test and apply a standardized classification scheme to constitutional variants in the Lynch syndrome-associated genes MLH1, MSH2, MSH6 and PMS2. Unpublished data submission was encouraged to assist in variant classification and was recognized through microattribution. The scheme was refined by multidisciplinary expert committee review of the clinical and functional data available for variants, applied to 2,360 sequence alterations, and disseminated online. Assessment using validated criteria altered classifications for 66% of 12,006 database entries. Clinical recommendations based on transparent evaluation are now possible for 1,370 variants that were not obviously protein truncating from nomenclature. This large-scale endeavor will facilitate the consistent management of families suspected to have Lynch syndrome and demonstrates the value of multidisciplinary collaboration in the curation and classification of variants in public locus-specific databases.
Plazzer, John-Paul; Greenblatt, Marc S.; Akagi, Kiwamu; Al-Mulla, Fahd; Bapat, Bharati; Bernstein, Inge; Capellá, Gabriel; den Dunnen, Johan T.; du Sart, Desiree; Fabre, Aurelie; Farrell, Michael P.; Farrington, Susan M.; Frayling, Ian M.; Frebourg, Thierry; Goldgar, David E.; Heinen, Christopher D.; Holinski-Feder, Elke; Kohonen-Corish, Maija; Robinson, Kristina Lagerstedt; Leung, Suet Yi; Martins, Alexandra; Moller, Pal; Morak, Monika; Nystrom, Minna; Peltomaki, Paivi; Pineda, Marta; Qi, Ming; Ramesar, Rajkumar; Rasmussen, Lene Juel; Royer-Pokora, Brigitte; Scott, Rodney J.; Sijmons, Rolf; Tavtigian, Sean V.; Tops, Carli M.; Weber, Thomas; Wijnen, Juul; Woods, Michael O.; Macrae, Finlay; Genuardi, Maurizio
2015-01-01
Clinical classification of sequence variants identified in hereditary disease genes directly affects clinical management of patients and their relatives. The International Society for Gastrointestinal Hereditary Tumours (InSiGHT) undertook a collaborative effort to develop, test and apply a standardized classification scheme to constitutional variants in the Lynch Syndrome genes MLH1, MSH2, MSH6 and PMS2. Unpublished data submission was encouraged to assist variant classification, and recognized by microattribution. The scheme was refined by multidisciplinary expert committee review of clinical and functional data available for variants, applied to 2,360 sequence alterations, and disseminated online. Assessment using validated criteria altered classifications for 66% of 12,006 database entries. Clinical recommendations based on transparent evaluation are now possible for 1,370 variants not obviously protein-truncating from nomenclature. This large-scale endeavor will facilitate consistent management of suspected Lynch Syndrome families, and demonstrates the value of multidisciplinary collaboration for curation and classification of variants in public locus-specific databases. PMID:24362816
Food Composition Database Format and Structure: A User Focused Approach
Clancy, Annabel K.; Woods, Kaitlyn; McMahon, Anne; Probst, Yasmine
2015-01-01
This study aimed to investigate the needs of Australian food composition database user’s regarding database format and relate this to the format of databases available globally. Three semi structured synchronous online focus groups (M = 3, F = 11) and n = 6 female key informant interviews were recorded. Beliefs surrounding the use, training, understanding, benefits and limitations of food composition data and databases were explored. Verbatim transcriptions underwent preliminary coding followed by thematic analysis with NVivo qualitative analysis software to extract the final themes. Schematic analysis was applied to the final themes related to database format. Desktop analysis also examined the format of six key globally available databases. 24 dominant themes were established, of which five related to format; database use, food classification, framework, accessibility and availability, and data derivation. Desktop analysis revealed that food classification systems varied considerably between databases. Microsoft Excel was a common file format used in all databases, and available software varied between countries. User’s also recognised that food composition databases format should ideally be designed specifically for the intended use, have a user-friendly food classification system, incorporate accurate data with clear explanation of data derivation and feature user input. However, such databases are limited by data availability and resources. Further exploration of data sharing options should be considered. Furthermore, user’s understanding of food composition data and databases limitations is inherent to the correct application of non-specific databases. Therefore, further exploration of user FCDB training should also be considered. PMID:26554836
Barroso, João; Pfannenbecker, Uwe; Adriaens, Els; Alépée, Nathalie; Cluzel, Magalie; De Smedt, Ann; Hibatallah, Jalila; Klaric, Martina; Mewes, Karsten R; Millet, Marion; Templier, Marie; McNamee, Pauline
2017-02-01
A thorough understanding of which of the effects assessed in the in vivo Draize eye test are responsible for driving UN GHS/EU CLP classification is critical for an adequate selection of chemicals to be used in the development and/or evaluation of alternative methods/strategies and for properly assessing their predictive capacity and limitations. For this reason, Cosmetics Europe has compiled a database of Draize data (Draize eye test Reference Database, DRD) from external lists that were created to support past validation activities. This database contains 681 independent in vivo studies on 634 individual chemicals representing a wide range of chemical classes. A description of all the ocular effects observed in vivo, i.e. degree of severity and persistence of corneal opacity (CO), iritis, and/or conjunctiva effects, was added for each individual study in the database, and the studies were categorised according to their UN GHS/EU CLP classification and the main effect driving the classification. An evaluation of the various in vivo drivers of classification compiled in the database was performed to establish which of these are most important from a regulatory point of view. These analyses established that the most important drivers for Cat 1 Classification are (1) CO mean ≥ 3 (days 1-3) (severity) and (2) CO persistence on day 21 in the absence of severity, and those for Cat 2 classification are (3) CO mean ≥ 1 and (4) conjunctival redness mean ≥ 2. Moreover, it is shown that all classifiable effects (including persistence and CO = 4) should be present in ≥60 % of the animals to drive a classification. As a consequence, our analyses suggest the need for a critical revision of the UN GHS/EU CLP decision criteria for the Cat 1 classification of chemicals. Finally, a number of key criteria are identified that should be taken into consideration when selecting reference chemicals for the development, evaluation and/or validation of alternative methods and/or strategies for serious eye damage/eye irritation testing. Most important, the DRD is an invaluable tool for any future activity involving the selection of reference chemicals.
1984-12-01
52242 Prepared for the AIR FORCE OFFICE OF SCIENTIFIC RESEARCH Under Grant No. AFOSR 82-0322 December 1984 ~ " ’w Unclassified SECURITY CLASSIFICATION4...OF THIS PAGE REPORT DOCUMENTATION PAGE is REPORT SECURITY CLASSIFICATION lb. RESTRICTIVE MARKINGS Unclassified None 20 SECURITY CLASSIFICATION...designer .and computer- are 20 DIiRIBUTION/AVAILABI LIT Y 0P ABSTR4ACT 21 ABSTRACT SECURITY CLASSIFICA1ONr UNCLASSIFIED/UNLIMITED SAME AS APT OTIC USERS
Betrán, Ana Pilar; Vindevoghel, Nadia; Souza, Joao Paulo; Gülmezoglu, A Metin; Torloni, Maria Regina
2014-01-01
Caesarean sections (CS) rates continue to increase worldwide without a clear understanding of the main drivers and consequences. The lack of a standardized internationally-accepted classification system to monitor and compare CS rates is one of the barriers to a better understanding of this trend. The Robson's 10-group classification is based on simple obstetrical parameters (parity, previous CS, gestational age, onset of labour, fetal presentation and number of fetuses) and does not involve the indication for CS. This classification has become very popular over the last years in many countries. We conducted a systematic review to synthesize the experience of users on the implementation of this classification and proposed adaptations. Four electronic databases were searched. A three-step thematic synthesis approach and a qualitative metasummary method were used. 232 unique reports were identified, 97 were selected for full-text evaluation and 73 were included. These publications reported on the use of Robson's classification in over 33 million women from 31 countries. According to users, the main strengths of the classification are its simplicity, robustness, reliability and flexibility. However, missing data, misclassification of women and lack of definition or consensus on core variables of the classification are challenges. To improve the classification for local use and to decrease heterogeneity within groups, several subdivisions in each of the 10 groups have been proposed. Group 5 (women with previous CS) received the largest number of suggestions. The use of the Robson classification is increasing rapidly and spontaneously worldwide. Despite some limitations, this classification is easy to implement and interpret. Several suggested modifications could be useful to help facilities and countries as they work towards its implementation.
Betrán, Ana Pilar; Vindevoghel, Nadia; Souza, Joao Paulo; Gülmezoglu, A. Metin; Torloni, Maria Regina
2014-01-01
Background Caesarean sections (CS) rates continue to increase worldwide without a clear understanding of the main drivers and consequences. The lack of a standardized internationally-accepted classification system to monitor and compare CS rates is one of the barriers to a better understanding of this trend. The Robson's 10-group classification is based on simple obstetrical parameters (parity, previous CS, gestational age, onset of labour, fetal presentation and number of fetuses) and does not involve the indication for CS. This classification has become very popular over the last years in many countries. We conducted a systematic review to synthesize the experience of users on the implementation of this classification and proposed adaptations. Methods Four electronic databases were searched. A three-step thematic synthesis approach and a qualitative metasummary method were used. Results 232 unique reports were identified, 97 were selected for full-text evaluation and 73 were included. These publications reported on the use of Robson's classification in over 33 million women from 31 countries. According to users, the main strengths of the classification are its simplicity, robustness, reliability and flexibility. However, missing data, misclassification of women and lack of definition or consensus on core variables of the classification are challenges. To improve the classification for local use and to decrease heterogeneity within groups, several subdivisions in each of the 10 groups have been proposed. Group 5 (women with previous CS) received the largest number of suggestions. Conclusions The use of the Robson classification is increasing rapidly and spontaneously worldwide. Despite some limitations, this classification is easy to implement and interpret. Several suggested modifications could be useful to help facilities and countries as they work towards its implementation. PMID:24892928
Protein classification based on text document classification techniques.
Cheng, Betty Yee Man; Carbonell, Jaime G; Klein-Seetharaman, Judith
2005-03-01
The need for accurate, automated protein classification methods continues to increase as advances in biotechnology uncover new proteins. G-protein coupled receptors (GPCRs) are a particularly difficult superfamily of proteins to classify due to extreme diversity among its members. Previous comparisons of BLAST, k-nearest neighbor (k-NN), hidden markov model (HMM) and support vector machine (SVM) using alignment-based features have suggested that classifiers at the complexity of SVM are needed to attain high accuracy. Here, analogous to document classification, we applied Decision Tree and Naive Bayes classifiers with chi-square feature selection on counts of n-grams (i.e. short peptide sequences of length n) to this classification task. Using the GPCR dataset and evaluation protocol from the previous study, the Naive Bayes classifier attained an accuracy of 93.0 and 92.4% in level I and level II subfamily classification respectively, while SVM has a reported accuracy of 88.4 and 86.3%. This is a 39.7 and 44.5% reduction in residual error for level I and level II subfamily classification, respectively. The Decision Tree, while inferior to SVM, outperforms HMM in both level I and level II subfamily classification. For those GPCR families whose profiles are stored in the Protein FAMilies database of alignments and HMMs (PFAM), our method performs comparably to a search against those profiles. Finally, our method can be generalized to other protein families by applying it to the superfamily of nuclear receptors with 94.5, 97.8 and 93.6% accuracy in family, level I and level II subfamily classification respectively. Copyright 2005 Wiley-Liss, Inc.
Database Driven 6-DOF Trajectory Simulation for Debris Transport Analysis
NASA Technical Reports Server (NTRS)
West, Jeff
2008-01-01
Debris mitigation and risk assessment have been carried out by NASA and its contractors supporting Space Shuttle Return-To-Flight (RTF). As a part of this assessment, analysis of transport potential for debris that may be liberated from the vehicle or from pad facilities prior to tower clear (Lift-Off Debris) is being performed by MSFC. This class of debris includes plume driven and wind driven sources for which lift as well as drag are critical for the determination of the debris trajectory. As a result, NASA MSFC has a need for a debris transport or trajectory simulation that supports the computation of lift effect in addition to drag without the computational expense of fully coupled CFD with 6-DOF. A database driven 6-DOF simulation that uses aerodynamic force and moment coefficients for the debris shape that are interpolated from a database has been developed to meet this need. The design, implementation, and verification of the database driven six degree of freedom (6-DOF) simulation addition to the Lift-Off Debris Transport Analysis (LODTA) software are discussed in this paper.
Statistical classification of drug incidents due to look-alike sound-alike mix-ups.
Wong, Zoie Shui Yee
2016-06-01
It has been recognised that medication names that look or sound similar are a cause of medication errors. This study builds statistical classifiers for identifying medication incidents due to look-alike sound-alike mix-ups. A total of 227 patient safety incident advisories related to medication were obtained from the Canadian Patient Safety Institute's Global Patient Safety Alerts system. Eight feature selection strategies based on frequent terms, frequent drug terms and constituent terms were performed. Statistical text classifiers based on logistic regression, support vector machines with linear, polynomial, radial-basis and sigmoid kernels and decision tree were trained and tested. The models developed achieved an average accuracy of above 0.8 across all the model settings. The receiver operating characteristic curves indicated the classifiers performed reasonably well. The results obtained in this study suggest that statistical text classification can be a feasible method for identifying medication incidents due to look-alike sound-alike mix-ups based on a database of advisories from Global Patient Safety Alerts. © The Author(s) 2014.
Probabilistic topic modeling for the analysis and classification of genomic sequences
2015-01-01
Background Studies on genomic sequences for classification and taxonomic identification have a leading role in the biomedical field and in the analysis of biodiversity. These studies are focusing on the so-called barcode genes, representing a well defined region of the whole genome. Recently, alignment-free techniques are gaining more importance because they are able to overcome the drawbacks of sequence alignment techniques. In this paper a new alignment-free method for DNA sequences clustering and classification is proposed. The method is based on k-mers representation and text mining techniques. Methods The presented method is based on Probabilistic Topic Modeling, a statistical technique originally proposed for text documents. Probabilistic topic models are able to find in a document corpus the topics (recurrent themes) characterizing classes of documents. This technique, applied on DNA sequences representing the documents, exploits the frequency of fixed-length k-mers and builds a generative model for a training group of sequences. This generative model, obtained through the Latent Dirichlet Allocation (LDA) algorithm, is then used to classify a large set of genomic sequences. Results and conclusions We performed classification of over 7000 16S DNA barcode sequences taken from Ribosomal Database Project (RDP) repository, training probabilistic topic models. The proposed method is compared to the RDP tool and Support Vector Machine (SVM) classification algorithm in a extensive set of trials using both complete sequences and short sequence snippets (from 400 bp to 25 bp). Our method reaches very similar results to RDP classifier and SVM for complete sequences. The most interesting results are obtained when short sequence snippets are considered. In these conditions the proposed method outperforms RDP and SVM with ultra short sequences and it exhibits a smooth decrease of performance, at every taxonomic level, when the sequence length is decreased. PMID:25916734
Consumption Database The California Energy Commission has created this on-line database for informal reporting ) classifications. The database also provides easy downloading of energy consumption data into Microsoft Excel (XLSX
OntoMate: a text-mining tool aiding curation at the Rat Genome Database
Liu, Weisong; Laulederkind, Stanley J. F.; Hayman, G. Thomas; Wang, Shur-Jen; Nigam, Rajni; Smith, Jennifer R.; De Pons, Jeff; Dwinell, Melinda R.; Shimoyama, Mary
2015-01-01
The Rat Genome Database (RGD) is the premier repository of rat genomic, genetic and physiologic data. Converting data from free text in the scientific literature to a structured format is one of the main tasks of all model organism databases. RGD spends considerable effort manually curating gene, Quantitative Trait Locus (QTL) and strain information. The rapidly growing volume of biomedical literature and the active research in the biological natural language processing (bioNLP) community have given RGD the impetus to adopt text-mining tools to improve curation efficiency. Recently, RGD has initiated a project to use OntoMate, an ontology-driven, concept-based literature search engine developed at RGD, as a replacement for the PubMed (http://www.ncbi.nlm.nih.gov/pubmed) search engine in the gene curation workflow. OntoMate tags abstracts with gene names, gene mutations, organism name and most of the 16 ontologies/vocabularies used at RGD. All terms/ entities tagged to an abstract are listed with the abstract in the search results. All listed terms are linked both to data entry boxes and a term browser in the curation tool. OntoMate also provides user-activated filters for species, date and other parameters relevant to the literature search. Using the system for literature search and import has streamlined the process compared to using PubMed. The system was built with a scalable and open architecture, including features specifically designed to accelerate the RGD gene curation process. With the use of bioNLP tools, RGD has added more automation to its curation workflow. Database URL: http://rgd.mcw.edu PMID:25619558
Sentiment classification technology based on Markov logic networks
NASA Astrophysics Data System (ADS)
He, Hui; Li, Zhigang; Yao, Chongchong; Zhang, Weizhe
2016-07-01
With diverse online media emerging, there is a growing concern of sentiment classification problem. At present, text sentiment classification mainly utilizes supervised machine learning methods, which feature certain domain dependency. On the basis of Markov logic networks (MLNs), this study proposed a cross-domain multi-task text sentiment classification method rooted in transfer learning. Through many-to-one knowledge transfer, labeled text sentiment classification, knowledge was successfully transferred into other domains, and the precision of the sentiment classification analysis in the text tendency domain was improved. The experimental results revealed the following: (1) the model based on a MLN demonstrated higher precision than the single individual learning plan model. (2) Multi-task transfer learning based on Markov logical networks could acquire more knowledge than self-domain learning. The cross-domain text sentiment classification model could significantly improve the precision and efficiency of text sentiment classification.
The Transporter Classification Database: recent advances.
Saier, Milton H; Yen, Ming Ren; Noto, Keith; Tamang, Dorjee G; Elkan, Charles
2009-01-01
The Transporter Classification Database (TCDB), freely accessible at http://www.tcdb.org, is a relational database containing sequence, structural, functional and evolutionary information about transport systems from a variety of living organisms, based on the International Union of Biochemistry and Molecular Biology-approved transporter classification (TC) system. It is a curated repository for factual information compiled largely from published references. It uses a functional/phylogenetic system of classification, and currently encompasses about 5000 representative transporters and putative transporters in more than 500 families. We here describe novel software designed to support and extend the usefulness of TCDB. Our recent efforts render it more user friendly, incorporate machine learning to input novel data in a semiautomatic fashion, and allow analyses that are more accurate and less time consuming. The availability of these tools has resulted in recognition of distant phylogenetic relationships and tremendous expansion of the information available to TCDB users.
Dictionary learning-based CT detection of pulmonary nodules
NASA Astrophysics Data System (ADS)
Wu, Panpan; Xia, Kewen; Zhang, Yanbo; Qian, Xiaohua; Wang, Ge; Yu, Hengyong
2016-10-01
Segmentation of lung features is one of the most important steps for computer-aided detection (CAD) of pulmonary nodules with computed tomography (CT). However, irregular shapes, complicated anatomical background and poor pulmonary nodule contrast make CAD a very challenging problem. Here, we propose a novel scheme for feature extraction and classification of pulmonary nodules through dictionary learning from training CT images, which does not require accurately segmented pulmonary nodules. Specifically, two classification-oriented dictionaries and one background dictionary are learnt to solve a two-category problem. In terms of the classification-oriented dictionaries, we calculate sparse coefficient matrices to extract intrinsic features for pulmonary nodule classification. The support vector machine (SVM) classifier is then designed to optimize the performance. Our proposed methodology is evaluated with the lung image database consortium and image database resource initiative (LIDC-IDRI) database, and the results demonstrate that the proposed strategy is promising.
MimoSA: a system for minimotif annotation
2010-01-01
Background Minimotifs are short peptide sequences within one protein, which are recognized by other proteins or molecules. While there are now several minimotif databases, they are incomplete. There are reports of many minimotifs in the primary literature, which have yet to be annotated, while entirely novel minimotifs continue to be published on a weekly basis. Our recently proposed function and sequence syntax for minimotifs enables us to build a general tool that will facilitate structured annotation and management of minimotif data from the biomedical literature. Results We have built the MimoSA application for minimotif annotation. The application supports management of the Minimotif Miner database, literature tracking, and annotation of new minimotifs. MimoSA enables the visualization, organization, selection and editing functions of minimotifs and their attributes in the MnM database. For the literature components, Mimosa provides paper status tracking and scoring of papers for annotation through a freely available machine learning approach, which is based on word correlation. The paper scoring algorithm is also available as a separate program, TextMine. Form-driven annotation of minimotif attributes enables entry of new minimotifs into the MnM database. Several supporting features increase the efficiency of annotation. The layered architecture of MimoSA allows for extensibility by separating the functions of paper scoring, minimotif visualization, and database management. MimoSA is readily adaptable to other annotation efforts that manually curate literature into a MySQL database. Conclusions MimoSA is an extensible application that facilitates minimotif annotation and integrates with the Minimotif Miner database. We have built MimoSA as an application that integrates dynamic abstract scoring with a high performance relational model of minimotif syntax. MimoSA's TextMine, an efficient paper-scoring algorithm, can be used to dynamically rank papers with respect to context. PMID:20565705
The process and utility of classification and regression tree methodology in nursing research
Kuhn, Lisa; Page, Karen; Ward, John; Worrall-Carter, Linda
2014-01-01
Aim This paper presents a discussion of classification and regression tree analysis and its utility in nursing research. Background Classification and regression tree analysis is an exploratory research method used to illustrate associations between variables not suited to traditional regression analysis. Complex interactions are demonstrated between covariates and variables of interest in inverted tree diagrams. Design Discussion paper. Data sources English language literature was sourced from eBooks, Medline Complete and CINAHL Plus databases, Google and Google Scholar, hard copy research texts and retrieved reference lists for terms including classification and regression tree* and derivatives and recursive partitioning from 1984–2013. Discussion Classification and regression tree analysis is an important method used to identify previously unknown patterns amongst data. Whilst there are several reasons to embrace this method as a means of exploratory quantitative research, issues regarding quality of data as well as the usefulness and validity of the findings should be considered. Implications for Nursing Research Classification and regression tree analysis is a valuable tool to guide nurses to reduce gaps in the application of evidence to practice. With the ever-expanding availability of data, it is important that nurses understand the utility and limitations of the research method. Conclusion Classification and regression tree analysis is an easily interpreted method for modelling interactions between health-related variables that would otherwise remain obscured. Knowledge is presented graphically, providing insightful understanding of complex and hierarchical relationships in an accessible and useful way to nursing and other health professions. PMID:24237048
The process and utility of classification and regression tree methodology in nursing research.
Kuhn, Lisa; Page, Karen; Ward, John; Worrall-Carter, Linda
2014-06-01
This paper presents a discussion of classification and regression tree analysis and its utility in nursing research. Classification and regression tree analysis is an exploratory research method used to illustrate associations between variables not suited to traditional regression analysis. Complex interactions are demonstrated between covariates and variables of interest in inverted tree diagrams. Discussion paper. English language literature was sourced from eBooks, Medline Complete and CINAHL Plus databases, Google and Google Scholar, hard copy research texts and retrieved reference lists for terms including classification and regression tree* and derivatives and recursive partitioning from 1984-2013. Classification and regression tree analysis is an important method used to identify previously unknown patterns amongst data. Whilst there are several reasons to embrace this method as a means of exploratory quantitative research, issues regarding quality of data as well as the usefulness and validity of the findings should be considered. Classification and regression tree analysis is a valuable tool to guide nurses to reduce gaps in the application of evidence to practice. With the ever-expanding availability of data, it is important that nurses understand the utility and limitations of the research method. Classification and regression tree analysis is an easily interpreted method for modelling interactions between health-related variables that would otherwise remain obscured. Knowledge is presented graphically, providing insightful understanding of complex and hierarchical relationships in an accessible and useful way to nursing and other health professions. © 2013 The Authors. Journal of Advanced Nursing Published by John Wiley & Sons Ltd.
Kongsholm, Gertrud Gansmo; Nielsen, Anna Katrine Toft; Damkier, Per
2015-11-01
It is well documented that drug-drug interaction databases (DIDs) differ substantially with respect to classification of drug-drug interactions (DDIs). The aim of this study was to study online available transparency of ownership, funding, information, classifications, staff training, and underlying documentation of the five most commonly used open access English language-based online DIDs and the three most commonly used subscription English language-based online DIDs in the literature. We conducted a systematic literature search to identify the five most commonly used open access and the three most commonly used subscription DIDs in the medical literature. The following parameters were assessed for each of the databases: Ownership, classification of interactions, primary information sources, and staff qualification. We compared the overall proportion of yes/no answers from open access databases and subscription databases by Fisher's exact test-both prior to and after requesting missing information. Among open access DIDs, 20/60 items could be verified from the webpage directly compared to 24/36 for the subscription DIDs (p = 0.0028). Following personal request, these numbers rose to 22/60 and 30/36, respectively (p < 0.0001). For items within the "classification of interaction" domain, proportions were 3/25 versus 11/15 available from the webpage (P = 0.0001) and 3/25 versus 15/15 (p < 0.0001) available upon personal request. Available information on online available transparency of ownership, funding, information, classifications, staff training, and underlying documentation varies substantially among various DIDs. Open access DIDs had a statistically lower score on parameters assessed.
The LSST Data Mining Research Agenda
NASA Astrophysics Data System (ADS)
Borne, K.; Becla, J.; Davidson, I.; Szalay, A.; Tyson, J. A.
2008-12-01
We describe features of the LSST science database that are amenable to scientific data mining, object classification, outlier identification, anomaly detection, image quality assurance, and survey science validation. The data mining research agenda includes: scalability (at petabytes scales) of existing machine learning and data mining algorithms; development of grid-enabled parallel data mining algorithms; designing a robust system for brokering classifications from the LSST event pipeline (which may produce 10,000 or more event alerts per night) multi-resolution methods for exploration of petascale databases; indexing of multi-attribute multi-dimensional astronomical databases (beyond spatial indexing) for rapid querying of petabyte databases; and more.
PrionHome: a database of prions and other sequences relevant to prion phenomena.
Harbi, Djamel; Parthiban, Marimuthu; Gendoo, Deena M A; Ehsani, Sepehr; Kumar, Manish; Schmitt-Ulms, Gerold; Sowdhamini, Ramanathan; Harrison, Paul M
2012-01-01
Prions are units of propagation of an altered state of a protein or proteins; prions can propagate from organism to organism, through cooption of other protein copies. Prions contain no necessary nucleic acids, and are important both as both pathogenic agents, and as a potential force in epigenetic phenomena. The original prions were derived from a misfolded form of the mammalian Prion Protein PrP. Infection by these prions causes neurodegenerative diseases. Other prions cause non-Mendelian inheritance in budding yeast, and sometimes act as diseases of yeast. We report the bioinformatic construction of the PrionHome, a database of >2000 prion-related sequences. The data was collated from various public and private resources and filtered for redundancy. The data was then processed according to a transparent classification system of prionogenic sequences (i.e., sequences that can make prions), prionoids (i.e., proteins that propagate like prions between individual cells), and other prion-related phenomena. There are eight PrionHome classifications for sequences. The first four classifications are derived from experimental observations: prionogenic sequences, prionoids, other prion-related phenomena, and prion interactors. The second four classifications are derived from sequence analysis: orthologs, paralogs, pseudogenes, and candidate-prionogenic sequences. Database entries list: supporting information for PrionHome classifications, prion-determinant areas (where relevant), and disordered and compositionally-biased regions. Also included are literature references for the PrionHome classifications, transcripts and genomic coordinates, and structural data (including comparative models made for the PrionHome from manually curated alignments). We provide database usage examples for both vertebrate and fungal prion contexts. Using the database data, we have performed a detailed analysis of the compositional biases in known budding-yeast prionogenic sequences, showing that the only abundant bias pattern is for asparagine bias with subsidiary serine bias. We anticipate that this database will be a useful experimental aid and reference resource. It is freely available at: http://libaio.biol.mcgill.ca/prion.
PrionHome: A Database of Prions and Other Sequences Relevant to Prion Phenomena
Harbi, Djamel; Parthiban, Marimuthu; Gendoo, Deena M. A.; Ehsani, Sepehr; Kumar, Manish; Schmitt-Ulms, Gerold; Sowdhamini, Ramanathan; Harrison, Paul M.
2012-01-01
Prions are units of propagation of an altered state of a protein or proteins; prions can propagate from organism to organism, through cooption of other protein copies. Prions contain no necessary nucleic acids, and are important both as both pathogenic agents, and as a potential force in epigenetic phenomena. The original prions were derived from a misfolded form of the mammalian Prion Protein PrP. Infection by these prions causes neurodegenerative diseases. Other prions cause non-Mendelian inheritance in budding yeast, and sometimes act as diseases of yeast. We report the bioinformatic construction of the PrionHome, a database of >2000 prion-related sequences. The data was collated from various public and private resources and filtered for redundancy. The data was then processed according to a transparent classification system of prionogenic sequences (i.e., sequences that can make prions), prionoids (i.e., proteins that propagate like prions between individual cells), and other prion-related phenomena. There are eight PrionHome classifications for sequences. The first four classifications are derived from experimental observations: prionogenic sequences, prionoids, other prion-related phenomena, and prion interactors. The second four classifications are derived from sequence analysis: orthologs, paralogs, pseudogenes, and candidate-prionogenic sequences. Database entries list: supporting information for PrionHome classifications, prion-determinant areas (where relevant), and disordered and compositionally-biased regions. Also included are literature references for the PrionHome classifications, transcripts and genomic coordinates, and structural data (including comparative models made for the PrionHome from manually curated alignments). We provide database usage examples for both vertebrate and fungal prion contexts. Using the database data, we have performed a detailed analysis of the compositional biases in known budding-yeast prionogenic sequences, showing that the only abundant bias pattern is for asparagine bias with subsidiary serine bias. We anticipate that this database will be a useful experimental aid and reference resource. It is freely available at: http://libaio.biol.mcgill.ca/prion. PMID:22363733
Nishio, Shin-Ya; Usami, Shin-Ichi
2017-03-01
Recent advances in next-generation sequencing (NGS) have given rise to new challenges due to the difficulties in variant pathogenicity interpretation and large dataset management, including many kinds of public population databases as well as public or commercial disease-specific databases. Here, we report a new database development tool, named the "Clinical NGS Database," for improving clinical NGS workflow through the unified management of variant information and clinical information. This database software offers a two-feature approach to variant pathogenicity classification. The first of these approaches is a phenotype similarity-based approach. This database allows the easy comparison of the detailed phenotype of each patient with the average phenotype of the same gene mutation at the variant or gene level. It is also possible to browse patients with the same gene mutation quickly. The other approach is a statistical approach to variant pathogenicity classification based on the use of the odds ratio for comparisons between the case and the control for each inheritance mode (families with apparently autosomal dominant inheritance vs. control, and families with apparently autosomal recessive inheritance vs. control). A number of case studies are also presented to illustrate the utility of this database. © 2016 The Authors. **Human Mutation published by Wiley Periodicals, Inc.
NASA Astrophysics Data System (ADS)
Buta, Ronald J.
2017-11-01
Rings are important and characteristic features of disc-shaped galaxies. This paper is the first in a series that re-visits galactic rings with the goals of further understanding the nature of the features and for examining their role in the secular evolution of galaxy structure. The series begins with a new sample of 3962 galaxies drawn from the Galaxy Zoo 2 citizen science data base, selected because zoo volunteers recognized a ring-shaped pattern in the morphology as seen in Sloan Digital Sky Survey colour images. The galaxies are classified within the framework of the Comprehensive de Vaucouleurs revised Hubble-Sandage system. It is found that zoo volunteers cued on the same kinds of ring-like features that were recognized in the 1995 Catalogue of Southern Ringed Galaxies. This paper presents the full catalogue of morphological classifications, comparisons with other sources of classifications and some histograms designed mainly to highlight the content of the catalogue. The advantages of the sample are its large size and the generally good quality of the images; the main disadvantage is the low physical resolution that limits the detectability of linearly small rings such as nuclear rings. The catalogue includes mainly inner and outer disc rings and lenses. Cataclysmic (`encounter-driven') rings (such as ring and polar ring galaxies) are recognized in less than 1 per cent of the sample.
TEXTINFO: a tool for automatic determination of patient clinical profiles using text analysis.
Borst, F.; Lyman, M.; Nhàn, N. T.; Tick, L. J.; Sager, N.; Scherrer, J. R.
1991-01-01
The clinical data contained in narrative patient documents is made available via grammatical and semantic processing. Retrievals from the resulting relational database tables are matched against a set of clinical descriptors to obtain clinical profiles of the patients in terms of the descriptors present in the documents. Discharge summaries of 57 Dept. of Digestive Surgery patients were processed in this manner. Factor analysis and discriminant analysis procedures were then applied, showing the profiles to be useful for diagnosis definitions (by establishing relations between diagnoses and clinical findings), for diagnosis assessment (by viewing the match between a definition and observed events recorded in a patient text), and potentially for outcome evaluation based on the classification abilities of clinical signs. PMID:1807679
Haptic Classification of Common Objects: Knowledge-Driven Exploration.
ERIC Educational Resources Information Center
Lederman, Susan J.; Klatzky, Roberta L.
1990-01-01
Theoretical and empirical issues relating to haptic exploration and the representation of common objects during haptic classification were investigated in 3 experiments involving a total of 112 college students. Results are discussed in terms of a computational model of human haptic object classification with implications for dextrous robot…
Development and characterization of a 3D high-resolution terrain database
NASA Astrophysics Data System (ADS)
Wilkosz, Aaron; Williams, Bryan L.; Motz, Steve
2000-07-01
A top-level description of methods used to generate elements of a high resolution 3D characterization database is presented. The database elements are defined as ground plane elevation map, vegetation height elevation map, material classification map, discrete man-made object map, and temperature radiance map. The paper will cover data collection by means of aerial photography, techniques of soft photogrammetry used to derive the elevation data, and the methodology followed to generate the material classification map. The discussion will feature the development of the database elements covering Fort Greely, Alaska. The developed databases are used by the US Army Aviation and Missile Command to evaluate the performance of various missile systems.
Structure and needs of global loss databases about natural disaster
NASA Astrophysics Data System (ADS)
Steuer, Markus
2010-05-01
Global loss databases are used for trend analyses and statistics in scientific projects, studies for governmental and nongovernmental organizations and for the insurance and finance industry as well. At the moment three global data sets are established: EM-DAT (CRED), Sigma (Swiss Re) and NatCatSERVICE (Munich Re). Together with the Asian Disaster Reduction Center (ADRC) and United Nations Development Program (UNDP) started a collaborative initiative in 2007 with the aim to agreed on and implemented a common "Disaster Category Classification and Peril Terminology for Operational Databases". This common classification has been established through several technical meetings and working groups and represents a first and important step in the development of a standardized international classification of disasters and terminology of perils. This means concrete to set up a common hierarchy and terminology for all global and regional databases on natural disasters and establish a common and agreed definition of disaster groups, main types and sub-types of events. Also the theme of georeferencing, temporal aspects, methodology and sourcing were other issues that have been identified and will be discussed. The implementation of the new and defined structure for global loss databases is already set up for Munich Re NatCatSERVICE. In the following oral session we will show the structure of the global databases as defined and in addition to give more transparency of the data sets behind published statistics and analyses. The special focus will be on the catastrophe classification from a moderate loss event up to a great natural catastrophe, also to show the quality of sources and give inside information about the assessment of overall and insured losses. Keywords: disaster category classification, peril terminology, overall and insured losses, definition
A taxonomy has been developed for outcomes in medical research to help improve knowledge discovery.
Dodd, Susanna; Clarke, Mike; Becker, Lorne; Mavergames, Chris; Fish, Rebecca; Williamson, Paula R
2018-04-01
There is increasing recognition that insufficient attention has been paid to the choice of outcomes measured in clinical trials. The lack of a standardized outcome classification system results in inconsistencies due to ambiguity and variation in how outcomes are described across different studies. Being able to classify by outcome would increase efficiency in searching sources such as clinical trial registries, patient registries, the Cochrane Database of Systematic Reviews, and the Core Outcome Measures in Effectiveness Trials (COMET) database of core outcome sets (COS), thus aiding knowledge discovery. A literature review was carried out to determine existing outcome classification systems, none of which were sufficiently comprehensive or granular for classification of all potential outcomes from clinical trials. A new taxonomy for outcome classification was developed, and as proof of principle, outcomes extracted from all published COS in the COMET database, selected Cochrane reviews, and clinical trial registry entries were classified using this new system. Application of this new taxonomy to COS in the COMET database revealed that 274/299 (92%) COS include at least one physiological outcome, whereas only 177 (59%) include at least one measure of impact (global quality of life or some measure of functioning) and only 105 (35%) made reference to adverse events. This outcome taxonomy will be used to annotate outcomes included in COS within the COMET database and is currently being piloted for use in Cochrane Reviews within the Cochrane Linked Data Project. Wider implementation of this standard taxonomy in trial and systematic review databases and registries will further promote efficient searching, reporting, and classification of trial outcomes. Copyright © 2018 The Authors. Published by Elsevier Inc. All rights reserved.
Asynchronous Data-Driven Classification of Weapon Systems
2009-10-01
Classification of Weapon SystemsF Xin Jin† Kushal Mukherjee† Shalabh Gupta† Asok Ray † Shashi Phoha† Thyagaraju Damarla‡ xuj103@psu.edu kum162@psu.edu szg107...Orlando, FL. [8] A. Ray , “Symbolic dynamic analysis of complex systems for anomaly detection,” Signal Processing, vol. 84, no. 7, pp. 1115–1130, July...2004. [9] S. Gupta and A. Ray , “Symbolic dynamic filtering for data-driven pat- tern recognition,” PATTERN RECOGNITION: Theory and Application
Superordinate Shape Classification Using Natural Shape Statistics
ERIC Educational Resources Information Center
Wilder, John; Feldman, Jacob; Singh, Manish
2011-01-01
This paper investigates the classification of shapes into broad natural categories such as "animal" or "leaf". We asked whether such coarse classifications can be achieved by a simple statistical classification of the shape skeleton. We surveyed databases of natural shapes, extracting shape skeletons and tabulating their…
Building an Ontology-driven Database for Clinical Immune Research
Ma, Jingming
2006-01-01
The clinical researches of immune response usually generate a huge amount of biomedical testing data over a certain period of time. The user-friendly data management systems based on the relational database will help immunologists/clinicians to fully manage the data. On the other hand, the same biological assays such as ELISPOT and flow cytometric assays are involved in immunological experiments no matter of different study purposes. The reuse of biological knowledge is one of driving forces behind this ontology-driven data management. Therefore, an ontology-driven database will help to handle different clinical immune researches and help immunologists/clinicians easily understand the immunological data from each other. We will discuss some outlines for building an ontology-driven data management for clinical immune researches (ODMim). PMID:17238637
UCbase 2.0: ultraconserved sequences database (2014 update).
Lomonaco, Vincenzo; Martoglia, Riccardo; Mandreoli, Federica; Anderlucci, Laura; Emmett, Warren; Bicciato, Silvio; Taccioli, Cristian
2014-01-01
UCbase 2.0 (http://ucbase.unimore.it) is an update, extension and evolution of UCbase, a Web tool dedicated to the analysis of ultraconserved sequences (UCRs). UCRs are 481 sequences >200 bases sharing 100% identity among human, mouse and rat genomes. They are frequently located in genomic regions known to be involved in cancer or differentially expressed in human leukemias and carcinomas. UCbase 2.0 is a platform-independent Web resource that includes the updated version of the human genome annotation (hg19), information linking disorders to chromosomal coordinates based on the Systematized Nomenclature of Medicine classification, a query tool to search for Single Nucleotide Polymorphisms (SNPs) and a new text box to directly interrogate the database using a MySQL interface. To facilitate the interactive visual interpretation of UCR chromosomal positioning, UCbase 2.0 now includes a graph visualization interface directly linked to UCSC genome browser. Database URL: http://ucbase.unimore.it. © The Author(s) 2014. Published by Oxford University Press.
The present report describes a strategy to refine the current Cramer classification of the TTC concept using a broad database (DB) termed TTC RepDose. Cramer classes 1-3 overlap to some extent, indicating a need for a better separation of structural classes likely to be toxic, mo...
Galaxy Classifications with Deep Learning
NASA Astrophysics Data System (ADS)
Lukic, Vesna; Brüggen, Marcus
2017-06-01
Machine learning techniques have proven to be increasingly useful in astronomical applications over the last few years, for example in object classification, estimating redshifts and data mining. One example of object classification is classifying galaxy morphology. This is a tedious task to do manually, especially as the datasets become larger with surveys that have a broader and deeper search-space. The Kaggle Galaxy Zoo competition presented the challenge of writing an algorithm to find the probability that a galaxy belongs in a particular class, based on SDSS optical spectroscopy data. The use of convolutional neural networks (convnets), proved to be a popular solution to the problem, as they have also produced unprecedented classification accuracies in other image databases such as the database of handwritten digits (MNIST †) and large database of images (CIFAR ‡). We experiment with the convnets that comprised the winning solution, but using broad classifications. The effect of changing the number of layers is explored, as well as using a different activation function, to help in developing an intuition of how the networks function and to see how they can be applied to radio galaxy images.
What are the most effective strategies for improving quality and safety of health care?
Scott, I
2009-06-01
There is now a plethora of different quality improvement strategies (QIS) for optimizing health care, some clinician/patient driven, others manager/policy-maker driven. Which of these are most effective remains unclear despite expressed concerns about potential for QIS-related patient harm and wasting of resources. The objective of this study was to review published literature assessing the relative effectiveness of different QIS. Data sources comprising PubMed Clinical Queries, Cochrane Library and its Effective Practice and Organization of Care database, and HealthStar were searched for studies of QIS between January 1985 and February 2008 using search terms based on an a priori QIS classification suggested by experts. Systematic reviews of controlled trials were selected in determining effect sizes for specific QIS, which were compared as a narrative meta-review. Clinician/patient driven QIS were associated with stronger evidence of efficacy and larger effect sizes than manager/policy-maker driven QIS. The most effective strategies (>10% absolute increase in appropriate care or equivalent measure) included clinician-directed audit and feedback cycles, clinical decision support systems, specialty outreach programmes, chronic disease management programmes, continuing professional education based on interactive small-group case discussions, and patient-mediated clinician reminders. Pay-for-performance schemes directed to clinician groups and organizational process redesign were modestly effective. Other manager/policy-maker driven QIS including continuous quality improvement programmes, risk and safety management systems, public scorecards and performance reports, external accreditation, and clinical governance arrangements have not been adequately evaluated with regard to effectiveness. QIS are heterogeneous and methodological flaws in much of the evaluative literature limit validity and generalizability of results. Based on current best available evidence, clinician/patient driven QIS appear to be more effective than manager/policy-maker driven QIS although the latter have, in many instances, attracted insufficient robust evaluations to accurately determine their comparative effectiveness.
Olier, Ivan; Springate, David A.; Ashcroft, Darren M.; Doran, Tim; Reeves, David; Planner, Claire; Reilly, Siobhan; Kontopantelis, Evangelos
2016-01-01
Background The use of Electronic Health Records databases for medical research has become mainstream. In the UK, increasing use of Primary Care Databases is largely driven by almost complete computerisation and uniform standards within the National Health Service. Electronic Health Records research often begins with the development of a list of clinical codes with which to identify cases with a specific condition. We present a methodology and accompanying Stata and R commands (pcdsearch/Rpcdsearch) to help researchers in this task. We present severe mental illness as an example. Methods We used the Clinical Practice Research Datalink, a UK Primary Care Database in which clinical information is largely organised using Read codes, a hierarchical clinical coding system. Pcdsearch is used to identify potentially relevant clinical codes and/or product codes from word-stubs and code-stubs suggested by clinicians. The returned code-lists are reviewed and codes relevant to the condition of interest are selected. The final code-list is then used to identify patients. Results We identified 270 Read codes linked to SMI and used them to identify cases in the database. We observed that our approach identified cases that would have been missed with a simpler approach using SMI registers defined within the UK Quality and Outcomes Framework. Conclusion We described a framework for researchers of Electronic Health Records databases, for identifying patients with a particular condition or matching certain clinical criteria. The method is invariant to coding system or database and can be used with SNOMED CT, ICD or other medical classification code-lists. PMID:26918439
Natural image classification driven by human brain activity
NASA Astrophysics Data System (ADS)
Zhang, Dai; Peng, Hanyang; Wang, Jinqiao; Tang, Ming; Xue, Rong; Zuo, Zhentao
2016-03-01
Natural image classification has been a hot topic in computer vision and pattern recognition research field. Since the performance of an image classification system can be improved by feature selection, many image feature selection methods have been developed. However, the existing supervised feature selection methods are typically driven by the class label information that are identical for different samples from the same class, ignoring with-in class image variability and therefore degrading the feature selection performance. In this study, we propose a novel feature selection method, driven by human brain activity signals collected using fMRI technique when human subjects were viewing natural images of different categories. The fMRI signals associated with subjects viewing different images encode the human perception of natural images, and therefore may capture image variability within- and cross- categories. We then select image features with the guidance of fMRI signals from brain regions with active response to image viewing. Particularly, bag of words features based on GIST descriptor are extracted from natural images for classification, and a sparse regression base feature selection method is adapted to select image features that can best predict fMRI signals. Finally, a classification model is built on the select image features to classify images without fMRI signals. The validation experiments for classifying images from 4 categories of two subjects have demonstrated that our method could achieve much better classification performance than the classifiers built on image feature selected by traditional feature selection methods.
Mycobacteriophage genome database.
Joseph, Jerrine; Rajendran, Vasanthi; Hassan, Sameer; Kumar, Vanaja
2011-01-01
Mycobacteriophage genome database (MGDB) is an exclusive repository of the 64 completely sequenced mycobacteriophages with annotated information. It is a comprehensive compilation of the various gene parameters captured from several databases pooled together to empower mycobacteriophage researchers. The MGDB (Version No.1.0) comprises of 6086 genes from 64 mycobacteriophages classified into 72 families based on ACLAME database. Manual curation was aided by information available from public databases which was enriched further by analysis. Its web interface allows browsing as well as querying the classification. The main objective is to collect and organize the complexity inherent to mycobacteriophage protein classification in a rational way. The other objective is to browse the existing and new genomes and describe their functional annotation. The database is available for free at http://mpgdb.ibioinformatics.org/mpgdb.php.
Strickler, Jeffery C; Lopiano, Kenneth K
2016-11-01
This study profiles an innovative approach to capture patient satisfaction data from emergency department (ED) patients by implementing an electronic survey method. This study compares responders to nonresponders. Our hypothesis is that the cohort of survey respondents will be similar to nonresponders in terms of the key characteristics of age, gender, race, ethnicity, ED disposition, and payor status. This study is a cross-sectional design using secondary data from the database and provides an opportunity for univariate analysis of the key characteristics for each group. The data elements will be abstracted from the database and compared with the same key characteristics from a similar sample from the database on nonresponders to the ED satisfaction survey. Age showed a statistically significant difference between responders and nonresponders. Comparison by disposition status showed no substantial difference between responders and nonresponders. Gender distribution showed a greater number of female than male responders. Race distribution showed a greater number and response by white and Asian patients as compared with African Americans. A review of ethnicity showed fewer Hispanics responded. An evaluation by payor classification showed greater number and response rate by those with a commercial or Workers Comp payor source. The response rate by Medicare recipients was stronger than expected; however, the response rate by Medicaid recipients and self-pay could be a concern for underrepresentation by lower socioeconomic groups. Finally, the evaluation of the method of notification showed that notification by both e-mail and text substantially improved response rates. The evaluation of key characteristics showed no difference related to disposition, but differences related to age, gender, race, ethnicity, and payor classification. These results point to a potential concern for underrepresentation by lower socioeconomic groups. The results showed that notification by both e-mail and text substantially improved response rates.
ERIC Educational Resources Information Center
Kunina-Habenicht, Olga; Rupp, André A.; Wilhelm, Oliver
2017-01-01
Diagnostic classification models (DCMs) hold great potential for applications in summative and formative assessment by providing discrete multivariate proficiency scores that yield statistically driven classifications of students. Using data from a newly developed diagnostic arithmetic assessment that was administered to 2032 fourth-grade students…
Bibliometric trend and patent analysis in nano-alloys research for period 2000-2013.
Živković, Dragana; Niculović, Milica; Manasijević, Dragan; Minić, Duško; Ćosović, Vladan; Sibinović, Maja
2015-05-04
This paper presents an overview of current situation in nano-alloys investigations based on bibliometric and patent analysis. Bibliometric analysis data, for period from 2000 to September 2013, were obtained using Scopus database as selected index database, whereas analyzed parameters were: number of scientific papers per years, authors, countries, affiliations, subject areas and document types. Analysis of nano-alloys patents was done with specific database, using the International Patent Classification and Patent Scope for the period from 2003 to 2013 year. Information found in this database was the number of patents, patent classification by country, patent applicators, main inventors and pub date.
Bibliometric trend and patent analysis in nano-alloys research for period 2000-2013.
Živković, Dragana; Niculović, Milica; Manasijević, Dragan; Minić, Duško; Ćosović, Vladan; Sibinović, Maja
2015-01-01
This paper presents an overview of current situation in nano-alloys investigations based on bibliometric and patent analysis. Bibliometric analysis data, for the period 2000 to 2013, were obtained using Scopus database as selected index database, whereas analyzed parameters were: number of scientific papers per year, authors, countries, affiliations, subject areas and document types. Analysis of nano-alloys patents was done with specific database, using the International Patent Classification and Patent Scope for the period 2003 to 2013. Information found in this database was the number of patents, patent classification by country, patent applicators, main inventors and publication date.
Concurrent tumor segmentation and registration with uncertainty-based sparse non-uniform graphs.
Parisot, Sarah; Wells, William; Chemouny, Stéphane; Duffau, Hugues; Paragios, Nikos
2014-05-01
In this paper, we present a graph-based concurrent brain tumor segmentation and atlas to diseased patient registration framework. Both segmentation and registration problems are modeled using a unified pairwise discrete Markov Random Field model on a sparse grid superimposed to the image domain. Segmentation is addressed based on pattern classification techniques, while registration is performed by maximizing the similarity between volumes and is modular with respect to the matching criterion. The two problems are coupled by relaxing the registration term in the tumor area, corresponding to areas of high classification score and high dissimilarity between volumes. In order to overcome the main shortcomings of discrete approaches regarding appropriate sampling of the solution space as well as important memory requirements, content driven samplings of the discrete displacement set and the sparse grid are considered, based on the local segmentation and registration uncertainties recovered by the min marginal energies. State of the art results on a substantial low-grade glioma database demonstrate the potential of our method, while our proposed approach shows maintained performance and strongly reduced complexity of the model. Copyright © 2014 Elsevier B.V. All rights reserved.
Pedoinformatics Approach to Soil Text Analytics
NASA Astrophysics Data System (ADS)
Furey, J.; Seiter, J.; Davis, A.
2017-12-01
The several extant schema for the classification of soils rely on differing criteria, but the major soil science taxonomies, including the United States Department of Agriculture (USDA) and the international harmonized World Reference Base for Soil Resources systems, are based principally on inferred pedogenic properties. These taxonomies largely result from compiled individual observations of soil morphologies within soil profiles, and the vast majority of this pedologic information is contained in qualitative text descriptions. We present text mining analyses of hundreds of gigabytes of parsed text and other data in the digitally available USDA soil taxonomy documentation, the Soil Survey Geographic (SSURGO) database, and the National Cooperative Soil Survey (NCSS) soil characterization database. These analyses implemented iPython calls to Gensim modules for topic modelling, with latent semantic indexing completed down to the lowest taxon level (soil series) paragraphs. Via a custom extension of the Natural Language Toolkit (NLTK), approximately one percent of the USDA soil series descriptions were used to train a classifier for the remainder of the documents, essentially by treating soil science words as comprising a novel language. While location-specific descriptors at the soil series level are amenable to geomatics methods, unsupervised clustering of the occurrence of other soil science words did not closely follow the usual hierarchy of soil taxa. We present preliminary phrasal analyses that may account for some of these effects.
Performance of an open-source heart sound segmentation algorithm on eight independent databases.
Liu, Chengyu; Springer, David; Clifford, Gari D
2017-08-01
Heart sound segmentation is a prerequisite step for the automatic analysis of heart sound signals, facilitating the subsequent identification and classification of pathological events. Recently, hidden Markov model-based algorithms have received increased interest due to their robustness in processing noisy recordings. In this study we aim to evaluate the performance of the recently published logistic regression based hidden semi-Markov model (HSMM) heart sound segmentation method, by using a wider variety of independently acquired data of varying quality. Firstly, we constructed a systematic evaluation scheme based on a new collection of heart sound databases, which we assembled for the PhysioNet/CinC Challenge 2016. This collection includes a total of more than 120 000 s of heart sounds recorded from 1297 subjects (including both healthy subjects and cardiovascular patients) and comprises eight independent heart sound databases sourced from multiple independent research groups around the world. Then, the HSMM-based segmentation method was evaluated using the assembled eight databases. The common evaluation metrics of sensitivity, specificity, accuracy, as well as the [Formula: see text] measure were used. In addition, the effect of varying the tolerance window for determining a correct segmentation was evaluated. The results confirm the high accuracy of the HSMM-based algorithm on a separate test dataset comprised of 102 306 heart sounds. An average [Formula: see text] score of 98.5% for segmenting S1 and systole intervals and 97.2% for segmenting S2 and diastole intervals were observed. The [Formula: see text] score was shown to increases with an increases in the tolerance window size, as expected. The high segmentation accuracy of the HSMM-based algorithm on a large database confirmed the algorithm's effectiveness. The described evaluation framework, combined with the largest collection of open access heart sound data, provides essential resources for evaluators who need to test their algorithms with realistic data and share reproducible results.
Chandonia, John-Marc; Fox, Naomi K; Brenner, Steven E
2017-02-03
SCOPe (Structural Classification of Proteins-extended, http://scop.berkeley.edu) is a database of relationships between protein structures that extends the Structural Classification of Proteins (SCOP) database. SCOP is an expert-curated ordering of domains from the majority of proteins of known structure in a hierarchy according to structural and evolutionary relationships. SCOPe classifies the majority of protein structures released since SCOP development concluded in 2009, using a combination of manual curation and highly precise automated tools, aiming to have the same accuracy as fully hand-curated SCOP releases. SCOPe also incorporates and updates the ASTRAL compendium, which provides several databases and tools to aid in the analysis of the sequences and structures of proteins classified in SCOPe. SCOPe continues high-quality manual classification of new superfamilies, a key feature of SCOP. Artifacts such as expression tags are now separated into their own class, in order to distinguish them from the homology-based annotations in the remainder of the SCOPe hierarchy. SCOPe 2.06 contains 77,439 Protein Data Bank entries, double the 38,221 structures classified in SCOP. Copyright © 2016 The Author(s). Published by Elsevier Ltd.. All rights reserved.
The value of protein structure classification information-Surveying the scientific literature
Fox, Naomi K.; Brenner, Steven E.; Chandonia, John -Marc
2015-08-27
The Structural Classification of Proteins (SCOP) and Class, Architecture, Topology, Homology (CATH) databases have been valuable resources for protein structure classification for over 20 years. Development of SCOP (version 1) concluded in June 2009 with SCOP 1.75. The SCOPe (SCOP-extended) database offers continued development of the classic SCOP hierarchy, adding over 33,000 structures. We have attempted to assess the impact of these two decade old resources and guide future development. To this end, we surveyed recent articles to learn how structure classification data are used. Of 571 articles published in 2012-2013 that cite SCOP, 439 actually use data from themore » resource. We found that the type of use was fairly evenly distributed among four top categories: A) study protein structure or evolution (27% of articles), B) train and/or benchmark algorithms (28% of articles), C) augment non-SCOP datasets with SCOP classification (21% of articles), and D) examine the classification of one protein/a small set of proteins (22% of articles). Most articles described computational research, although 11% described purely experimental research, and a further 9% included both. We examined how CATH and SCOP were used in 158 articles that cited both databases: while some studies used only one dataset, the majority used data from both resources. Protein structure classification remains highly relevant for a diverse range of problems and settings.« less
The value of protein structure classification information-Surveying the scientific literature
DOE Office of Scientific and Technical Information (OSTI.GOV)
Fox, Naomi K.; Brenner, Steven E.; Chandonia, John -Marc
The Structural Classification of Proteins (SCOP) and Class, Architecture, Topology, Homology (CATH) databases have been valuable resources for protein structure classification for over 20 years. Development of SCOP (version 1) concluded in June 2009 with SCOP 1.75. The SCOPe (SCOP-extended) database offers continued development of the classic SCOP hierarchy, adding over 33,000 structures. We have attempted to assess the impact of these two decade old resources and guide future development. To this end, we surveyed recent articles to learn how structure classification data are used. Of 571 articles published in 2012-2013 that cite SCOP, 439 actually use data from themore » resource. We found that the type of use was fairly evenly distributed among four top categories: A) study protein structure or evolution (27% of articles), B) train and/or benchmark algorithms (28% of articles), C) augment non-SCOP datasets with SCOP classification (21% of articles), and D) examine the classification of one protein/a small set of proteins (22% of articles). Most articles described computational research, although 11% described purely experimental research, and a further 9% included both. We examined how CATH and SCOP were used in 158 articles that cited both databases: while some studies used only one dataset, the majority used data from both resources. Protein structure classification remains highly relevant for a diverse range of problems and settings.« less
Concentrations of indoor pollutants (CIP) database user's manual (Version 4. 0)
DOE Office of Scientific and Technical Information (OSTI.GOV)
Apte, M.G.; Brown, S.R.; Corradi, C.A.
1990-10-01
This is the latest release of the database and the user manual. The user manual is a tutorial and reference for utilizing the CIP Database system. An installation guide is included to cover various hardware configurations. Numerous examples and explanations of the dialogue between the user and the database program are provided. It is hoped that this resource will, along with on-line help and the menu-driven software, make for a quick and easy learning curve. For the purposes of this manual, it is assumed that the user is acquainted with the goals of the CIP Database, which are: (1) tomore » collect existing measurements of concentrations of indoor air pollutants in a user-oriented database and (2) to provide a repository of references citing measured field results openly accessible to a wide audience of researchers, policy makers, and others interested in the issues of indoor air quality. The database software, as distinct from the data, is contained in two files, CIP. EXE and PFIL.COM. CIP.EXE is made up of a number of programs written in dBase III command code and compiled using Clipper into a single, executable file. PFIL.COM is a program written in Turbo Pascal that handles the output of summary text files and is called from CIP.EXE. Version 4.0 of the CIP Database is current through March 1990.« less
SVM-RFE based feature selection and Taguchi parameters optimization for multiclass SVM classifier.
Huang, Mei-Ling; Hung, Yung-Hsiang; Lee, W M; Li, R K; Jiang, Bo-Ru
2014-01-01
Recently, support vector machine (SVM) has excellent performance on classification and prediction and is widely used on disease diagnosis or medical assistance. However, SVM only functions well on two-group classification problems. This study combines feature selection and SVM recursive feature elimination (SVM-RFE) to investigate the classification accuracy of multiclass problems for Dermatology and Zoo databases. Dermatology dataset contains 33 feature variables, 1 class variable, and 366 testing instances; and the Zoo dataset contains 16 feature variables, 1 class variable, and 101 testing instances. The feature variables in the two datasets were sorted in descending order by explanatory power, and different feature sets were selected by SVM-RFE to explore classification accuracy. Meanwhile, Taguchi method was jointly combined with SVM classifier in order to optimize parameters C and γ to increase classification accuracy for multiclass classification. The experimental results show that the classification accuracy can be more than 95% after SVM-RFE feature selection and Taguchi parameter optimization for Dermatology and Zoo databases.
SVM-RFE Based Feature Selection and Taguchi Parameters Optimization for Multiclass SVM Classifier
Huang, Mei-Ling; Hung, Yung-Hsiang; Lee, W. M.; Li, R. K.; Jiang, Bo-Ru
2014-01-01
Recently, support vector machine (SVM) has excellent performance on classification and prediction and is widely used on disease diagnosis or medical assistance. However, SVM only functions well on two-group classification problems. This study combines feature selection and SVM recursive feature elimination (SVM-RFE) to investigate the classification accuracy of multiclass problems for Dermatology and Zoo databases. Dermatology dataset contains 33 feature variables, 1 class variable, and 366 testing instances; and the Zoo dataset contains 16 feature variables, 1 class variable, and 101 testing instances. The feature variables in the two datasets were sorted in descending order by explanatory power, and different feature sets were selected by SVM-RFE to explore classification accuracy. Meanwhile, Taguchi method was jointly combined with SVM classifier in order to optimize parameters C and γ to increase classification accuracy for multiclass classification. The experimental results show that the classification accuracy can be more than 95% after SVM-RFE feature selection and Taguchi parameter optimization for Dermatology and Zoo databases. PMID:25295306
NASA Astrophysics Data System (ADS)
Zhang, Y.; Wen, J.; Xiao, Q.; You, D.
2016-12-01
Operational algorithms for land surface BRDF/Albedo products are mainly developed from kernel-driven model, combining atmospherically corrected, multidate, multiband surface reflectance to extract BRDF parameters. The Angular and Spectral Kernel Driven model (ASK model), which incorporates the component spectra as a priori knowledge, provides a potential way to make full use of the multi-sensor data with multispectral information and accumulated observations. However, the ASK model is still not feasible for global BRDF/Albedo inversions due to the lack of sufficient field measurements of component spectra at the large scale. This research outlines a parameterization scheme on the component spectra for global scale BRDF/Albedo inversions in the frame of ASK. The parameter γ(λ) can be derived from the ratio of the leaf reflectance and soil reflectance, supported by globally distributed soil spectral library, ANGERS and LOPEX leaf optical properties database. To consider the intrinsic variability in both the land cover and spectral dimension, the mean and standard deviation of γ(λ) for 28 soil units and 4 leaf types in seven MODIS bands were calculated, with a world soil map used for global BRDF/Albedo products retrieval. Compared to the retrievals from BRF datasets simulated by the PROSAIL model, ASK model shows an acceptable accuracy on the parameterization strategy, with the RMSE 0.007 higher at most than inversion by true component spectra. The results indicate that the classification on ratio contributed to capture the spectral characteristics in BBRDF/Albedo retrieval, whereas the ratio range should be controlled within 8% in each band. Ground-based measurements in Heihe river basin were used to validate the accuracy of the improved ASK model, and the generated broadband albedo products shows good agreement with in situ data, which suggests that the improvement of the component spectra on the ASK model has potential for global scale BRDF/Albedo inversions.
Toews, Matthew; Wells, William M.; Collins, Louis; Arbel, Tal
2013-01-01
This paper presents feature-based morphometry (FBM), a new, fully data-driven technique for identifying group-related differences in volumetric imagery. In contrast to most morphometry methods which assume one-to-one correspondence between all subjects, FBM models images as a collage of distinct, localized image features which may not be present in all subjects. FBM thus explicitly accounts for the case where the same anatomical tissue cannot be reliably identified in all subjects due to disease or anatomical variability. A probabilistic model describes features in terms of their appearance, geometry, and relationship to sub-groups of a population, and is automatically learned from a set of subject images and group labels. Features identified indicate group-related anatomical structure that can potentially be used as disease biomarkers or as a basis for computer-aided diagnosis. Scale-invariant image features are used, which reflect generic, salient patterns in the image. Experiments validate FBM clinically in the analysis of normal (NC) and Alzheimer’s (AD) brain images using the freely available OASIS database. FBM automatically identifies known structural differences between NC and AD subjects in a fully data-driven fashion, and obtains an equal error classification rate of 0.78 on new subjects. PMID:20426102
DOT National Transportation Integrated Search
2006-01-01
The Transportation-Markings Database project (within the T-M Monograph Series) began in 1997 with the publishing of the initial component, Transportation-Markings Database: Marine. That study was joined by T-M Database: Traffic Control Devices (1998)...
The Classification of Romanian High-Schools
ERIC Educational Resources Information Center
Ivan, Ion; Milodin, Daniel; Naie, Lucian
2006-01-01
The article tries to tackle the issue of high-schools classification from one city, district or from Romania. The classification criteria are presented. The National Database of Education is also presented and the application of criteria is illustrated. An algorithm for high-school multi-rang classification is proposed in order to build classes of…
Searching bioremediation patents through Cooperative Patent Classification (CPC).
Prasad, Rajendra
2016-03-01
Patent classification systems have traditionally evolved independently at each patent jurisdiction to classify patents handled by their examiners to be able to search previous patents while dealing with new patent applications. As patent databases maintained by them went online for free access to public as also for global search of prior art by examiners, the need arose for a common platform and uniform structure of patent databases. The diversity of different classification, however, posed problems of integrating and searching relevant patents across patent jurisdictions. To address this problem of comparability of data from different sources and searching patents, WIPO in the recent past developed what is known as International Patent Classification (IPC) system which most countries readily adopted to code their patents with IPC codes along with their own codes. The Cooperative Patent Classification (CPC) is the latest patent classification system based on IPC/European Classification (ECLA) system, developed by the European Patent Office (EPO) and the United States Patent and Trademark Office (USPTO) which is likely to become a global standard. This paper discusses this new classification system with reference to patents on bioremediation.
Text Classification for Organizational Researchers
Kobayashi, Vladimer B.; Mol, Stefan T.; Berkers, Hannah A.; Kismihók, Gábor; Den Hartog, Deanne N.
2017-01-01
Organizations are increasingly interested in classifying texts or parts thereof into categories, as this enables more effective use of their information. Manual procedures for text classification work well for up to a few hundred documents. However, when the number of documents is larger, manual procedures become laborious, time-consuming, and potentially unreliable. Techniques from text mining facilitate the automatic assignment of text strings to categories, making classification expedient, fast, and reliable, which creates potential for its application in organizational research. The purpose of this article is to familiarize organizational researchers with text mining techniques from machine learning and statistics. We describe the text classification process in several roughly sequential steps, namely training data preparation, preprocessing, transformation, application of classification techniques, and validation, and provide concrete recommendations at each step. To help researchers develop their own text classifiers, the R code associated with each step is presented in a tutorial. The tutorial draws from our own work on job vacancy mining. We end the article by discussing how researchers can validate a text classification model and the associated output. PMID:29881249
Lisan, Quentin; Moya-Plana, Antoine; Bonfils, Pierre
2017-11-01
The risk factors for the recurrence of sinonasal inverted papilloma are still unclear. To investigate the potential association between the Krouse classification and the recurrence rates of sinonasal inverted papilloma. The EMBASE and MEDLINE databases were searched for the period January 1, 1964, through September 30, 2016, using the following search strategy: (paranasal sinuses [Medical Subject Headings (MeSH) terms] OR sinonasal [all fields]) AND (inverted papilloma [MeSH terms] OR (inverted [all fields] AND papilloma [all fields]). The inclusion criteria were (1) studies including sinonasal inverted papilloma only and no other forms of papillomas, such as oncocytic papilloma; (2) minimum follow-up of 1 year after the surgery; and (3) clear report of cases (recurrence) and controls according to the Krouse classification system or deducible from the full-text article. Literature search was performed by 2 reviewers. Of the 625 articles retrieved in the literature, 97 full-text articles were reviewed. Observational cohort studies or randomized controlled trials were included, and the following variables were extracted from full-text articles: authors of the study, publication year, follow-up data, and number of cases (recurrence) and controls (no recurrence) in each of the 4 stages of the Krouse classification system. The Meta-analysis of Observational Studies in Epidemiology (MOOSE) guidelines were followed. Odds ratios (ORs) and 95% CIs were estimated, and data of included studies were pooled using a random-effects model. The main outcome was recurrence after surgical removal of sinonasal inverted papilloma according to each stage of the Krouse classification system. Thirteen studies comprising 1787 patients were analyzed. A significant increased risk of recurrence (51%) was highlighted for Krouse stage T3 disease when compared with stage T2 (pooled OR, 1.51; 95% CI, 1.09-2.09). No significant difference in risk of recurrence was found between Krouse stages T1 and T2 disease (pooled OR, 1.14; 95% CI, 0.63-2.04) or between stages T3 and T4 (pooled OR, 1.27; 95% CI, 0.72-2.26). Inverted papillomas classified as stage T3 according to the Krouse classification system presented a 51% higher likelihood of recurrence. Head and neck surgeons must be aware of this higher likelihood of recurrence when planning and performing surgery for sinonasal inverted papilloma.
Improving healthcare services using web based platform for management of medical case studies.
Ogescu, Cristina; Plaisanu, Claudiu; Udrescu, Florian; Dumitru, Silviu
2008-01-01
The paper presents a web based platform for management of medical cases, support for healthcare specialists in taking the best clinical decision. Research has been oriented mostly on multimedia data management, classification algorithms for querying, retrieving and processing different medical data types (text and images). The medical case studies can be accessed by healthcare specialists and by students as anonymous case studies providing trust and confidentiality in Internet virtual environment. The MIDAS platform develops an intelligent framework to manage sets of medical data (text, static or dynamic images), in order to optimize the diagnosis and the decision process, which will reduce the medical errors and will increase the quality of medical act. MIDAS is an integrated project working on medical information retrieval from heterogeneous, distributed medical multimedia database.
Location-Driven Image Retrieval for Images Collected by a Mobile Robot
NASA Astrophysics Data System (ADS)
Tanaka, Kanji; Hirayama, Mitsuru; Okada, Nobuhiro; Kondo, Eiji
Mobile robot teleoperation is a method for a human user to interact with a mobile robot over time and distance. Successful teleoperation depends on how well images taken by the mobile robot are visualized to the user. To enhance the efficiency and flexibility of the visualization, an image retrieval system on such a robot’s image database would be very useful. The main difference of the robot’s image database from standard image databases is that various relevant images exist due to variety of viewing conditions. The main contribution of this paper is to propose an efficient retrieval approach, named location-driven approach, utilizing correlation between visual features and real world locations of images. Combining the location-driven approach with the conventional feature-driven approach, our goal can be viewed as finding an optimal classifier between relevant and irrelevant feature-location pairs. An active learning technique based on support vector machine is extended for this aim.
Databases applicable to quantitative hazard/risk assessment-Towards a predictive systems toxicology
DOE Office of Scientific and Technical Information (OSTI.GOV)
Waters, Michael; Jackson, Marcus
2008-11-15
The Workshop on The Power of Aggregated Toxicity Data addressed the requirement for distributed databases to support quantitative hazard and risk assessment. The authors have conceived and constructed with federal support several databases that have been used in hazard identification and risk assessment. The first of these databases, the EPA Gene-Tox Database was developed for the EPA Office of Toxic Substances by the Oak Ridge National Laboratory, and is currently hosted by the National Library of Medicine. This public resource is based on the collaborative evaluation, by government, academia, and industry, of short-term tests for the detection of mutagens andmore » presumptive carcinogens. The two-phased evaluation process resulted in more than 50 peer-reviewed publications on test system performance and a qualitative database on thousands of chemicals. Subsequently, the graphic and quantitative EPA/IARC Genetic Activity Profile (GAP) Database was developed in collaboration with the International Agency for Research on Cancer (IARC). A chemical database driven by consideration of the lowest effective dose, GAP has served IARC for many years in support of hazard classification of potential human carcinogens. The Toxicological Activity Profile (TAP) prototype database was patterned after GAP and utilized acute, subchronic, and chronic data from the Office of Air Quality Planning and Standards. TAP demonstrated the flexibility of the GAP format for air toxics, water pollutants and other environmental agents. The GAP format was also applied to developmental toxicants and was modified to represent quantitative results from the rodent carcinogen bioassay. More recently, the authors have constructed: 1) the NIEHS Genetic Alterations in Cancer (GAC) Database which quantifies specific mutations found in cancers induced by environmental agents, and 2) the NIEHS Chemical Effects in Biological Systems (CEBS) Knowledgebase that integrates genomic and other biological data including dose-response studies in toxicology and pathology. Each of the public databases has been discussed in prior publications. They will be briefly described in the present report from the perspective of aggregating datasets to augment the data and information contained within them.« less
NASA Astrophysics Data System (ADS)
Mallepudi, Sri Abhishikth; Calix, Ricardo A.; Knapp, Gerald M.
2011-02-01
In recent years there has been a rapid increase in the size of video and image databases. Effective searching and retrieving of images from these databases is a significant current research area. In particular, there is a growing interest in query capabilities based on semantic image features such as objects, locations, and materials, known as content-based image retrieval. This study investigated mechanisms for identifying materials present in an image. These capabilities provide additional information impacting conditional probabilities about images (e.g. objects made of steel are more likely to be buildings). These capabilities are useful in Building Information Modeling (BIM) and in automatic enrichment of images. I2T methodologies are a way to enrich an image by generating text descriptions based on image analysis. In this work, a learning model is trained to detect certain materials in images. To train the model, an image dataset was constructed containing single material images of bricks, cloth, grass, sand, stones, and wood. For generalization purposes, an additional set of 50 images containing multiple materials (some not used in training) was constructed. Two different supervised learning classification models were investigated: a single multi-class SVM classifier, and multiple binary SVM classifiers (one per material). Image features included Gabor filter parameters for texture, and color histogram data for RGB components. All classification accuracy scores using the SVM-based method were above 85%. The second model helped in gathering more information from the images since it assigned multiple classes to the images. A framework for the I2T methodology is presented.
ASDB: a resource for probing protein functions with small molecules.
Liu, Zhihong; Ding, Peng; Yan, Xin; Zheng, Minghao; Zhou, Huihao; Xu, Yuehua; Du, Yunfei; Gu, Qiong; Xu, Jun
2016-06-01
: Identifying chemical probes or seeking scaffolds for a specific biological target is important for protein function studies. Therefore, we create the Annotated Scaffold Database (ASDB), a computer-readable and systematic target-annotated scaffold database, to serve such needs. The scaffolds in ASDB were derived from public databases including ChEMBL, DrugBank and TCMSP, with a scaffold-based classification approach. Each scaffold was assigned with an InChIKey as its unique identifier, energy-minimized 3D conformations, and other calculated properties. A scaffold is also associated with drugs, natural products, drug targets and medical indications. The database can be retrieved through text or structure query tools. ASDB collects 333 601 scaffolds, which are associated with 4368 targets. The scaffolds consist of 3032 scaffolds derived from drugs and 5163 scaffolds derived from natural products. For given scaffolds, scaffold-target networks can be generated from the database to demonstrate the relations of scaffolds and targets. ASDB is freely available at http://www.rcdd.org.cn/asdb/with the major web browsers. junxu@biochemomes.com or xujun9@mail.sysu.edu.cn Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
Brodic, Darko; Milivojevic, Dragan R.; Milivojevic, Zoran N.
2011-01-01
The paper introduces a testing framework for the evaluation and validation of text line segmentation algorithms. Text line segmentation represents the key action for correct optical character recognition. Many of the tests for the evaluation of text line segmentation algorithms deal with text databases as reference templates. Because of the mismatch, the reliable testing framework is required. Hence, a new approach to a comprehensive experimental framework for the evaluation of text line segmentation algorithms is proposed. It consists of synthetic multi-like text samples and real handwritten text as well. Although the tests are mutually independent, the results are cross-linked. The proposed method can be used for different types of scripts and languages. Furthermore, two different procedures for the evaluation of algorithm efficiency based on the obtained error type classification are proposed. The first is based on the segmentation line error description, while the second one incorporates well-known signal detection theory. Each of them has different capabilities and convenience, but they can be used as supplements to make the evaluation process efficient. Overall the proposed procedure based on the segmentation line error description has some advantages, characterized by five measures that describe measurement procedures. PMID:22164106
Brodic, Darko; Milivojevic, Dragan R; Milivojevic, Zoran N
2011-01-01
The paper introduces a testing framework for the evaluation and validation of text line segmentation algorithms. Text line segmentation represents the key action for correct optical character recognition. Many of the tests for the evaluation of text line segmentation algorithms deal with text databases as reference templates. Because of the mismatch, the reliable testing framework is required. Hence, a new approach to a comprehensive experimental framework for the evaluation of text line segmentation algorithms is proposed. It consists of synthetic multi-like text samples and real handwritten text as well. Although the tests are mutually independent, the results are cross-linked. The proposed method can be used for different types of scripts and languages. Furthermore, two different procedures for the evaluation of algorithm efficiency based on the obtained error type classification are proposed. The first is based on the segmentation line error description, while the second one incorporates well-known signal detection theory. Each of them has different capabilities and convenience, but they can be used as supplements to make the evaluation process efficient. Overall the proposed procedure based on the segmentation line error description has some advantages, characterized by five measures that describe measurement procedures.
NASA Astrophysics Data System (ADS)
Sun, Ziheng; Fang, Hui; Di, Liping; Yue, Peng
2016-09-01
It was an untouchable dream for remote sensing experts to realize total automatic image classification without inputting any parameter values. Experts usually spend hours and hours on tuning the input parameters of classification algorithms in order to obtain the best results. With the rapid development of knowledge engineering and cyberinfrastructure, a lot of data processing and knowledge reasoning capabilities become online accessible, shareable and interoperable. Based on these recent improvements, this paper presents an idea of parameterless automatic classification which only requires an image and automatically outputs a labeled vector. No parameters and operations are needed from endpoint consumers. An approach is proposed to realize the idea. It adopts an ontology database to store the experiences of tuning values for classifiers. A sample database is used to record training samples of image segments. Geoprocessing Web services are used as functionality blocks to finish basic classification steps. Workflow technology is involved to turn the overall image classification into a total automatic process. A Web-based prototypical system named PACS (Parameterless Automatic Classification System) is implemented. A number of images are fed into the system for evaluation purposes. The results show that the approach could automatically classify remote sensing images and have a fairly good average accuracy. It is indicated that the classified results will be more accurate if the two databases have higher quality. Once the experiences and samples in the databases are accumulated as many as an expert has, the approach should be able to get the results with similar quality to that a human expert can get. Since the approach is total automatic and parameterless, it can not only relieve remote sensing workers from the heavy and time-consuming parameter tuning work, but also significantly shorten the waiting time for consumers and facilitate them to engage in image classification activities. Currently, the approach is used only on high resolution optical three-band remote sensing imagery. The feasibility using the approach on other kinds of remote sensing images or involving additional bands in classification will be studied in future.
Löpprich, Martin; Krauss, Felix; Ganzinger, Matthias; Senghas, Karsten; Riezler, Stefan; Knaup, Petra
2016-08-05
In the Multiple Myeloma clinical registry at Heidelberg University Hospital, most data are extracted from discharge letters. Our aim was to analyze if it is possible to make the manual documentation process more efficient by using methods of natural language processing for multiclass classification of free-text diagnostic reports to automatically document the diagnosis and state of disease of myeloma patients. The first objective was to create a corpus consisting of free-text diagnosis paragraphs of patients with multiple myeloma from German diagnostic reports, and its manual annotation of relevant data elements by documentation specialists. The second objective was to construct and evaluate a framework using different NLP methods to enable automatic multiclass classification of relevant data elements from free-text diagnostic reports. The main diagnoses paragraph was extracted from the clinical report of one third randomly selected patients of the multiple myeloma research database from Heidelberg University Hospital (in total 737 selected patients). An EDC system was setup and two data entry specialists performed independently a manual documentation of at least nine specific data elements for multiple myeloma characterization. Both data entries were compared and assessed by a third specialist and an annotated text corpus was created. A framework was constructed, consisting of a self-developed package to split multiple diagnosis sequences into several subsequences, four different preprocessing steps to normalize the input data and two classifiers: a maximum entropy classifier (MEC) and a support vector machine (SVM). In total 15 different pipelines were examined and assessed by a ten-fold cross-validation, reiterated 100 times. For quality indication the average error rate and the average F1-score were conducted. For significance testing the approximate randomization test was used. The created annotated corpus consists of 737 different diagnoses paragraphs with a total number of 865 coded diagnosis. The dataset is publicly available in the supplementary online files for training and testing of further NLP methods. Both classifiers showed low average error rates (MEC: 1.05; SVM: 0.84) and high F1-scores (MEC: 0.89; SVM: 0.92). However the results varied widely depending on the classified data element. Preprocessing methods increased this effect and had significant impact on the classification, both positive and negative. The automatic diagnosis splitter increased the average error rate significantly, even if the F1-score decreased only slightly. The low average error rates and high average F1-scores of each pipeline demonstrate the suitability of the investigated NPL methods. However, it was also shown that there is no best practice for an automatic classification of data elements from free-text diagnostic reports.
The value of protein structure classification information—Surveying the scientific literature
Fox, Naomi K.; Brenner, Steven E.
2015-01-01
ABSTRACT The Structural Classification of Proteins (SCOP) and Class, Architecture, Topology, Homology (CATH) databases have been valuable resources for protein structure classification for over 20 years. Development of SCOP (version 1) concluded in June 2009 with SCOP 1.75. The SCOPe (SCOP–extended) database offers continued development of the classic SCOP hierarchy, adding over 33,000 structures. We have attempted to assess the impact of these two decade old resources and guide future development. To this end, we surveyed recent articles to learn how structure classification data are used. Of 571 articles published in 2012–2013 that cite SCOP, 439 actually use data from the resource. We found that the type of use was fairly evenly distributed among four top categories: A) study protein structure or evolution (27% of articles), B) train and/or benchmark algorithms (28% of articles), C) augment non‐SCOP datasets with SCOP classification (21% of articles), and D) examine the classification of one protein/a small set of proteins (22% of articles). Most articles described computational research, although 11% described purely experimental research, and a further 9% included both. We examined how CATH and SCOP were used in 158 articles that cited both databases: while some studies used only one dataset, the majority used data from both resources. Protein structure classification remains highly relevant for a diverse range of problems and settings. Proteins 2015; 83:2025–2038. © 2015 The Authors. Proteins: Structure, Function, and Bioinformatics Published by Wiley Periodicals, Inc. PMID:26313554
Classification of ECG beats using deep belief network and active learning.
G, Sayantan; T, Kien P; V, Kadambari K
2018-04-12
A new semi-supervised approach based on deep learning and active learning for classification of electrocardiogram signals (ECG) is proposed. The objective of the proposed work is to model a scientific method for classification of cardiac irregularities using electrocardiogram beats. The model follows the Association for the Advancement of medical instrumentation (AAMI) standards and consists of three phases. In phase I, feature representation of ECG is learnt using Gaussian-Bernoulli deep belief network followed by a linear support vector machine (SVM) training in the consecutive phase. It yields three deep models which are based on AAMI-defined classes, namely N, V, S, and F. In the last phase, a query generator is introduced to interact with the expert to label few beats to improve accuracy and sensitivity. The proposed approach depicts significant improvement in accuracy with minimal queries posed to the expert and fast online training as tested on the MIT-BIH Arrhythmia Database and the MIT-BIH Supra-ventricular Arrhythmia Database (SVDB). With 100 queries labeled by the expert in phase III, the method achieves an accuracy of 99.5% in "S" versus all classifications (SVEB) and 99.4% accuracy in "V " versus all classifications (VEB) on MIT-BIH Arrhythmia Database. In a similar manner, it is attributed that an accuracy of 97.5% for SVEB and 98.6% for VEB on SVDB database is achieved respectively. Graphical Abstract Reply- Deep belief network augmented by active learning for efficient prediction of arrhythmia.
Algorithms and methodology used in constructing high-resolution terrain databases
NASA Astrophysics Data System (ADS)
Williams, Bryan L.; Wilkosz, Aaron
1998-07-01
This paper presents a top-level description of methods used to generate high-resolution 3D IR digital terrain databases using soft photogrammetry. The 3D IR database is derived from aerial photography and is made up of digital ground plane elevation map, vegetation height elevation map, material classification map, object data (tanks, buildings, etc.), and temperature radiance map. Steps required to generate some of these elements are outlined. The use of metric photogrammetry is discussed in the context of elevation map development; and methods employed to generate the material classification maps are given. The developed databases are used by the US Army Aviation and Missile Command to evaluate the performance of various missile systems. A discussion is also presented on database certification which consists of validation, verification, and accreditation procedures followed to certify that the developed databases give a true representation of the area of interest, and are fully compatible with the targeted digital simulators.
NASA Astrophysics Data System (ADS)
Stuhlmacher, M.; Wang, C.; Georgescu, M.; Tellman, B.; Balling, R.; Clinton, N. E.; Collins, L.; Goldblatt, R.; Hanson, G.
2016-12-01
Global representations of modern day urban land use and land cover (LULC) extent are becoming increasingly prevalent. Yet considerable uncertainties in the representation of built environment extent (i.e. global classifications generated from 250m resolution MODIS imagery or the United States' National Land Cover Database) remain because of the lack of a systematic, globally consistent methodological approach. We aim to increase resolution, accuracy, and improve upon past efforts by establishing a data-driven definition of the urban landscape, based on Landsat 5, 7 & 8 imagery and ancillary data sets. Continuous and discrete machine learning classification algorithms have been developed in Google Earth Engine (GEE), a powerful online cloud-based geospatial storage and parallel-computing platform. Additionally, thousands of ground truth points have been selected from high resolution imagery to fill in the previous lack of accurate data to be used for training and validation. We will present preliminary classification and accuracy assessments for select cities in the United States and Mexico. Our approach has direct implications for development of projected urban growth that is grounded on realistic identification of urbanizing hot-spots, with consequences for local to regional scale climate change, energy demand, water stress, human health, urban-ecological interactions, and efforts used to prioritize adaptation and mitigation strategies to offset large-scale climate change. Future work to apply the built-up detection algorithm globally and yearly is underway in a partnership between GEE, University of California in San Diego, and Arizona State University.
NASA Astrophysics Data System (ADS)
Sharma, Manu; Bhatt, Jignesh S.; Joshi, Manjunath V.
2018-04-01
Lung cancer is one of the most abundant causes of the cancerous deaths worldwide. It has low survival rate mainly due to the late diagnosis. With the hardware advancements in computed tomography (CT) technology, it is now possible to capture the high resolution images of lung region. However, it needs to be augmented by efficient algorithms to detect the lung cancer in the earlier stages using the acquired CT images. To this end, we propose a two-step algorithm for early detection of lung cancer. Given the CT image, we first extract the patch from the center location of the nodule and segment the lung nodule region. We propose to use Otsu method followed by morphological operations for the segmentation. This step enables accurate segmentation due to the use of data-driven threshold. Unlike other methods, we perform the segmentation without using the complete contour information of the nodule. In the second step, a deep convolutional neural network (CNN) is used for the better classification (malignant or benign) of the nodule present in the segmented patch. Accurate segmentation of even a tiny nodule followed by better classification using deep CNN enables the early detection of lung cancer. Experiments have been conducted using 6306 CT images of LIDC-IDRI database. We achieved the test accuracy of 84.13%, with the sensitivity and specificity of 91.69% and 73.16%, respectively, clearly outperforming the state-of-the-art algorithms.
NASA Technical Reports Server (NTRS)
Shull, Sarah A.; Gralla, Erica L.; deWeck, Olivier L.; Shishko, Robert
2006-01-01
One of the major logistical challenges in human space exploration is asset management. This paper presents observations on the practice of asset management in support of human space flight to date and discusses a functional-based supply classification and a framework for an integrated database that could be used to improve asset management and logistics for human missions to the Moon, Mars and beyond.
Impact of training sets on classification of high-throughput bacterial 16s rRNA gene surveys
Werner, Jeffrey J; Koren, Omry; Hugenholtz, Philip; DeSantis, Todd Z; Walters, William A; Caporaso, J Gregory; Angenent, Largus T; Knight, Rob; Ley, Ruth E
2012-01-01
Taxonomic classification of the thousands–millions of 16S rRNA gene sequences generated in microbiome studies is often achieved using a naïve Bayesian classifier (for example, the Ribosomal Database Project II (RDP) classifier), due to favorable trade-offs among automation, speed and accuracy. The resulting classification depends on the reference sequences and taxonomic hierarchy used to train the model; although the influence of primer sets and classification algorithms have been explored in detail, the influence of training set has not been characterized. We compared classification results obtained using three different publicly available databases as training sets, applied to five different bacterial 16S rRNA gene pyrosequencing data sets generated (from human body, mouse gut, python gut, soil and anaerobic digester samples). We observed numerous advantages to using the largest, most diverse training set available, that we constructed from the Greengenes (GG) bacterial/archaeal 16S rRNA gene sequence database and the latest GG taxonomy. Phylogenetic clusters of previously unclassified experimental sequences were identified with notable improvements (for example, 50% reduction in reads unclassified at the phylum level in mouse gut, soil and anaerobic digester samples), especially for phylotypes belonging to specific phyla (Tenericutes, Chloroflexi, Synergistetes and Candidate phyla TM6, TM7). Trimming the reference sequences to the primer region resulted in systematic improvements in classification depth, and greatest gains at higher confidence thresholds. Phylotypes unclassified at the genus level represented a greater proportion of the total community variation than classified operational taxonomic units in mouse gut and anaerobic digester samples, underscoring the need for greater diversity in existing reference databases. PMID:21716311
Kimura, Shinya; Sato, Toshihiko; Ikeda, Shunya; Noda, Mitsuhiko; Nakayama, Takeo
2010-01-01
Health insurance claims (ie, receipts) record patient health care treatments and expenses and, although created for the health care payment system, are potentially useful for research. Combining different types of receipts generated for the same patient would dramatically increase the utility of these receipts. However, technical problems, including standardization of disease names and classifications, and anonymous linkage of individual receipts, must be addressed. In collaboration with health insurance societies, all information from receipts (inpatient, outpatient, and pharmacy) was collected. To standardize disease names and classifications, we developed a computer-aided post-entry standardization method using a disease name dictionary based on International Classification of Diseases (ICD)-10 classifications. We also developed an anonymous linkage system by using an encryption code generated from a combination of hash values and stream ciphers. Using different sets of the original data (data set 1: insurance certificate number, name, and sex; data set 2: insurance certificate number, date of birth, and relationship status), we compared the percentage of successful record matches obtained by using data set 1 to generate key codes with the percentage obtained when both data sets were used. The dictionary's automatic conversion of disease names successfully standardized 98.1% of approximately 2 million new receipts entered into the database. The percentage of anonymous matches was higher for the combined data sets (98.0%) than for data set 1 (88.5%). The use of standardized disease classifications and anonymous record linkage substantially contributed to the construction of a large, chronologically organized database of receipts. This database is expected to aid in epidemiologic and health services research using receipt information.
Reflective random indexing for semi-automatic indexing of the biomedical literature.
Vasuki, Vidya; Cohen, Trevor
2010-10-01
The rapid growth of biomedical literature is evident in the increasing size of the MEDLINE research database. Medical Subject Headings (MeSH), a controlled set of keywords, are used to index all the citations contained in the database to facilitate search and retrieval. This volume of citations calls for efficient tools to assist indexers at the US National Library of Medicine (NLM). Currently, the Medical Text Indexer (MTI) system provides assistance by recommending MeSH terms based on the title and abstract of an article using a combination of distributional and vocabulary-based methods. In this paper, we evaluate a novel approach toward indexer assistance by using nearest neighbor classification in combination with Reflective Random Indexing (RRI), a scalable alternative to the established methods of distributional semantics. On a test set provided by the NLM, our approach significantly outperforms the MTI system, suggesting that the RRI approach would make a useful addition to the current methodologies.
NASA Astrophysics Data System (ADS)
Hu, Ruiguang; Xiao, Liping; Zheng, Wenjuan
2015-12-01
In this paper, multi-kernel learning(MKL) is used for drug-related webpages classification. First, body text and image-label text are extracted through HTML parsing, and valid images are chosen by the FOCARSS algorithm. Second, text based BOW model is used to generate text representation, and image-based BOW model is used to generate images representation. Last, text and images representation are fused with a few methods. Experimental results demonstrate that the classification accuracy of MKL is higher than those of all other fusion methods in decision level and feature level, and much higher than the accuracy of single-modal classification.
Hill, Ryan M; Oosterhoff, Benjamin; Kaplow, Julie B
2017-07-01
Although a large number of risk markers for suicide ideation have been identified, little guidance has been provided to prospectively identify adolescents at risk for suicide ideation within community settings. The current study addressed this gap in the literature by utilizing classification tree analysis (CTA) to provide a decision-making model for screening adolescents at risk for suicide ideation. Participants were N = 4,799 youth (Mage = 16.15 years, SD = 1.63) who completed both Waves 1 and 2 of the National Longitudinal Study of Adolescent to Adult Health. CTA was used to generate a series of decision rules for identifying adolescents at risk for reporting suicide ideation at Wave 2. Findings revealed 3 distinct solutions with varying sensitivity and specificity for identifying adolescents who reported suicide ideation. Sensitivity of the classification trees ranged from 44.6% to 77.6%. The tree with greatest specificity and lowest sensitivity was based on a history of suicide ideation. The tree with moderate sensitivity and high specificity was based on depressive symptoms, suicide attempts or suicide among family and friends, and social support. The most sensitive but least specific tree utilized these factors and gender, ethnicity, hours of sleep, school-related factors, and future orientation. These classification trees offer community organizations options for instituting large-scale screenings for suicide ideation risk depending on the available resources and modality of services to be provided. This study provides a theoretically and empirically driven model for prospectively identifying adolescents at risk for suicide ideation and has implications for preventive interventions among at-risk youth. (PsycINFO Database Record (c) 2017 APA, all rights reserved).
Development of Vision Based Multiview Gait Recognition System with MMUGait Database
Ng, Hu; Tan, Wooi-Haw; Tong, Hau-Lee
2014-01-01
This paper describes the acquisition setup and development of a new gait database, MMUGait. This database consists of 82 subjects walking under normal condition and 19 subjects walking with 11 covariate factors, which were captured under two views. This paper also proposes a multiview model-based gait recognition system with joint detection approach that performs well under different walking trajectories and covariate factors, which include self-occluded or external occluded silhouettes. In the proposed system, the process begins by enhancing the human silhouette to remove the artifacts. Next, the width and height of the body are obtained. Subsequently, the joint angular trajectories are determined once the body joints are automatically detected. Lastly, crotch height and step-size of the walking subject are determined. The extracted features are smoothened by Gaussian filter to eliminate the effect of outliers. The extracted features are normalized with linear scaling, which is followed by feature selection prior to the classification process. The classification experiments carried out on MMUGait database were benchmarked against the SOTON Small DB from University of Southampton. Results showed correct classification rate above 90% for all the databases. The proposed approach is found to outperform other approaches on SOTON Small DB in most cases. PMID:25143972
Using text mining to link journal articles to neuroanatomical databases
French, Leon; Pavlidis, Paul
2013-01-01
The electronic linking of neuroscience information, including data embedded in the primary literature, would permit powerful queries and analyses driven by structured databases. This task would be facilitated by automated procedures which can identify biological concepts in journals. Here we apply an approach for automatically mapping formal identifiers of neuroanatomical regions to text found in journal abstracts, and apply it to a large body of abstracts from the Journal of Comparative Neurology (JCN). The analyses yield over one hundred thousand brain region mentions which we map to 8,225 brain region concepts in multiple organisms. Based on the analysis of a manually annotated corpus, we estimate mentions are mapped at 95% precision and 63% recall. Our results provide insights into the patterns of publication on brain regions and species of study in the Journal, but also point to important challenges in the standardization of neuroanatomical nomenclatures. We find that many terms in the formal terminologies never appear in a JCN abstract, while conversely, many terms authors use are not reflected in the terminologies. To improve the terminologies we deposited 136 unrecognized brain regions into the Neuroscience Lexicon (NeuroLex). The training data, terminologies, normalizations, evaluations and annotated journal abstracts are freely available at http://www.chibi.ubc.ca/WhiteText/. PMID:22120205
Harris, Eric S J; Erickson, Sean D; Tolopko, Andrew N; Cao, Shugeng; Craycroft, Jane A; Scholten, Robert; Fu, Yanling; Wang, Wenquan; Liu, Yong; Zhao, Zhongzhen; Clardy, Jon; Shamu, Caroline E; Eisenberg, David M
2011-05-17
Ethnobotanically driven drug-discovery programs include data related to many aspects of the preparation of botanical medicines, from initial plant collection to chemical extraction and fractionation. The Traditional Medicine Collection Tracking System (TM-CTS) was created to organize and store data of this type for an international collaborative project involving the systematic evaluation of commonly used Traditional Chinese Medicinal plants. The system was developed using domain-driven design techniques, and is implemented using Java, Hibernate, PostgreSQL, Business Intelligence and Reporting Tools (BIRT), and Apache Tomcat. The TM-CTS relational database schema contains over 70 data types, comprising over 500 data fields. The system incorporates a number of unique features that are useful in the context of ethnobotanical projects such as support for information about botanical collection, method of processing, quality tests for plants with existing pharmacopoeia standards, chemical extraction and fractionation, and historical uses of the plants. The database also accommodates data provided in multiple languages and integration with a database system built to support high throughput screening based drug discovery efforts. It is accessed via a web-based application that provides extensive, multi-format reporting capabilities. This new database system was designed to support a project evaluating the bioactivity of Chinese medicinal plants. The software used to create the database is open source, freely available, and could potentially be applied to other ethnobotanically driven natural product collection and drug-discovery programs. Copyright © 2011 Elsevier Ireland Ltd. All rights reserved.
Harris, Eric S. J.; Erickson, Sean D.; Tolopko, Andrew N.; Cao, Shugeng; Craycroft, Jane A.; Scholten, Robert; Fu, Yanling; Wang, Wenquan; Liu, Yong; Zhao, Zhongzhen; Clardy, Jon; Shamu, Caroline E.; Eisenberg, David M.
2011-01-01
Aim of the study. Ethnobotanically-driven drug-discovery programs include data related to many aspects of the preparation of botanical medicines, from initial plant collection to chemical extraction and fractionation. The Traditional Medicine-Collection Tracking System (TM-CTS) was created to organize and store data of this type for an international collaborative project involving the systematic evaluation of commonly used Traditional Chinese Medicinal plants. Materials and Methods. The system was developed using domain-driven design techniques, and is implemented using Java, Hibernate, PostgreSQL, Business Intelligence and Reporting Tools (BIRT), and Apache Tomcat. Results. The TM-CTS relational database schema contains over 70 data types, comprising over 500 data fields. The system incorporates a number of unique features that are useful in the context of ethnobotanical projects such as support for information about botanical collection, method of processing, quality tests for plants with existing pharmacopoeia standards, chemical extraction and fractionation, and historical uses of the plants. The database also accommodates data provided in multiple languages and integration with a database system built to support high throughput screening based drug discovery efforts. It is accessed via a web-based application that provides extensive, multi-format reporting capabilities. Conclusions. This new database system was designed to support a project evaluating the bioactivity of Chinese medicinal plants. The software used to create the database is open source, freely available, and could potentially be applied to other ethnobotanically-driven natural product collection and drug-discovery programs. PMID:21420479
Hassanpour, Saeed; Langlotz, Curtis P; Amrhein, Timothy J; Befera, Nicholas T; Lungren, Matthew P
2017-04-01
The purpose of this study is to evaluate the performance of a natural language processing (NLP) system in classifying a database of free-text knee MRI reports at two separate academic radiology practices. An NLP system that uses terms and patterns in manually classified narrative knee MRI reports was constructed. The NLP system was trained and tested on expert-classified knee MRI reports from two major health care organizations. Radiology reports were modeled in the training set as vectors, and a support vector machine framework was used to train the classifier. A separate test set from each organization was used to evaluate the performance of the system. We evaluated the performance of the system both within and across organizations. Standard evaluation metrics, such as accuracy, precision, recall, and F1 score (i.e., the weighted average of the precision and recall), and their respective 95% CIs were used to measure the efficacy of our classification system. The accuracy for radiology reports that belonged to the model's clinically significant concept classes after training data from the same institution was good, yielding an F1 score greater than 90% (95% CI, 84.6-97.3%). Performance of the classifier on cross-institutional application without institution-specific training data yielded F1 scores of 77.6% (95% CI, 69.5-85.7%) and 90.2% (95% CI, 84.5-95.9%) at the two organizations studied. The results show excellent accuracy by the NLP machine learning classifier in classifying free-text knee MRI reports, supporting the institution-independent reproducibility of knee MRI report classification. Furthermore, the machine learning classifier performed well on free-text knee MRI reports from another institution. These data support the feasibility of multiinstitutional classification of radiologic imaging text reports with a single machine learning classifier without requiring institution-specific training data.
Torii, Manabu; Yin, Lanlan; Nguyen, Thang; Mazumdar, Chand T.; Liu, Hongfang; Hartley, David M.; Nelson, Noele P.
2014-01-01
Purpose Early detection of infectious disease outbreaks is crucial to protecting the public health of a society. Online news articles provide timely information on disease outbreaks worldwide. In this study, we investigated automated detection of articles relevant to disease outbreaks using machine learning classifiers. In a real-life setting, it is expensive to prepare a training data set for classifiers, which usually consists of manually labeled relevant and irrelevant articles. To mitigate this challenge, we examined the use of randomly sampled unlabeled articles as well as labeled relevant articles. Methods Naïve Bayes and Support Vector Machine (SVM) classifiers were trained on 149 relevant and 149 or more randomly sampled unlabeled articles. Diverse classifiers were trained by varying the number of sampled unlabeled articles and also the number of word features. The trained classifiers were applied to 15 thousand articles published over 15 days. Top-ranked articles from each classifier were pooled and the resulting set of 1337 articles was reviewed by an expert analyst to evaluate the classifiers. Results Daily averages of areas under ROC curves (AUCs) over the 15-day evaluation period were 0.841 and 0.836, respectively, for the naïve Bayes and SVM classifier. We referenced a database of disease outbreak reports to confirm that this evaluation data set resulted from the pooling method indeed covered incidents recorded in the database during the evaluation period. Conclusions The proposed text classification framework utilizing randomly sampled unlabeled articles can facilitate a cost-effective approach to training machine learning classifiers in a real-life Internet-based biosurveillance project. We plan to examine this framework further using larger data sets and using articles in non-English languages. PMID:21134784
77 FR 60475 - Draft of SWGDOC Standard Classification of Typewritten Text
Federal Register 2010, 2011, 2012, 2013, 2014
2012-10-03
... DEPARTMENT OF JUSTICE Office of Justice Programs [OJP (NIJ) Docket No. 1607] Draft of SWGDOC Standard Classification of Typewritten Text AGENCY: National Institute of Justice, DOJ. ACTION: Notice and..., ``SWGDOC Standard Classification of Typewritten Text''. The opportunity to provide comments on this...
ImmunemiR - A Database of Prioritized Immune miRNA Disease Associations and its Interactome.
Prabahar, Archana; Natarajan, Jeyakumar
2017-01-01
MicroRNAs are the key regulators of gene expression and their abnormal expression in the immune system may be associated with several human diseases such as inflammation, cancer and autoimmune diseases. Elucidation of miRNA disease association through the interactome will deepen the understanding of its disease mechanisms. A specialized database for immune miRNAs is highly desirable to demonstrate the immune miRNA disease associations in the interactome. miRNAs specific to immune related diseases were retrieved from curated databases such as HMDD, miR2disease and PubMed literature based on MeSH classification of immune system diseases. The additional data such as miRNA target genes, genes coding protein-protein interaction information were compiled from related resources. Further, miRNAs were prioritized to specific immune diseases using random walk ranking algorithm. In total 245 immune miRNAs associated with 92 OMIM disease categories were identified from external databases. The resultant data were compiled as ImmunemiR, a database of prioritized immune miRNA disease associations. This database provides both text based annotation information and network visualization of its interactome. To our knowledge, ImmunemiR is the first available database to provide a comprehensive repository of human immune disease associated miRNAs with network visualization options of its target genes, protein-protein interactions (PPI) and its disease associations. It is freely available at http://www.biominingbu.org/immunemir/. Copyright© Bentham Science Publishers; For any queries, please email at epub@benthamscience.org.
Wang, Yin; Li, Rudong; Zhou, Yuhua; Ling, Zongxin; Guo, Xiaokui; Xie, Lu; Liu, Lei
2016-01-01
Text data of 16S rRNA are informative for classifications of microbiota-associated diseases. However, the raw text data need to be systematically processed so that features for classification can be defined/extracted; moreover, the high-dimension feature spaces generated by the text data also pose an additional difficulty. Here we present a Phylogenetic Tree-Based Motif Finding algorithm (PMF) to analyze 16S rRNA text data. By integrating phylogenetic rules and other statistical indexes for classification, we can effectively reduce the dimension of the large feature spaces generated by the text datasets. Using the retrieved motifs in combination with common classification methods, we can discriminate different samples of both pneumonia and dental caries better than other existing methods. We extend the phylogenetic approaches to perform supervised learning on microbiota text data to discriminate the pathological states for pneumonia and dental caries. The results have shown that PMF may enhance the efficiency and reliability in analyzing high-dimension text data.
Mining the Galaxy Zoo Database: Machine Learning Applications
NASA Astrophysics Data System (ADS)
Borne, Kirk D.; Wallin, J.; Vedachalam, A.; Baehr, S.; Lintott, C.; Darg, D.; Smith, A.; Fortson, L.
2010-01-01
The new Zooniverse initiative is addressing the data flood in the sciences through a transformative partnership between professional scientists, volunteer citizen scientists, and machines. As part of this project, we are exploring the application of machine learning techniques to data mining problems associated with the large and growing database of volunteer science results gathered by the Galaxy Zoo citizen science project. We will describe the basic challenge, some machine learning approaches, and early results. One of the motivators for this study is the acquisition (through the Galaxy Zoo results database) of approximately 100 million classification labels for roughly one million galaxies, yielding a tremendously large and rich set of training examples for improving automated galaxy morphological classification algorithms. In our first case study, the goal is to learn which morphological and photometric features in the Sloan Digital Sky Survey (SDSS) database correlate most strongly with user-selected galaxy morphological class. As a corollary to this study, we are also aiming to identify which galaxy parameters in the SDSS database correspond to galaxies that have been the most difficult to classify (based upon large dispersion in their volunter-provided classifications). Our second case study will focus on similar data mining analyses and machine leaning algorithms applied to the Galaxy Zoo catalog of merging and interacting galaxies. The outcomes of this project will have applications in future large sky surveys, such as the LSST (Large Synoptic Survey Telescope) project, which will generate a catalog of 20 billion galaxies and will produce an additional astronomical alert database of approximately 100 thousand events each night for 10 years -- the capabilities and algorithms that we are exploring will assist in the rapid characterization and classification of such massive data streams. This research has been supported in part through NSF award #0941610.
Wang, Weiqi; Wang, Yanbo Justin; Bañares-Alcántara, René; Coenen, Frans; Cui, Zhanfeng
2009-12-01
In this paper, data mining is used to analyze the data on the differentiation of mammalian Mesenchymal Stem Cells (MSCs), aiming at discovering known and hidden rules governing MSC differentiation, following the establishment of a web-based public database containing experimental data on the MSC proliferation and differentiation. To this effect, a web-based public interactive database comprising the key parameters which influence the fate and destiny of mammalian MSCs has been constructed and analyzed using Classification Association Rule Mining (CARM) as a data-mining technique. The results show that the proposed approach is technically feasible and performs well with respect to the accuracy of (classification) prediction. Key rules mined from the constructed MSC database are consistent with experimental observations, indicating the validity of the method developed and the first step in the application of data mining to the study of MSCs.
The COG database: new developments in phylogenetic classification of proteins from complete genomes
Tatusov, Roman L.; Natale, Darren A.; Garkavtsev, Igor V.; Tatusova, Tatiana A.; Shankavaram, Uma T.; Rao, Bachoti S.; Kiryutin, Boris; Galperin, Michael Y.; Fedorova, Natalie D.; Koonin, Eugene V.
2001-01-01
The database of Clusters of Orthologous Groups of proteins (COGs), which represents an attempt on a phylogenetic classification of the proteins encoded in complete genomes, currently consists of 2791 COGs including 45 350 proteins from 30 genomes of bacteria, archaea and the yeast Saccharomyces cerevisiae (http://www.ncbi.nlm.nih.gov/COG). In addition, a supplement to the COGs is available, in which proteins encoded in the genomes of two multicellular eukaryotes, the nematode Caenorhabditis elegans and the fruit fly Drosophila melanogaster, and shared with bacteria and/or archaea were included. The new features added to the COG database include information pages with structural and functional details on each COG and literature references, improvements of the COGNITOR program that is used to fit new proteins into the COGs, and classification of genomes and COGs constructed by using principal component analysis. PMID:11125040
Some sequential, distribution-free pattern classification procedures with applications
NASA Technical Reports Server (NTRS)
Poage, J. L.
1971-01-01
Some sequential, distribution-free pattern classification techniques are presented. The decision problem to which the proposed classification methods are applied is that of discriminating between two kinds of electroencephalogram responses recorded from a human subject: spontaneous EEG and EEG driven by a stroboscopic light stimulus at the alpha frequency. The classification procedures proposed make use of the theory of order statistics. Estimates of the probabilities of misclassification are given. The procedures were tested on Gaussian samples and the EEG responses.
Hu, Weiming; Hu, Ruiguang; Xie, Nianhua; Ling, Haibin; Maybank, Stephen
2014-04-01
In this paper, we propose saliency driven image multiscale nonlinear diffusion filtering. The resulting scale space in general preserves or even enhances semantically important structures such as edges, lines, or flow-like structures in the foreground, and inhibits and smoothes clutter in the background. The image is classified using multiscale information fusion based on the original image, the image at the final scale at which the diffusion process converges, and the image at a midscale. Our algorithm emphasizes the foreground features, which are important for image classification. The background image regions, whether considered as contexts of the foreground or noise to the foreground, can be globally handled by fusing information from different scales. Experimental tests of the effectiveness of the multiscale space for the image classification are conducted on the following publicly available datasets: 1) the PASCAL 2005 dataset; 2) the Oxford 102 flowers dataset; and 3) the Oxford 17 flowers dataset, with high classification rates.
Ecker, David J; Sampath, Rangarajan; Willett, Paul; Wyatt, Jacqueline R; Samant, Vivek; Massire, Christian; Hall, Thomas A; Hari, Kumar; McNeil, John A; Büchen-Osmond, Cornelia; Budowle, Bruce
2005-01-01
Background Thousands of different microorganisms affect the health, safety, and economic stability of populations. Many different medical and governmental organizations have created lists of the pathogenic microorganisms relevant to their missions; however, the nomenclature for biological agents on these lists and pathogens described in the literature is inexact. This ambiguity can be a significant block to effective communication among the diverse communities that must deal with epidemics or bioterrorist attacks. Results We have developed a database known as the Microbial Rosetta Stone. The database relates microorganism names, taxonomic classifications, diseases, specific detection and treatment protocols, and relevant literature. The database structure facilitates linkage to public genomic databases. This paper focuses on the information in the database for pathogens that impact global public health, emerging infectious organisms, and bioterrorist threat agents. Conclusion The Microbial Rosetta Stone is available at . The database provides public access to up-to-date taxonomic classifications of organisms that cause human diseases, improves the consistency of nomenclature in disease reporting, and provides useful links between different public genomic and public health databases. PMID:15850481
A novel processed food classification system applied to Australian food composition databases.
O'Halloran, S A; Lacy, K E; Grimes, C A; Woods, J; Campbell, K J; Nowson, C A
2017-08-01
The extent of food processing can affect the nutritional quality of foodstuffs. Categorising foods by the level of processing emphasises the differences in nutritional quality between foods within the same food group and is likely useful for determining dietary processed food consumption. The present study aimed to categorise foods within Australian food composition databases according to the level of food processing using a processed food classification system, as well as assess the variation in the levels of processing within food groups. A processed foods classification system was applied to food and beverage items contained within Australian Food and Nutrient (AUSNUT) 2007 (n = 3874) and AUSNUT 2011-13 (n = 5740). The proportion of Minimally Processed (MP), Processed Culinary Ingredients (PCI) Processed (P) and Ultra Processed (ULP) by AUSNUT food group and the overall proportion of the four processed food categories across AUSNUT 2007 and AUSNUT 2011-13 were calculated. Across the food composition databases, the overall proportions of foods classified as MP, PCI, P and ULP were 27%, 3%, 26% and 44% for AUSNUT 2007 and 38%, 2%, 24% and 36% for AUSNUT 2011-13. Although there was wide variation in the classifications of food processing within the food groups, approximately one-third of foodstuffs were classified as ULP food items across both the 2007 and 2011-13 AUSNUT databases. This Australian processed food classification system will allow researchers to easily quantify the contribution of processed foods within the Australian food supply to assist in assessing the nutritional quality of the dietary intake of population groups. © 2017 The British Dietetic Association Ltd.
Van Berkel, Gary J.; Kertesz, Vilmos
2016-11-15
An “Open Access”-like mass spectrometric platform to fully utilize the simplicity of the manual open port sampling interface for rapid characterization of unprocessed samples by liquid introduction atmospheric pressure ionization mass spectrometry has been lacking. The in-house developed integrated software with a simple, small and relatively low-cost mass spectrometry system introduced here fills this void. Software was developed to operate the mass spectrometer, to collect and process mass spectrometric data files, to build a database and to classify samples using such a database. These tasks were accomplished via the vendorprovided software libraries. Sample classification based on spectral comparison utilized themore » spectral contrast angle method. As a result, using the developed software platform near real-time sample classification is exemplified using a series of commercially available blue ink rollerball pens and vegetable oils. In the case of the inks, full scan positive and negative ion ESI mass spectra were both used for database generation and sample classification. For the vegetable oils, full scan positive ion mode APCI mass spectra were recorded. The overall accuracy of the employed spectral contrast angle statistical model was 95.3% and 98% in case of the inks and oils, respectively, using leave-one-out cross-validation. In conclusion, this work illustrates that an open port sampling interface/mass spectrometer combination, with appropriate instrument control and data processing software, is a viable direct liquid extraction sampling and analysis system suitable for the non-expert user and near real-time sample classification via database matching.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Van Berkel, Gary J.; Kertesz, Vilmos
An “Open Access”-like mass spectrometric platform to fully utilize the simplicity of the manual open port sampling interface for rapid characterization of unprocessed samples by liquid introduction atmospheric pressure ionization mass spectrometry has been lacking. The in-house developed integrated software with a simple, small and relatively low-cost mass spectrometry system introduced here fills this void. Software was developed to operate the mass spectrometer, to collect and process mass spectrometric data files, to build a database and to classify samples using such a database. These tasks were accomplished via the vendorprovided software libraries. Sample classification based on spectral comparison utilized themore » spectral contrast angle method. As a result, using the developed software platform near real-time sample classification is exemplified using a series of commercially available blue ink rollerball pens and vegetable oils. In the case of the inks, full scan positive and negative ion ESI mass spectra were both used for database generation and sample classification. For the vegetable oils, full scan positive ion mode APCI mass spectra were recorded. The overall accuracy of the employed spectral contrast angle statistical model was 95.3% and 98% in case of the inks and oils, respectively, using leave-one-out cross-validation. In conclusion, this work illustrates that an open port sampling interface/mass spectrometer combination, with appropriate instrument control and data processing software, is a viable direct liquid extraction sampling and analysis system suitable for the non-expert user and near real-time sample classification via database matching.« less
Reading the lesson: eliciting requirements for a mammography training application
NASA Astrophysics Data System (ADS)
Hartswood, M.; Blot, L.; Taylor, P.; Anderson, S.; Procter, R.; Wilkinson, L.; Smart, L.
2009-02-01
Demonstrations of a prototype training tool were used to elicit requirements for an intelligent training system for screening mammography. The prototype allowed senior radiologists (mentors) to select cases from a distributed database of images to meet the specific training requirements of junior colleagues (trainees) and then provided automated feedback in response to trainees' attempts at interpretation. The tool was demonstrated to radiologists and radiographers working in the breast screening service at four evaluation sessions. Participants highlighted ease of selecting cases that can deliver specific learning objectives as important for delivering effective training. To usefully structure a large data set of training images we undertook a classification exercise of mentor authored free text 'learning points' attached to training case obtained from two screening centres (n=333, n=129 respectively). We were able to adduce a hierarchy of abstract categories representing classes of lesson that groups of cases were intended to convey (e.g. Temporal change, Misleading juxtapositions, Position of lesion, Typical/Atypical presentation, and so on). In this paper we present the method used to devise this classification, the classification scheme itself, initial user-feedback, and our plans to incorporated it into a software tool to aid case selection.
NASA Astrophysics Data System (ADS)
Costache, G. N.; Gavat, I.
2004-09-01
Along with the aggressive growing of the amount of digital data available (text, audio samples, digital photos and digital movies joined all in the multimedia domain) the need for classification, recognition and retrieval of this kind of data became very important. In this paper will be presented a system structure to handle multimedia data based on a recognition perspective. The main processing steps realized for the interesting multimedia objects are: first, the parameterization, by analysis, in order to obtain a description based on features, forming the parameter vector; second, a classification, generally with a hierarchical structure to make the necessary decisions. For audio signals, both speech and music, the derived perceptual features are the melcepstral (MFCC) and the perceptual linear predictive (PLP) coefficients. For images, the derived features are the geometric parameters of the speaker mouth. The hierarchical classifier consists generally in a clustering stage, based on the Kohonnen Self-Organizing Maps (SOM) and a final stage, based on a powerful classification algorithm called Support Vector Machines (SVM). The system, in specific variants, is applied with good results in two tasks: the first, is a bimodal speech recognition which uses features obtained from speech signal fused to features obtained from speaker's image and the second is a music retrieval from large music database.
ESTuber db: an online database for Tuber borchii EST sequences.
Lazzari, Barbara; Caprera, Andrea; Cosentino, Cristian; Stella, Alessandra; Milanesi, Luciano; Viotti, Angelo
2007-03-08
The ESTuber database (http://www.itb.cnr.it/estuber) includes 3,271 Tuber borchii expressed sequence tags (EST). The dataset consists of 2,389 sequences from an in-house prepared cDNA library from truffle vegetative hyphae, and 882 sequences downloaded from GenBank and representing four libraries from white truffle mycelia and ascocarps at different developmental stages. An automated pipeline was prepared to process EST sequences using public software integrated by in-house developed Perl scripts. Data were collected in a MySQL database, which can be queried via a php-based web interface. Sequences included in the ESTuber db were clustered and annotated against three databases: the GenBank nr database, the UniProtKB database and a third in-house prepared database of fungi genomic sequences. An algorithm was implemented to infer statistical classification among Gene Ontology categories from the ontology occurrences deduced from the annotation procedure against the UniProtKB database. Ontologies were also deduced from the annotation of more than 130,000 EST sequences from five filamentous fungi, for intra-species comparison purposes. Further analyses were performed on the ESTuber db dataset, including tandem repeats search and comparison of the putative protein dataset inferred from the EST sequences to the PROSITE database for protein patterns identification. All the analyses were performed both on the complete sequence dataset and on the contig consensus sequences generated by the EST assembly procedure. The resulting web site is a resource of data and links related to truffle expressed genes. The Sequence Report and Contig Report pages are the web interface core structures which, together with the Text search utility and the Blast utility, allow easy access to the data stored in the database.
Wang, Lizhu; Riseng, Catherine M.; Mason, Lacey; Werhrly, Kevin; Rutherford, Edward; McKenna, James E.; Castiglione, Chris; Johnson, Lucinda B.; Infante, Dana M.; Sowa, Scott P.; Robertson, Mike; Schaeffer, Jeff; Khoury, Mary; Gaiot, John; Hollenhurst, Tom; Brooks, Colin N.; Coscarelli, Mark
2015-01-01
Managing the world's largest and most complex freshwater ecosystem, the Laurentian Great Lakes, requires a spatially hierarchical basin-wide database of ecological and socioeconomic information that is comparable across the region. To meet such a need, we developed a spatial classification framework and database — Great Lakes Aquatic Habitat Framework (GLAHF). GLAHF consists of catchments, coastal terrestrial, coastal margin, nearshore, and offshore zones that encompass the entire Great Lakes Basin. The catchments captured in the database as river pour points or coastline segments are attributed with data known to influence physicochemical and biological characteristics of the lakes from the catchments. The coastal terrestrial zone consists of 30-m grid cells attributed with data from the terrestrial region that has direct connection with the lakes. The coastal margin and nearshore zones consist of 30-m grid cells attributed with data describing the coastline conditions, coastal human disturbances, and moderately to highly variable physicochemical and biological characteristics. The offshore zone consists of 1.8-km grid cells attributed with data that are spatially less variable compared with the other aquatic zones. These spatial classification zones and their associated data are nested within lake sub-basins and political boundaries and allow the synthesis of information from grid cells to classification zones, within and among political boundaries, lake sub-basins, Great Lakes, or within the entire Great Lakes Basin. This spatially structured database could help the development of basin-wide management plans, prioritize locations for funding and specific management actions, track protection and restoration progress, and conduct research for science-based decision making.
NASA Scope and Subject Category Guide
NASA Technical Reports Server (NTRS)
2011-01-01
This guide provides a simple, effective tool to assist aerospace information analysts and database builders in the high-level subject classification of technical materials. Each of the 76 subject categories comprising the classification scheme is presented with a description of category scope, a listing of subtopics, cross references, and an indication of particular areas of NASA interest. The guide also includes an index of nearly 3,000 specific research topics cross referenced to the subject categories. The portable document format (PDF) version of the guide contains links in the index from each input subject to its corresponding categories. In addition to subject classification, the guide can serve as an aid to searching databases that use the classification scheme, and is also an excellent selection guide for those involved in the acquisition of aerospace literature. The CD-ROM contains both HTML and PDF versions.
3D multi-view convolutional neural networks for lung nodule classification
Kang, Guixia; Hou, Beibei; Zhang, Ningbo
2017-01-01
The 3D convolutional neural network (CNN) is able to make full use of the spatial 3D context information of lung nodules, and the multi-view strategy has been shown to be useful for improving the performance of 2D CNN in classifying lung nodules. In this paper, we explore the classification of lung nodules using the 3D multi-view convolutional neural networks (MV-CNN) with both chain architecture and directed acyclic graph architecture, including 3D Inception and 3D Inception-ResNet. All networks employ the multi-view-one-network strategy. We conduct a binary classification (benign and malignant) and a ternary classification (benign, primary malignant and metastatic malignant) on Computed Tomography (CT) images from Lung Image Database Consortium and Image Database Resource Initiative database (LIDC-IDRI). All results are obtained via 10-fold cross validation. As regards the MV-CNN with chain architecture, results show that the performance of 3D MV-CNN surpasses that of 2D MV-CNN by a significant margin. Finally, a 3D Inception network achieved an error rate of 4.59% for the binary classification and 7.70% for the ternary classification, both of which represent superior results for the corresponding task. We compare the multi-view-one-network strategy with the one-view-one-network strategy. The results reveal that the multi-view-one-network strategy can achieve a lower error rate than the one-view-one-network strategy. PMID:29145492
Tanihara, Shinichi
2015-01-01
Uncoded diagnoses in health insurance claims (HICs) may introduce bias into Japanese health statistics dependent on computerized HICs. This study's aim was to identify the causes and characteristics of uncoded diagnoses. Uncoded diagnoses from computerized HICs (outpatient, inpatient, and the diagnosis procedure-combination per-diem payment system [DPC/PDPS]) submitted to the National Health Insurance Organization of Kumamoto Prefecture in May 2010 were analyzed. The text documentation accompanying the uncoded diagnoses was used to classify diagnoses in accordance with the International Classification of Diseases-10 (ICD-10). The text documentation was also classified into four categories using the standard descriptions of diagnoses defined in the master files of the computerized HIC system: 1) standard descriptions of diagnoses, 2) standard descriptions with a modifier, 3) non-standard descriptions of diagnoses, and 4) unclassifiable text documentation. Using these classifications, the proportions of uncoded diagnoses by ICD-10 disease category were calculated. Of the uncoded diagnoses analyzed (n = 363 753), non-standard descriptions of diagnoses for outpatient, inpatient, and DPC/PDPS HICs comprised 12.1%, 14.6%, and 1.0% of uncoded diagnoses, respectively. The proportion of uncoded diagnoses with standard descriptions with a modifier for Diseases of the eye and adnexa was significantly higher than the overall proportion of uncoded diagnoses among every HIC type. The pattern of uncoded diagnoses differed by HIC type and disease category. Evaluating the proportion of uncoded diagnoses in all medical facilities and developing effective coding methods for diagnoses with modifiers, prefixes, and suffixes should reduce number of uncoded diagnoses in computerized HICs and improve the quality of HIC databases.
Kalium: a database of potassium channel toxins from scorpion venom.
Kuzmenkov, Alexey I; Krylov, Nikolay A; Chugunov, Anton O; Grishin, Eugene V; Vassilevski, Alexander A
2016-01-01
Kalium (http://kaliumdb.org/) is a manually curated database that accumulates data on potassium channel toxins purified from scorpion venom (KTx). This database is an open-access resource, and provides easy access to pages of other databases of interest, such as UniProt, PDB, NCBI Taxonomy Browser, and PubMed. General achievements of Kalium are a strict and easy regulation of KTx classification based on the unified nomenclature supported by researchers in the field, removal of peptides with partial sequence and entries supported by transcriptomic information only, classification of β-family toxins, and addition of a novel λ-family. Molecules presented in the database can be processed by the Clustal Omega server using a one-click option. Molecular masses of mature peptides are calculated and available activity data are compiled for all KTx. We believe that Kalium is not only of high interest to professional toxinologists, but also of general utility to the scientific community.Database URL:http://kaliumdb.org/. © The Author(s) 2016. Published by Oxford University Press.
Fujimura, Tomomi; Umemura, Hiroyuki
2018-01-15
The present study describes the development and validation of a facial expression database comprising five different horizontal face angles in dynamic and static presentations. The database includes twelve expression types portrayed by eight Japanese models. This database was inspired by the dimensional and categorical model of emotions: surprise, fear, sadness, anger with open mouth, anger with closed mouth, disgust with open mouth, disgust with closed mouth, excitement, happiness, relaxation, sleepiness, and neutral (static only). The expressions were validated using emotion classification and Affect Grid rating tasks [Russell, Weiss, & Mendelsohn, 1989. Affect Grid: A single-item scale of pleasure and arousal. Journal of Personality and Social Psychology, 57(3), 493-502]. The results indicate that most of the expressions were recognised as the intended emotions and could systematically represent affective valence and arousal. Furthermore, face angle and facial motion information influenced emotion classification and valence and arousal ratings. Our database will be available online at the following URL. https://www.dh.aist.go.jp/database/face2017/ .
Exploring Deep Learning and Transfer Learning for Colonic Polyp Classification
Uhl, Andreas; Wimmer, Georg; Häfner, Michael
2016-01-01
Recently, Deep Learning, especially through Convolutional Neural Networks (CNNs) has been widely used to enable the extraction of highly representative features. This is done among the network layers by filtering, selecting, and using these features in the last fully connected layers for pattern classification. However, CNN training for automated endoscopic image classification still provides a challenge due to the lack of large and publicly available annotated databases. In this work we explore Deep Learning for the automated classification of colonic polyps using different configurations for training CNNs from scratch (or full training) and distinct architectures of pretrained CNNs tested on 8-HD-endoscopic image databases acquired using different modalities. We compare our results with some commonly used features for colonic polyp classification and the good results suggest that features learned by CNNs trained from scratch and the “off-the-shelf” CNNs features can be highly relevant for automated classification of colonic polyps. Moreover, we also show that the combination of classical features and “off-the-shelf” CNNs features can be a good approach to further improve the results. PMID:27847543
Future-oriented tweets predict lower county-level HIV prevalence in the United States.
Ireland, Molly E; Schwartz, H Andrew; Chen, Qijia; Ungar, Lyle H; Albarracín, Dolores
2015-12-01
Future orientation promotes health and well-being at the individual level. Computerized text analysis of a dataset encompassing billions of words used across the United States on Twitter tested whether community-level rates of future-oriented messages correlated with lower human immunodeficiency virus (HIV) rates and moderated the association between behavioral risk indicators and HIV. Over 150 million tweets mapped to U.S. counties were analyzed using 2 methods of text analysis. First, county-level HIV rates (cases per 100,000) were regressed on aggregate usage of future-oriented language (e.g., will, gonna). A second data-driven method regressed HIV rates on individual words and phrases. Results showed that counties with higher rates of future tense on Twitter had fewer HIV cases, independent of strong structural predictors of HIV such as population density. Future-oriented messages also appeared to buffer health risk: Sexually transmitted infection rates and references to risky behavior on Twitter were associated with higher HIV prevalence in all counties except those with high rates of future orientation. Data-driven analyses likewise showed that words and phrases referencing the future (e.g., tomorrow, would be) correlated with lower HIV prevalence. Integrating big data approaches to text analysis and epidemiology with psychological theory may provide an inexpensive, real-time method of anticipating outbreaks of HIV and etiologically similar diseases. (PsycINFO Database Record (c) 2015 APA, all rights reserved).
Rossen, Janne; Lucovnik, Miha; Eggebø, Torbjørn Moe; Tul, Natasa; Murphy, Martina; Vistad, Ingvild; Robson, Michael
2017-07-12
Internationally, the 10-Group Classification System (TGCS) has been used to report caesarean section rates, but analysis of other outcomes is also recommended. We now aim to present the TGCS as a method to assess outcomes of labour and delivery using routine collection of perinatal information. This research is a methodological study to describe the use of the TGCS. Stavanger University Hospital (SUH), Norway, National Maternity Hospital Dublin, Ireland and Slovenian National Perinatal Database (SLO), Slovenia. 9848 women from SUH, Norway, 9250 women from National Maternity Hospital Dublin, Ireland and 106 167 women, from SLO, Slovenia. All women were classified according to the TGCS within which caesarean section, oxytocin augmentation, epidural analgesia, operative vaginal deliveries, episiotomy, sphincter rupture, postpartum haemorrhage, blood transfusion, maternal age >35 years, body mass index >30, Apgar score, umbilical cord pH, hypoxic-ischaemic encephalopathy, antepartum and perinatal deaths were incorporated. There were significant differences in the sizes of the groups of women and the incidences of events and outcomes within the TGCS between the three perinatal databases. The TGCS is a standardised objective classification system where events and outcomes of labour and delivery can be incorporated. Obstetric core events and outcomes should be agreed and defined to set standards of care. This method provides continuous and available observations from delivery wards, possibly used for further interpretation, questions and international comparisons. The definition of quality may vary in different units and can only be ascertained when all the necessary information is available and considered together. © Article author(s) (or their employer(s) unless otherwise stated in the text of the article) 2017. All rights reserved. No commercial use is permitted unless otherwise expressly granted.
Thomas, Paul D; Kejariwal, Anish; Campbell, Michael J; Mi, Huaiyu; Diemer, Karen; Guo, Nan; Ladunga, Istvan; Ulitsky-Lazareva, Betty; Muruganujan, Anushya; Rabkin, Steven; Vandergriff, Jody A; Doremieux, Olivier
2003-01-01
The PANTHER database was designed for high-throughput analysis of protein sequences. One of the key features is a simplified ontology of protein function, which allows browsing of the database by biological functions. Biologist curators have associated the ontology terms with groups of protein sequences rather than individual sequences. Statistical models (Hidden Markov Models, or HMMs) are built from each of these groups. The advantage of this approach is that new sequences can be automatically classified as they become available. To ensure accurate functional classification, HMMs are constructed not only for families, but also for functionally distinct subfamilies. Multiple sequence alignments and phylogenetic trees, including curator-assigned information, are available for each family. The current version of the PANTHER database includes training sequences from all organisms in the GenBank non-redundant protein database, and the HMMs have been used to classify gene products across the entire genomes of human, and Drosophila melanogaster. The ontology terms and protein families and subfamilies, as well as Drosophila gene c;assifications, can be browsed and searched for free. Due to outstanding contractual obligations, access to human gene classifications and to protein family trees and multiple sequence alignments will temporarily require a nominal registration fee. PANTHER is publicly available on the web at http://panther.celera.com.
Identifying sports videos using replay, text, and camera motion features
NASA Astrophysics Data System (ADS)
Kobla, Vikrant; DeMenthon, Daniel; Doermann, David S.
1999-12-01
Automated classification of digital video is emerging as an important piece of the puzzle in the design of content management systems for digital libraries. The ability to classify videos into various classes such as sports, news, movies, or documentaries, increases the efficiency of indexing, browsing, and retrieval of video in large databases. In this paper, we discuss the extraction of features that enable identification of sports videos directly from the compressed domain of MPEG video. These features include detecting the presence of action replays, determining the amount of scene text in vide, and calculating various statistics on camera and/or object motion. The features are derived from the macroblock, motion,and bit-rate information that is readily accessible from MPEG video with very minimal decoding, leading to substantial gains in processing speeds. Full-decoding of selective frames is required only for text analysis. A decision tree classifier built using these features is able to identify sports clips with an accuracy of about 93 percent.
An automatic graph-based approach for artery/vein classification in retinal images.
Dashtbozorg, Behdad; Mendonça, Ana Maria; Campilho, Aurélio
2014-03-01
The classification of retinal vessels into artery/vein (A/V) is an important phase for automating the detection of vascular changes, and for the calculation of characteristic signs associated with several systemic diseases such as diabetes, hypertension, and other cardiovascular conditions. This paper presents an automatic approach for A/V classification based on the analysis of a graph extracted from the retinal vasculature. The proposed method classifies the entire vascular tree deciding on the type of each intersection point (graph nodes) and assigning one of two labels to each vessel segment (graph links). Final classification of a vessel segment as A/V is performed through the combination of the graph-based labeling results with a set of intensity features. The results of this proposed method are compared with manual labeling for three public databases. Accuracy values of 88.3%, 87.4%, and 89.8% are obtained for the images of the INSPIRE-AVR, DRIVE, and VICAVR databases, respectively. These results demonstrate that our method outperforms recent approaches for A/V classification.
Marking Importance in Lectures: Interactive and Textual Orientation
ERIC Educational Resources Information Center
Deroey, Katrien L. B.
2015-01-01
This paper provides a comprehensive overview of lexicogrammatical markers of important lecture points and proposes a classification in terms of their interactive and textual orientation. The importance markers were extracted from the British Academic Spoken English corpus using corpus-driven and corpus-based methods. The classification is based on…
Inter-Coder Agreement in One-to-Many Classification: Fuzzy Kappa.
Kirilenko, Andrei P; Stepchenkova, Svetlana
2016-01-01
Content analysis involves classification of textual, visual, or audio data. The inter-coder agreement is estimated by making two or more coders to classify the same data units, with subsequent comparison of their results. The existing methods of agreement estimation, e.g., Cohen's kappa, require that coders place each unit of content into one and only one category (one-to-one coding) from the pre-established set of categories. However, in certain data domains (e.g., maps, photographs, databases of texts and images), this requirement seems overly restrictive. The restriction could be lifted, provided that there is a measure to calculate the inter-coder agreement in the one-to-many protocol. Building on the existing approaches to one-to-many coding in geography and biomedicine, such measure, fuzzy kappa, which is an extension of Cohen's kappa, is proposed. It is argued that the measure is especially compatible with data from certain domains, when holistic reasoning of human coders is utilized in order to describe the data and access the meaning of communication.
Mehryary, Farrokh; Kaewphan, Suwisa; Hakala, Kai; Ginter, Filip
2016-01-01
Biomedical event extraction is one of the key tasks in biomedical text mining, supporting various applications such as database curation and hypothesis generation. Several systems, some of which have been applied at a large scale, have been introduced to solve this task. Past studies have shown that the identification of the phrases describing biological processes, also known as trigger detection, is a crucial part of event extraction, and notable overall performance gains can be obtained by solely focusing on this sub-task. In this paper we propose a novel approach for filtering falsely identified triggers from large-scale event databases, thus improving the quality of knowledge extraction. Our method relies on state-of-the-art word embeddings, event statistics gathered from the whole biomedical literature, and both supervised and unsupervised machine learning techniques. We focus on EVEX, an event database covering the whole PubMed and PubMed Central Open Access literature containing more than 40 million extracted events. The top most frequent EVEX trigger words are hierarchically clustered, and the resulting cluster tree is pruned to identify words that can never act as triggers regardless of their context. For rarely occurring trigger words we introduce a supervised approach trained on the combination of trigger word classification produced by the unsupervised clustering method and manual annotation. The method is evaluated on the official test set of BioNLP Shared Task on Event Extraction. The evaluation shows that the method can be used to improve the performance of the state-of-the-art event extraction systems. This successful effort also translates into removing 1,338,075 of potentially incorrect events from EVEX, thus greatly improving the quality of the data. The method is not solely bound to the EVEX resource and can be thus used to improve the quality of any event extraction system or database. The data and source code for this work are available at: http://bionlp-www.utu.fi/trigger-clustering/.
A Descriptive and Interpretative Information System for the IODP
NASA Astrophysics Data System (ADS)
Blum, P.; Foster, P. A.; Mateo, Z.
2006-12-01
The ODP/IODP has a long and rich history of collecting descriptive and interpretative information (DESCINFO) from rock and sediment cores from the world's oceans. Unlike instrumental data, DESCINFO generated by subject experts is biased by the scientific and cultural background of the observers and their choices of classification schemes. As a result, global searches of DESCINFO and its integration with other data are problematical. To address this issue, the IODP-USIO is in the process of designing and implementing a DESCINFO system for IODP Phase 2 (2007-2013) that meets the user expectations expressed over the past decade. The requirements include support of (1) detailed, material property-based descriptions as well as classification-based descriptions; (2) global searches by physical sample and digital data sources as well as any of the descriptive parameters; (3) user-friendly data capture tools for a variety of workflows; and (4) extensive visualization of DESCINFO data along with instrumental data and images; and (5) portability/interoperability such that the system can work with database schemas of other organizations - a specific challenge given the schema and semantic heterogeneity not only among the three IODP operators but within the geosciences in general. The DESCINFO approach is based on the definition of a set of generic observable parameters that are populated with numeric or text values. Text values are derived from controlled, extensible hierarchical value lists that allow descriptions at the appropriate level of detail and ensure successful data searches. Material descriptions can be completed independently of domain-specific classifications, genetic concepts, and interpretative frameworks.
Iris Image Classification Based on Hierarchical Visual Codebook.
Zhenan Sun; Hui Zhang; Tieniu Tan; Jianyu Wang
2014-06-01
Iris recognition as a reliable method for personal identification has been well-studied with the objective to assign the class label of each iris image to a unique subject. In contrast, iris image classification aims to classify an iris image to an application specific category, e.g., iris liveness detection (classification of genuine and fake iris images), race classification (e.g., classification of iris images of Asian and non-Asian subjects), coarse-to-fine iris identification (classification of all iris images in the central database into multiple categories). This paper proposes a general framework for iris image classification based on texture analysis. A novel texture pattern representation method called Hierarchical Visual Codebook (HVC) is proposed to encode the texture primitives of iris images. The proposed HVC method is an integration of two existing Bag-of-Words models, namely Vocabulary Tree (VT), and Locality-constrained Linear Coding (LLC). The HVC adopts a coarse-to-fine visual coding strategy and takes advantages of both VT and LLC for accurate and sparse representation of iris texture. Extensive experimental results demonstrate that the proposed iris image classification method achieves state-of-the-art performance for iris liveness detection, race classification, and coarse-to-fine iris identification. A comprehensive fake iris image database simulating four types of iris spoof attacks is developed as the benchmark for research of iris liveness detection.
Benchmarking protein classification algorithms via supervised cross-validation.
Kertész-Farkas, Attila; Dhir, Somdutta; Sonego, Paolo; Pacurar, Mircea; Netoteia, Sergiu; Nijveen, Harm; Kuzniar, Arnold; Leunissen, Jack A M; Kocsor, András; Pongor, Sándor
2008-04-24
Development and testing of protein classification algorithms are hampered by the fact that the protein universe is characterized by groups vastly different in the number of members, in average protein size, similarity within group, etc. Datasets based on traditional cross-validation (k-fold, leave-one-out, etc.) may not give reliable estimates on how an algorithm will generalize to novel, distantly related subtypes of the known protein classes. Supervised cross-validation, i.e., selection of test and train sets according to the known subtypes within a database has been successfully used earlier in conjunction with the SCOP database. Our goal was to extend this principle to other databases and to design standardized benchmark datasets for protein classification. Hierarchical classification trees of protein categories provide a simple and general framework for designing supervised cross-validation strategies for protein classification. Benchmark datasets can be designed at various levels of the concept hierarchy using a simple graph-theoretic distance. A combination of supervised and random sampling was selected to construct reduced size model datasets, suitable for algorithm comparison. Over 3000 new classification tasks were added to our recently established protein classification benchmark collection that currently includes protein sequence (including protein domains and entire proteins), protein structure and reading frame DNA sequence data. We carried out an extensive evaluation based on various machine-learning algorithms such as nearest neighbor, support vector machines, artificial neural networks, random forests and logistic regression, used in conjunction with comparison algorithms, BLAST, Smith-Waterman, Needleman-Wunsch, as well as 3D comparison methods DALI and PRIDE. The resulting datasets provide lower, and in our opinion more realistic estimates of the classifier performance than do random cross-validation schemes. A combination of supervised and random sampling was used to construct model datasets, suitable for algorithm comparison.
A machine learning approach to multi-level ECG signal quality classification.
Li, Qiao; Rajagopalan, Cadathur; Clifford, Gari D
2014-12-01
Current electrocardiogram (ECG) signal quality assessment studies have aimed to provide a two-level classification: clean or noisy. However, clinical usage demands more specific noise level classification for varying applications. This work outlines a five-level ECG signal quality classification algorithm. A total of 13 signal quality metrics were derived from segments of ECG waveforms, which were labeled by experts. A support vector machine (SVM) was trained to perform the classification and tested on a simulated dataset and was validated using data from the MIT-BIH arrhythmia database (MITDB). The simulated training and test datasets were created by selecting clean segments of the ECG in the 2011 PhysioNet/Computing in Cardiology Challenge database, and adding three types of real ECG noise at different signal-to-noise ratio (SNR) levels from the MIT-BIH Noise Stress Test Database (NSTDB). The MITDB was re-annotated for five levels of signal quality. Different combinations of the 13 metrics were trained and tested on the simulated datasets and the best combination that produced the highest classification accuracy was selected and validated on the MITDB. Performance was assessed using classification accuracy (Ac), and a single class overlap accuracy (OAc), which assumes that an individual type classified into an adjacent class is acceptable. An Ac of 80.26% and an OAc of 98.60% on the test set were obtained by selecting 10 metrics while 57.26% (Ac) and 94.23% (OAc) were the numbers for the unseen MITDB validation data without retraining. By performing the fivefold cross validation, an Ac of 88.07±0.32% and OAc of 99.34±0.07% were gained on the validation fold of MITDB. Copyright © 2014 Elsevier Ireland Ltd. All rights reserved.
Data-Driven Learning of Q-Matrix
ERIC Educational Resources Information Center
Liu, Jingchen; Xu, Gongjun; Ying, Zhiliang
2012-01-01
The recent surge of interests in cognitive assessment has led to developments of novel statistical models for diagnostic classification. Central to many such models is the well-known "Q"-matrix, which specifies the item-attribute relationships. This article proposes a data-driven approach to identification of the "Q"-matrix and estimation of…
Construction accident narrative classification: An evaluation of text mining techniques.
Goh, Yang Miang; Ubeynarayana, C U
2017-11-01
Learning from past accidents is fundamental to accident prevention. Thus, accident and near miss reporting are encouraged by organizations and regulators. However, for organizations managing large safety databases, the time taken to accurately classify accident and near miss narratives will be very significant. This study aims to evaluate the utility of various text mining classification techniques in classifying 1000 publicly available construction accident narratives obtained from the US OSHA website. The study evaluated six machine learning algorithms, including support vector machine (SVM), linear regression (LR), random forest (RF), k-nearest neighbor (KNN), decision tree (DT) and Naive Bayes (NB), and found that SVM produced the best performance in classifying the test set of 251 cases. Further experimentation with tokenization of the processed text and non-linear SVM were also conducted. In addition, a grid search was conducted on the hyperparameters of the SVM models. It was found that the best performing classifiers were linear SVM with unigram tokenization and radial basis function (RBF) SVM with uni-gram tokenization. In view of its relative simplicity, the linear SVM is recommended. Across the 11 labels of accident causes or types, the precision of the linear SVM ranged from 0.5 to 1, recall ranged from 0.36 to 0.9 and F1 score was between 0.45 and 0.92. The reasons for misclassification were discussed and suggestions on ways to improve the performance were provided. Copyright © 2017 Elsevier Ltd. All rights reserved.
A Hybrid Semi-supervised Classification Scheme for Mining Multisource Geospatial Data
DOE Office of Scientific and Technical Information (OSTI.GOV)
Vatsavai, Raju; Bhaduri, Budhendra L
2011-01-01
Supervised learning methods such as Maximum Likelihood (ML) are often used in land cover (thematic) classification of remote sensing imagery. ML classifier relies exclusively on spectral characteristics of thematic classes whose statistical distributions (class conditional probability densities) are often overlapping. The spectral response distributions of thematic classes are dependent on many factors including elevation, soil types, and ecological zones. A second problem with statistical classifiers is the requirement of large number of accurate training samples (10 to 30 |dimensions|), which are often costly and time consuming to acquire over large geographic regions. With the increasing availability of geospatial databases, itmore » is possible to exploit the knowledge derived from these ancillary datasets to improve classification accuracies even when the class distributions are highly overlapping. Likewise newer semi-supervised techniques can be adopted to improve the parameter estimates of statistical model by utilizing a large number of easily available unlabeled training samples. Unfortunately there is no convenient multivariate statistical model that can be employed for mulitsource geospatial databases. In this paper we present a hybrid semi-supervised learning algorithm that effectively exploits freely available unlabeled training samples from multispectral remote sensing images and also incorporates ancillary geospatial databases. We have conducted several experiments on real datasets, and our new hybrid approach shows over 25 to 35% improvement in overall classification accuracy over conventional classification schemes.« less
Deep transfer learning for automatic target classification: MWIR to LWIR
NASA Astrophysics Data System (ADS)
Ding, Zhengming; Nasrabadi, Nasser; Fu, Yun
2016-05-01
Publisher's Note: This paper, originally published on 5/12/2016, was replaced with a corrected/revised version on 5/18/2016. If you downloaded the original PDF but are unable to access the revision, please contact SPIE Digital Library Customer Service for assistance. When dealing with sparse or no labeled data in the target domain, transfer learning shows its appealing performance by borrowing the supervised knowledge from external domains. Recently deep structure learning has been exploited in transfer learning due to its attractive power in extracting effective knowledge through multi-layer strategy, so that deep transfer learning is promising to address the cross-domain mismatch. In general, cross-domain disparity can be resulted from the difference between source and target distributions or different modalities, e.g., Midwave IR (MWIR) and Longwave IR (LWIR). In this paper, we propose a Weighted Deep Transfer Learning framework for automatic target classification through a task-driven fashion. Specifically, deep features and classifier parameters are obtained simultaneously for optimal classification performance. In this way, the proposed deep structures can extract more effective features with the guidance of the classifier performance; on the other hand, the classifier performance is further improved since it is optimized on more discriminative features. Furthermore, we build a weighted scheme to couple source and target output by assigning pseudo labels to target data, therefore we can transfer knowledge from source (i.e., MWIR) to target (i.e., LWIR). Experimental results on real databases demonstrate the superiority of the proposed algorithm by comparing with others.
Automatic classification and detection of clinically relevant images for diabetic retinopathy
NASA Astrophysics Data System (ADS)
Xu, Xinyu; Li, Baoxin
2008-03-01
We proposed a novel approach to automatic classification of Diabetic Retinopathy (DR) images and retrieval of clinically-relevant DR images from a database. Given a query image, our approach first classifies the image into one of the three categories: microaneurysm (MA), neovascularization (NV) and normal, and then it retrieves DR images that are clinically-relevant to the query image from an archival image database. In the classification stage, the query DR images are classified by the Multi-class Multiple-Instance Learning (McMIL) approach, where images are viewed as bags, each of which contains a number of instances corresponding to non-overlapping blocks, and each block is characterized by low-level features including color, texture, histogram of edge directions, and shape. McMIL first learns a collection of instance prototypes for each class that maximizes the Diverse Density function using Expectation- Maximization algorithm. A nonlinear mapping is then defined using the instance prototypes and maps every bag to a point in a new multi-class bag feature space. Finally a multi-class Support Vector Machine is trained in the multi-class bag feature space. In the retrieval stage, we retrieve images from the archival database who bear the same label with the query image, and who are the top K nearest neighbors of the query image in terms of similarity in the multi-class bag feature space. The classification approach achieves high classification accuracy, and the retrieval of clinically-relevant images not only facilitates utilization of the vast amount of hidden diagnostic knowledge in the database, but also improves the efficiency and accuracy of DR lesion diagnosis and assessment.
Van Berkel, Gary J; Kertesz, Vilmos
2017-02-15
An "Open Access"-like mass spectrometric platform to fully utilize the simplicity of the manual open port sampling interface for rapid characterization of unprocessed samples by liquid introduction atmospheric pressure ionization mass spectrometry has been lacking. The in-house developed integrated software with a simple, small and relatively low-cost mass spectrometry system introduced here fills this void. Software was developed to operate the mass spectrometer, to collect and process mass spectrometric data files, to build a database and to classify samples using such a database. These tasks were accomplished via the vendor-provided software libraries. Sample classification based on spectral comparison utilized the spectral contrast angle method. Using the developed software platform near real-time sample classification is exemplified using a series of commercially available blue ink rollerball pens and vegetable oils. In the case of the inks, full scan positive and negative ion ESI mass spectra were both used for database generation and sample classification. For the vegetable oils, full scan positive ion mode APCI mass spectra were recorded. The overall accuracy of the employed spectral contrast angle statistical model was 95.3% and 98% in case of the inks and oils, respectively, using leave-one-out cross-validation. This work illustrates that an open port sampling interface/mass spectrometer combination, with appropriate instrument control and data processing software, is a viable direct liquid extraction sampling and analysis system suitable for the non-expert user and near real-time sample classification via database matching. Published in 2016. This article is a U.S. Government work and is in the public domain in the USA. Published in 2016. This article is a U.S. Government work and is in the public domain in the USA.
Aman, Malin; Forssblad, Magnus; Henriksson-Larsén, Karin
2014-06-12
Before preventive actions can be suggested for sports injuries at the national level, a solid surveillance system is required in order to study their epidemiology, risk factors and mechanisms. There are guidelines for sports injury data collection and classifications in the literature for that purpose. In Sweden, 90% of all athletes (57/70 sports federations) are insured with the same insurance company and data from their database could be a foundation for studies on acute sports injuries at the national level. To evaluate the usefulness of sports injury insurance claims data in sports injury surveillance at the national level. A database with 27 947 injuries was exported to an Excel file. Access to the corresponding text files was also obtained. Data were reviewed on available information, missing information and dropouts. Comparison with ASIDD (Australian Sports Injury Data Dictionary) and existing consensus statements in the literature (football (soccer), rugby union, tennis, cricket and thoroughbred horse racing) was performed in a structured manner. Comparison with ASIDD showed that 93% of the suggested data items were present in the database to at least some extent. Compliance with the consensus statements was generally high (13/18). Almost all claims (83%) contained text information concerning the injury. Relatively high-quality sports injury data can be obtained from a specific insurance company at the national level in Sweden. The database has the potential to be a solid base for research on acute sports injuries in different sports at the national level. Published by the BMJ Publishing Group Limited. For permission to use (where not already granted under a licence) please go to http://group.bmj.com/group/rights-licensing/permissions.
Davis, Philip A.; Grolier, Maurice J.
1984-01-01
Landsat multispectral scanner (MSS) band and band-ratio databases of two scenes covering the Midyan region of northwestern Saudi Arabia were examined quantitatively and qualitatively to determine which databases best discriminate the geologic units of this semi-arid and arid region. Unsupervised, linear-discriminant cluster-analysis was performed on these two band-ratio combinations and on the MSS bands for both scenes. The results for granitoid-rock discrimination indicated that the classification images using the MSS bands are superior to the band-ratio classification images for two reasons, discussed in the paper. Yet, the effects of topography and material type (including desert varnish) on the MSS-band data produced ambiguities in the MSS-band classification results. However, these ambiguities were clarified by using a simulated natural-color image in conjunction with the MSS-band classification image.
Exploring the CAESAR database using dimensionality reduction techniques
NASA Astrophysics Data System (ADS)
Mendoza-Schrock, Olga; Raymer, Michael L.
2012-06-01
The Civilian American and European Surface Anthropometry Resource (CAESAR) database containing over 40 anthropometric measurements on over 4000 humans has been extensively explored for pattern recognition and classification purposes using the raw, original data [1-4]. However, some of the anthropometric variables would be impossible to collect in an uncontrolled environment. Here, we explore the use of dimensionality reduction methods in concert with a variety of classification algorithms for gender classification using only those variables that are readily observable in an uncontrolled environment. Several dimensionality reduction techniques are employed to learn the underlining structure of the data. These techniques include linear projections such as the classical Principal Components Analysis (PCA) and non-linear (manifold learning) techniques, such as Diffusion Maps and the Isomap technique. This paper briefly describes all three techniques, and compares three different classifiers, Naïve Bayes, Adaboost, and Support Vector Machines (SVM), for gender classification in conjunction with each of these three dimensionality reduction approaches.
Free-Text Disease Classification
2011-09-01
NAVAL POSTGRADUATE SCHOOL MONTEREY, CALIFORNIA THESIS FREE-TEXT DISEASE CLASSIFICATION by Craig Maxey September 2011 Thesis Advisor: Lyn R. Whitaker...2104-10-31 Free-Text Disease Classification Craig Maxey Naval Postgraduate School Monterey, CA 93943 Department of the Navy Approved for public...POSTGRADUATE SCHOOL September 2011 Author: Craig Maxey Approved by: Lyn R. Whitaker Thesis Advisor Samuel E. Buttrey Second Reader Robert F. Dell Chair
Full-Text Databases in Medicine.
ERIC Educational Resources Information Center
Sievert, MaryEllen C.; And Others
1995-01-01
Describes types of full-text databases in medicine; discusses features for searching full-text journal databases available through online vendors; reviews research on full-text databases in medicine; and describes the MEDLINE/Full-Text Research Project at the University of Missouri (Columbia) which investigated precision, recall, and relevancy.…
On Mixed Data and Event Driven Design for Adaptive-Critic-Based Nonlinear $H_{\\infty}$ Control.
Wang, Ding; Mu, Chaoxu; Liu, Derong; Ma, Hongwen
2018-04-01
In this paper, based on the adaptive critic learning technique, the control for a class of unknown nonlinear dynamic systems is investigated by adopting a mixed data and event driven design approach. The nonlinear control problem is formulated as a two-player zero-sum differential game and the adaptive critic method is employed to cope with the data-based optimization. The novelty lies in that the data driven learning identifier is combined with the event driven design formulation, in order to develop the adaptive critic controller, thereby accomplishing the nonlinear control. The event driven optimal control law and the time driven worst case disturbance law are approximated by constructing and tuning a critic neural network. Applying the event driven feedback control, the closed-loop system is built with stability analysis. Simulation studies are conducted to verify the theoretical results and illustrate the control performance. It is significant to observe that the present research provides a new avenue of integrating data-based control and event-triggering mechanism into establishing advanced adaptive critic systems.
Visual Systems for Interactive Exploration and Mining of Large-Scale Neuroimaging Data Archives
Bowman, Ian; Joshi, Shantanu H.; Van Horn, John D.
2012-01-01
While technological advancements in neuroimaging scanner engineering have improved the efficiency of data acquisition, electronic data capture methods will likewise significantly expedite the populating of large-scale neuroimaging databases. As they do and these archives grow in size, a particular challenge lies in examining and interacting with the information that these resources contain through the development of compelling, user-driven approaches for data exploration and mining. In this article, we introduce the informatics visualization for neuroimaging (INVIZIAN) framework for the graphical rendering of, and dynamic interaction with the contents of large-scale neuroimaging data sets. We describe the rationale behind INVIZIAN, detail its development, and demonstrate its usage in examining a collection of over 900 T1-anatomical magnetic resonance imaging (MRI) image volumes from across a diverse set of clinical neuroimaging studies drawn from a leading neuroimaging database. Using a collection of cortical surface metrics and means for examining brain similarity, INVIZIAN graphically displays brain surfaces as points in a coordinate space and enables classification of clusters of neuroanatomically similar MRI images and data mining. As an initial step toward addressing the need for such user-friendly tools, INVIZIAN provides a highly unique means to interact with large quantities of electronic brain imaging archives in ways suitable for hypothesis generation and data mining. PMID:22536181
University Real Estate Development Database: A Database-Driven Internet Research Tool
ERIC Educational Resources Information Center
Wiewel, Wim; Kunst, Kara
2008-01-01
The University Real Estate Development Database is an Internet resource developed by the University of Baltimore for the Lincoln Institute of Land Policy, containing over six hundred cases of university expansion outside of traditional campus boundaries. The University Real Estate Development database is a searchable collection of real estate…
NASA Astrophysics Data System (ADS)
Hess, M. R.; Petrovic, V.; Kuester, F.
2017-08-01
Digital documentation of cultural heritage structures is increasingly more common through the application of different imaging techniques. Many works have focused on the application of laser scanning and photogrammetry techniques for the acquisition of threedimensional (3D) geometry detailing cultural heritage sites and structures. With an abundance of these 3D data assets, there must be a digital environment where these data can be visualized and analyzed. Presented here is a feedback driven visualization framework that seamlessly enables interactive exploration and manipulation of massive point cloud data. The focus of this work is on the classification of different building materials with the goal of building more accurate as-built information models of historical structures. User defined functions have been tested within the interactive point cloud visualization framework to evaluate automated and semi-automated classification of 3D point data. These functions include decisions based on observed color, laser intensity, normal vector or local surface geometry. Multiple case studies are presented here to demonstrate the flexibility and utility of the presented point cloud visualization framework to achieve classification objectives.
Faust, Kevin; Xie, Quin; Han, Dominick; Goyle, Kartikay; Volynskaya, Zoya; Djuric, Ugljesa; Diamandis, Phedias
2018-05-16
There is growing interest in utilizing artificial intelligence, and particularly deep learning, for computer vision in histopathology. While accumulating studies highlight expert-level performance of convolutional neural networks (CNNs) on focused classification tasks, most studies rely on probability distribution scores with empirically defined cutoff values based on post-hoc analysis. More generalizable tools that allow humans to visualize histology-based deep learning inferences and decision making are scarce. Here, we leverage t-distributed Stochastic Neighbor Embedding (t-SNE) to reduce dimensionality and depict how CNNs organize histomorphologic information. Unique to our workflow, we develop a quantitative and transparent approach to visualizing classification decisions prior to softmax compression. By discretizing the relationships between classes on the t-SNE plot, we show we can super-impose randomly sampled regions of test images and use their distribution to render statistically-driven classifications. Therefore, in addition to providing intuitive outputs for human review, this visual approach can carry out automated and objective multi-class classifications similar to more traditional and less-transparent categorical probability distribution scores. Importantly, this novel classification approach is driven by a priori statistically defined cutoffs. It therefore serves as a generalizable classification and anomaly detection tool less reliant on post-hoc tuning. Routine incorporation of this convenient approach for quantitative visualization and error reduction in histopathology aims to accelerate early adoption of CNNs into generalized real-world applications where unanticipated and previously untrained classes are often encountered.
Magrabi, Farah; Ong, Mei-Sing; Runciman, William; Coiera, Enrico
2010-01-01
To analyze patient safety incidents associated with computer use to develop the basis for a classification of problems reported by health professionals. Incidents submitted to a voluntary incident reporting database across one Australian state were retrieved and a subset (25%) was analyzed to identify 'natural categories' for classification. Two coders independently classified the remaining incidents into one or more categories. Free text descriptions were analyzed to identify contributing factors. Where available medical specialty, time of day and consequences were examined. Descriptive statistics; inter-rater reliability. A search of 42,616 incidents from 2003 to 2005 yielded 123 computer related incidents. After removing duplicate and unrelated incidents, 99 incidents describing 117 problems remained. A classification with 32 types of computer use problems was developed. Problems were grouped into information input (31%), transfer (20%), output (20%) and general technical (24%). Overall, 55% of problems were machine related and 45% were attributed to human-computer interaction. Delays in initiating and completing clinical tasks were a major consequence of machine related problems (70%) whereas rework was a major consequence of human-computer interaction problems (78%). While 38% (n=26) of the incidents were reported to have a noticeable consequence but no harm, 34% (n=23) had no noticeable consequence. Only 0.2% of all incidents reported were computer related. Further work is required to expand our classification using incident reports and other sources of information about healthcare IT problems. Evidence based user interface design must focus on the safe entry and retrieval of clinical information and support users in detecting and correcting errors and malfunctions.
Livingston, Kara A.; Chung, Mei; Sawicki, Caleigh M.; Lyle, Barbara J.; Wang, Ding Ding; Roberts, Susan B.; McKeown, Nicola M.
2016-01-01
Background Dietary fiber is a broad category of compounds historically defined as partially or completely indigestible plant-based carbohydrates and lignin with, more recently, the additional criteria that fibers incorporated into foods as additives should demonstrate functional human health outcomes to receive a fiber classification. Thousands of research studies have been published examining fibers and health outcomes. Objectives (1) Develop a database listing studies testing fiber and physiological health outcomes identified by experts at the Ninth Vahouny Conference; (2) Use evidence mapping methodology to summarize this body of literature. This paper summarizes the rationale, methodology, and resulting database. The database will help both scientists and policy-makers to evaluate evidence linking specific fibers with physiological health outcomes, and identify missing information. Methods To build this database, we conducted a systematic literature search for human intervention studies published in English from 1946 to May 2015. Our search strategy included a broad definition of fiber search terms, as well as search terms for nine physiological health outcomes identified at the Ninth Vahouny Fiber Symposium. Abstracts were screened using a priori defined eligibility criteria and a low threshold for inclusion to minimize the likelihood of rejecting articles of interest. Publications then were reviewed in full text, applying additional a priori defined exclusion criteria. The database was built and published on the Systematic Review Data Repository (SRDR™), a web-based, publicly available application. Conclusions A fiber database was created. This resource will reduce the unnecessary replication of effort in conducting systematic reviews by serving as both a central database archiving PICO (population, intervention, comparator, outcome) data on published studies and as a searchable tool through which this data can be extracted and updated. PMID:27348733
Mujtaba, Ghulam; Shuib, Liyana; Raj, Ram Gopal; Rajandram, Retnagowri; Shaikh, Khairunisa
2018-07-01
Automatic text classification techniques are useful for classifying plaintext medical documents. This study aims to automatically predict the cause of death from free text forensic autopsy reports by comparing various schemes for feature extraction, term weighing or feature value representation, text classification, and feature reduction. For experiments, the autopsy reports belonging to eight different causes of death were collected, preprocessed and converted into 43 master feature vectors using various schemes for feature extraction, representation, and reduction. The six different text classification techniques were applied on these 43 master feature vectors to construct a classification model that can predict the cause of death. Finally, classification model performance was evaluated using four performance measures i.e. overall accuracy, macro precision, macro-F-measure, and macro recall. From experiments, it was found that that unigram features obtained the highest performance compared to bigram, trigram, and hybrid-gram features. Furthermore, in feature representation schemes, term frequency, and term frequency with inverse document frequency obtained similar and better results when compared with binary frequency, and normalized term frequency with inverse document frequency. Furthermore, the chi-square feature reduction approach outperformed Pearson correlation, and information gain approaches. Finally, in text classification algorithms, support vector machine classifier outperforms random forest, Naive Bayes, k-nearest neighbor, decision tree, and ensemble-voted classifier. Our results and comparisons hold practical importance and serve as references for future works. Moreover, the comparison outputs will act as state-of-art techniques to compare future proposals with existing automated text classification techniques. Copyright © 2017 Elsevier Ltd and Faculty of Forensic and Legal Medicine. All rights reserved.
The Power of Neuroimaging Biomarkers for Screening Frontotemporal Dementia
McMillan, Corey T.; Avants, Brian B.; Cook, Philip; Ungar, Lyle; Trojanowski, John Q.; Grossman, Murray
2014-01-01
Frontotemporal dementia (FTD) is a clinically and pathologically heterogeneous neurodegenerative disease that can result from either frontotemporal lobar degeneration (FTLD) or Alzheimer’s disease (AD) pathology. It is critical to establish statistically powerful biomarkers that can achieve substantial cost-savings and increase feasibility of clinical trials. We assessed three broad categories of neuroimaging methods to screen underlying FTLD and AD pathology in a clinical FTD series: global measures (e.g., ventricular volume), anatomical volumes of interest (VOIs) (e.g., hippocampus) using a standard atlas, and data-driven VOIs using Eigenanatomy. We evaluated clinical FTD patients (N=93) with cerebrospinal fluid, gray matter (GM) MRI, and diffusion tensor imaging (DTI) to assess whether they had underlying FTLD or AD pathology. Linear regression was performed to identify the optimal VOIs for each method in a training dataset and then we evaluated classification sensitivity and specificity in an independent test cohort. Power was evaluated by calculating minimum sample sizes (mSS) required in the test classification analyses for each model. The data-driven VOI analysis using a multimodal combination of GM MRI and DTI achieved the greatest classification accuracy (89% SENSITIVE; 89% SPECIFIC) and required a lower minimum sample size (N=26) relative to anatomical VOI and global measures. We conclude that a data-driven VOI approach employing Eigenanatomy provides more accurate classification, benefits from increased statistical power in unseen datasets, and therefore provides a robust method for screening underlying pathology in FTD patients for entry into clinical trials. PMID:24687814
ERIC Educational Resources Information Center
International Federation of Library Associations and Institutions, London (England).
Five papers from the sessions of the International Federation of Library Associations and Institutions 1992 conference on classification, indexing, and cataloging are presented. Three papers deal with knowledge classification as it relates to database design, as it is practiced in India, and in a worldwide context. The remaining two papers focus…
TIM Barrel Protein Structure Classification Using Alignment Approach and Best Hit Strategy
NASA Astrophysics Data System (ADS)
Chu, Jia-Han; Lin, Chun Yuan; Chang, Cheng-Wen; Lee, Chihan; Yang, Yuh-Shyong; Tang, Chuan Yi
2007-11-01
The classification of protein structures is essential for their function determination in bioinformatics. It has been estimated that around 10% of all known enzymes have TIM barrel domains from the Structural Classification of Proteins (SCOP) database. With its high sequence variation and diverse functionalities, TIM barrel protein becomes to be an attractive target for protein engineering and for the evolution study. Hence, in this paper, an alignment approach with the best hit strategy is proposed to classify the TIM barrel protein structure in terms of superfamily and family levels in the SCOP. This work is also used to do the classification for class level in the Enzyme nomenclature (ENZYME) database. Two testing data sets, TIM40D and TIM95D, both are used to evaluate this approach. The resulting classification has an overall prediction accuracy rate of 90.3% for the superfamily level in the SCOP, 89.5% for the family level in the SCOP and 70.1% for the class level in the ENZYME. These results demonstrate that the alignment approach with the best hit strategy is a simple and viable method for the TIM barrel protein structure classification, even only has the amino acid sequences information.
Mycofier: a new machine learning-based classifier for fungal ITS sequences.
Delgado-Serrano, Luisa; Restrepo, Silvia; Bustos, Jose Ricardo; Zambrano, Maria Mercedes; Anzola, Juan Manuel
2016-08-11
The taxonomic and phylogenetic classification based on sequence analysis of the ITS1 genomic region has become a crucial component of fungal ecology and diversity studies. Nowadays, there is no accurate alignment-free classification tool for fungal ITS1 sequences for large environmental surveys. This study describes the development of a machine learning-based classifier for the taxonomical assignment of fungal ITS1 sequences at the genus level. A fungal ITS1 sequence database was built using curated data. Training and test sets were generated from it. A Naïve Bayesian classifier was built using features from the primary sequence with an accuracy of 87 % in the classification at the genus level. The final model was based on a Naïve Bayes algorithm using ITS1 sequences from 510 fungal genera. This classifier, denoted as Mycofier, provides similar classification accuracy compared to BLASTN, but the database used for the classification contains curated data and the tool, independent of alignment, is more efficient and contributes to the field, given the lack of an accurate classification tool for large data from fungal ITS1 sequences. The software and source code for Mycofier are freely available at https://github.com/ldelgado-serrano/mycofier.git .
NASA Astrophysics Data System (ADS)
Maas, A.; Alrajhi, M.; Alobeid, A.; Heipke, C.
2017-05-01
Updating topographic geospatial databases is often performed based on current remotely sensed images. To automatically extract the object information (labels) from the images, supervised classifiers are being employed. Decisions to be taken in this process concern the definition of the classes which should be recognised, the features to describe each class and the training data necessary in the learning part of classification. With a view to large scale topographic databases for fast developing urban areas in the Kingdom of Saudi Arabia we conducted a case study, which investigated the following two questions: (a) which set of features is best suitable for the classification?; (b) what is the added value of height information, e.g. derived from stereo imagery? Using stereoscopic GeoEye and Ikonos satellite data we investigate these two questions based on our research on label tolerant classification using logistic regression and partly incorrect training data. We show that in between five and ten features can be recommended to obtain a stable solution, that height information consistently yields an improved overall classification accuracy of about 5%, and that label noise can be successfully modelled and thus only marginally influences the classification results.
PROTAX-Sound: A probabilistic framework for automated animal sound identification
Somervuo, Panu; Ovaskainen, Otso
2017-01-01
Autonomous audio recording is stimulating new field in bioacoustics, with a great promise for conducting cost-effective species surveys. One major current challenge is the lack of reliable classifiers capable of multi-species identification. We present PROTAX-Sound, a statistical framework to perform probabilistic classification of animal sounds. PROTAX-Sound is based on a multinomial regression model, and it can utilize as predictors any kind of sound features or classifications produced by other existing algorithms. PROTAX-Sound combines audio and image processing techniques to scan environmental audio files. It identifies regions of interest (a segment of the audio file that contains a vocalization to be classified), extracts acoustic features from them and compares with samples in a reference database. The output of PROTAX-Sound is the probabilistic classification of each vocalization, including the possibility that it represents species not present in the reference database. We demonstrate the performance of PROTAX-Sound by classifying audio from a species-rich case study of tropical birds. The best performing classifier achieved 68% classification accuracy for 200 bird species. PROTAX-Sound improves the classification power of current techniques by combining information from multiple classifiers in a manner that yields calibrated classification probabilities. PMID:28863178
PROTAX-Sound: A probabilistic framework for automated animal sound identification.
de Camargo, Ulisses Moliterno; Somervuo, Panu; Ovaskainen, Otso
2017-01-01
Autonomous audio recording is stimulating new field in bioacoustics, with a great promise for conducting cost-effective species surveys. One major current challenge is the lack of reliable classifiers capable of multi-species identification. We present PROTAX-Sound, a statistical framework to perform probabilistic classification of animal sounds. PROTAX-Sound is based on a multinomial regression model, and it can utilize as predictors any kind of sound features or classifications produced by other existing algorithms. PROTAX-Sound combines audio and image processing techniques to scan environmental audio files. It identifies regions of interest (a segment of the audio file that contains a vocalization to be classified), extracts acoustic features from them and compares with samples in a reference database. The output of PROTAX-Sound is the probabilistic classification of each vocalization, including the possibility that it represents species not present in the reference database. We demonstrate the performance of PROTAX-Sound by classifying audio from a species-rich case study of tropical birds. The best performing classifier achieved 68% classification accuracy for 200 bird species. PROTAX-Sound improves the classification power of current techniques by combining information from multiple classifiers in a manner that yields calibrated classification probabilities.
Ground-based cloud classification by learning stable local binary patterns
NASA Astrophysics Data System (ADS)
Wang, Yu; Shi, Cunzhao; Wang, Chunheng; Xiao, Baihua
2018-07-01
Feature selection and extraction is the first step in implementing pattern classification. The same is true for ground-based cloud classification. Histogram features based on local binary patterns (LBPs) are widely used to classify texture images. However, the conventional uniform LBP approach cannot capture all the dominant patterns in cloud texture images, thereby resulting in low classification performance. In this study, a robust feature extraction method by learning stable LBPs is proposed based on the averaged ranks of the occurrence frequencies of all rotation invariant patterns defined in the LBPs of cloud images. The proposed method is validated with a ground-based cloud classification database comprising five cloud types. Experimental results demonstrate that the proposed method achieves significantly higher classification accuracy than the uniform LBP, local texture patterns (LTP), dominant LBP (DLBP), completed LBP (CLTP) and salient LBP (SaLBP) methods in this cloud image database and under different noise conditions. And the performance of the proposed method is comparable with that of the popular deep convolutional neural network (DCNN) method, but with less computation complexity. Furthermore, the proposed method also achieves superior performance on an independent test data set.
A curated database of cyanobacterial strains relevant for modern taxonomy and phylogenetic studies.
Ramos, Vitor; Morais, João; Vasconcelos, Vitor M
2017-04-25
The dataset herein described lays the groundwork for an online database of relevant cyanobacterial strains, named CyanoType (http://lege.ciimar.up.pt/cyanotype). It is a database that includes categorized cyanobacterial strains useful for taxonomic, phylogenetic or genomic purposes, with associated information obtained by means of a literature-based curation. The dataset lists 371 strains and represents the first version of the database (CyanoType v.1). Information for each strain includes strain synonymy and/or co-identity, strain categorization, habitat, accession numbers for molecular data, taxonomy and nomenclature notes according to three different classification schemes, hierarchical automatic classification, phylogenetic placement according to a selection of relevant studies (including this), and important bibliographic references. The database will be updated periodically, namely by adding new strains meeting the criteria for inclusion and by revising and adding up-to-date metadata for strains already listed. A global 16S rDNA-based phylogeny is provided in order to assist users when choosing the appropriate strains for their studies.
A curated database of cyanobacterial strains relevant for modern taxonomy and phylogenetic studies
Ramos, Vitor; Morais, João; Vasconcelos, Vitor M.
2017-01-01
The dataset herein described lays the groundwork for an online database of relevant cyanobacterial strains, named CyanoType (http://lege.ciimar.up.pt/cyanotype). It is a database that includes categorized cyanobacterial strains useful for taxonomic, phylogenetic or genomic purposes, with associated information obtained by means of a literature-based curation. The dataset lists 371 strains and represents the first version of the database (CyanoType v.1). Information for each strain includes strain synonymy and/or co-identity, strain categorization, habitat, accession numbers for molecular data, taxonomy and nomenclature notes according to three different classification schemes, hierarchical automatic classification, phylogenetic placement according to a selection of relevant studies (including this), and important bibliographic references. The database will be updated periodically, namely by adding new strains meeting the criteria for inclusion and by revising and adding up-to-date metadata for strains already listed. A global 16S rDNA-based phylogeny is provided in order to assist users when choosing the appropriate strains for their studies. PMID:28440791
Reddy, T.B.K.; Thomas, Alex D.; Stamatis, Dimitri; Bertsch, Jon; Isbandi, Michelle; Jansson, Jakob; Mallajosyula, Jyothi; Pagani, Ioanna; Lobos, Elizabeth A.; Kyrpides, Nikos C.
2015-01-01
The Genomes OnLine Database (GOLD; http://www.genomesonline.org) is a comprehensive online resource to catalog and monitor genetic studies worldwide. GOLD provides up-to-date status on complete and ongoing sequencing projects along with a broad array of curated metadata. Here we report version 5 (v.5) of the database. The newly designed database schema and web user interface supports several new features including the implementation of a four level (meta)genome project classification system and a simplified intuitive web interface to access reports and launch search tools. The database currently hosts information for about 19 200 studies, 56 000 Biosamples, 56 000 sequencing projects and 39 400 analysis projects. More than just a catalog of worldwide genome projects, GOLD is a manually curated, quality-controlled metadata warehouse. The problems encountered in integrating disparate and varying quality data into GOLD are briefly highlighted. GOLD fully supports and follows the Genomic Standards Consortium (GSC) Minimum Information standards. PMID:25348402
Butcher, Jason T.; Stewart, Paul M.; Simon, Thomas P.
2003-01-01
Ninety-four sites were used to analyze the effects of two different classification strategies on the Benthic Community Index (BCI). The first, a priori classification, reflected the wetland status of the streams; the second, a posteriori classification, used a bio-environmental analysis to select classification variables. Both classifications were examined by measuring classification strength and testing differences in metric values with respect to group membership. The a priori (wetland) classification strength (83.3%) was greater than the a posteriori (bio-environmental) classification strength (76.8%). Both classifications found one metric that had significant differences between groups. The original index was modified to reflect the wetland classification by re-calibrating the scoring criteria for percent Crustacea and Mollusca. A proposed refinement to the original Benthic Community Index is suggested. This study shows the importance of using hypothesis-driven classifications, as well as exploratory statistical analysis, to evaluate alternative ways to reveal environmental variability in biological assessment tools.
SOM Classification of Martian TES Data
NASA Technical Reports Server (NTRS)
Hogan, R. C.; Roush, T. L.
2002-01-01
A classification scheme based on unsupervised self-organizing maps (SOM) is described. Results from its application to the ASU mineral spectral database are presented. Applications to the Martian Thermal Emission Spectrometer data are discussed. Additional information is contained in the original extended abstract.
Information extraction from Italian medical reports: An ontology-driven approach.
Viani, Natalia; Larizza, Cristiana; Tibollo, Valentina; Napolitano, Carlo; Priori, Silvia G; Bellazzi, Riccardo; Sacchi, Lucia
2018-03-01
In this work, we propose an ontology-driven approach to identify events and their attributes from episodes of care included in medical reports written in Italian. For this language, shared resources for clinical information extraction are not easily accessible. The corpus considered in this work includes 5432 non-annotated medical reports belonging to patients with rare arrhythmias. To guide the information extraction process, we built a domain-specific ontology that includes the events and the attributes to be extracted, with related regular expressions. The ontology and the annotation system were constructed on a development set, while the performance was evaluated on an independent test set. As a gold standard, we considered a manually curated hospital database named TRIAD, which stores most of the information written in reports. The proposed approach performs well on the considered Italian medical corpus, with a percentage of correct annotations above 90% for most considered clinical events. We also assessed the possibility to adapt the system to the analysis of another language (i.e., English), with promising results. Our annotation system relies on a domain ontology to extract and link information in clinical text. We developed an ontology that can be easily enriched and translated, and the system performs well on the considered task. In the future, it could be successfully used to automatically populate the TRIAD database. Copyright © 2017 Elsevier B.V. All rights reserved.
Ontology driven modeling for the knowledge of genetic susceptibility to disease.
Lin, Yu; Sakamoto, Norihiro
2009-05-12
For the machine helped exploring the relationships between genetic factors and complex diseases, a well-structured conceptual framework of the background knowledge is needed. However, because of the complexity of determining a genetic susceptibility factor, there is no formalization for the knowledge of genetic susceptibility to disease, which makes the interoperability between systems impossible. Thus, the ontology modeling language OWL was used for formalization in this paper. After introducing the Semantic Web and OWL language propagated by W3C, we applied text mining technology combined with competency questions to specify the classes of the ontology. Then, an N-ary pattern was adopted to describe the relationships among these defined classes. Based on the former work of OGSF-DM (Ontology of Genetic Susceptibility Factors to Diabetes Mellitus), we formalized the definition of "Genetic Susceptibility", "Genetic Susceptibility Factor" and other classes by using OWL-DL modeling language; and a reasoner automatically performed the classification of the class "Genetic Susceptibility Factor". The ontology driven modeling is used for formalization the knowledge of genetic susceptibility to complex diseases. More importantly, when a class has been completely formalized in an ontology, the OWL reasoning can automatically compute the classification of the class, in our case, the class of "Genetic Susceptibility Factors". With more types of genetic susceptibility factors obtained from the laboratory research, our ontologies always needs to be refined, and many new classes must be taken into account to harmonize with the ontologies. Using the ontologies to develop the semantic web needs to be applied in the future.
Hailstone classifier based on Rough Set Theory
NASA Astrophysics Data System (ADS)
Wan, Huisong; Jiang, Shuming; Wei, Zhiqiang; Li, Jian; Li, Fengjiao
2017-09-01
The Rough Set Theory was used for the construction of the hailstone classifier. Firstly, the database of the radar image feature was constructed. It included transforming the base data reflected by the Doppler radar into the bitmap format which can be seen. Then through the image processing, the color, texture, shape and other dimensional features should be extracted and saved as the characteristic database to provide data support for the follow-up work. Secondly, Through the Rough Set Theory, a machine for hailstone classifications can be built to achieve the hailstone samples’ auto-classification.
A Study of Hand Back Skin Texture Patterns for Personal Identification and Gender Classification
Xie, Jin; Zhang, Lei; You, Jane; Zhang, David; Qu, Xiaofeng
2012-01-01
Human hand back skin texture (HBST) is often consistent for a person and distinctive from person to person. In this paper, we study the HBST pattern recognition problem with applications to personal identification and gender classification. A specially designed system is developed to capture HBST images, and an HBST image database was established, which consists of 1,920 images from 80 persons (160 hands). An efficient texton learning based method is then presented to classify the HBST patterns. First, textons are learned in the space of filter bank responses from a set of training images using the l1 -minimization based sparse representation (SR) technique. Then, under the SR framework, we represent the feature vector at each pixel over the learned dictionary to construct a representation coefficient histogram. Finally, the coefficient histogram is used as skin texture feature for classification. Experiments on personal identification and gender classification are performed by using the established HBST database. The results show that HBST can be used to assist human identification and gender classification. PMID:23012512
ERIC Educational Resources Information Center
Ertel, Monica M.
1984-01-01
This discussion of current microcomputer technologies available to libraries focuses on software applications in four major classifications: communications (online database searching); word processing; administration; and database management systems. Specific examples of library applications are given and six references are cited. (EJS)
Gehrmann, Sebastian; Dernoncourt, Franck; Li, Yeran; Carlson, Eric T; Wu, Joy T; Welt, Jonathan; Foote, John; Moseley, Edward T; Grant, David W; Tyler, Patrick D; Celi, Leo A
2018-01-01
In secondary analysis of electronic health records, a crucial task consists in correctly identifying the patient cohort under investigation. In many cases, the most valuable and relevant information for an accurate classification of medical conditions exist only in clinical narratives. Therefore, it is necessary to use natural language processing (NLP) techniques to extract and evaluate these narratives. The most commonly used approach to this problem relies on extracting a number of clinician-defined medical concepts from text and using machine learning techniques to identify whether a particular patient has a certain condition. However, recent advances in deep learning and NLP enable models to learn a rich representation of (medical) language. Convolutional neural networks (CNN) for text classification can augment the existing techniques by leveraging the representation of language to learn which phrases in a text are relevant for a given medical condition. In this work, we compare concept extraction based methods with CNNs and other commonly used models in NLP in ten phenotyping tasks using 1,610 discharge summaries from the MIMIC-III database. We show that CNNs outperform concept extraction based methods in almost all of the tasks, with an improvement in F1-score of up to 26 and up to 7 percentage points in area under the ROC curve (AUC). We additionally assess the interpretability of both approaches by presenting and evaluating methods that calculate and extract the most salient phrases for a prediction. The results indicate that CNNs are a valid alternative to existing approaches in patient phenotyping and cohort identification, and should be further investigated. Moreover, the deep learning approach presented in this paper can be used to assist clinicians during chart review or support the extraction of billing codes from text by identifying and highlighting relevant phrases for various medical conditions.
Marafino, Ben J; Boscardin, W John; Dudley, R Adams
2015-04-01
Sparsity is often a desirable property of statistical models, and various feature selection methods exist so as to yield sparser and interpretable models. However, their application to biomedical text classification, particularly to mortality risk stratification among intensive care unit (ICU) patients, has not been thoroughly studied. To develop and characterize sparse classifiers based on the free text of nursing notes in order to predict ICU mortality risk and to discover text features most strongly associated with mortality. We selected nursing notes from the first 24h of ICU admission for 25,826 adult ICU patients from the MIMIC-II database. We then developed a pair of stochastic gradient descent-based classifiers with elastic-net regularization. We also studied the performance-sparsity tradeoffs of both classifiers as their regularization parameters were varied. The best-performing classifier achieved a 10-fold cross-validated AUC of 0.897 under the log loss function and full L2 regularization, while full L1 regularization used just 0.00025% of candidate input features and resulted in an AUC of 0.889. Using the log loss (range of AUCs 0.889-0.897) yielded better performance compared to the hinge loss (0.850-0.876), but the latter yielded even sparser models. Most features selected by both classifiers appear clinically relevant and correspond to predictors already present in existing ICU mortality models. The sparser classifiers were also able to discover a number of informative - albeit nonclinical - features. The elastic-net-regularized classifiers perform reasonably well and are capable of reducing the number of features required by over a thousandfold, with only a modest impact on performance. Copyright © 2015 Elsevier Inc. All rights reserved.
URS DataBase: universe of RNA structures and their motifs.
Baulin, Eugene; Yacovlev, Victor; Khachko, Denis; Spirin, Sergei; Roytberg, Mikhail
2016-01-01
The Universe of RNA Structures DataBase (URSDB) stores information obtained from all RNA-containing PDB entries (2935 entries in October 2015). The content of the database is updated regularly. The database consists of 51 tables containing indexed data on various elements of the RNA structures. The database provides a web interface allowing user to select a subset of structures with desired features and to obtain various statistical data for a selected subset of structures or for all structures. In particular, one can easily obtain statistics on geometric parameters of base pairs, on structural motifs (stems, loops, etc.) or on different types of pseudoknots. The user can also view and get information on an individual structure or its selected parts, e.g. RNA-protein hydrogen bonds. URSDB employs a new original definition of loops in RNA structures. That definition fits both pseudoknot-free and pseudoknotted secondary structures and coincides with the classical definition in case of pseudoknot-free structures. To our knowledge, URSDB is the first database supporting searches based on topological classification of pseudoknots and on extended loop classification.Database URL: http://server3.lpm.org.ru/urs/. © The Author(s) 2016. Published by Oxford University Press.
URS DataBase: universe of RNA structures and their motifs
Baulin, Eugene; Yacovlev, Victor; Khachko, Denis; Spirin, Sergei; Roytberg, Mikhail
2016-01-01
The Universe of RNA Structures DataBase (URSDB) stores information obtained from all RNA-containing PDB entries (2935 entries in October 2015). The content of the database is updated regularly. The database consists of 51 tables containing indexed data on various elements of the RNA structures. The database provides a web interface allowing user to select a subset of structures with desired features and to obtain various statistical data for a selected subset of structures or for all structures. In particular, one can easily obtain statistics on geometric parameters of base pairs, on structural motifs (stems, loops, etc.) or on different types of pseudoknots. The user can also view and get information on an individual structure or its selected parts, e.g. RNA–protein hydrogen bonds. URSDB employs a new original definition of loops in RNA structures. That definition fits both pseudoknot-free and pseudoknotted secondary structures and coincides with the classical definition in case of pseudoknot-free structures. To our knowledge, URSDB is the first database supporting searches based on topological classification of pseudoknots and on extended loop classification. Database URL: http://server3.lpm.org.ru/urs/ PMID:27242032
Classifying environmental pollutants: Part 3. External validation of the classification system.
Verhaar, H J; Solbé, J; Speksnijder, J; van Leeuwen, C J; Hermens, J L
2000-04-01
In order to validate a classification system for the prediction of the toxic effect concentrations of organic environmental pollutants to fish, all available fish acute toxicity data were retrieved from the ECETOC database, a database of quality-evaluated aquatic toxicity measurements created and maintained by the European Centre for the Ecotoxicology and Toxicology of Chemicals. The individual chemicals for which these data were available were classified according to the rulebase under consideration and predictions of effect concentrations or ranges of possible effect concentrations were generated. These predictions were compared to the actual toxicity data retrieved from the database. The results of this comparison show that generally, the classification system provides adequate predictions of either the aquatic toxicity (class 1) or the possible range of toxicity (other classes) of organic compounds. A slight underestimation of effect concentrations occurs for some highly water soluble, reactive chemicals with low log K(ow) values. On the other end of the scale, some compounds that are classified as belonging to a relatively toxic class appear to belong to the so-called baseline toxicity compounds. For some of these, additional classification rules are proposed. Furthermore, some groups of compounds cannot be classified, although they should be amenable to predictions. For these compounds additional research as to class membership and associated prediction rules is proposed.
Classification of Land Cover and Land Use Based on Convolutional Neural Networks
NASA Astrophysics Data System (ADS)
Yang, Chun; Rottensteiner, Franz; Heipke, Christian
2018-04-01
Land cover describes the physical material of the earth's surface, whereas land use describes the socio-economic function of a piece of land. Land use information is typically collected in geospatial databases. As such databases become outdated quickly, an automatic update process is required. This paper presents a new approach to determine land cover and to classify land use objects based on convolutional neural networks (CNN). The input data are aerial images and derived data such as digital surface models. Firstly, we apply a CNN to determine the land cover for each pixel of the input image. We compare different CNN structures, all of them based on an encoder-decoder structure for obtaining dense class predictions. Secondly, we propose a new CNN-based methodology for the prediction of the land use label of objects from a geospatial database. In this context, we present a strategy for generating image patches of identical size from the input data, which are classified by a CNN. Again, we compare different CNN architectures. Our experiments show that an overall accuracy of up to 85.7 % and 77.4 % can be achieved for land cover and land use, respectively. The classification of land cover has a positive contribution to the classification of the land use classification.
The impact of database quality on keystroke dynamics authentication
NASA Astrophysics Data System (ADS)
Panasiuk, Piotr; Rybnik, Mariusz; Saeed, Khalid; Rogowski, Marcin
2016-06-01
This paper concerns keystroke dynamics, also partially in the context of touchscreen devices. The authors concentrate on the impact of database quality and propose their algorithm to test database quality issues. The algorithm is used on their own
Brorson, Stig
2011-04-01
The diagnosis and treatment of fractures of the proximal humerus have troubled patients and medical practitioners since antiquity. Preradiographic diagnosis relied on surface anatomy, pain localization, crepitus, and impaired function. During the nineteenth century, a more thorough understanding of the pathoanatomy and pathophysiology of proximal humeral fractures was obtained, and new methods of reduction and bandaging were developed. I reviewed nineteenth-century principles of (1) diagnosis, (2) classification, (3) reduction, (4) bandaging, and (5) concepts of displacement in fractures of the proximal humerus. A narrative review of nineteenth-century surgical texts is presented. Sources were identified by searching bibliographic databases, orthopaedic sourcebooks, textbooks in medical history, and a subsequent hand search. Substantial progress in understanding fractures of the proximal humerus is found in nineteenth-century textbooks. A rational approach to understanding fractures of the proximal humerus was made possible by an appreciation of the underlying functional anatomy and subsequent pathoanatomy. Thus, new principles of diagnosis, pathoanatomic classifications, modified methods of reduction, functional bandaging, and advanced concepts of displacement were proposed, challenging the classic management adhered to for more than 2000 years. The principles for modern pathoanatomic and pathophysiologic understanding of proximal humeral fractures and the principles for classification, nonsurgical treatment, and bandaging were established in the preradiographic era.
Data-driven advice for applying machine learning to bioinformatics problems
Olson, Randal S.; La Cava, William; Mustahsan, Zairah; Varik, Akshay; Moore, Jason H.
2017-01-01
As the bioinformatics field grows, it must keep pace not only with new data but with new algorithms. Here we contribute a thorough analysis of 13 state-of-the-art, commonly used machine learning algorithms on a set of 165 publicly available classification problems in order to provide data-driven algorithm recommendations to current researchers. We present a number of statistical and visual comparisons of algorithm performance and quantify the effect of model selection and algorithm tuning for each algorithm and dataset. The analysis culminates in the recommendation of five algorithms with hyperparameters that maximize classifier performance across the tested problems, as well as general guidelines for applying machine learning to supervised classification problems. PMID:29218881
A Data Augmentation Approach to Short Text Classification
ERIC Educational Resources Information Center
Rosario, Ryan Robert
2017-01-01
Text classification typically performs best with large training sets, but short texts are very common on the World Wide Web. Can we use resampling and data augmentation to construct larger texts using similar terms? Several current methods exist for working with short text that rely on using external data and contexts, or workarounds. Our focus is…
Rule-driven defect detection in CT images of hardwood logs
Erol Sarigul; A. Lynn Abbott; Daniel L. Schmoldt
2000-01-01
This paper deals with automated detection and identification of internal defects in hardwood logs using computed tomography (CT) images. We have developed a system that employs artificial neural networks to perform tentative classification of logs on a pixel-by-pixel basis. This approach achieves a high level of classification accuracy for several hardwood species (...
Tufts Health Sciences Database: Lessons, Issues, and Opportunities.
ERIC Educational Resources Information Center
Lee, Mary Y.; Albright, Susan A.; Alkasab, Tarik; Damassa, David A.; Wang, Paul J.; Eaton, Elizabeth K.
2003-01-01
Describes a seven-year experience with developing the Tufts Health Sciences Database, a database-driven information management system that combines the strengths of a digital library, content delivery tools, and curriculum management. Identifies major effects on teaching and learning. Also addresses issues of faculty development, copyright and…
Construction of databases: advances and significance in clinical research.
Long, Erping; Huang, Bingjie; Wang, Liming; Lin, Xiaoyu; Lin, Haotian
2015-12-01
Widely used in clinical research, the database is a new type of data management automation technology and the most efficient tool for data management. In this article, we first explain some basic concepts, such as the definition, classification, and establishment of databases. Afterward, the workflow for establishing databases, inputting data, verifying data, and managing databases is presented. Meanwhile, by discussing the application of databases in clinical research, we illuminate the important role of databases in clinical research practice. Lastly, we introduce the reanalysis of randomized controlled trials (RCTs) and cloud computing techniques, showing the most recent advancements of databases in clinical research.
Improving Recall Using Database Management Systems: A Learning Strategy.
ERIC Educational Resources Information Center
Jonassen, David H.
1986-01-01
Describes the use of microcomputer database management systems to facilitate the instructional uses of learning strategies relating to information processing skills, especially recall. Two learning strategies, cross-classification matrixing and node acquisition and integration, are highlighted. (Author/LRW)
Ramanauskaite, Ausra; Juodzbalys, Gintaras
2016-01-01
To review and summarize the literature concerning peri-implantitis diagnostic parameters and to propose guidelines for peri-implantitis diagnosis. An electronic literature search was conducted of the MEDLINE (Ovid) and EMBASE databases for articles published between 2011 and 2016. Sequential screening at the title/abstract and full-text levels was performed. Systematic reviews/guidelines of consensus conferences proposing classification or suggesting diagnostic parameters for peri-implantitis in the English language were included. The review was recorded on PROSPERO system with the code CRD42016033287. The search resulted in 10 articles that met the inclusion criteria. Four were papers from consensus conferences, two recommended diagnostic guidelines, three proposed classification of peri-implantitis, and one suggested an index for implant success. The following parameters were suggested to be used for peri-implantitis diagnosis: pain, mobility, bleeding on probing, probing depth, suppuration/exudate, and radiographic bone loss. In all of the papers, different definitions of peri-implantitis or implant success, as well as different thresholds for the above mentioned clinical and radiographical parameters, were used. Current evidence rationale for the diagnosis of peri-implantitis and classification based on consecutive evaluation of soft-tissue conditions and the amount of bone loss were suggested. Currently there is no single uniform definition of peri-implantitis or the parameters that should be used. Rationale for diagnosis and prognosis of peri-implantitis as well as classification of the disease is proposed.
A Semi-supervised Heat Kernel Pagerank MBO Algorithm for Data Classification
2016-07-01
financial predictions, etc. and is finding growing use in text mining studies. In this paper, we present an efficient algorithm for classification of high...video data, set of images, hyperspectral data, medical data, text data, etc. Moreover, the framework provides a way to analyze data whose different...also be incorporated. For text classification, one can use tfidf (term frequency inverse document frequency) to form feature vectors for each document
Multiple Spectral-Spatial Classification Approach for Hyperspectral Data
NASA Technical Reports Server (NTRS)
Tarabalka, Yuliya; Benediktsson, Jon Atli; Chanussot, Jocelyn; Tilton, James C.
2010-01-01
A .new multiple classifier approach for spectral-spatial classification of hyperspectral images is proposed. Several classifiers are used independently to classify an image. For every pixel, if all the classifiers have assigned this pixel to the same class, the pixel is kept as a marker, i.e., a seed of the spatial region, with the corresponding class label. We propose to use spectral-spatial classifiers at the preliminary step of the marker selection procedure, each of them combining the results of a pixel-wise classification and a segmentation map. Different segmentation methods based on dissimilar principles lead to different classification results. Furthermore, a minimum spanning forest is built, where each tree is rooted on a classification -driven marker and forms a region in the spectral -spatial classification: map. Experimental results are presented for two hyperspectral airborne images. The proposed method significantly improves classification accuracies, when compared to previously proposed classification techniques.
Improved data retrieval from TreeBASE via taxonomic and linguistic data enrichment
Anwar, Nadia; Hunt, Ela
2009-01-01
Background TreeBASE, the only data repository for phylogenetic studies, is not being used effectively since it does not meet the taxonomic data retrieval requirements of the systematics community. We show, through an examination of the queries performed on TreeBASE, that data retrieval using taxon names is unsatisfactory. Results We report on a new wrapper supporting taxon queries on TreeBASE by utilising a Taxonomy and Classification Database (TCl-Db) we created. TCl-Db holds merged and consolidated taxonomic names from multiple data sources and can be used to translate hierarchical, vernacular and synonym queries into specific query terms in TreeBASE. The query expansion supported by TCl-Db shows very significant information retrieval quality improvement. The wrapper can be accessed at the URL The methodology we developed is scalable and can be applied to new data, as those become available in the future. Conclusion Significantly improved data retrieval quality is shown for all queries, and additional flexibility is achieved via user-driven taxonomy selection. PMID:19426482
Structural health monitoring feature design by genetic programming
NASA Astrophysics Data System (ADS)
Harvey, Dustin Y.; Todd, Michael D.
2014-09-01
Structural health monitoring (SHM) systems provide real-time damage and performance information for civil, aerospace, and other high-capital or life-safety critical structures. Conventional data processing involves pre-processing and extraction of low-dimensional features from in situ time series measurements. The features are then input to a statistical pattern recognition algorithm to perform the relevant classification or regression task necessary to facilitate decisions by the SHM system. Traditional design of signal processing and feature extraction algorithms can be an expensive and time-consuming process requiring extensive system knowledge and domain expertise. Genetic programming, a heuristic program search method from evolutionary computation, was recently adapted by the authors to perform automated, data-driven design of signal processing and feature extraction algorithms for statistical pattern recognition applications. The proposed method, called Autofead, is particularly suitable to handle the challenges inherent in algorithm design for SHM problems where the manifestation of damage in structural response measurements is often unclear or unknown. Autofead mines a training database of response measurements to discover information-rich features specific to the problem at hand. This study provides experimental validation on three SHM applications including ultrasonic damage detection, bearing damage classification for rotating machinery, and vibration-based structural health monitoring. Performance comparisons with common feature choices for each problem area are provided demonstrating the versatility of Autofead to produce significant algorithm improvements on a wide range of problems.
Value Driven Information Processing and Fusion
2016-03-01
consensus approach allows a decentralized approach to achieve the optimal error exponent of the centralized counterpart, a conclusion that is signifi...SECURITY CLASSIFICATION OF: The objective of the project is to develop a general framework for value driven decentralized information processing...including: optimal data reduction in a network setting for decentralized inference with quantization constraint; interactive fusion that allows queries and
Task-Driven Dictionary Learning Based on Mutual Information for Medical Image Classification.
Diamant, Idit; Klang, Eyal; Amitai, Michal; Konen, Eli; Goldberger, Jacob; Greenspan, Hayit
2017-06-01
We present a novel variant of the bag-of-visual-words (BoVW) method for automated medical image classification. Our approach improves the BoVW model by learning a task-driven dictionary of the most relevant visual words per task using a mutual information-based criterion. Additionally, we generate relevance maps to visualize and localize the decision of the automatic classification algorithm. These maps demonstrate how the algorithm works and show the spatial layout of the most relevant words. We applied our algorithm to three different tasks: chest x-ray pathology identification (of four pathologies: cardiomegaly, enlarged mediastinum, right consolidation, and left consolidation), liver lesion classification into four categories in computed tomography (CT) images and benign/malignant clusters of microcalcifications (MCs) classification in breast mammograms. Validation was conducted on three datasets: 443 chest x-rays, 118 portal phase CT images of liver lesions, and 260 mammography MCs. The proposed method improves the classical BoVW method for all tested applications. For chest x-ray, area under curve of 0.876 was obtained for enlarged mediastinum identification compared to 0.855 using classical BoVW (with p-value 0.01). For MC classification, a significant improvement of 4% was achieved using our new approach (with p-value = 0.03). For liver lesion classification, an improvement of 6% in sensitivity and 2% in specificity were obtained (with p-value 0.001). We demonstrated that classification based on informative selected set of words results in significant improvement. Our new BoVW approach shows promising results in clinically important domains. Additionally, it can discover relevant parts of images for the task at hand without explicit annotations for training data. This can provide computer-aided support for medical experts in challenging image analysis tasks.
ERIC Educational Resources Information Center
Lundquist, Carol; Frieder, Ophir; Holmes, David O.; Grossman, David
1999-01-01
Describes a scalable, parallel, relational database-drive information retrieval engine. To support portability across a wide range of execution environments, all algorithms adhere to the SQL-92 standard. By incorporating relevance feedback algorithms, accuracy is enhanced over prior database-driven information retrieval efforts. Presents…
Shift-invariant discrete wavelet transform analysis for retinal image classification.
Khademi, April; Krishnan, Sridhar
2007-12-01
This work involves retinal image classification and a novel analysis system was developed. From the compressed domain, the proposed scheme extracts textural features from wavelet coefficients, which describe the relative homogeneity of localized areas of the retinal images. Since the discrete wavelet transform (DWT) is shift-variant, a shift-invariant DWT was explored to ensure that a robust feature set was extracted. To combat the small database size, linear discriminant analysis classification was used with the leave one out method. 38 normal and 48 abnormal (exudates, large drusens, fine drusens, choroidal neovascularization, central vein and artery occlusion, histoplasmosis, arteriosclerotic retinopathy, hemi-central retinal vein occlusion and more) were used and a specificity of 79% and sensitivity of 85.4% were achieved (the average classification rate is 82.2%). The success of the system can be accounted to the highly robust feature set which included translation, scale and semi-rotational, features. Additionally, this technique is database independent since the features were specifically tuned to the pathologies of the human eye.
Gender classification from video under challenging operating conditions
NASA Astrophysics Data System (ADS)
Mendoza-Schrock, Olga; Dong, Guozhu
2014-06-01
The literature is abundant with papers on gender classification research. However the majority of such research is based on the assumption that there is enough resolution so that the subject's face can be resolved. Hence the majority of the research is actually in the face recognition and facial feature area. A gap exists for gender classification under challenging operating conditions—different seasonal conditions, different clothing, etc.—and when the subject's face cannot be resolved due to lack of resolution. The Seasonal Weather and Gender (SWAG) Database is a novel database that contains subjects walking through a scene under operating conditions that span a calendar year. This paper exploits a subset of that database—the SWAG One dataset—using data mining techniques, traditional classifiers (ex. Naïve Bayes, Support Vector Machine, etc.) and traditional (canny edge detection, etc.) and non-traditional (height/width ratios, etc.) feature extractors to achieve high correct gender classification rates (greater than 85%). Another novelty includes exploiting frame differentials.
Rudi, Knut; Kleiberg, Gro H; Heiberg, Ragnhild; Rosnes, Jan T
2007-08-01
The aim of this work was to evaluate restriction fragment melting curve analyses (RFMCA) as a novel approach for rapid classification of bacteria during food production. RFMCA was evaluated for bacteria isolated from sous vide food products, and raw materials used for sous vide production. We identified four major bacterial groups in the material analysed (cluster I-Streptococcus, cluster II-Carnobacterium/Bacillus, cluster III-Staphylococcus and cluster IV-Actinomycetales). The accuracy of RFMCA was evaluated by comparison with 16S rDNA sequencing. The strains satisfying the RFMCA quality filtering criteria (73%, n=57), with both 16S rDNA sequence information and RFMCA data (n=45) gave identical group assignments with the two methods. RFMCA enabled rapid and accurate classification of bacteria that is database compatible. Potential application of RFMCA in the food or pharmaceutical industry will include development of classification models for the bacteria expected in a given product, and then to build an RFMCA database as a part of the product quality control.
Classification of ion mobility spectra by functional groups using neural networks
NASA Technical Reports Server (NTRS)
Bell, S.; Nazarov, E.; Wang, Y. F.; Eiceman, G. A.
1999-01-01
Neural networks were trained using whole ion mobility spectra from a standardized database of 3137 spectra for 204 chemicals at various concentrations. Performance of the network was measured by the success of classification into ten chemical classes. Eleven stages for evaluation of spectra and of spectral pre-processing were employed and minimums established for response thresholds and spectral purity. After optimization of the database, network, and pre-processing routines, the fraction of successful classifications by functional group was 0.91 throughout a range of concentrations. Network classification relied on a combination of features, including drift times, number of peaks, relative intensities, and other factors apparently including peak shape. The network was opportunistic, exploiting different features within different chemical classes. Application of neural networks in a two-tier design where chemicals were first identified by class and then individually eliminated all but one false positive out of 161 test spectra. These findings establish that ion mobility spectra, even with low resolution instrumentation, contain sufficient detail to permit the development of automated identification systems.
Chung, Cecilia P; Rohan, Patricia; Krishnaswami, Shanthi; McPheeters, Melissa L
2013-12-30
To review the evidence supporting the validity of billing, procedural, or diagnosis code, or pharmacy claim-based algorithms used to identify patients with rheumatoid arthritis (RA) in administrative and claim databases. We searched the MEDLINE database from 1991 to September 2012 using controlled vocabulary and key terms related to RA and reference lists of included studies were searched. Two investigators independently assessed the full text of studies against pre-determined inclusion criteria and extracted the data. Data collected included participant and algorithm characteristics. Nine studies reported validation of computer algorithms based on International Classification of Diseases (ICD) codes with or without free-text, medication use, laboratory data and the need for a diagnosis by a rheumatologist. These studies yielded positive predictive values (PPV) ranging from 34 to 97% to identify patients with RA. Higher PPVs were obtained with the use of at least two ICD and/or procedure codes (ICD-9 code 714 and others), the requirement of a prescription of a medication used to treat RA, or requirement of participation of a rheumatologist in patient care. For example, the PPV increased from 66 to 97% when the use of disease-modifying antirheumatic drugs and the presence of a positive rheumatoid factor were required. There have been substantial efforts to propose and validate algorithms to identify patients with RA in automated databases. Algorithms that include more than one code and incorporate medications or laboratory data and/or required a diagnosis by a rheumatologist may increase the PPV. Copyright © 2013 Elsevier Ltd. All rights reserved.
ACLAME: a CLAssification of Mobile genetic Elements, update 2010.
Leplae, Raphaël; Lima-Mendez, Gipsi; Toussaint, Ariane
2010-01-01
The ACLAME database is dedicated to the collection, analysis and classification of sequenced mobile genetic elements (MGEs, in particular phages and plasmids). In addition to providing information on the MGEs content, classifications are available at various levels of organization. At the gene/protein level, families group similar sequences that are expected to share the same function. Families of four or more proteins are manually assigned with a functional annotation using the GeneOntology and the locally developed ontology MeGO dedicated to MGEs. At the genome level, evolutionary cohesive modules group sets of protein families shared among MGEs. At the population level, networks display the reticulate evolutionary relationships among MGEs. To increase the coverage of the phage sequence space, ACLAME version 0.4 incorporates 760 high-quality predicted prophages selected from the Prophinder database. Most of the data can be downloaded from the freely accessible ACLAME web site (http://aclame.ulb.ac.be). The BLAST interface for querying the database has been extended and numerous tools for in-depth analysis of the results have been added.
The Impact of Data-Based Science Instruction on Standardized Test Performance
NASA Astrophysics Data System (ADS)
Herrington, Tia W.
Increased teacher accountability efforts have resulted in the use of data to improve student achievement. This study addressed teachers' inconsistent use of data-driven instruction in middle school science. Evidence of the impact of data-based instruction on student achievement and school and district practices has been well documented by researchers. In science, less information has been available on teachers' use of data for classroom instruction. Drawing on data-driven decision making theory, the purpose of this study was to examine whether data-based instruction impacted performance on the science Criterion Referenced Competency Test (CRCT) and to explore the factors that impeded its use by a purposeful sample of 12 science teachers at a data-driven school. The research questions addressed in this study included understanding: (a) the association between student performance on the science portion of the CRCT and data-driven instruction professional development, (b) middle school science teachers' perception of the usefulness of data, and (c) the factors that hindered the use of data for science instruction. This study employed a mixed methods sequential explanatory design. Data collected included 8th grade CRCT data, survey responses, and individual teacher interviews. A chi-square test revealed no improvement in the CRCT scores following the implementation of professional development on data-driven instruction (chi 2 (1) = .183, p = .67). Results from surveys and interviews revealed that teachers used data to inform their instruction, indicating time as the major hindrance to their use. Implications for social change include the development of lesson plans that will empower science teachers to deliver data-based instruction and students to achieve identified academic goals.
Gupta, Priyanka; Schomburg, John; Krishna, Suprita; Adejoro, Oluwakayode; Wang, Qi; Marsh, Benjamin; Nguyen, Andrew; Genere, Juan Reyes; Self, Patrick; Lund, Erik; Konety, Badrinath R
2017-01-01
To examine the Manufacturer and User Facility Device Experience Database (MAUDE) database to capture adverse events experienced with the Da Vinci Surgical System. In addition, to design a standardized classification system to categorize the complications and machine failures associated with the device. Overall, 1,057,000 DaVinci procedures were performed in the United States between 2009 and 2012. Currently, no system exists for classifying and comparing device-related errors and complications with which to evaluate adverse events associated with the Da Vinci Surgical System. The MAUDE database was queried for events reports related to the DaVinci Surgical System between the years 2009 and 2012. A classification system was developed and tested among 14 robotic surgeons to associate a level of severity with each event and its relationship to the DaVinci Surgical System. Events were then classified according to this system and examined by using Chi-square analysis. Two thousand eight hundred thirty-seven events were identified, of which 34% were obstetrics and gynecology (Ob/Gyn); 19%, urology; 11%, other; and 36%, not specified. Our classification system had moderate agreement with a Kappa score of 0.52. Using our classification system, we identified 75% of the events as mild, 18% as moderate, 4% as severe, and 3% as life threatening or resulting in death. Seventy-seven percent were classified as definitely related to the device, 15% as possibly related, and 8% as not related. Urology procedures compared with Ob/Gyn were associated with more severe events (38% vs 26%, p < 0.0001). Energy instruments were associated with less severe events compared with the surgical system (8% vs 87%, p < 0.0001). Events that were definitely associated with the device tended to be less severe (81% vs 19%, p < 0.0001). Our classification system is a valid tool with moderate inter-rater agreement that can be used to better understand device-related adverse events. The majority of robotic related events were mild but associated with the device.
A Dynamic Human Health Risk Assessment System
Prasad, Umesh; Singh, Gurmit; Pant, A. B.
2012-01-01
An online human health risk assessment system (OHHRAS) has been designed and developed in the form of a prototype database-driven system and made available for the population of India through a website – www.healthriskindia.in. OHHRAS provide the three utilities, that is, health survey, health status, and bio-calculators. The first utility health survey is functional on the basis of database being developed dynamically and gives the desired output to the user on the basis of input criteria entered into the system; the second utility health status is providing the output on the basis of dynamic questionnaire and ticked (selected) answers and generates the health status reports based on multiple matches set as per advise of medical experts and the third utility bio-calculators are very useful for the scientists/researchers as online statistical analysis tool that gives more accuracy and save the time of user. The whole system and database-driven website has been designed and developed by using the software (mainly are PHP, My-SQL, Deamweaver, C++ etc.) and made available publically through a database-driven website (www.healthriskindia.in), which are very useful for researchers, academia, students, and general masses of all sectors. PMID:22778520
Designing a data portal for synthesis modeling
NASA Astrophysics Data System (ADS)
Holmes, M. A.
2006-12-01
Processing of field and model data in multi-disciplinary integrated science studies is a vital part of synthesis modeling. Collection and storage techniques for field data vary greatly between the participating scientific disciplines due to the nature of the data being collected, whether it be in situ, remotely sensed, or recorded by automated data logging equipment. Spreadsheets, personal databases, text files and binary files are used in the initial storage and processing of the raw data. In order to be useful to scientists, engineers and modelers the data need to be stored in a format that is easily identifiable, accessible and transparent to a variety of computing environments. The Model Operations and Synthesis (MOAS) database and associated web portal were created to provide such capabilities. The industry standard relational database is comprised of spatial and temporal data tables, shape files and supporting metadata accessible over the network, through a menu driven web-based portal or spatially accessible through ArcSDE connections from the user's local GIS desktop software. A separate server provides public access to spatial data and model output in the form of attributed shape files through an ArcIMS web-based graphical user interface.
Development of a global land cover characteristics database and IGBP DISCover from 1 km AVHRR data
Loveland, Thomas R.; Reed, B.C.; Brown, Jesslyn F.; Ohlen, D.O.; Zhu, Z.; Yang, L.; Merchant, J.W.
2000-01-01
Researchers from the U.S. Geological Survey, University of Nebraska-Lincoln and the European Commission's Joint Research Centre, Ispra, Italy produced a 1 km resolution global land cover characteristics database for use in a wide range of continental-to global-scale environmental studies. This database provides a unique view of the broad patterns of the biogeographical and ecoclimatic diversity of the global land surface, and presents a detailed interpretation of the extent of human development. The project was carried out as an International Geosphere-Biosphere Programme, Data and Information Systems (IGBP-DIS) initiative. The IGBP DISCover global land cover product is an integral component of the global land cover database. DISCover includes 17 general land cover classes defined to meet the needs of IGBP core science projects. A formal accuracy assessment of the DISCover data layer will be completed in 1998. The 1 km global land cover database was developed through a continent-by-continent unsupervised classification of 1 km monthly Advanced Very High Resolution Radiometer (AVHRR) Normalized Difference Vegetation Index (NDVI) composites covering 1992-1993. Extensive post-classification stratification was necessary to resolve spectral/temporal confusion between disparate land cover types. The complete global database consists of 961 seasonal land cover regions that capture patterns of land cover, seasonality and relative primary productivity. The seasonal land cover regions were aggregated to produce seven separate land cover data sets used for global environmental modelling and assessment. The data sets include IGBP DISCover, U.S. Geological Survey Anderson System, Simple Biosphere Model, Simple Biosphere Model 2, Biosphere-Atmosphere Transfer Scheme, Olson Ecosystems and Running Global Remote Sensing Land Cover. The database also includes all digital sources that were used in the classification. The complete database can be sourced from the website: http://edcwww.cr.usgs.gov/landdaac/glcc/glcc.html.
NASA Astrophysics Data System (ADS)
Galitsky, Boris; Kovalerchuk, Boris
2006-04-01
We develop a software system Text Scanner for Emotional Distress (TSED) for helping to detect email messages which are suspicious of coming from people under strong emotional distress. It has been confirmed by multiple studies that terrorist attackers have experienced a substantial emotional distress at some points before committing a terrorist attack. Therefore, if an individual in emotional distress can be detected on the basis of email texts, some preventive measures can be taken. The proposed detection machinery is based on extraction and classification of emotional profiles from emails. An emotional profile is a formal representation of a sequence of emotional states through a textual discourse where communicative actions are attached to these emotional states. The issues of extraction of emotional profiles from text and reasoning about it are discussed and illustrated. We then develop an inductive machine learning and reasoning framework to relate an emotional profile to the class "Emotional distress" or "No emotional distress", given a training dataset where the class is assigned by an expert. TSED's machine learning is evaluated using the database of structured customer complaints.
Automatic comparison of striation marks and automatic classification of shoe prints
NASA Astrophysics Data System (ADS)
Geradts, Zeno J.; Keijzer, Jan; Keereweer, Isaac
1995-09-01
A database for toolmarks (named TRAX) and a database for footwear outsole designs (named REBEZO) have been developed on a PC. The databases are filled with video-images and administrative data about the toolmarks and the footwear designs. An algorithm for the automatic comparison of the digitized striation patterns has been developed for TRAX. The algorithm appears to work well for deep and complete striation marks and will be implemented in TRAX. For REBEZO some efforts have been made to the automatic classification of outsole patterns. The algorithm first segments the shoeprofile. Fourier-features are selected for the separate elements and are classified with a neural network. In future developments information on invariant moments of the shape and rotation angle will be included in the neural network.
Zhao, Min; Chen, Yanming; Qu, Dacheng; Qu, Hong
2015-01-01
The substrates of a transporter are not only useful for inferring function of the transporter, but also important to discover compound-compound interaction and to reconstruct metabolic pathway. Though plenty of data has been accumulated with the developing of new technologies such as in vitro transporter assays, the search for substrates of transporters is far from complete. In this article, we introduce METSP, a maximum-entropy classifier devoted to retrieve transporter-substrate pairs (TSPs) from semistructured text. Based on the high quality annotation from UniProt, METSP achieves high precision and recall in cross-validation experiments. When METSP is applied to 182,829 human transporter annotation sentences in UniProt, it identifies 3942 sentences with transporter and compound information. Finally, 1547 confidential human TSPs are identified for further manual curation, among which 58.37% pairs with novel substrates not annotated in public transporter databases. METSP is the first efficient tool to extract TSPs from semistructured annotation text in UniProt. This tool can help to determine the precise substrates and drugs of transporters, thus facilitating drug-target prediction, metabolic network reconstruction, and literature classification.
An updated version of NPIDB includes new classifications of DNA–protein complexes and their families
Zanegina, Olga; Kirsanov, Dmitriy; Baulin, Eugene; Karyagina, Anna; Alexeevski, Andrei; Spirin, Sergey
2016-01-01
The recent upgrade of nucleic acid–protein interaction database (NPIDB, http://npidb.belozersky.msu.ru/) includes a newly elaborated classification of complexes of protein domains with double-stranded DNA and a classification of families of related complexes. Our classifications are based on contacting structural elements of both DNA: the major groove, the minor groove and the backbone; and protein: helices, beta-strands and unstructured segments. We took into account both hydrogen bonds and hydrophobic interaction. The analyzed material contains 1942 structures of protein domains from 748 PDB entries. We have identified 97 interaction modes of individual protein domain–DNA complexes and 17 DNA–protein interaction classes of protein domain families. We analyzed the sources of diversity of DNA–protein interaction modes in different complexes of one protein domain family. The observed interaction mode is sometimes influenced by artifacts of crystallization or diversity in secondary structure assignment. The interaction classes of domain families are more stable and thus possess more biological sense than a classification of single complexes. Integration of the classification into NPIDB allows the user to browse the database according to the interacting structural elements of DNA and protein molecules. For each family, we present average DNA shape parameters in contact zones with domains of the family. PMID:26656949
On-line classification of pollutants in water using wireless portable electronic noses.
Herrero, José Luis; Lozano, Jesús; Santos, José Pedro; Suárez, José Ignacio
2016-06-01
A portable electronic nose with database connection for on-line classification of pollutants in water is presented in this paper. It is a hand-held, lightweight and powered instrument with wireless communications capable of standalone operation. A network of similar devices can be configured for distributed measurements. It uses four resistive microsensors and headspace as sampling method for extracting the volatile compounds from glass vials. The measurement and control program has been developed in LabVIEW using the database connection toolkit to send the sensors data to a server for training and classification with Artificial Neural Networks (ANNs). The use of a server instead of the microprocessor of the e-nose increases the capacity of memory and the computing power of the classifier and allows external users to perform data classification. To address this challenge, this paper also proposes a web-based framework (based on RESTFul web services, Asynchronous JavaScript and XML and JavaScript Object Notation) that allows remote users to train ANNs and request classification values regardless user's location and the type of device used. Results show that the proposed prototype can discriminate the samples measured (Blank water, acetone, toluene, ammonia, formaldehyde, hydrogen peroxide, ethanol, benzene, dichloromethane, acetic acid, xylene and dimethylacetamide) with a 94% classification success rate. Copyright © 2016 Elsevier Ltd. All rights reserved.
Data-driven heterogeneity in mathematical learning disabilities based on the triple code model.
Peake, Christian; Jiménez, Juan E; Rodríguez, Cristina
2017-12-01
Many classifications of heterogeneity in mathematical learning disabilities (MLD) have been proposed over the past four decades, however no empirical research has been conducted until recently, and none of the classifications are derived from Triple Code Model (TCM) postulates. The TCM proposes MLD as a heterogeneous disorder, with two distinguishable profiles: a representational subtype and a verbal subtype. A sample of elementary school 3rd to 6th graders was divided into two age cohorts (3rd - 4th grades, and 5th - 6th grades). Using data-driven strategies, based on the cognitive classification variables predicted by the TCM, our sample of children with MLD clustered as expected: a group with representational deficits and a group with number-fact retrieval deficits. In the younger group, a spatial subtype also emerged, while in both cohorts a non-specific cluster was produced whose profile could not be explained by this theoretical approach. Copyright © 2017 Elsevier Ltd. All rights reserved.
Data-driven classification of bipolar I disorder from longitudinal course of mood.
Cochran, A L; McInnis, M G; Forger, D B
2016-10-11
The Diagnostic and Statistical Manual of Mental Disorder (DSM) classification of bipolar disorder defines categories to reflect common understanding of mood symptoms rather than scientific evidence. This work aimed to determine whether bipolar I can be objectively classified from longitudinal mood data and whether resulting classes have clinical associations. Bayesian nonparametric hierarchical models with latent classes and patient-specific models of mood are fit to data from Longitudinal Interval Follow-up Evaluations (LIFE) of bipolar I patients (N=209). Classes are tested for clinical associations. No classes are justified using the time course of DSM-IV mood states. Three classes are justified using the course of subsyndromal mood symptoms. Classes differed in attempted suicides (P=0.017), disability status (P=0.012) and chronicity of affective symptoms (P=0.009). Thus, bipolar I disorder can be objectively classified from mood course, and individuals in the resulting classes share clinical features. Data-driven classification from mood course could be used to enrich sample populations for pharmacological and etiological studies.
Duz, Marco; Marshall, John F; Parkin, Tim
2017-06-29
The use of electronic medical records (EMRs) offers opportunity for clinical epidemiological research. With large EMR databases, automated analysis processes are necessary but require thorough validation before they can be routinely used. The aim of this study was to validate a computer-assisted technique using commercially available content analysis software (SimStat-WordStat v.6 (SS/WS), Provalis Research) for mining free-text EMRs. The dataset used for the validation process included life-long EMRs from 335 patients (17,563 rows of data), selected at random from a larger dataset (141,543 patients, ~2.6 million rows of data) and obtained from 10 equine veterinary practices in the United Kingdom. The ability of the computer-assisted technique to detect rows of data (cases) of colic, renal failure, right dorsal colitis, and non-steroidal anti-inflammatory drug (NSAID) use in the population was compared with manual classification. The first step of the computer-assisted analysis process was the definition of inclusion dictionaries to identify cases, including terms identifying a condition of interest. Words in inclusion dictionaries were selected from the list of all words in the dataset obtained in SS/WS. The second step consisted of defining an exclusion dictionary, including combinations of words to remove cases erroneously classified by the inclusion dictionary alone. The third step was the definition of a reinclusion dictionary to reinclude cases that had been erroneously classified by the exclusion dictionary. Finally, cases obtained by the exclusion dictionary were removed from cases obtained by the inclusion dictionary, and cases from the reinclusion dictionary were subsequently reincluded using Rv3.0.2 (R Foundation for Statistical Computing, Vienna, Austria). Manual analysis was performed as a separate process by a single experienced clinician reading through the dataset once and classifying each row of data based on the interpretation of the free-text notes. Validation was performed by comparison of the computer-assisted method with manual analysis, which was used as the gold standard. Sensitivity, specificity, negative predictive values (NPVs), positive predictive values (PPVs), and F values of the computer-assisted process were calculated by comparing them with the manual classification. Lowest sensitivity, specificity, PPVs, NPVs, and F values were 99.82% (1128/1130), 99.88% (16410/16429), 94.6% (223/239), 100.00% (16410/16412), and 99.0% (100×2×0.983×0.998/[0.983+0.998]), respectively. The computer-assisted process required few seconds to run, although an estimated 30 h were required for dictionary creation. Manual classification required approximately 80 man-hours. The critical step in this work is the creation of accurate and inclusive dictionaries to ensure that no potential cases are missed. It is significantly easier to remove false positive terms from a SS/WS selected subset of a large database than search that original database for potential false negatives. The benefits of using this method are proportional to the size of the dataset to be analyzed. ©Marco Duz, John F Marshall, Tim Parkin. Originally published in JMIR Medical Informatics (http://medinform.jmir.org), 29.06.2017.
Marshall, John F; Parkin, Tim
2017-01-01
Background The use of electronic medical records (EMRs) offers opportunity for clinical epidemiological research. With large EMR databases, automated analysis processes are necessary but require thorough validation before they can be routinely used. Objective The aim of this study was to validate a computer-assisted technique using commercially available content analysis software (SimStat-WordStat v.6 (SS/WS), Provalis Research) for mining free-text EMRs. Methods The dataset used for the validation process included life-long EMRs from 335 patients (17,563 rows of data), selected at random from a larger dataset (141,543 patients, ~2.6 million rows of data) and obtained from 10 equine veterinary practices in the United Kingdom. The ability of the computer-assisted technique to detect rows of data (cases) of colic, renal failure, right dorsal colitis, and non-steroidal anti-inflammatory drug (NSAID) use in the population was compared with manual classification. The first step of the computer-assisted analysis process was the definition of inclusion dictionaries to identify cases, including terms identifying a condition of interest. Words in inclusion dictionaries were selected from the list of all words in the dataset obtained in SS/WS. The second step consisted of defining an exclusion dictionary, including combinations of words to remove cases erroneously classified by the inclusion dictionary alone. The third step was the definition of a reinclusion dictionary to reinclude cases that had been erroneously classified by the exclusion dictionary. Finally, cases obtained by the exclusion dictionary were removed from cases obtained by the inclusion dictionary, and cases from the reinclusion dictionary were subsequently reincluded using Rv3.0.2 (R Foundation for Statistical Computing, Vienna, Austria). Manual analysis was performed as a separate process by a single experienced clinician reading through the dataset once and classifying each row of data based on the interpretation of the free-text notes. Validation was performed by comparison of the computer-assisted method with manual analysis, which was used as the gold standard. Sensitivity, specificity, negative predictive values (NPVs), positive predictive values (PPVs), and F values of the computer-assisted process were calculated by comparing them with the manual classification. Results Lowest sensitivity, specificity, PPVs, NPVs, and F values were 99.82% (1128/1130), 99.88% (16410/16429), 94.6% (223/239), 100.00% (16410/16412), and 99.0% (100×2×0.983×0.998/[0.983+0.998]), respectively. The computer-assisted process required few seconds to run, although an estimated 30 h were required for dictionary creation. Manual classification required approximately 80 man-hours. Conclusions The critical step in this work is the creation of accurate and inclusive dictionaries to ensure that no potential cases are missed. It is significantly easier to remove false positive terms from a SS/WS selected subset of a large database than search that original database for potential false negatives. The benefits of using this method are proportional to the size of the dataset to be analyzed. PMID:28663163
Con-Text: Text Detection for Fine-grained Object Classification.
Karaoglu, Sezer; Tao, Ran; van Gemert, Jan C; Gevers, Theo
2017-05-24
This work focuses on fine-grained object classification using recognized scene text in natural images. While the state-of-the-art relies on visual cues only, this paper is the first work which proposes to combine textual and visual cues. Another novelty is the textual cue extraction. Unlike the state-of-the-art text detection methods, we focus more on the background instead of text regions. Once text regions are detected, they are further processed by two methods to perform text recognition i.e. ABBYY commercial OCR engine and a state-of-the-art character recognition algorithm. Then, to perform textual cue encoding, bi- and trigrams are formed between the recognized characters by considering the proposed spatial pairwise constraints. Finally, extracted visual and textual cues are combined for fine-grained classification. The proposed method is validated on four publicly available datasets: ICDAR03, ICDAR13, Con-Text and Flickr-logo. We improve the state-of-the-art end-to-end character recognition by a large margin of 15% on ICDAR03. We show that textual cues are useful in addition to visual cues for fine-grained classification. We show that textual cues are also useful for logo retrieval. Adding textual cues outperforms visual- and textual-only in fine-grained classification (70.7% to 60.3%) and logo retrieval (57.4% to 54.8%).
Hypothesis-driven classification of materials using nuclear magnetic resonance relaxometry
DOE Office of Scientific and Technical Information (OSTI.GOV)
Espy, Michelle A.; Matlashov, Andrei N.; Schultz, Larry J.
Technologies related to identification of a substance in an optimized manner are provided. A reference group of known materials is identified. Each known material has known values for several classification parameters. The classification parameters comprise at least one of T.sub.1, T.sub.2, T.sub.1.rho., a relative nuclear susceptibility (RNS) of the substance, and an x-ray linear attenuation coefficient (LAC) of the substance. A measurement sequence is optimized based on at least one of a measurement cost of each of the classification parameters and an initial probability of each of the known materials in the reference group.
Monitoring aquatic resources for regional assessments requires an accurate and comprehensive inventory of the resource and useful classification of exosystem similarities. Our research effort to create an electronic database and work with various ways to classify coastal wetlands...
Inter-Coder Agreement in One-to-Many Classification: Fuzzy Kappa
Kirilenko, Andrei P.; Stepchenkova, Svetlana
2016-01-01
Content analysis involves classification of textual, visual, or audio data. The inter-coder agreement is estimated by making two or more coders to classify the same data units, with subsequent comparison of their results. The existing methods of agreement estimation, e.g., Cohen’s kappa, require that coders place each unit of content into one and only one category (one-to-one coding) from the pre-established set of categories. However, in certain data domains (e.g., maps, photographs, databases of texts and images), this requirement seems overly restrictive. The restriction could be lifted, provided that there is a measure to calculate the inter-coder agreement in the one-to-many protocol. Building on the existing approaches to one-to-many coding in geography and biomedicine, such measure, fuzzy kappa, which is an extension of Cohen’s kappa, is proposed. It is argued that the measure is especially compatible with data from certain domains, when holistic reasoning of human coders is utilized in order to describe the data and access the meaning of communication. PMID:26933956
Ong, Mei-Sing; Runciman, William; Coiera, Enrico
2010-01-01
Objective To analyze patient safety incidents associated with computer use to develop the basis for a classification of problems reported by health professionals. Design Incidents submitted to a voluntary incident reporting database across one Australian state were retrieved and a subset (25%) was analyzed to identify ‘natural categories’ for classification. Two coders independently classified the remaining incidents into one or more categories. Free text descriptions were analyzed to identify contributing factors. Where available medical specialty, time of day and consequences were examined. Measurements Descriptive statistics; inter-rater reliability. Results A search of 42 616 incidents from 2003 to 2005 yielded 123 computer related incidents. After removing duplicate and unrelated incidents, 99 incidents describing 117 problems remained. A classification with 32 types of computer use problems was developed. Problems were grouped into information input (31%), transfer (20%), output (20%) and general technical (24%). Overall, 55% of problems were machine related and 45% were attributed to human–computer interaction. Delays in initiating and completing clinical tasks were a major consequence of machine related problems (70%) whereas rework was a major consequence of human–computer interaction problems (78%). While 38% (n=26) of the incidents were reported to have a noticeable consequence but no harm, 34% (n=23) had no noticeable consequence. Conclusion Only 0.2% of all incidents reported were computer related. Further work is required to expand our classification using incident reports and other sources of information about healthcare IT problems. Evidence based user interface design must focus on the safe entry and retrieval of clinical information and support users in detecting and correcting errors and malfunctions. PMID:20962128
DOE Office of Scientific and Technical Information (OSTI.GOV)
Reddy, Tatiparthi B. K.; Thomas, Alex D.; Stamatis, Dimitri
The Genomes OnLine Database (GOLD; http://www.genomesonline.org) is a comprehensive online resource to catalog and monitor genetic studies worldwide. GOLD provides up-to-date status on complete and ongoing sequencing projects along with a broad array of curated metadata. Within this paper, we report version 5 (v.5) of the database. The newly designed database schema and web user interface supports several new features including the implementation of a four level (meta)genome project classification system and a simplified intuitive web interface to access reports and launch search tools. The database currently hosts information for about 19 200 studies, 56 000 Biosamples, 56 000 sequencingmore » projects and 39 400 analysis projects. More than just a catalog of worldwide genome projects, GOLD is a manually curated, quality-controlled metadata warehouse. The problems encountered in integrating disparate and varying quality data into GOLD are briefly highlighted. Lastly, GOLD fully supports and follows the Genomic Standards Consortium (GSC) Minimum Information standards.« less
Toward Computational Cumulative Biology by Combining Models of Biological Datasets
Faisal, Ali; Peltonen, Jaakko; Georgii, Elisabeth; Rung, Johan; Kaski, Samuel
2014-01-01
A main challenge of data-driven sciences is how to make maximal use of the progressively expanding databases of experimental datasets in order to keep research cumulative. We introduce the idea of a modeling-based dataset retrieval engine designed for relating a researcher's experimental dataset to earlier work in the field. The search is (i) data-driven to enable new findings, going beyond the state of the art of keyword searches in annotations, (ii) modeling-driven, to include both biological knowledge and insights learned from data, and (iii) scalable, as it is accomplished without building one unified grand model of all data. Assuming each dataset has been modeled beforehand, by the researchers or automatically by database managers, we apply a rapidly computable and optimizable combination model to decompose a new dataset into contributions from earlier relevant models. By using the data-driven decomposition, we identify a network of interrelated datasets from a large annotated human gene expression atlas. While tissue type and disease were major driving forces for determining relevant datasets, the found relationships were richer, and the model-based search was more accurate than the keyword search; moreover, it recovered biologically meaningful relationships that are not straightforwardly visible from annotations—for instance, between cells in different developmental stages such as thymocytes and T-cells. Data-driven links and citations matched to a large extent; the data-driven links even uncovered corrections to the publication data, as two of the most linked datasets were not highly cited and turned out to have wrong publication entries in the database. PMID:25427176
Toward computational cumulative biology by combining models of biological datasets.
Faisal, Ali; Peltonen, Jaakko; Georgii, Elisabeth; Rung, Johan; Kaski, Samuel
2014-01-01
A main challenge of data-driven sciences is how to make maximal use of the progressively expanding databases of experimental datasets in order to keep research cumulative. We introduce the idea of a modeling-based dataset retrieval engine designed for relating a researcher's experimental dataset to earlier work in the field. The search is (i) data-driven to enable new findings, going beyond the state of the art of keyword searches in annotations, (ii) modeling-driven, to include both biological knowledge and insights learned from data, and (iii) scalable, as it is accomplished without building one unified grand model of all data. Assuming each dataset has been modeled beforehand, by the researchers or automatically by database managers, we apply a rapidly computable and optimizable combination model to decompose a new dataset into contributions from earlier relevant models. By using the data-driven decomposition, we identify a network of interrelated datasets from a large annotated human gene expression atlas. While tissue type and disease were major driving forces for determining relevant datasets, the found relationships were richer, and the model-based search was more accurate than the keyword search; moreover, it recovered biologically meaningful relationships that are not straightforwardly visible from annotations-for instance, between cells in different developmental stages such as thymocytes and T-cells. Data-driven links and citations matched to a large extent; the data-driven links even uncovered corrections to the publication data, as two of the most linked datasets were not highly cited and turned out to have wrong publication entries in the database.
Characterization of Particles Created By Laser-Driven Hydrothermal Processing
2016-06-01
created by laser-driven hydrothermal processing, an innovative technique used for the ablation of submerged materials. Two naturally occurring...processing, characterization, obsidian, tektite, natural glass 15. NUMBER OF PAGES 89 16. PRICE CODE 17. SECURITY CLASSIFICATION OF REPORT...technique used for the ablation of submerged materials. Two naturally occurring materials, obsidian and tektite, were used as targets for this technique
Li, Wei; Cao, Peng; Zhao, Dazhe; Wang, Junbo
2016-01-01
Computer aided detection (CAD) systems can assist radiologists by offering a second opinion on early diagnosis of lung cancer. Classification and feature representation play critical roles in false-positive reduction (FPR) in lung nodule CAD. We design a deep convolutional neural networks method for nodule classification, which has an advantage of autolearning representation and strong generalization ability. A specified network structure for nodule images is proposed to solve the recognition of three types of nodules, that is, solid, semisolid, and ground glass opacity (GGO). Deep convolutional neural networks are trained by 62,492 regions-of-interest (ROIs) samples including 40,772 nodules and 21,720 nonnodules from the Lung Image Database Consortium (LIDC) database. Experimental results demonstrate the effectiveness of the proposed method in terms of sensitivity and overall accuracy and that it consistently outperforms the competing methods.
Managing the Big Data Avalanche in Astronomy - Data Mining the Galaxy Zoo Classification Database
NASA Astrophysics Data System (ADS)
Borne, Kirk D.
2014-01-01
We will summarize a variety of data mining experiments that have been applied to the Galaxy Zoo database of galaxy classifications, which were provided by the volunteer citizen scientists. The goal of these exercises is to learn new and improved classification rules for diverse populations of galaxies, which can then be applied to much larger sky surveys of the future, such as the LSST (Large Synoptic Sky Survey), which is proposed to obtain detailed photometric data for approximately 20 billion galaxies. The massive Big Data that astronomy projects will generate in the future demand greater application of data mining and data science algorithms, as well as greater training of astronomy students in the skills of data mining and data science. The project described here has involved several graduate and undergraduate research assistants at George Mason University.
Classification System and Information Services in the Library of SAO RAS
NASA Astrophysics Data System (ADS)
Shvedova, G. S.
The classification system used at SAO RAS is described. It includes both special determinants from UDC (Universal Decimal Classification) and newer tables with astronomical terms from the Library-Bibliographical Classification (LBC). The classification tables are continually modified, and new astronomical terms are introduced. At the present time the information services of the scientists is fulfilled with the help of the Abstract Journal Astronomy, Astronomy and Astrophysics Abstracts, catalogues and card indexes of the library. Based on our classification system and The Astronomy Thesaurus completed by R.M. Shobbrook and R.R. Shobbrook the development of a database for the library has been started, which allows prompt service of the observatory's staff members.
Development of an Integrated Biospecimen Database among the Regional Biobanks in Korea.
Park, Hyun Sang; Cho, Hune; Kim, Hwa Sun
2016-04-01
This study developed an integrated database for 15 regional biobanks that provides large quantities of high-quality bio-data to researchers to be used for the prevention of disease, for the development of personalized medicines, and in genetics studies. We collected raw data, managed independently by 15 regional biobanks, for database modeling and analyzed and defined the metadata of the items. We also built a three-step (high, middle, and low) classification system for classifying the item concepts based on the metadata. To generate clear meanings of the items, clinical items were defined using the Systematized Nomenclature of Medicine Clinical Terms, and specimen items were defined using the Logical Observation Identifiers Names and Codes. To optimize database performance, we set up a multi-column index based on the classification system and the international standard code. As a result of subdividing 7,197,252 raw data items collected, we refined the metadata into 1,796 clinical items and 1,792 specimen items. The classification system consists of 15 high, 163 middle, and 3,588 low class items. International standard codes were linked to 69.9% of the clinical items and 71.7% of the specimen items. The database consists of 18 tables based on a table from MySQL Server 5.6. As a result of the performance evaluation, the multi-column index shortened query time by as much as nine times. The database developed was based on an international standard terminology system, providing an infrastructure that can integrate the 7,197,252 raw data items managed by the 15 regional biobanks. In particular, it resolved the inevitable interoperability issues in the exchange of information among the biobanks, and provided a solution to the synonym problem, which arises when the same concept is expressed in a variety of ways.
Algorithmic Classification of Five Characteristic Types of Paraphasias.
Fergadiotis, Gerasimos; Gorman, Kyle; Bedrick, Steven
2016-12-01
This study was intended to evaluate a series of algorithms developed to perform automatic classification of paraphasic errors (formal, semantic, mixed, neologistic, and unrelated errors). We analyzed 7,111 paraphasias from the Moss Aphasia Psycholinguistics Project Database (Mirman et al., 2010) and evaluated the classification accuracy of 3 automated tools. First, we used frequency norms from the SUBTLEXus database (Brysbaert & New, 2009) to differentiate nonword errors and real-word productions. Then we implemented a phonological-similarity algorithm to identify phonologically related real-word errors. Last, we assessed the performance of a semantic-similarity criterion that was based on word2vec (Mikolov, Yih, & Zweig, 2013). Overall, the algorithmic classification replicated human scoring for the major categories of paraphasias studied with high accuracy. The tool that was based on the SUBTLEXus frequency norms was more than 97% accurate in making lexicality judgments. The phonological-similarity criterion was approximately 91% accurate, and the overall classification accuracy of the semantic classifier ranged from 86% to 90%. Overall, the results highlight the potential of tools from the field of natural language processing for the development of highly reliable, cost-effective diagnostic tools suitable for collecting high-quality measurement data for research and clinical purposes.
NASA Astrophysics Data System (ADS)
Manteiga, M.; Carricajo, I.; Rodríguez, A.; Dafonte, C.; Arcay, B.
2009-02-01
Astrophysics is evolving toward a more rational use of costly observational data by intelligently exploiting the large terrestrial and spatial astronomical databases. In this paper, we present a study showing the suitability of an expert system to perform the classification of stellar spectra in the Morgan and Keenan (MK) system. Using the formalism of artificial intelligence for the development of such a system, we propose a rules' base that contains classification criteria and confidence grades, all integrated in an inference engine that emulates human reasoning by means of a hierarchical decision rules tree that also considers the uncertainty factors associated with rules. Our main objective is to illustrate the formulation and development of such a system for an astrophysical classification problem. An extensive spectral database of MK standard spectra has been collected and used as a reference to determine the spectral indexes that are suitable for classification in the MK system. It is shown that by considering 30 spectral indexes and associating them with uncertainty factors, we can find an accurate diagnose in MK types of a particular spectrum. The system was evaluated against the NOAO-INDO-US spectral catalog.
Variability in Standard Outcomes of Posterior Lumbar Fusion Determined by National Databases.
Joseph, Jacob R; Smith, Brandon W; Park, Paul
2017-01-01
National databases are used with increasing frequency in spine surgery literature to evaluate patient outcomes. The differences between individual databases in relationship to outcomes of lumbar fusion are not known. We evaluated the variability in standard outcomes of posterior lumbar fusion between the University HealthSystem Consortium (UHC) database and the Healthcare Cost and Utilization Project National Inpatient Sample (NIS). NIS and UHC databases were queried for all posterior lumbar fusions (International Classification of Diseases, Ninth Revision code 81.07) performed in 2012. Patient demographics, comorbidities (including obesity), length of stay (LOS), in-hospital mortality, and complications such as urinary tract infection, deep venous thrombosis, pulmonary embolism, myocardial infarction, durotomy, and surgical site infection were collected using specific International Classification of Diseases, Ninth Revision codes. Analysis included 21,470 patients from the NIS database and 14,898 patients from the UHC database. Demographic data were not significantly different between databases. Obesity was more prevalent in UHC (P = 0.001). Mean LOS was 3.8 days in NIS and 4.55 in UHC (P < 0.0001). Complications were significantly higher in UHC, including urinary tract infection, deep venous thrombosis, pulmonary embolism, myocardial infarction, surgical site infection, and durotomy. In-hospital mortality was similar between databases. NIS and UHC databases had similar demographic patient populations undergoing posterior lumbar fusion. However, the UHC database reported significantly higher complication rate and longer LOS. This difference may reflect academic institutions treating higher-risk patients; however, a definitive reason for the variability between databases is unknown. The inability to precisely determine the basis of the variability between databases highlights the limitations of using administrative databases for spinal outcome analysis. Copyright © 2016 Elsevier Inc. All rights reserved.
Al-Nasheri, Ahmed; Muhammad, Ghulam; Alsulaiman, Mansour; Ali, Zulfiqar; Mesallam, Tamer A; Farahat, Mohamed; Malki, Khalid H; Bencherif, Mohamed A
2017-01-01
Automatic voice-pathology detection and classification systems may help clinicians to detect the existence of any voice pathologies and the type of pathology from which patients suffer in the early stages. The main aim of this paper is to investigate Multidimensional Voice Program (MDVP) parameters to automatically detect and classify the voice pathologies in multiple databases, and then to find out which parameters performed well in these two processes. Samples of the sustained vowel /a/ of normal and pathological voices were extracted from three different databases, which have three voice pathologies in common. The selected databases in this study represent three distinct languages: (1) the Arabic voice pathology database; (2) the Massachusetts Eye and Ear Infirmary database (English database); and (3) the Saarbruecken Voice Database (German database). A computerized speech lab program was used to extract MDVP parameters as features, and an acoustical analysis was performed. The Fisher discrimination ratio was applied to rank the parameters. A t test was performed to highlight any significant differences in the means of the normal and pathological samples. The experimental results demonstrate a clear difference in the performance of the MDVP parameters using these databases. The highly ranked parameters also differed from one database to another. The best accuracies were obtained by using the three highest ranked MDVP parameters arranged according to the Fisher discrimination ratio: these accuracies were 99.68%, 88.21%, and 72.53% for the Saarbruecken Voice Database, the Massachusetts Eye and Ear Infirmary database, and the Arabic voice pathology database, respectively. Copyright © 2017 The Voice Foundation. Published by Elsevier Inc. All rights reserved.
Optimized extreme learning machine for urban land cover classification using hyperspectral imagery
NASA Astrophysics Data System (ADS)
Su, Hongjun; Tian, Shufang; Cai, Yue; Sheng, Yehua; Chen, Chen; Najafian, Maryam
2017-12-01
This work presents a new urban land cover classification framework using the firefly algorithm (FA) optimized extreme learning machine (ELM). FA is adopted to optimize the regularization coefficient C and Gaussian kernel σ for kernel ELM. Additionally, effectiveness of spectral features derived from an FA-based band selection algorithm is studied for the proposed classification task. Three sets of hyperspectral databases were recorded using different sensors, namely HYDICE, HyMap, and AVIRIS. Our study shows that the proposed method outperforms traditional classification algorithms such as SVM and reduces computational cost significantly.
Ihmaid, Saleh K; Ahmed, Hany E A; Zayed, Mohamed F; Abadleh, Mohammed M
2016-01-30
The main step in a successful drug discovery pipeline is the identification of small potent compounds that selectively bind to the target of interest with high affinity. However, there is still a shortage of efficient and accurate computational methods with powerful capability to study and hence predict compound selectivity properties. In this work, we propose an affordable machine learning method to perform compound selectivity classification and prediction. For this purpose, we have collected compounds with reported activity and built a selectivity database formed of 153 cathepsin K and S inhibitors that are considered of medicinal interest. This database has three compound sets, two K/S and S/K selective ones and one non-selective KS one. We have subjected this database to the selectivity classification tool 'Emergent Self-Organizing Maps' for exploring its capability to differentiate selective cathepsin inhibitors for one target over the other. The method exhibited good clustering performance for selective ligands with high accuracy (up to 100 %). Among the possibilites, BAPs and MACCS molecular structural fingerprints were used for such a classification. The results exhibited the ability of the method for structure-selectivity relationship interpretation and selectivity markers were identified for the design of further novel inhibitors with high activity and target selectivity.
ERIC Educational Resources Information Center
Sargent, John
The Office of Technology Policy analyzed Bureau of Labor Statistics' growth projections for the core occupational classifications of IT (information technology) workers to assess future demand in the United States. Classifications studied were computer engineers, systems analysts, computer programmers, database administrators, computer support…
A relational database in neurosurgery.
Sicurello, F; Marchetti, M R; Cazzaniga, P
1995-01-01
This paper describes teh automatic procedure for a clinical record management in a Neurosurgery ward. The automated record allows the storage, querying and effective management of clinical data. This is useful during the patient stay and also for data processing and analysis aiming at clinical research and statistical studies. The clinical record is problem-oriented. It contains a minimum data set regarding every patient and a data set which is defined by a classification nomenclature (using an inner protocol). The main parts of the clinical record are the following tables: PERSONAL DATA: contains the fields relating to personal and admission data of the patient. The compilation of some fields is compulsory because they serve as input for the automated discharge letter. This table is used as an identifier for patient retrieval. composed of five different tables according to the kind of data. They are: familiar anamnesis, physiological anamnesis, past and next pathology anamnesis, and trauma anamnesis. GENERAL OBJECTIVITY: contains the general physical information of a patient. The field hold default values, which quickens the compilation and assures the recording of normal values. NEUROLOGICAL EXAMINATION: contains information about the neurological status of the patient. Also in this table, ther are default values in the fields. COMA: contains standardized ata and classifications. The multiple choices are automated and driven and belong to homogeneous classes. SURGICAL OPERATIONS: the information recording is made defining the general kind of operation and then defining the peculiar kind of operation. INSTRUMENTAL EXAMINATIONS: some examination results are recorded in a free structure, while other ones (TAC, etc.) follow codified structure. In order to identify a pathology by means of TAC, it is enough to record three values corresponding to three variables. THis classification fully describes a lot of neurosurgical pathologies. DISCHARGE: contains conclusions, therapies, result, and hospital course. Medical language is closer to the natural one and presents some abiguities. In order to solve this problem, a classification nomenclature was used for diagnosis definition. DISCHARGE LETTER: the document given to the patient when he is discharged. It extracts data from the previously described modules and contains standard headings. The information stored int he database is structured (e.g., diagnosis, name, surname, etc.) and access to this data takes place when the user wants to search the database, using particular queries where the identifying data of a patient is put as conditions for the research (SELECT age, name WHERE diagnosis="TRAUMA"). Logical operators and relational algebra of the relational DBMS allows more complex queries ((diagnosis="TRAUMA" AND age="19") OR sex="M"). The queries are deterministic, because data management uses a classification nomenclature. Data retrieval takes place through a matching, and the DBMS answers directly to the queries. The information retrieval speed depends upon the kind of system that is used; in our case retrieval time is low because the accesses to disk are few even for big databases. In medicine, clinical records can have a hierarchical structure and/or a relational one. Nevertheless, the hierarchical model presents a disadvantage: it is not very flexible because it is linked to a pre-defined structure; as a matter of fact, the definition of path is established in the beginning and not during the execution. Thus, a better representation of the system at a logical level requries a relational DBMS which exploits the relationships between entities in a vertical and horizontal way. That is why the developers adopted a mixed strategy which exploits the advantages of both models and which is provided by M Technology with SQL language (M/SQL). For the future, it is important to have at one's disposal multimedia technologies, which integrate different kinds of information (alp
Texture for script identification.
Busch, Andrew; Boles, Wageeh W; Sridharan, Sridha
2005-11-01
The problem of determining the script and language of a document image has a number of important applications in the field of document analysis, such as indexing and sorting of large collections of such images, or as a precursor to optical character recognition (OCR). In this paper, we investigate the use of texture as a tool for determining the script of a document image, based on the observation that text has a distinct visual texture. An experimental evaluation of a number of commonly used texture features is conducted on a newly created script database, providing a qualitative measure of which features are most appropriate for this task. Strategies for improving classification results in situations with limited training data and multiple font types are also proposed.
Study of an External Neutron Source for an Accelerator-Driven System using the PHITS Code
DOE Office of Scientific and Technical Information (OSTI.GOV)
Sugawara, Takanori; Iwasaki, Tomohiko; Chiba, Takashi
A code system for the Accelerator Driven System (ADS) has been under development for analyzing dynamic behaviors of a subcritical core coupled with an accelerator. This code system named DSE (Dynamics calculation code system for a Subcritical system with an External neutron source) consists of an accelerator part and a reactor part. The accelerator part employs a database, which is calculated by using PHITS, for investigating the effect related to the accelerator such as the changes of beam energy, beam diameter, void generation, and target level. This analysis method using the database may introduce some errors into dynamics calculations sincemore » the neutron source data derived from the database has some errors in fitting or interpolating procedures. In this study, the effects of various events are investigated to confirm that the method based on the database is appropriate.« less
Data-driven indexing mechanism for the recognition of polyhedral objects
NASA Astrophysics Data System (ADS)
McLean, Stewart; Horan, Peter; Caelli, Terry M.
1992-02-01
This paper is concerned with the problem of searching large model databases. To date, most object recognition systems have concentrated on the problem of matching using simple searching algorithms. This is quite acceptable when the number of object models is small. However, in the future, general purpose computer vision systems will be required to recognize hundreds or perhaps thousands of objects and, in such circumstances, efficient searching algorithms will be needed. The problem of searching a large model database is one which must be addressed if future computer vision systems are to be at all effective. In this paper we present a method we call data-driven feature-indexed hypothesis generation as one solution to the problem of searching large model databases.
Transportation-markings database : traffic control devices. Part I 2, Volume 3, additional studies
DOT National Transportation Integrated Search
1998-01-01
The Database (Part I 1, 2, 3, 4) of transportation-markings: a study in communication monograph series draws together the several varios dimensions of T-M. it shares this drawing togther function with the General Classification (Part H). But, paradox...
DOT National Transportation Integrated Search
2001-01-01
The Database (Parts I 1, 2, 3, 4 of TRANSPORTATION-MARKINGS: A STUDY IN COMMUNICATION MONOGRAPH SERIES) draws together the several dimensions of T-M. It shares this drawmg together function with the General Classification (Part H). But, paradoxically...
The COG database: a tool for genome-scale analysis of protein functions and evolution
Tatusov, Roman L.; Galperin, Michael Y.; Natale, Darren A.; Koonin, Eugene V.
2000-01-01
Rational classification of proteins encoded in sequenced genomes is critical for making the genome sequences maximally useful for functional and evolutionary studies. The database of Clusters of Orthologous Groups of proteins (COGs) is an attempt on a phylogenetic classification of the proteins encoded in 21 complete genomes of bacteria, archaea and eukaryotes (http://www.ncbi.nlm.nih.gov/COG ). The COGs were constructed by applying the criterion of consistency of genome-specific best hits to the results of an exhaustive comparison of all protein sequences from these genomes. The database comprises 2091 COGs that include 56–83% of the gene products from each of the complete bacterial and archaeal genomes and ~35% of those from the yeast Saccharomyces cerevisiae genome. The COG database is accompanied by the COGNITOR program that is used to fit new proteins into the COGs and can be applied to functional and phylogenetic annotation of newly sequenced genomes. PMID:10592175
Classification of time series patterns from complex dynamic systems
DOE Office of Scientific and Technical Information (OSTI.GOV)
Schryver, J.C.; Rao, N.
1998-07-01
An increasing availability of high-performance computing and data storage media at decreasing cost is making possible the proliferation of large-scale numerical databases and data warehouses. Numeric warehousing enterprises on the order of hundreds of gigabytes to terabytes are a reality in many fields such as finance, retail sales, process systems monitoring, biomedical monitoring, surveillance and transportation. Large-scale databases are becoming more accessible to larger user communities through the internet, web-based applications and database connectivity. Consequently, most researchers now have access to a variety of massive datasets. This trend will probably only continue to grow over the next several years. Unfortunately,more » the availability of integrated tools to explore, analyze and understand the data warehoused in these archives is lagging far behind the ability to gain access to the same data. In particular, locating and identifying patterns of interest in numerical time series data is an increasingly important problem for which there are few available techniques. Temporal pattern recognition poses many interesting problems in classification, segmentation, prediction, diagnosis and anomaly detection. This research focuses on the problem of classification or characterization of numerical time series data. Highway vehicles and their drivers are examples of complex dynamic systems (CDS) which are being used by transportation agencies for field testing to generate large-scale time series datasets. Tools for effective analysis of numerical time series in databases generated by highway vehicle systems are not yet available, or have not been adapted to the target problem domain. However, analysis tools from similar domains may be adapted to the problem of classification of numerical time series data.« less
Jones, James V.; Karl, Susan M.; Labay, Keith A.; Shew, Nora B.; Granitto, Matthew; Hayes, Timothy S.; Mauk, Jeffrey L.; Schmidt, Jeanine M.; Todd, Erin; Wang, Bronwen; Werdon, Melanie B.; Yager, Douglas B.
2015-01-01
This study has used a data-driven, geographic information system (GIS)-based method for evaluating the mineral resource potential across the large region of the CYPA. This method systematically and simultaneously analyzes geoscience data from multiple geospatially referenced datasets and uses individual subwatersheds (12-digit hydrologic unit codes or HUCs) as the spatial unit of classification. The final map output indicates an estimated potential (high, medium, low) for a given mineral deposit group and indicates the certainty (high, medium, low) of that estimate for any given subwatershed (HUC). Accompanying tables describe the data layers used in each analysis, the values assigned for specific analysis parameters, and the relative weighting of each data layer that contributes to the estimated potential and certainty determinations. Core datasets used include the U.S. Geological Survey (USGS) Alaska Geochemical Database (AGDB2), the Alaska Division of Geologic and Geophysical Surveys Web-based geochemical database, data from an anticipated USGS geologic map of Alaska, and the USGS Alaska Resource Data File. Map plates accompanying this report illustrate the mineral prospectivity for the six deposit groups across the CYPA and estimates of mineral resource potential. There are numerous areas, some of them large, rated with high potential for one or more of the selected deposit groups within the CYPA.
ERIC Educational Resources Information Center
Shedd, Louis; Katsinas, Stephen; Bray, Nathaniel
2018-01-01
This article categorizes institutions under both the 2015 Carnegie Basic Classification system and the mission-driven classification system, and further analyzes both by the presence of a collective bargaining agreement. The goal of this article was to use the presentation of data on revenue, employment numbers, salary outlays, and the presence or…
Hierarchical classification method and its application in shape representation
NASA Astrophysics Data System (ADS)
Ireton, M. A.; Oakley, John P.; Xydeas, Costas S.
1992-04-01
In this paper we describe a technique for performing shaped-based content retrieval of images from a large database. In order to be able to formulate such user-generated queries about visual objects, we have developed an hierarchical classification technique. This hierarchical classification technique enables similarity matching between objects, with the position in the hierarchy signifying the level of generality to be used in the query. The classification technique is unsupervised, robust, and general; it can be applied to any suitable parameter set. To establish the potential of this classifier for aiding visual querying, we have applied it to the classification of the 2-D outlines of leaves.
Forster, Samuel C; Browne, Hilary P; Kumar, Nitin; Hunt, Martin; Denise, Hubert; Mitchell, Alex; Finn, Robert D; Lawley, Trevor D
2016-01-04
The Human Pan-Microbe Communities (HPMC) database (http://www.hpmcd.org/) provides a manually curated, searchable, metagenomic resource to facilitate investigation of human gastrointestinal microbiota. Over the past decade, the application of metagenome sequencing to elucidate the microbial composition and functional capacity present in the human microbiome has revolutionized many concepts in our basic biology. When sufficient high quality reference genomes are available, whole genome metagenomic sequencing can provide direct biological insights and high-resolution classification. The HPMC database provides species level, standardized phylogenetic classification of over 1800 human gastrointestinal metagenomic samples. This is achieved by combining a manually curated list of bacterial genomes from human faecal samples with over 21000 additional reference genomes representing bacteria, viruses, archaea and fungi with manually curated species classification and enhanced sample metadata annotation. A user-friendly, web-based interface provides the ability to search for (i) microbial groups associated with health or disease state, (ii) health or disease states and community structure associated with a microbial group, (iii) the enrichment of a microbial gene or sequence and (iv) enrichment of a functional annotation. The HPMC database enables detailed analysis of human microbial communities and supports research from basic microbiology and immunology to therapeutic development in human health and disease. © The Author(s) 2015. Published by Oxford University Press on behalf of Nucleic Acids Research.
Low, Gary Kim-Kuan; Ogston, Simon A; Yong, Mun-Hin; Gan, Seng-Chiew; Chee, Hui-Yee
2018-06-01
Since the introduction of 2009 WHO dengue case classification, no literature was found regarding its effect on dengue death. This study was to evaluate the effect of 2009 WHO dengue case classification towards dengue case fatality rate. Various databases were used to search relevant articles since 1995. Studies included were cohort and cross-sectional studies, all patients with dengue infection and must report the number of death or case fatality rate. The Joanna Briggs Institute appraisal checklist was used to evaluate the risk of bias of the full-texts. The studies were grouped according to the classification adopted: WHO 1997 and WHO 2009. Meta-regression was employed using a logistic transformation (log-odds) of the case fatality rate. The result of the meta-regression was the adjusted case fatality rate and odds ratio on the explanatory variables. A total of 77 studies were included in the meta-regression analysis. The case fatality rate for all studies combined was 1.14% with 95% confidence interval (CI) of 0.82-1.58%. The combined (unadjusted) case fatality rate for 69 studies which adopted WHO 1997 dengue case classification was 1.09% with 95% CI of 0.77-1.55%; and for eight studies with WHO 2009 was 1.62% with 95% CI of 0.64-4.02%. The unadjusted and adjusted odds ratio of case fatality using WHO 2009 dengue case classification was 1.49 (95% CI: 0.52, 4.24) and 0.83 (95% CI: 0.26, 2.63) respectively, compared to WHO 1997 dengue case classification. There was an apparent increase in trend of case fatality rate from the year 1992-2016. Neither was statistically significant. The WHO 2009 dengue case classification might have no effect towards the case fatality rate although the adjusted results indicated a lower case fatality rate. Future studies are required for an update in the meta-regression analysis to confirm the findings. Copyright © 2018 Elsevier B.V. All rights reserved.
Difficulties in diagnosing Marfan syndrome using current FBN1 databases.
Groth, Kristian A; Gaustadnes, Mette; Thorsen, Kasper; Østergaard, John R; Jensen, Uffe Birk; Gravholt, Claus H; Andersen, Niels H
2016-01-01
The diagnostic criteria of Marfan syndrome (MFS) highlight the importance of a FBN1 mutation test in diagnosing MFS. As genetic sequencing becomes better, cheaper, and more accessible, the expected increase in the number of genetic tests will become evident, resulting in numerous genetic variants that need to be evaluated for disease-causing effects based on database information. The aim of this study was to evaluate genetic variants in four databases and review the relevant literature. We assessed background data on 23 common variants registered in ESP6500 and classified as causing MFS in the Human Gene Mutation Database (HGMD). We evaluated data in four variant databases (HGMD, UMD-FBN1, ClinVar, and UniProt) according to the diagnostic criteria for MFS and compared the results with the classification of each variant in the four databases. None of the 23 variants was clearly associated with MFS, even though all classifications in the databases stated otherwise. A genetic diagnosis of MFS cannot reliably be based on current variant databases because they contain incorrectly interpreted conclusions on variants. Variants must be evaluated by time-consuming review of the background material in the databases and by combining these data with expert knowledge on MFS. This is a major problem because we expect even more genetic test results in the near future as a result of the reduced cost and process time for next-generation sequencing.Genet Med 18 1, 98-102.
Cubilla, Antonio L; Velazquez, Elsa F; Amin, Mahul B; Epstein, Jonathan; Berney, Daniel M; Corbishley, Cathy M
2018-05-01
The International Society of Urological Pathology (ISUP) held an expert-driven penile cancer conference in Boston in March 2015, which focused on the new World Health Organisation (WHO) classification of penile cancer: human papillomavirus (HPV)-related tumours and histological grading. The conference was preceded by an online survey of the ISUP members, and the results were used to initiate discussions. Because of the rarity of penile tumours, this was not a consensus but an expert-driven conference aimed at assisting pathologists who do not see these tumours on a regular basis. After a justification for the novel separation of penile squamous cell carcinomas into HPV-related and non-HPV-related-carcinomas, the histological classification of penile carcinoma was proposed; this system was also accepted subsequently by the WHO for subtyping of penile carcinomas (2016). A description of HPV-related neoplasms, which may be recognised by their histological features, was presented, and p16 was recommended as a surrogate indicator of HPV. A three-tier grading system was recommended for penile squamous carcinomas; this was also adopted by the WHO (2016). Many of the distinctive histological subtypes of squamous cell carcinoma of the penis are associated with distinct grades, based on the squamous cell carcinoma subtype histological features. © 2017 John Wiley & Sons Ltd.
Cohen, J. I.; Kimura, H.; Nakamura, S.; Ko, Y.-H.; Jaffe, E. S.
2009-01-01
Background: Recently novel Epstein–Barr virus (EBV) lymphoproliferative diseases (LPDs) have been identified in non-immunocompromised hosts, both in Asia and Western countries. These include aggressive T-cell and NK-cell LPDs often subsumed under the heading of chronic active Epstein–Barr virus (CAEBV) infection and EBV-driven B-cell LPDs mainly affecting the elderly. Design: To better define the pathogenesis, classification, and treatment of these disorders, participants from Asia, The Americas, Europe, and Australia presented clinical and experimental data at an international meeting. Results: The term systemic EBV-positive T-cell LPD, as adopted by the WHO classification, is preferred as a pathological classification over CAEBV (the favored clinical term) for those cases that are clonal. The disease has an aggressive clinical course, but may arise in the background of CAEBV. Hydroa vacciniforme (HV) and HV-like lymphoma represent a spectrum of clonal EBV-positive T-cell LPDs, which have a more protracted clinical course; spontaneous regression may occur in adult life. Severe mosquito bite allergy is a related syndrome usually of NK cell origin. Immune senescence in the elderly is associated with both reactive and neoplastic EBV-driven LPDs, including EBV-positive diffuse large B-cell lymphomas. Conclusion: The participants proposed an international consortium to facilitate further clinical and biological studies of novel EBV-driven LPDs. PMID:19515747
Cohen, J I; Kimura, H; Nakamura, S; Ko, Y-H; Jaffe, E S
2009-09-01
Recently novel Epstein-Barr virus (EBV) lymphoproliferative diseases (LPDs) have been identified in non-immunocompromised hosts, both in Asia and Western countries. These include aggressive T-cell and NK-cell LPDs often subsumed under the heading of chronic active Epstein-Barr virus (CAEBV) infection and EBV-driven B-cell LPDs mainly affecting the elderly. To better define the pathogenesis, classification, and treatment of these disorders, participants from Asia, The Americas, Europe, and Australia presented clinical and experimental data at an international meeting. The term systemic EBV-positive T-cell LPD, as adopted by the WHO classification, is preferred as a pathological classification over CAEBV (the favored clinical term) for those cases that are clonal. The disease has an aggressive clinical course, but may arise in the background of CAEBV. Hydroa vacciniforme (HV) and HV-like lymphoma represent a spectrum of clonal EBV-positive T-cell LPDs, which have a more protracted clinical course; spontaneous regression may occur in adult life. Severe mosquito bite allergy is a related syndrome usually of NK cell origin. Immune senescence in the elderly is associated with both reactive and neoplastic EBV-driven LPDs, including EBV-positive diffuse large B-cell lymphomas. The participants proposed an international consortium to facilitate further clinical and biological studies of novel EBV-driven LPDs.
Approach for Text Classification Based on the Similarity Measurement between Normal Cloud Models
Dai, Jin; Liu, Xin
2014-01-01
The similarity between objects is the core research area of data mining. In order to reduce the interference of the uncertainty of nature language, a similarity measurement between normal cloud models is adopted to text classification research. On this basis, a novel text classifier based on cloud concept jumping up (CCJU-TC) is proposed. It can efficiently accomplish conversion between qualitative concept and quantitative data. Through the conversion from text set to text information table based on VSM model, the text qualitative concept, which is extraction from the same category, is jumping up as a whole category concept. According to the cloud similarity between the test text and each category concept, the test text is assigned to the most similar category. By the comparison among different text classifiers in different feature selection set, it fully proves that not only does CCJU-TC have a strong ability to adapt to the different text features, but also the classification performance is also better than the traditional classifiers. PMID:24711737
Outcome Measurement in the Treatment of Spasmodic Dysphonia: A Systematic Review of the Literature.
Rumbach, Anna; Aiken, Patrick; Novakovic, Daniel
2018-04-11
The aim of this review was to systematically identify all available studies reporting outcomes measures to assess treatment outcomes for people with spasmodic dysphonia (SD). Full-text journal articles were identified through searches of PubMed, Embase, CINAHL, and Cochrane databases and hand searching of journals. A total of 4,714 articles were retrieved from searching databases; 1,165 were duplicates. Titles and abstracts of 3,549 were screened, with 171 being selected for full-text review. During full-text review, 101 articles were deemed suitable for inclusion. An additional 24 articles were identified as suitable for inclusion through a hand search of reference lists. Data were extracted from 125 studies. A total of 220 outcome measures were identified. Considered in reference to the World Health Organization International Classification of Functioning, Disability and Health (ICF), the majority of outcomes were measured at a Body Function level (n = 212, 96%). Outcomes that explored communication and participation in everyday life and attitudes toward communication (ie, activity and participation domains) were infrequent (n = 8; 4%). Quality of life, a construct not measured within the ICF, was also captured by four outcome measures. No instruments evaluating communication partners' perspectives or burden/disability were identified. The outcome measures used in SD treatment studies are many and varied. The outcome measures identified predominately measure constructs within the Body Functions component of the ICF. In order to facilitate data synthesis across trials, the development of a core outcome set is recommended. Crown Copyright © 2018. Published by Elsevier Inc. All rights reserved.
Classification of hepatocellular carcinoma stages from free-text clinical and radiology reports
Yim, Wen-wai; Kwan, Sharon W; Johnson, Guy; Yetisgen, Meliha
2017-01-01
Cancer stage information is important for clinical research. However, they are not always explicitly noted in electronic medical records. In this paper, we present our work on automatic classification of hepatocellular carcinoma (HCC) stages from free-text clinical and radiology notes. To accomplish this, we defined 11 stage parameters used in the three HCC staging systems, American Joint Committee on Cancer (AJCC), Barcelona Clinic Liver Cancer (BCLC), and Cancer of the Liver Italian Program (CLIP). After aggregating stage parameters to the patient-level, the final stage classifications were achieved using an expert-created decision logic. Each stage parameter relevant for staging was extracted using several classification methods, e.g. sentence classification and automatic information structuring, to identify and normalize text as cancer stage parameter values. Stage parameter extraction for the test set performed at 0.81 F1. Cancer stage prediction for AJCC, BCLC, and CLIP stage classifications were 0.55, 0.50, and 0.43 F1.
NASA Astrophysics Data System (ADS)
Ceballos, G. A.; Hernández, L. F.
2015-04-01
Objective. The classical ERP-based speller, or P300 Speller, is one of the most commonly used paradigms in the field of Brain Computer Interfaces (BCI). Several alterations to the visual stimuli presentation system have been developed to avoid unfavorable effects elicited by adjacent stimuli. However, there has been little, if any, regard to useful information contained in responses to adjacent stimuli about spatial location of target symbols. This paper aims to demonstrate that combining the classification of non-target adjacent stimuli with standard classification (target versus non-target) significantly improves classical ERP-based speller efficiency. Approach. Four SWLDA classifiers were trained and combined with the standard classifier: the lower row, upper row, right column and left column classifiers. This new feature extraction procedure and the classification method were carried out on three open databases: the UAM P300 database (Universidad Autonoma Metropolitana, Mexico), BCI competition II (dataset IIb) and BCI competition III (dataset II). Main results. The inclusion of the classification of non-target adjacent stimuli improves target classification in the classical row/column paradigm. A gain in mean single trial classification of 9.6% and an overall improvement of 25% in simulated spelling speed was achieved. Significance. We have provided further evidence that the ERPs produced by adjacent stimuli present discriminable features, which could provide additional information about the spatial location of intended symbols. This work promotes the searching of information on the peripheral stimulation responses to improve the performance of emerging visual ERP-based spellers.
Manosroi, Jiradej; Sainakham, Mathukorn; Manosroi, Worapaka; Manosroi, Aranya
2012-05-07
ETHONOPHARMACOLOGICAL RELEVANCES: Traditional medicines have long been used by the Thai people. Several medicinal recipes prepared from a mixture of plants are often used by traditional medicinal practitioners for the treatment of many diseases including cancer. The recipes collected from the Thai medicinal text books were recorded in MANOSROI II database. Anticancer recipes were searched and selected by a computer program using the recipe indication keywords including Ma-reng and San which means cancer in Thai, from the database for anticancer activity investigation. To investigate anti-cancer activities of the Thai medicinal plant recipes selected from the "MANOSROI II" database. Anti-proliferative and apoptotic activities of extracts from 121 recipes selected from 56,137 recipes in the Thai medicinal plant recipe "MANOSROI II" database were investigated in two cancer cell lines including human mouth epidermal carcinoma (KB) and human colon adenocarcinoma (HT-29) cell lines using sulforhodamine B (SRB) assay and acridine orange (AO) and ethidium bromide (EB) staining technique, respectively. In the SRB assay, recipes NE028 and, S003 gave the highest anti-proliferation activity on KB and HT29 with the IC(50) values of 2.48±0.24 and 6.92±0.49μg/ml, respectively. In the AO/EB staining assay, recipes S016 and NE028 exhibited the highest apoptotic induction in KB and HT-29 cell lines, respectively. This study has demonstrated that the three Thai medicinal plant recipes selected from "MANOSROI II" database (NE028, S003 and S016) gave active anti-cancer activities according to the NCI classification which can be further developed for anti-cancer treatment. Copyright © 2012 Elsevier Ireland Ltd. All rights reserved.
Chen, Chuyun; Hong, Jiaming; Zhou, Weilin; Lin, Guohua; Wang, Zhengfei; Zhang, Qufei; Lu, Cuina; Lu, Lihong
2017-07-12
To construct a knowledge platform of acupuncture ancient books based on data mining technology, and to provide retrieval service for users. The Oracle 10 g database was applied and JAVA was selected as development language; based on the standard library and ancient books database established by manual entry, a variety of data mining technologies, including word segmentation, speech tagging, dependency analysis, rule extraction, similarity calculation, ambiguity analysis, supervised classification technology were applied to achieve text automatic extraction of ancient books; in the last, through association mining and decision analysis, the comprehensive and intelligent analysis of disease and symptom, meridians, acupoints, rules of acupuncture and moxibustion in acupuncture ancient books were realized, and retrieval service was provided for users through structure of browser/server (B/S). The platform realized full-text retrieval, word frequency analysis and association analysis; when diseases or acupoints were searched, the frequencies of meridian, acupoints (diseases) and techniques were presented from high to low, meanwhile the support degree and confidence coefficient between disease and acupoints (special acupoint), acupoints and acupoints in prescription, disease or acupoints and technique were presented. The experience platform of acupuncture ancient books based on data mining technology could be used as a reference for selection of disease, meridian and acupoint in clinical treatment and education of acupuncture and moxibustion.
Knowledge Discovery from Databases: An Introductory Review.
ERIC Educational Resources Information Center
Vickery, Brian
1997-01-01
Introduces new procedures being used to extract knowledge from databases and discusses rationales for developing knowledge discovery methods. Methods are described for such techniques as classification, clustering, and the detection of deviations from pre-established norms. Examines potential uses of knowledge discovery in the information field.…
CSE database: extended annotations and new recommendations for ECG software testing.
Smíšek, Radovan; Maršánová, Lucie; Němcová, Andrea; Vítek, Martin; Kozumplík, Jiří; Nováková, Marie
2017-08-01
Nowadays, cardiovascular diseases represent the most common cause of death in western countries. Among various examination techniques, electrocardiography (ECG) is still a highly valuable tool used for the diagnosis of many cardiovascular disorders. In order to diagnose a person based on ECG, cardiologists can use automatic diagnostic algorithms. Research in this area is still necessary. In order to compare various algorithms correctly, it is necessary to test them on standard annotated databases, such as the Common Standards for Quantitative Electrocardiography (CSE) database. According to Scopus, the CSE database is the second most cited standard database. There were two main objectives in this work. First, new diagnoses were added to the CSE database, which extended its original annotations. Second, new recommendations for diagnostic software quality estimation were established. The ECG recordings were diagnosed by five new cardiologists independently, and in total, 59 different diagnoses were found. Such a large number of diagnoses is unique, even in terms of standard databases. Based on the cardiologists' diagnoses, a four-round consensus (4R consensus) was established. Such a 4R consensus means a correct final diagnosis, which should ideally be the output of any tested classification software. The accuracy of the cardiologists' diagnoses compared with the 4R consensus was the basis for the establishment of accuracy recommendations. The accuracy was determined in terms of sensitivity = 79.20-86.81%, positive predictive value = 79.10-87.11%, and the Jaccard coefficient = 72.21-81.14%, respectively. Within these ranges, the accuracy of the software is comparable with the accuracy of cardiologists. The accuracy quantification of the correct classification is unique. Diagnostic software developers can objectively evaluate the success of their algorithm and promote its further development. The annotations and recommendations proposed in this work will allow for faster development and testing of classification software. As a result, this might facilitate cardiologists' work and lead to faster diagnoses and earlier treatment.
Detection and Evaluation of Cheating on College Exams Using Supervised Classification
ERIC Educational Resources Information Center
Cavalcanti, Elmano Ramalho; Pires, Carlos Eduardo; Cavalcanti, Elmano Pontes; Pires, Vládia Freire
2012-01-01
Text mining has been used for various purposes, such as document classification and extraction of domain-specific information from text. In this paper we present a study in which text mining methodology and algorithms were properly employed for academic dishonesty (cheating) detection and evaluation on open-ended college exams, based on document…
ERIC Educational Resources Information Center
Beghtol, Clare
1986-01-01
Explicates a definition and theory of "aboutness" and aboutness analysis developed by text linguist van Dijk; explores implications of text linguistics for bibliographic classification theory; suggests the elements that a theory of the cognitive process of classifying documents needs to encompass; and delineates how people identify…
Comparisons and Selections of Features and Classifiers for Short Text Classification
NASA Astrophysics Data System (ADS)
Wang, Ye; Zhou, Zhi; Jin, Shan; Liu, Debin; Lu, Mi
2017-10-01
Short text is considerably different from traditional long text documents due to its shortness and conciseness, which somehow hinders the applications of conventional machine learning and data mining algorithms in short text classification. According to traditional artificial intelligence methods, we divide short text classification into three steps, namely preprocessing, feature selection and classifier comparison. In this paper, we have illustrated step-by-step how we approach our goals. Specifically, in feature selection, we compared the performance and robustness of the four methods of one-hot encoding, tf-idf weighting, word2vec and paragraph2vec, and in the classification part, we deliberately chose and compared Naive Bayes, Logistic Regression, Support Vector Machine, K-nearest Neighbor and Decision Tree as our classifiers. Then, we compared and analysed the classifiers horizontally with each other and vertically with feature selections. Regarding the datasets, we crawled more than 400,000 short text files from Shanghai and Shenzhen Stock Exchanges and manually labeled them into two classes, the big and the small. There are eight labels in the big class, and 59 labels in the small class.
Academic Journal Embargoes and Full Text Databases.
ERIC Educational Resources Information Center
Brooks, Sam
2003-01-01
Documents the reasons for embargoes of academic journals in full text databases (i.e., publisher-imposed delays on the availability of full text content) and provides insight regarding common misconceptions. Tables present data on selected journals covering a cross-section of subjects and publishers and comparing two full text business databases.…
Negative Example Selection for Protein Function Prediction: The NoGO Database
Youngs, Noah; Penfold-Brown, Duncan; Bonneau, Richard; Shasha, Dennis
2014-01-01
Negative examples – genes that are known not to carry out a given protein function – are rarely recorded in genome and proteome annotation databases, such as the Gene Ontology database. Negative examples are required, however, for several of the most powerful machine learning methods for integrative protein function prediction. Most protein function prediction efforts have relied on a variety of heuristics for the choice of negative examples. Determining the accuracy of methods for negative example prediction is itself a non-trivial task, given that the Open World Assumption as applied to gene annotations rules out many traditional validation metrics. We present a rigorous comparison of these heuristics, utilizing a temporal holdout, and a novel evaluation strategy for negative examples. We add to this comparison several algorithms adapted from Positive-Unlabeled learning scenarios in text-classification, which are the current state of the art methods for generating negative examples in low-density annotation contexts. Lastly, we present two novel algorithms of our own construction, one based on empirical conditional probability, and the other using topic modeling applied to genes and annotations. We demonstrate that our algorithms achieve significantly fewer incorrect negative example predictions than the current state of the art, using multiple benchmarks covering multiple organisms. Our methods may be applied to generate negative examples for any type of method that deals with protein function, and to this end we provide a database of negative examples in several well-studied organisms, for general use (The NoGO database, available at: bonneaulab.bio.nyu.edu/nogo.html). PMID:24922051
Classification of Palmprint Using Principal Line
NASA Astrophysics Data System (ADS)
Prasad, Munaga V. N. K.; Kumar, M. K. Pramod; Sharma, Kuldeep
In this paper, a new classification scheme for palmprint is proposed. Palmprint is one of the reliable physiological characteristics that can be used to authenticate an individual. Palmprint classification provides an important indexing mechanism in a very large palmprint database. Here, the palmprint database is initially categorized into two groups, right hand group and left hand group. Then, each group is further classified based on the distance traveled by principal line i.e. Heart Line During pre processing, a rectangular Region of Interest (ROI) in which only heart line is present, is extracted. Further, ROI is divided into 6 regions and depending upon the regions in which the heart line traverses the palmprint is classified accordingly. Consequently, our scheme allows 64 categories for each group forming a total number of 128 possible categories. The technique proposed in this paper includes only 15 such categories and it classifies not more than 20.96% of the images into a single category.
Automated Patent Categorization and Guided Patent Search using IPC as Inspired by MeSH and PubMed.
Eisinger, Daniel; Tsatsaronis, George; Bundschus, Markus; Wieneke, Ulrich; Schroeder, Michael
2013-04-15
Document search on PubMed, the pre-eminent database for biomedical literature, relies on the annotation of its documents with relevant terms from the Medical Subject Headings ontology (MeSH) for improving recall through query expansion. Patent documents are another important information source, though they are considerably less accessible. One option to expand patent search beyond pure keywords is the inclusion of classification information: Since every patent is assigned at least one class code, it should be possible for these assignments to be automatically used in a similar way as the MeSH annotations in PubMed. In order to develop a system for this task, it is necessary to have a good understanding of the properties of both classification systems. This report describes our comparative analysis of MeSH and the main patent classification system, the International Patent Classification (IPC). We investigate the hierarchical structures as well as the properties of the terms/classes respectively, and we compare the assignment of IPC codes to patents with the annotation of PubMed documents with MeSH terms.Our analysis shows a strong structural similarity of the hierarchies, but significant differences of terms and annotations. The low number of IPC class assignments and the lack of occurrences of class labels in patent texts imply that current patent search is severely limited. To overcome these limits, we evaluate a method for the automated assignment of additional classes to patent documents, and we propose a system for guided patent search based on the use of class co-occurrence information and external resources.
Moretti, Marta; Alves, Ines; Maxwell, Gregor
2012-02-01
This article presents the outcome of a systematic literature review exploring the applicability of the International Classification of Functioning, Disability, and Health (ICF) and its Children and Youth version (ICF-CY) at various levels and in processes within the education systems in different countries. A systematic database search using selected search terms has been used. The selection of studies was then refined further using four protocols: inclusion and exclusion protocols at abstract and full text and extraction levels along with a quality protocol. Studies exploring the direct relationship between education and the ICF/ICF-CY were sought.As expected, the results show a strong presence of studies from English-speaking countries, namely from Europe and North America. The articles were mainly published in noneducational journals. The most used ICF/ICF-CY components are activity and participation, participation, and environmental factors. From the analysis of the papers included, the results show that the ICF/ICF-CY is currently used as a research tool, theoretical framework, and tool for implementing educational processes. The ICF/ICF-CY can provide a useful language to the education field where there is currently a lot of disparity in theoretical, praxis, and research issues. Although the systematic literature review does not report a high incidence of the use of the ICF/ICF-CY in education, the results show that the ICF/ICF-CY model and classification have potential to be applied in education systems.
Automated Patent Categorization and Guided Patent Search using IPC as Inspired by MeSH and PubMed
2013-01-01
Document search on PubMed, the pre-eminent database for biomedical literature, relies on the annotation of its documents with relevant terms from the Medical Subject Headings ontology (MeSH) for improving recall through query expansion. Patent documents are another important information source, though they are considerably less accessible. One option to expand patent search beyond pure keywords is the inclusion of classification information: Since every patent is assigned at least one class code, it should be possible for these assignments to be automatically used in a similar way as the MeSH annotations in PubMed. In order to develop a system for this task, it is necessary to have a good understanding of the properties of both classification systems. This report describes our comparative analysis of MeSH and the main patent classification system, the International Patent Classification (IPC). We investigate the hierarchical structures as well as the properties of the terms/classes respectively, and we compare the assignment of IPC codes to patents with the annotation of PubMed documents with MeSH terms. Our analysis shows a strong structural similarity of the hierarchies, but significant differences of terms and annotations. The low number of IPC class assignments and the lack of occurrences of class labels in patent texts imply that current patent search is severely limited. To overcome these limits, we evaluate a method for the automated assignment of additional classes to patent documents, and we propose a system for guided patent search based on the use of class co-occurrence information and external resources. PMID:23734562
An etiologic classification of autism spectrum disorders.
Gabis, Lidia V; Pomeroy, John
2014-05-01
Autism spectrum disorders (ASD) represent a common phenotype related to multiple etiologies, such as genetic, brain injury (e.g., prematurity), environmental (e.g., viral, toxic), multiple or unknown causes. To devise a clinical classification of children diagnosed with ASD according to etiologic workup. Children diagnosed with ASD (n = 436) from two databases were divided into groups of symptomatic cryptogenic or idiopathic, and variables within each database and diagnostic category were compared. By analyzing the two separate databases, 5.4% of the children were classified as symptomatic, 27% as cryptogenic and 67.75% as idiopathic. Among other findings, the entire symptomatic group demonstrated language delays, but almost none showed evidence for regression. Our results indicate similarities between the idiopathic and cryptogenic subgroups in most of the examined variables, and mutual differences from the symptomatic subgroup. The similarities between the first two subgroups support prior evidence that most perinatal factors and minor physical anomalies do not contribute to the development of core symptoms of autism. Differences in gender and clinical and diagnostic features were found when etiology was used to create subtypes of ASD. This classification could have heuristic importance in the search for an autism gene(s).
YTPdb: a wiki database of yeast membrane transporters.
Brohée, Sylvain; Barriot, Roland; Moreau, Yves; André, Bruno
2010-10-01
Membrane transporters constitute one of the largest functional categories of proteins in all organisms. In the yeast Saccharomyces cerevisiae, this represents about 300 proteins ( approximately 5% of the proteome). We here present the Yeast Transport Protein database (YTPdb), a user-friendly collaborative resource dedicated to the precise classification and annotation of yeast transporters. YTPdb exploits an evolution of the MediaWiki web engine used for popular collaborative databases like Wikipedia, allowing every registered user to edit the data in a user-friendly manner. Proteins in YTPdb are classified on the basis of functional criteria such as subcellular location or their substrate compounds. These classifications are hierarchical, allowing queries to be performed at various levels, from highly specific (e.g. ammonium as a substrate or the vacuole as a location) to broader (e.g. cation as a substrate or inner membranes as location). Other resources accessible for each transporter via YTPdb include post-translational modifications, K(m) values, a permanently updated bibliography, and a hierarchical classification into families. The YTPdb concept can be extrapolated to other organisms and could even be applied for other functional categories of proteins. YTPdb is accessible at http://homes.esat.kuleuven.be/ytpdb/. Copyright © 2010 Elsevier B.V. All rights reserved.
Sidek, Khairul; Khali, Ibrahim
2012-01-01
In this paper, a person identification mechanism implemented with Cardioid based graph using electrocardiogram (ECG) is presented. Cardioid based graph has given a reasonably good classification accuracy in terms of differentiating between individuals. However, the current feature extraction method using Euclidean distance could be further improved by using Mahalanobis distance measurement producing extracted coefficients which takes into account the correlations of the data set. Identification is then done by applying these extracted features to Radial Basis Function Network. A total of 30 ECG data from MITBIH Normal Sinus Rhythm database (NSRDB) and MITBIH Arrhythmia database (MITDB) were used for development and evaluation purposes. Our experimentation results suggest that the proposed feature extraction method has significantly increased the classification performance of subjects in both databases with accuracy from 97.50% to 99.80% in NSRDB and 96.50% to 99.40% in MITDB. High sensitivity, specificity and positive predictive value of 99.17%, 99.91% and 99.23% for NSRDB and 99.30%, 99.90% and 99.40% for MITDB also validates the proposed method. This result also indicates that the right feature extraction technique plays a vital role in determining the persistency of the classification accuracy for Cardioid based person identification mechanism.
This study assessed how landcover classification affects associations between landscape characteristics and Lyme disease rate. Landscape variables were derived from the National Land Cover Database (NLCD), including native classes (e.g., deciduous forest, developed low intensity)...
ERIC Educational Resources Information Center
McConky, Katie Theresa
2013-01-01
This work covers topics in event coreference and event classification from spoken conversation. Event coreference is the process of identifying descriptions of the same event across sentences, documents, or structured databases. Existing event coreference work focuses on sentence similarity models or feature based similarity models requiring slot…
The Nonprofit Program Classification System: Increasing Understanding of the Nonprofit Sector.
ERIC Educational Resources Information Center
Romeo, Sheryl; Lampkin, Linda; Twombly, Eric
2001-01-01
The Nonprofit Program Classification System being developed by the National Center for Charitable Statistics (NCCS) provides a way to enrich the information available on nonprofits and utilize the newly available NCCS/PRI National Nonprofit Organization database from the IRS Forms 990 filed annually by charities. It provides a method to organize…
2009-10-01
parameters for a large number of species. These authors provide many sample calculations with the JCZS database incorporated in CHEETAH 2.0, including...FORM (highest classification of Title, Abstract, Keywords) DOCUMENT CONTROL DATA (Security classification of title, body of abstract and...CLASSIFICATION OF FORM 13. ABSTRACT (a brief and factual summary of the document. It may also appear elsewhere in the body of the document itself
Information analysis of a spatial database for ecological land classification
NASA Technical Reports Server (NTRS)
Davis, Frank W.; Dozier, Jeff
1990-01-01
An ecological land classification was developed for a complex region in southern California using geographic information system techniques of map overlay and contingency table analysis. Land classes were identified by mutual information analysis of vegetation pattern in relation to other mapped environmental variables. The analysis was weakened by map errors, especially errors in the digital elevation data. Nevertheless, the resulting land classification was ecologically reasonable and performed well when tested with higher quality data from the region.
Thematic accuracy of the National Land Cover Database (NLCD) 2001 land cover for Alaska
Selkowitz, D.J.; Stehman, S.V.
2011-01-01
The National Land Cover Database (NLCD) 2001 Alaska land cover classification is the first 30-m resolution land cover product available covering the entire state of Alaska. The accuracy assessment of the NLCD 2001 Alaska land cover classification employed a geographically stratified three-stage sampling design to select the reference sample of pixels. Reference land cover class labels were determined via fixed wing aircraft, as the high resolution imagery used for determining the reference land cover classification in the conterminous U.S. was not available for most of Alaska. Overall thematic accuracy for the Alaska NLCD was 76.2% (s.e. 2.8%) at Level II (12 classes evaluated) and 83.9% (s.e. 2.1%) at Level I (6 classes evaluated) when agreement was defined as a match between the map class and either the primary or alternate reference class label. When agreement was defined as a match between the map class and primary reference label only, overall accuracy was 59.4% at Level II and 69.3% at Level I. The majority of classification errors occurred at Level I of the classification hierarchy (i.e., misclassifications were generally to a different Level I class, not to a Level II class within the same Level I class). Classification accuracy was higher for more abundant land cover classes and for pixels located in the interior of homogeneous land cover patches. ?? 2011.
A neural network approach to cloud classification
NASA Technical Reports Server (NTRS)
Lee, Jonathan; Weger, Ronald C.; Sengupta, Sailes K.; Welch, Ronald M.
1990-01-01
It is shown that, using high-spatial-resolution data, very high cloud classification accuracies can be obtained with a neural network approach. A texture-based neural network classifier using only single-channel visible Landsat MSS imagery achieves an overall cloud identification accuracy of 93 percent. Cirrus can be distinguished from boundary layer cloudiness with an accuracy of 96 percent, without the use of an infrared channel. Stratocumulus is retrieved with an accuracy of 92 percent, cumulus at 90 percent. The use of the neural network does not improve cirrus classification accuracy. Rather, its main effect is in the improved separation between stratocumulus and cumulus cloudiness. While most cloud classification algorithms rely on linear parametric schemes, the present study is based on a nonlinear, nonparametric four-layer neural network approach. A three-layer neural network architecture, the nonparametric K-nearest neighbor approach, and the linear stepwise discriminant analysis procedure are compared. A significant finding is that significantly higher accuracies are attained with the nonparametric approaches using only 20 percent of the database as training data, compared to 67 percent of the database in the linear approach.
Pattern Classifications Using Grover's and Ventura's Algorithms in a Two-qubits System
NASA Astrophysics Data System (ADS)
Singh, Manu Pratap; Radhey, Kishori; Rajput, B. S.
2018-03-01
Carrying out the classification of patterns in a two-qubit system by separately using Grover's and Ventura's algorithms on different possible superposition, it has been shown that the exclusion superposition and the phase-invariance superposition are the most suitable search states obtained from two-pattern start-states and one-pattern start-states, respectively, for the simultaneous classifications of patterns. The higher effectiveness of Grover's algorithm for large search states has been verified but the higher effectiveness of Ventura's algorithm for smaller data base has been contradicted in two-qubit systems and it has been demonstrated that the unknown patterns (not present in the concerned data-base) are classified more efficiently than the known ones (present in the data-base) in both the algorithms. It has also been demonstrated that different states of Singh-Rajput MES obtained from the corresponding self-single- pattern start states are the most suitable search states for the classification of patterns |00>,|01 >, |10> and |11> respectively on the second iteration of Grover's method or the first operation of Ventura's algorithm.
Kılıç, Sarah S; Kılıç, Suat; Crippen, Meghan M; Varughese, Denny; Eloy, Jean Anderson; Baredes, Soly; Mahmoud, Omar M; Park, Richard Chan Woo
2018-04-01
Few studies have examined the frequency and survival implications of clinicopathologic stage discrepancy in oral cavity squamous cell carcinoma (SCC). Oral cavity SCC cases with full pathologic staging information were identified in the National Cancer Database (NCDB). Clinical and pathologic stages were compared. Multivariate logistic regressions were performed to identify factors associated with stage discrepancy. There were 9110 cases identified, of which 67.3% of the cases were stage concordant, 19.9% were upstaged, and 12.8% were downstaged. The N classification discordance (28.5%) was more common than T classification discordance (27.6%). In cases of T classification discordance, downstaging is more common than upstaging (15.4% vs 12.1% of cases), but in cases of N classification discordance, the reverse is true; upstaging is much more common than downstaging (20.1 vs 8.4% of cases). Clinicopathologic stage discrepancy in oral cavity SCC is a common phenomenon that is associated with a number of clinical factors and has survival implications. © 2018 Wiley Periodicals, Inc.
NASA Astrophysics Data System (ADS)
Strausberger, Donald J.
Several Radar Target Identification (RTI) techniques have been developed at The Ohio State University in recent years. Using the ElectroScience Laboratory compact range a large database of coherent RCS measurement has been constructed for several types of targets (aircraft, ships, and ground vehicles) at a variety of polarizations, aspect angles, and frequency bands. This extensive database has been used to analyze the performance of several different classification algorithms through the use of computer simulations. In order to optimize classification performance, it was concluded that the radar frequency range should lie in the Rayleigh-resonance frequency range, where the wavelength is on the order of or larger than the target size. For aircraft and ships with general dimensions on the order of 10 meters to 100 meters it is apparent that the High Frequency (HF) band provides optimal classification performance. Since existing HF radars are currently being used for detection and tracking or aircraft and ships of these dimensions, it is natural to further investigate the possibility of using these existing radars as the measurement devices in a radar target classification system.
Thyroid Nodule Classification in Ultrasound Images by Fine-Tuning Deep Convolutional Neural Network.
Chi, Jianning; Walia, Ekta; Babyn, Paul; Wang, Jimmy; Groot, Gary; Eramian, Mark
2017-08-01
With many thyroid nodules being incidentally detected, it is important to identify as many malignant nodules as possible while excluding those that are highly likely to be benign from fine needle aspiration (FNA) biopsies or surgeries. This paper presents a computer-aided diagnosis (CAD) system for classifying thyroid nodules in ultrasound images. We use deep learning approach to extract features from thyroid ultrasound images. Ultrasound images are pre-processed to calibrate their scale and remove the artifacts. A pre-trained GoogLeNet model is then fine-tuned using the pre-processed image samples which leads to superior feature extraction. The extracted features of the thyroid ultrasound images are sent to a Cost-sensitive Random Forest classifier to classify the images into "malignant" and "benign" cases. The experimental results show the proposed fine-tuned GoogLeNet model achieves excellent classification performance, attaining 98.29% classification accuracy, 99.10% sensitivity and 93.90% specificity for the images in an open access database (Pedraza et al. 16), while 96.34% classification accuracy, 86% sensitivity and 99% specificity for the images in our local health region database.
ClassyFire: automated chemical classification with a comprehensive, computable taxonomy.
Djoumbou Feunang, Yannick; Eisner, Roman; Knox, Craig; Chepelev, Leonid; Hastings, Janna; Owen, Gareth; Fahy, Eoin; Steinbeck, Christoph; Subramanian, Shankar; Bolton, Evan; Greiner, Russell; Wishart, David S
2016-01-01
Scientists have long been driven by the desire to describe, organize, classify, and compare objects using taxonomies and/or ontologies. In contrast to biology, geology, and many other scientific disciplines, the world of chemistry still lacks a standardized chemical ontology or taxonomy. Several attempts at chemical classification have been made; but they have mostly been limited to either manual, or semi-automated proof-of-principle applications. This is regrettable as comprehensive chemical classification and description tools could not only improve our understanding of chemistry but also improve the linkage between chemistry and many other fields. For instance, the chemical classification of a compound could help predict its metabolic fate in humans, its druggability or potential hazards associated with it, among others. However, the sheer number (tens of millions of compounds) and complexity of chemical structures is such that any manual classification effort would prove to be near impossible. We have developed a comprehensive, flexible, and computable, purely structure-based chemical taxonomy (ChemOnt), along with a computer program (ClassyFire) that uses only chemical structures and structural features to automatically assign all known chemical compounds to a taxonomy consisting of >4800 different categories. This new chemical taxonomy consists of up to 11 different levels (Kingdom, SuperClass, Class, SubClass, etc.) with each of the categories defined by unambiguous, computable structural rules. Furthermore each category is named using a consensus-based nomenclature and described (in English) based on the characteristic common structural properties of the compounds it contains. The ClassyFire webserver is freely accessible at http://classyfire.wishartlab.com/. Moreover, a Ruby API version is available at https://bitbucket.org/wishartlab/classyfire_api, which provides programmatic access to the ClassyFire server and database. ClassyFire has been used to annotate over 77 million compounds and has already been integrated into other software packages to automatically generate textual descriptions for, and/or infer biological properties of over 100,000 compounds. Additional examples and applications are provided in this paper. ClassyFire, in combination with ChemOnt (ClassyFire's comprehensive chemical taxonomy), now allows chemists and cheminformaticians to perform large-scale, rapid and automated chemical classification. Moreover, a freely accessible API allows easy access to more than 77 million "ClassyFire" classified compounds. The results can be used to help annotate well studied, as well as lesser-known compounds. In addition, these chemical classifications can be used as input for data integration, and many other cheminformatics-related tasks.
Information Gain Based Dimensionality Selection for Classifying Text Documents
DOE Office of Scientific and Technical Information (OSTI.GOV)
Dumidu Wijayasekara; Milos Manic; Miles McQueen
2013-06-01
Selecting the optimal dimensions for various knowledge extraction applications is an essential component of data mining. Dimensionality selection techniques are utilized in classification applications to increase the classification accuracy and reduce the computational complexity. In text classification, where the dimensionality of the dataset is extremely high, dimensionality selection is even more important. This paper presents a novel, genetic algorithm based methodology, for dimensionality selection in text mining applications that utilizes information gain. The presented methodology uses information gain of each dimension to change the mutation probability of chromosomes dynamically. Since the information gain is calculated a priori, the computational complexitymore » is not affected. The presented method was tested on a specific text classification problem and compared with conventional genetic algorithm based dimensionality selection. The results show an improvement of 3% in the true positives and 1.6% in the true negatives over conventional dimensionality selection methods.« less
Tasneem, Asba; Aberle, Laura; Ananth, Hari; Chakraborty, Swati; Chiswell, Karen; McCourt, Brian J.; Pietrobon, Ricardo
2012-01-01
Background The ClinicalTrials.gov registry provides information regarding characteristics of past, current, and planned clinical studies to patients, clinicians, and researchers; in addition, registry data are available for bulk download. However, issues related to data structure, nomenclature, and changes in data collection over time present challenges to the aggregate analysis and interpretation of these data in general and to the analysis of trials according to clinical specialty in particular. Improving usability of these data could enhance the utility of ClinicalTrials.gov as a research resource. Methods/Principal Results The purpose of our project was twofold. First, we sought to extend the usability of ClinicalTrials.gov for research purposes by developing a database for aggregate analysis of ClinicalTrials.gov (AACT) that contains data from the 96,346 clinical trials registered as of September 27, 2010. Second, we developed and validated a methodology for annotating studies by clinical specialty, using a custom taxonomy employing Medical Subject Heading (MeSH) terms applied by an NLM algorithm, as well as MeSH terms and other disease condition terms provided by study sponsors. Clinical specialists reviewed and annotated MeSH and non-MeSH disease condition terms, and an algorithm was created to classify studies into clinical specialties based on both MeSH and non-MeSH annotations. False positives and false negatives were evaluated by comparing algorithmic classification with manual classification for three specialties. Conclusions/Significance The resulting AACT database features study design attributes parsed into discrete fields, integrated metadata, and an integrated MeSH thesaurus, and is available for download as Oracle extracts (.dmp file and text format). This publicly-accessible dataset will facilitate analysis of studies and permit detailed characterization and analysis of the U.S. clinical trials enterprise as a whole. In addition, the methodology we present for creating specialty datasets may facilitate other efforts to analyze studies by specialty groups. PMID:22438982
Tasneem, Asba; Aberle, Laura; Ananth, Hari; Chakraborty, Swati; Chiswell, Karen; McCourt, Brian J; Pietrobon, Ricardo
2012-01-01
The ClinicalTrials.gov registry provides information regarding characteristics of past, current, and planned clinical studies to patients, clinicians, and researchers; in addition, registry data are available for bulk download. However, issues related to data structure, nomenclature, and changes in data collection over time present challenges to the aggregate analysis and interpretation of these data in general and to the analysis of trials according to clinical specialty in particular. Improving usability of these data could enhance the utility of ClinicalTrials.gov as a research resource. The purpose of our project was twofold. First, we sought to extend the usability of ClinicalTrials.gov for research purposes by developing a database for aggregate analysis of ClinicalTrials.gov (AACT) that contains data from the 96,346 clinical trials registered as of September 27, 2010. Second, we developed and validated a methodology for annotating studies by clinical specialty, using a custom taxonomy employing Medical Subject Heading (MeSH) terms applied by an NLM algorithm, as well as MeSH terms and other disease condition terms provided by study sponsors. Clinical specialists reviewed and annotated MeSH and non-MeSH disease condition terms, and an algorithm was created to classify studies into clinical specialties based on both MeSH and non-MeSH annotations. False positives and false negatives were evaluated by comparing algorithmic classification with manual classification for three specialties. The resulting AACT database features study design attributes parsed into discrete fields, integrated metadata, and an integrated MeSH thesaurus, and is available for download as Oracle extracts (.dmp file and text format). This publicly-accessible dataset will facilitate analysis of studies and permit detailed characterization and analysis of the U.S. clinical trials enterprise as a whole. In addition, the methodology we present for creating specialty datasets may facilitate other efforts to analyze studies by specialty groups.
Full Text Psychology Journals Available from Popular Library Databases
ERIC Educational Resources Information Center
Joswick, Kathleen E.
2006-01-01
The author identified 433 core journals in psychology and investigated their full text availability in popular databases. While 62 percent of the studied journals were available in at least one database, access from individual databases ranged from 1.4 percent to 38.1 percent of the titles. The full text of influential psychology journals is not…
Managing the world’s largest and complex freshwater ecosystem, the Laurentian Great Lakes, requires a spatially hierarchical basin-wide database of ecological and socioeconomic information that are comparable across the region. To meet such a need, we developed a hierarchi...
CIFAR10-DVS: An Event-Stream Dataset for Object Classification
Li, Hongmin; Liu, Hanchao; Ji, Xiangyang; Li, Guoqi; Shi, Luping
2017-01-01
Neuromorphic vision research requires high-quality and appropriately challenging event-stream datasets to support continuous improvement of algorithms and methods. However, creating event-stream datasets is a time-consuming task, which needs to be recorded using the neuromorphic cameras. Currently, there are limited event-stream datasets available. In this work, by utilizing the popular computer vision dataset CIFAR-10, we converted 10,000 frame-based images into 10,000 event streams using a dynamic vision sensor (DVS), providing an event-stream dataset of intermediate difficulty in 10 different classes, named as “CIFAR10-DVS.” The conversion of event-stream dataset was implemented by a repeated closed-loop smooth (RCLS) movement of frame-based images. Unlike the conversion of frame-based images by moving the camera, the image movement is more realistic in respect of its practical applications. The repeated closed-loop image movement generates rich local intensity changes in continuous time which are quantized by each pixel of the DVS camera to generate events. Furthermore, a performance benchmark in event-driven object classification is provided based on state-of-the-art classification algorithms. This work provides a large event-stream dataset and an initial benchmark for comparison, which may boost algorithm developments in even-driven pattern recognition and object classification. PMID:28611582
CIFAR10-DVS: An Event-Stream Dataset for Object Classification.
Li, Hongmin; Liu, Hanchao; Ji, Xiangyang; Li, Guoqi; Shi, Luping
2017-01-01
Neuromorphic vision research requires high-quality and appropriately challenging event-stream datasets to support continuous improvement of algorithms and methods. However, creating event-stream datasets is a time-consuming task, which needs to be recorded using the neuromorphic cameras. Currently, there are limited event-stream datasets available. In this work, by utilizing the popular computer vision dataset CIFAR-10, we converted 10,000 frame-based images into 10,000 event streams using a dynamic vision sensor (DVS), providing an event-stream dataset of intermediate difficulty in 10 different classes, named as "CIFAR10-DVS." The conversion of event-stream dataset was implemented by a repeated closed-loop smooth (RCLS) movement of frame-based images. Unlike the conversion of frame-based images by moving the camera, the image movement is more realistic in respect of its practical applications. The repeated closed-loop image movement generates rich local intensity changes in continuous time which are quantized by each pixel of the DVS camera to generate events. Furthermore, a performance benchmark in event-driven object classification is provided based on state-of-the-art classification algorithms. This work provides a large event-stream dataset and an initial benchmark for comparison, which may boost algorithm developments in even-driven pattern recognition and object classification.
Detection and Rectification of Distorted Fingerprints.
Si, Xuanbin; Feng, Jianjiang; Zhou, Jie; Luo, Yuxuan
2015-03-01
Elastic distortion of fingerprints is one of the major causes for false non-match. While this problem affects all fingerprint recognition applications, it is especially dangerous in negative recognition applications, such as watchlist and deduplication applications. In such applications, malicious users may purposely distort their fingerprints to evade identification. In this paper, we proposed novel algorithms to detect and rectify skin distortion based on a single fingerprint image. Distortion detection is viewed as a two-class classification problem, for which the registered ridge orientation map and period map of a fingerprint are used as the feature vector and a SVM classifier is trained to perform the classification task. Distortion rectification (or equivalently distortion field estimation) is viewed as a regression problem, where the input is a distorted fingerprint and the output is the distortion field. To solve this problem, a database (called reference database) of various distorted reference fingerprints and corresponding distortion fields is built in the offline stage, and then in the online stage, the nearest neighbor of the input fingerprint is found in the reference database and the corresponding distortion field is used to transform the input fingerprint into a normal one. Promising results have been obtained on three databases containing many distorted fingerprints, namely FVC2004 DB1, Tsinghua Distorted Fingerprint database, and the NIST SD27 latent fingerprint database.
Newland, Pamela; Newland, John M; Hendricks-Ferguson, Verna L; Smith, Judith M; Oliver, Brant J
2018-06-01
The purpose of this article was to demonstrate the feasibility of using common data elements (CDEs) to search for information on the pediatric patient with multiple sclerosis (MS) and provide recommendations for future quality improvement and research in the use of CDEs for pediatric MS symptom management strategies Methods: The St. Louis Children's Hospital (SLCH), Washington University (WU) pediatrics data network was evaluated for use of CDEs identified from a database to identify variables in pediatric MS, including the key clinical features from the disease course of MS. The algorithms used were based on International Classification of Diseases, Ninth/Tenth Revision, codes and text keywords to identify pediatric patients with MS from a de-identified database. Data from a coordinating center of SLCH/WU pediatrics data network, which houses inpatient and outpatient records consisting of patients (N = 498 000), were identified, and detailed information regarding the clinical course of MS were located from the text of the medical records, including medications, presence of oligoclonal bands, year of diagnosis, and diagnosis code. There were 466 pediatric patients with MS, with a few also having the comorbid diagnosis of anxiety and depression. St. Louis Children's Hospital/WU pediatrics data network is one of the largest databases in the United States of detailed data, with the ability to query and validate clinical data for research on MS. Nurses and other healthcare professionals working with pediatric MS patients will benefit from having common disease identifiers for quality improvement, research, and practice. The increased knowledge of big data from SLCH/WU pediatrics data network has the potential to provide information for intervention and decision-making that can be personalized to the pediatric MS patient.
The Protein-DNA Interface database
2010-01-01
The Protein-DNA Interface database (PDIdb) is a repository containing relevant structural information of Protein-DNA complexes solved by X-ray crystallography and available at the Protein Data Bank. The database includes a simple functional classification of the protein-DNA complexes that consists of three hierarchical levels: Class, Type and Subtype. This classification has been defined and manually curated by humans based on the information gathered from several sources that include PDB, PubMed, CATH, SCOP and COPS. The current version of the database contains only structures with resolution of 2.5 Å or higher, accounting for a total of 922 entries. The major aim of this database is to contribute to the understanding of the main rules that underlie the molecular recognition process between DNA and proteins. To this end, the database is focused on each specific atomic interface rather than on the separated binding partners. Therefore, each entry in this database consists of a single and independent protein-DNA interface. We hope that PDIdb will be useful to many researchers working in fields such as the prediction of transcription factor binding sites in DNA, the study of specificity determinants that mediate enzyme recognition events, engineering and design of new DNA binding proteins with distinct binding specificity and affinity, among others. Finally, due to its friendly and easy-to-use web interface, we hope that PDIdb will also serve educational and teaching purposes. PMID:20482798
The Protein-DNA Interface database.
Norambuena, Tomás; Melo, Francisco
2010-05-18
The Protein-DNA Interface database (PDIdb) is a repository containing relevant structural information of Protein-DNA complexes solved by X-ray crystallography and available at the Protein Data Bank. The database includes a simple functional classification of the protein-DNA complexes that consists of three hierarchical levels: Class, Type and Subtype. This classification has been defined and manually curated by humans based on the information gathered from several sources that include PDB, PubMed, CATH, SCOP and COPS. The current version of the database contains only structures with resolution of 2.5 A or higher, accounting for a total of 922 entries. The major aim of this database is to contribute to the understanding of the main rules that underlie the molecular recognition process between DNA and proteins. To this end, the database is focused on each specific atomic interface rather than on the separated binding partners. Therefore, each entry in this database consists of a single and independent protein-DNA interface.We hope that PDIdb will be useful to many researchers working in fields such as the prediction of transcription factor binding sites in DNA, the study of specificity determinants that mediate enzyme recognition events, engineering and design of new DNA binding proteins with distinct binding specificity and affinity, among others. Finally, due to its friendly and easy-to-use web interface, we hope that PDIdb will also serve educational and teaching purposes.
Asynchronous Data Retrieval from an Object-Oriented Database
NASA Astrophysics Data System (ADS)
Gilbert, Jonathan P.; Bic, Lubomir
We present an object-oriented semantic database model which, similar to other object-oriented systems, combines the virtues of four concepts: the functional data model, a property inheritance hierarchy, abstract data types and message-driven computation. The main emphasis is on the last of these four concepts. We describe generic procedures that permit queries to be processed in a purely message-driven manner. A database is represented as a network of nodes and directed arcs, in which each node is a logical processing element, capable of communicating with other nodes by exchanging messages. This eliminates the need for shared memory and for centralized control during query processing. Hence, the model is suitable for implementation on a multiprocessor computer architecture, consisting of large numbers of loosely coupled processing elements.
Learning From Short Text Streams With Topic Drifts.
Li, Peipei; He, Lu; Wang, Haiyan; Hu, Xuegang; Zhang, Yuhong; Li, Lei; Wu, Xindong
2017-09-18
Short text streams such as search snippets and micro blogs have been popular on the Web with the emergence of social media. Unlike traditional normal text streams, these data present the characteristics of short length, weak signal, high volume, high velocity, topic drift, etc. Short text stream classification is hence a very challenging and significant task. However, this challenge has received little attention from the research community. Therefore, a new feature extension approach is proposed for short text stream classification with the help of a large-scale semantic network obtained from a Web corpus. It is built on an incremental ensemble classification model for efficiency. First, more semantic contexts based on the senses of terms in short texts are introduced to make up of the data sparsity using the open semantic network, in which all terms are disambiguated by their semantics to reduce the noise impact. Second, a concept cluster-based topic drifting detection method is proposed to effectively track hidden topic drifts. Finally, extensive studies demonstrate that as compared to several well-known concept drifting detection methods in data stream, our approach can detect topic drifts effectively, and it enables handling short text streams effectively while maintaining the efficiency as compared to several state-of-the-art short text classification approaches.
Visualizing the semantic content of large text databases using text maps
NASA Technical Reports Server (NTRS)
Combs, Nathan
1993-01-01
A methodology for generating text map representations of the semantic content of text databases is presented. Text maps provide a graphical metaphor for conceptualizing and visualizing the contents and data interrelationships of large text databases. Described are a set of experiments conducted against the TIPSTER corpora of Wall Street Journal articles. These experiments provide an introduction to current work in the representation and visualization of documents by way of their semantic content.
Simple-random-sampling-based multiclass text classification algorithm.
Liu, Wuying; Wang, Lin; Yi, Mianzhu
2014-01-01
Multiclass text classification (MTC) is a challenging issue and the corresponding MTC algorithms can be used in many applications. The space-time overhead of the algorithms must be concerned about the era of big data. Through the investigation of the token frequency distribution in a Chinese web document collection, this paper reexamines the power law and proposes a simple-random-sampling-based MTC (SRSMTC) algorithm. Supported by a token level memory to store labeled documents, the SRSMTC algorithm uses a text retrieval approach to solve text classification problems. The experimental results on the TanCorp data set show that SRSMTC algorithm can achieve the state-of-the-art performance at greatly reduced space-time requirements.
Sarker, Abeed; Gonzalez, Graciela
2015-02-01
Automatic detection of adverse drug reaction (ADR) mentions from text has recently received significant interest in pharmacovigilance research. Current research focuses on various sources of text-based information, including social media-where enormous amounts of user posted data is available, which have the potential for use in pharmacovigilance if collected and filtered accurately. The aims of this study are: (i) to explore natural language processing (NLP) approaches for generating useful features from text, and utilizing them in optimized machine learning algorithms for automatic classification of ADR assertive text segments; (ii) to present two data sets that we prepared for the task of ADR detection from user posted internet data; and (iii) to investigate if combining training data from distinct corpora can improve automatic classification accuracies. One of our three data sets contains annotated sentences from clinical reports, and the two other data sets, built in-house, consist of annotated posts from social media. Our text classification approach relies on generating a large set of features, representing semantic properties (e.g., sentiment, polarity, and topic), from short text nuggets. Importantly, using our expanded feature sets, we combine training data from different corpora in attempts to boost classification accuracies. Our feature-rich classification approach performs significantly better than previously published approaches with ADR class F-scores of 0.812 (previously reported best: 0.770), 0.538 and 0.678 for the three data sets. Combining training data from multiple compatible corpora further improves the ADR F-scores for the in-house data sets to 0.597 (improvement of 5.9 units) and 0.704 (improvement of 2.6 units) respectively. Our research results indicate that using advanced NLP techniques for generating information rich features from text can significantly improve classification accuracies over existing benchmarks. Our experiments illustrate the benefits of incorporating various semantic features such as topics, concepts, sentiments, and polarities. Finally, we show that integration of information from compatible corpora can significantly improve classification performance. This form of multi-corpus training may be particularly useful in cases where data sets are heavily imbalanced (e.g., social media data), and may reduce the time and costs associated with the annotation of data in the future. Copyright © 2014 The Authors. Published by Elsevier Inc. All rights reserved.
Portable Automatic Text Classification for Adverse Drug Reaction Detection via Multi-corpus Training
Gonzalez, Graciela
2014-01-01
Objective Automatic detection of Adverse Drug Reaction (ADR) mentions from text has recently received significant interest in pharmacovigilance research. Current research focuses on various sources of text-based information, including social media — where enormous amounts of user posted data is available, which have the potential for use in pharmacovigilance if collected and filtered accurately. The aims of this study are: (i) to explore natural language processing approaches for generating useful features from text, and utilizing them in optimized machine learning algorithms for automatic classification of ADR assertive text segments; (ii) to present two data sets that we prepared for the task of ADR detection from user posted internet data; and (iii) to investigate if combining training data from distinct corpora can improve automatic classification accuracies. Methods One of our three data sets contains annotated sentences from clinical reports, and the two other data sets, built in-house, consist of annotated posts from social media. Our text classification approach relies on generating a large set of features, representing semantic properties (e.g., sentiment, polarity, and topic), from short text nuggets. Importantly, using our expanded feature sets, we combine training data from different corpora in attempts to boost classification accuracies. Results Our feature-rich classification approach performs significantly better than previously published approaches with ADR class F-scores of 0.812 (previously reported best: 0.770), 0.538 and 0.678 for the three data sets. Combining training data from multiple compatible corpora further improves the ADR F-scores for the in-house data sets to 0.597 (improvement of 5.9 units) and 0.704 (improvement of 2.6 units) respectively. Conclusions Our research results indicate that using advanced NLP techniques for generating information rich features from text can significantly improve classification accuracies over existing benchmarks. Our experiments illustrate the benefits of incorporating various semantic features such as topics, concepts, sentiments, and polarities. Finally, we show that integration of information from compatible corpora can significantly improve classification performance. This form of multi-corpus training may be particularly useful in cases where data sets are heavily imbalanced (e.g., social media data), and may reduce the time and costs associated with the annotation of data in the future. PMID:25451103
Analysis and Classification of Voice Pathologies Using Glottal Signal Parameters.
Forero M, Leonardo A; Kohler, Manoela; Vellasco, Marley M B R; Cataldo, Edson
2016-09-01
The classification of voice diseases has many applications in health, in diseases treatment, and in the design of new medical equipment for helping doctors in diagnosing pathologies related to the voice. This work uses the parameters of the glottal signal to help the identification of two types of voice disorders related to the pathologies of the vocal folds: nodule and unilateral paralysis. The parameters of the glottal signal are obtained through a known inverse filtering method, and they are used as inputs to an Artificial Neural Network, a Support Vector Machine, and also to a Hidden Markov Model, to obtain the classification, and to compare the results, of the voice signals into three different groups: speakers with nodule in the vocal folds; speakers with unilateral paralysis of the vocal folds; and speakers with normal voices, that is, without nodule or unilateral paralysis present in the vocal folds. The database is composed of 248 voice recordings (signals of vowels production) containing samples corresponding to the three groups mentioned. In this study, a larger database was used for the classification when compared with similar studies, and its classification rate is superior to other studies, reaching 97.2%. Copyright © 2016 The Voice Foundation. Published by Elsevier Inc. All rights reserved.
Cheng, Ching-Wu; Leu, Sou-Sen; Cheng, Ying-Mei; Wu, Tsung-Chih; Lin, Chen-Chung
2012-09-01
Construction accident research involves the systematic sorting, classification, and encoding of comprehensive databases of injuries and fatalities. The present study explores the causes and distribution of occupational accidents in the Taiwan construction industry by analyzing such a database using the data mining method known as classification and regression tree (CART). Utilizing a database of 1542 accident cases during the period 2000-2009, the study seeks to establish potential cause-and-effect relationships regarding serious occupational accidents in the industry. The results of this study show that the occurrence rules for falls and collapses in both public and private project construction industries serve as key factors to predict the occurrence of occupational injuries. The results of the study provide a framework for improving the safety practices and training programs that are essential to protecting construction workers from occasional or unexpected accidents. Copyright © 2011 Elsevier Ltd. All rights reserved.
From ClinicalTrials.gov trial registry to an analysis-ready database of clinical trial results.
Cepeda, M Soledad; Lobanov, Victor; Berlin, Jesse A
2013-04-01
The ClinicalTrials.gov web site provides a convenient interface to look up study results, but it does not allow downloading data in a format that can be readily used for quantitative analyses. To develop a system that automatically downloads study results from ClinicalTrials.gov and provides an interface to retrieve study results in a spreadsheet format ready for analysis. Sherlock(®) identifies studies by intervention, population, or outcome of interest and in seconds creates an analytic database of study results ready for analyses. The outcome classification algorithms used in Sherlock were validated against a classification by an expert. Having a database ready for analysis that can be updated automatically, dramatically extends the utility of the ClinicalTrials.gov trial registry. It increases the speed of comparative research, reduces the need for manual extraction of data, and permits answering a vast array of questions.
AC Current Driven Dynamic Vortex State in YBa2Cu3O7-x (Postprint)
2012-02-01
coexisting steady states of driven vortex motion with different characteristics: a quasi-static disordered glassy state in the sample interior and a...coexisting, vortex, plastic, dynamic, calculations, disordered , hysteretic, model, films, edges 16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF...characteris- tics: a quasi-static disordered glassy state in the sample interior and a dynamic state of plastic motion near the edges. Finite- element
Bidra, Avinash S; Jacob, Rhonda F; Taylor, Thomas D
2012-04-01
Maxillectomy defects are complex and involve a number of anatomic structures. Several maxillectomy defect classifications have been proposed with no universal acceptance among surgeons and prosthodontists. Established criteria for describing the maxillectomy defect are lacking. This systematic review aimed to evaluate classification systems in the available literature, to provide a critical appraisal, and to identify the criteria necessary for a universal description of maxillectomy and midfacial defects. An electronic search of the English language literature between the periods of 1974 and June 2011 was performed by using PubMed, Scopus, and Cochrane databases with predetermined inclusion criteria. Key terms included in the search were maxillectomy classification, maxillary resection classification, maxillary removal classification, maxillary reconstruction classification, midfacial defect classification, and midfacial reconstruction classification. This was supplemented by a manual search of selected journals. After application of predetermined exclusion criteria, the final list of articles was reviewed in-depth to provide a critical appraisal and identify criteria for a universal description of a maxillectomy defect. The electronic database search yielded 261 titles. Systematic application of inclusion and exclusion criteria resulted in identification of 14 maxillectomy and midfacial defect classification systems. From these articles, 6 different criteria were identified as necessary for a universal description of a maxillectomy defect. Multiple deficiencies were noted in each classification system. Though most articles described the superior-inferior extent of the defect, only a small number of articles described the anterior-posterior and medial-lateral extent of the defect. Few articles listed dental status and soft palate involvement when describing maxillectomy defects. No classification system has accurately described the maxillectomy defect, based on criteria that satisfy both surgical and prosthodontic needs. The 6 criteria identified in this systematic review for a universal description of a maxillectomy defect are: 1) dental status; 2) oroantral/nasal communication status; 3) soft palate and other contiguous structure involvement; 4) superior-inferior extent; 5) anterior-posterior extent; and 6) medial-lateral extent of the defect. A criteria-based description appears more objective and amenable for universal use than a classification-based description. Copyright © 2012 The Editorial Council of the Journal of Prosthetic Dentistry. Published by Mosby, Inc. All rights reserved.
Renard, Bernhard Y.; Xu, Buote; Kirchner, Marc; Zickmann, Franziska; Winter, Dominic; Korten, Simone; Brattig, Norbert W.; Tzur, Amit; Hamprecht, Fred A.; Steen, Hanno
2012-01-01
Currently, the reliable identification of peptides and proteins is only feasible when thoroughly annotated sequence databases are available. Although sequencing capacities continue to grow, many organisms remain without reliable, fully annotated reference genomes required for proteomic analyses. Standard database search algorithms fail to identify peptides that are not exactly contained in a protein database. De novo searches are generally hindered by their restricted reliability, and current error-tolerant search strategies are limited by global, heuristic tradeoffs between database and spectral information. We propose a Bayesian information criterion-driven error-tolerant peptide search (BICEPS) and offer an open source implementation based on this statistical criterion to automatically balance the information of each single spectrum and the database, while limiting the run time. We show that BICEPS performs as well as current database search algorithms when such algorithms are applied to sequenced organisms, whereas BICEPS only uses a remotely related organism database. For instance, we use a chicken instead of a human database corresponding to an evolutionary distance of more than 300 million years (International Chicken Genome Sequencing Consortium (2004) Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution. Nature 432, 695–716). We demonstrate the successful application to cross-species proteomics with a 33% increase in the number of identified proteins for a filarial nematode sample of Litomosoides sigmodontis. PMID:22493179
ERIC Educational Resources Information Center
Amershi, Saleema; Conati, Cristina
2009-01-01
In this paper, we present a data-based user modeling framework that uses both unsupervised and supervised classification to build student models for exploratory learning environments. We apply the framework to build student models for two different learning environments and using two different data sources (logged interface and eye-tracking data).…
ERIC Educational Resources Information Center
Ha¨rtinger, Stefan; Clarke, Nigel
2016-01-01
Developing skills for searching the patent literature is an essential element of chemical information literacy programs at the university level. The present article creates awareness of patents as a rich source of chemical information. Patent classification is introduced as a key-component in comprehensive search strategies. The free Espacenet…
Ferreira Junior, José Raniery; Oliveira, Marcelo Costa; de Azevedo-Marques, Paulo Mazzoncini
2016-12-01
Lung cancer is the leading cause of cancer-related deaths in the world, and its main manifestation is pulmonary nodules. Detection and classification of pulmonary nodules are challenging tasks that must be done by qualified specialists, but image interpretation errors make those tasks difficult. In order to aid radiologists on those hard tasks, it is important to integrate the computer-based tools with the lesion detection, pathology diagnosis, and image interpretation processes. However, computer-aided diagnosis research faces the problem of not having enough shared medical reference data for the development, testing, and evaluation of computational methods for diagnosis. In order to minimize this problem, this paper presents a public nonrelational document-oriented cloud-based database of pulmonary nodules characterized by 3D texture attributes, identified by experienced radiologists and classified in nine different subjective characteristics by the same specialists. Our goal with the development of this database is to improve computer-aided lung cancer diagnosis and pulmonary nodule detection and classification research through the deployment of this database in a cloud Database as a Service framework. Pulmonary nodule data was provided by the Lung Image Database Consortium and Image Database Resource Initiative (LIDC-IDRI), image descriptors were acquired by a volumetric texture analysis, and database schema was developed using a document-oriented Not only Structured Query Language (NoSQL) approach. The proposed database is now with 379 exams, 838 nodules, and 8237 images, 4029 of them are CT scans and 4208 manually segmented nodules, and it is allocated in a MongoDB instance on a cloud infrastructure.
Text Extraction from Scene Images by Character Appearance and Structure Modeling
Yi, Chucai; Tian, Yingli
2012-01-01
In this paper, we propose a novel algorithm to detect text information from natural scene images. Scene text classification and detection are still open research topics. Our proposed algorithm is able to model both character appearance and structure to generate representative and discriminative text descriptors. The contributions of this paper include three aspects: 1) a new character appearance model by a structure correlation algorithm which extracts discriminative appearance features from detected interest points of character samples; 2) a new text descriptor based on structons and correlatons, which model character structure by structure differences among character samples and structure component co-occurrence; and 3) a new text region localization method by combining color decomposition, character contour refinement, and string line alignment to localize character candidates and refine detected text regions. We perform three groups of experiments to evaluate the effectiveness of our proposed algorithm, including text classification, text detection, and character identification. The evaluation results on benchmark datasets demonstrate that our algorithm achieves the state-of-the-art performance on scene text classification and detection, and significantly outperforms the existing algorithms for character identification. PMID:23316111
Divorcing Strain Classification from Species Names.
Baltrus, David A
2016-06-01
Confusion about strain classification and nomenclature permeates modern microbiology. Although taxonomists have traditionally acted as gatekeepers of order, the numbers of, and speed at which, new strains are identified has outpaced the opportunity for professional classification for many lineages. Furthermore, the growth of bioinformatics and database-fueled investigations have placed metadata curation in the hands of researchers with little taxonomic experience. Here I describe practical challenges facing modern microbial taxonomy, provide an overview of complexities of classification for environmentally ubiquitous taxa like Pseudomonas syringae, and emphasize that classification can be independent of nomenclature. A move toward implementation of relational classification schemes based on inherent properties of whole genomes could provide sorely needed continuity in how strains are referenced across manuscripts and data sets. Copyright © 2016 Elsevier Ltd. All rights reserved.
Statistical text classifier to detect specific type of medical incidents.
Wong, Zoie Shui-Yee; Akiyama, Masanori
2013-01-01
WHO Patient Safety has put focus to increase the coherence and expressiveness of patient safety classification with the foundation of International Classification for Patient Safety (ICPS). Text classification and statistical approaches has showed to be successful to identifysafety problems in the Aviation industryusing incident text information. It has been challenging to comprehend the taxonomy of medical incidents in a structured manner. Independent reporting mechanisms for patient safety incidents have been established in the UK, Canada, Australia, Japan, Hong Kong etc. This research demonstrates the potential to construct statistical text classifiers to detect specific type of medical incidents using incident text data. An illustrative example for classifying look-alike sound-alike (LASA) medication incidents using structured text from 227 advisories related to medication errors from Global Patient Safety Alerts (GPSA) is shown in this poster presentation. The classifier was built using logistic regression model. ROC curve and the AUC value indicated that this is a satisfactory good model.
Menon, K Venugopal; Kumar, Dinesh; Thomas, Tessamma
2014-02-01
Study Design Preliminary evaluation of new tool. Objective To ascertain whether the newly developed content-based image retrieval (CBIR) software can be used successfully to retrieve images of similar cases of adolescent idiopathic scoliosis (AIS) from a database to help plan treatment without adhering to a classification scheme. Methods Sixty-two operated cases of AIS were entered into the newly developed CBIR database. Five new cases of different curve patterns were used as query images. The images were fed into the CBIR database that retrieved similar images from the existing cases. These were analyzed by a senior surgeon for conformity to the query image. Results Within the limits of variability set for the query system, all the resultant images conformed to the query image. One case had no similar match in the series. The other four retrieved several images that were matching with the query. No matching case was left out in the series. The postoperative images were then analyzed to check for surgical strategies. Broad guidelines for treatment could be derived from the results. More precise query settings, inclusion of bending films, and a larger database will enhance accurate retrieval and better decision making. Conclusion The CBIR system is an effective tool for accurate documentation and retrieval of scoliosis images. Broad guidelines for surgical strategies can be made from the postoperative images of the existing cases without adhering to any classification scheme.
Application Analysis and Decision with Dynamic Analysis
2014-12-01
pushes the application file and the JSON file containing the metadata from the database . When the 2 files are in place, the consumer thread starts...human analysts and stores it in a database . It would then use some of these data to generate a risk score for the application. However, static analysis...and store them in the primary A2D database for future analysis. 15. SUBJECT TERMS Android, dynamic analysis 16. SECURITY CLASSIFICATION OF: 17
TrSDB: a proteome database of transcription factors
Hermoso, Antoni; Aguilar, Daniel; Aviles, Francesc X.; Querol, Enrique
2004-01-01
TrSDB—TranScout Database—(http://ibb.uab.es/trsdb) is a proteome database of eukaryotic transcription factors based upon predicted motifs by TranScout and data sources such as InterPro and Gene Ontology Annotation. Nine eukaryotic proteomes are included in the current version. Extensive and diverse information for each database entry, different analyses considering TranScout classification and similarity relationships are offered for research on transcription factors or gene expression. PMID:14681387
The Design and Product of National 1:1000000 Cartographic Data of Topographic Map
NASA Astrophysics Data System (ADS)
Wang, Guizhi
2016-06-01
National administration of surveying, mapping and geoinformation started to launch the project of national fundamental geographic information database dynamic update in 2012. Among them, the 1:50000 database was updated once a year, furthermore the 1:250000 database was downsized and linkage-updated on the basis. In 2014, using the latest achievements of 1:250000 database, comprehensively update the 1:1000000 digital line graph database. At the same time, generate cartographic data of topographic map and digital elevation model data. This article mainly introduce national 1:1000000 cartographic data of topographic map, include feature content, database structure, Database-driven Mapping technology, workflow and so on.
Update on terrestrial ecological classification in the highlands of West Virginia
James P. Vanderhorst
2010-01-01
The West Virginia Natural Heritage Program (WVNHP) maintains databases on the biological diversity of the state, including species and natural communities, to help focus conservation efforts by agencies and organizations. Information on terrestrial communities (also called vegetation, or habitat, depending on user or audience focus) is maintained in two databases. The...
The Extent of Consumer Product Involvement in Paediatric Injuries
Catchpoole, Jesani; Walker, Sue; Vallmuur, Kirsten
2016-01-01
A challenge in utilising health sector injury data for Product Safety purposes is that clinically coded data have limited ability to inform regulators about product involvement in injury events, given data entry is bound by a predefined set of codes. Text narratives collected in emergency departments can potentially address this limitation by providing relevant product information with additional accompanying context. This study aims to identify and quantify consumer product involvement in paediatric injuries recorded in emergency department-based injury surveillance data. A total of 7743 paediatric injuries were randomly selected from Queensland Injury Surveillance Unit database and associated text narratives were manually reviewed to determine product involvement in the injury event. A Product Involvement Factor classification system was used to categorise these injury cases. Overall, 44% of all reviewed cases were associated with consumer products, with proximity factor (25%) being identified as the most common involvement of a product in an injury event. Only 6% were established as being directly due to the product. The study highlights the importance of utilising injury data to inform product safety initiatives where text narratives can be used to identify the type and involvement of products in injury cases. PMID:27399744
The Extent of Consumer Product Involvement in Paediatric Injuries.
Catchpoole, Jesani; Walker, Sue; Vallmuur, Kirsten
2016-07-07
A challenge in utilising health sector injury data for Product Safety purposes is that clinically coded data have limited ability to inform regulators about product involvement in injury events, given data entry is bound by a predefined set of codes. Text narratives collected in emergency departments can potentially address this limitation by providing relevant product information with additional accompanying context. This study aims to identify and quantify consumer product involvement in paediatric injuries recorded in emergency department-based injury surveillance data. A total of 7743 paediatric injuries were randomly selected from Queensland Injury Surveillance Unit database and associated text narratives were manually reviewed to determine product involvement in the injury event. A Product Involvement Factor classification system was used to categorise these injury cases. Overall, 44% of all reviewed cases were associated with consumer products, with proximity factor (25%) being identified as the most common involvement of a product in an injury event. Only 6% were established as being directly due to the product. The study highlights the importance of utilising injury data to inform product safety initiatives where text narratives can be used to identify the type and involvement of products in injury cases.
The history of the CATH structural classification of protein domains.
Sillitoe, Ian; Dawson, Natalie; Thornton, Janet; Orengo, Christine
2015-12-01
This article presents a historical review of the protein structure classification database CATH. Together with the SCOP database, CATH remains comprehensive and reasonably up-to-date with the now more than 100,000 protein structures in the PDB. We review the expansion of the CATH and SCOP resources to capture predicted domain structures in the genome sequence data and to provide information on the likely functions of proteins mediated by their constituent domains. The establishment of comprehensive function annotation resources has also meant that domain families can be functionally annotated allowing insights into functional divergence and evolution within protein families. Copyright © 2015 The Authors. Published by Elsevier B.V. All rights reserved.
Application of kernel functions for accurate similarity search in large chemical databases.
Wang, Xiaohong; Huan, Jun; Smalter, Aaron; Lushington, Gerald H
2010-04-29
Similarity search in chemical structure databases is an important problem with many applications in chemical genomics, drug design, and efficient chemical probe screening among others. It is widely believed that structure based methods provide an efficient way to do the query. Recently various graph kernel functions have been designed to capture the intrinsic similarity of graphs. Though successful in constructing accurate predictive and classification models, graph kernel functions can not be applied to large chemical compound database due to the high computational complexity and the difficulties in indexing similarity search for large databases. To bridge graph kernel function and similarity search in chemical databases, we applied a novel kernel-based similarity measurement, developed in our team, to measure similarity of graph represented chemicals. In our method, we utilize a hash table to support new graph kernel function definition, efficient storage and fast search. We have applied our method, named G-hash, to large chemical databases. Our results show that the G-hash method achieves state-of-the-art performance for k-nearest neighbor (k-NN) classification. Moreover, the similarity measurement and the index structure is scalable to large chemical databases with smaller indexing size, and faster query processing time as compared to state-of-the-art indexing methods such as Daylight fingerprints, C-tree and GraphGrep. Efficient similarity query processing method for large chemical databases is challenging since we need to balance running time efficiency and similarity search accuracy. Our previous similarity search method, G-hash, provides a new way to perform similarity search in chemical databases. Experimental study validates the utility of G-hash in chemical databases.
CARD 2017: expansion and model-centric curation of the comprehensive antibiotic resistance database
Jia, Baofeng; Raphenya, Amogelang R.; Alcock, Brian; Waglechner, Nicholas; Guo, Peiyao; Tsang, Kara K.; Lago, Briony A.; Dave, Biren M.; Pereira, Sheldon; Sharma, Arjun N.; Doshi, Sachin; Courtot, Mélanie; Lo, Raymond; Williams, Laura E.; Frye, Jonathan G.; Elsayegh, Tariq; Sardar, Daim; Westman, Erin L.; Pawlowski, Andrew C.; Johnson, Timothy A.; Brinkman, Fiona S.L.; Wright, Gerard D.; McArthur, Andrew G.
2017-01-01
The Comprehensive Antibiotic Resistance Database (CARD; http://arpcard.mcmaster.ca) is a manually curated resource containing high quality reference data on the molecular basis of antimicrobial resistance (AMR), with an emphasis on the genes, proteins and mutations involved in AMR. CARD is ontologically structured, model centric, and spans the breadth of AMR drug classes and resistance mechanisms, including intrinsic, mutation-driven and acquired resistance. It is built upon the Antibiotic Resistance Ontology (ARO), a custom built, interconnected and hierarchical controlled vocabulary allowing advanced data sharing and organization. Its design allows the development of novel genome analysis tools, such as the Resistance Gene Identifier (RGI) for resistome prediction from raw genome sequence. Recent improvements include extensive curation of additional reference sequences and mutations, development of a unique Model Ontology and accompanying AMR detection models to power sequence analysis, new visualization tools, and expansion of the RGI for detection of emergent AMR threats. CARD curation is updated monthly based on an interplay of manual literature curation, computational text mining, and genome analysis. PMID:27789705
DOE Office of Scientific and Technical Information (OSTI.GOV)
Masci, Frank J.; Grillmair, Carl J.; Cutri, Roc M.
2014-07-01
We describe a methodology to classify periodic variable stars identified using photometric time-series measurements constructed from the Wide-field Infrared Survey Explorer (WISE) full-mission single-exposure Source Databases. This will assist in the future construction of a WISE Variable Source Database that assigns variables to specific science classes as constrained by the WISE observing cadence with statistically meaningful classification probabilities. We have analyzed the WISE light curves of 8273 variable stars identified in previous optical variability surveys (MACHO, GCVS, and ASAS) and show that Fourier decomposition techniques can be extended into the mid-IR to assist with their classification. Combined with other periodicmore » light-curve features, this sample is then used to train a machine-learned classifier based on the random forest (RF) method. Consistent with previous classification studies of variable stars in general, the RF machine-learned classifier is superior to other methods in terms of accuracy, robustness against outliers, and relative immunity to features that carry little or redundant class information. For the three most common classes identified by WISE: Algols, RR Lyrae, and W Ursae Majoris type variables, we obtain classification efficiencies of 80.7%, 82.7%, and 84.5% respectively using cross-validation analyses, with 95% confidence intervals of approximately ±2%. These accuracies are achieved at purity (or reliability) levels of 88.5%, 96.2%, and 87.8% respectively, similar to that achieved in previous automated classification studies of periodic variable stars.« less
Bag-of-visual-ngrams for histopathology image classification
NASA Astrophysics Data System (ADS)
López-Monroy, A. Pastor; Montes-y-Gómez, Manuel; Escalante, Hugo Jair; Cruz-Roa, Angel; González, Fabio A.
2013-11-01
This paper describes an extension of the Bag-of-Visual-Words (BoVW) representation for image categorization (IC) of histophatology images. This representation is one of the most used approaches in several high-level computer vision tasks. However, the BoVW representation has an important limitation: the disregarding of spatial information among visual words. This information may be useful to capture discriminative visual-patterns in specific computer vision tasks. In order to overcome this problem we propose the use of visual n-grams. N-grams based-representations are very popular in the field of natural language processing (NLP), in particular within text mining and information retrieval. We propose building a codebook of n-grams and then representing images by histograms of visual n-grams. We evaluate our proposal in the challenging task of classifying histopathology images. The novelty of our proposal lies in the fact that we use n-grams as attributes for a classification model (together with visual-words, i.e., 1-grams). This is common practice within NLP, although, to the best of our knowledge, this idea has not been explored yet within computer vision. We report experimental results in a database of histopathology images where our proposed method outperforms the traditional BoVWs formulation.
ERIC Educational Resources Information Center
Cotton, P. L.
1987-01-01
Defines two types of online databases: source, referring to those intended to be complete in themselves, whether full-text or abstracts; and bibliographic, meaning those that are not complete. Predictions are made about the future growth rate of these two types of databases, as well as full-text versus abstract databases. (EM)
Classification of proteins with shared motifs and internal repeats in the ECOD database
Kinch, Lisa N.; Liao, Yuxing
2016-01-01
Abstract Proteins and their domains evolve by a set of events commonly including the duplication and divergence of small motifs. The presence of short repetitive regions in domains has generally constituted a difficult case for structural domain classifications and their hierarchies. We developed the Evolutionary Classification Of protein Domains (ECOD) in part to implement a new schema for the classification of these types of proteins. Here we document the ways in which ECOD classifies proteins with small internal repeats, widespread functional motifs, and assemblies of small domain‐like fragments in its evolutionary schema. We illustrate the ways in which the structural genomics project impacted the classification and characterization of new structural domains and sequence families over the decade. PMID:26833690
Hastings, Janna; de Matos, Paula; Dekker, Adriano; Ennis, Marcus; Harsha, Bhavana; Kale, Namrata; Muthukrishnan, Venkatesh; Owen, Gareth; Turner, Steve; Williams, Mark; Steinbeck, Christoph
2013-01-01
ChEBI (http://www.ebi.ac.uk/chebi) is a database and ontology of chemical entities of biological interest. Over the past few years, ChEBI has continued to grow steadily in content, and has added several new features. In addition to incorporating all user-requested compounds, our annotation efforts have emphasized immunology, natural products and metabolites in many species. All database entries are now 'is_a' classified within the ontology, meaning that all of the chemicals are available to semantic reasoning tools that harness the classification hierarchy. We have completely aligned the ontology with the Open Biomedical Ontologies (OBO) Foundry-recommended upper level Basic Formal Ontology. Furthermore, we have aligned our chemical classification with the classification of chemical-involving processes in the Gene Ontology (GO), and as a result of this effort, the majority of chemical-involving processes in GO are now defined in terms of the ChEBI entities that participate in them. This effort necessitated incorporating many additional biologically relevant compounds. We have incorporated additional data types including reference citations, and the species and component for metabolites. Finally, our website and web services have had several enhancements, most notably the provision of a dynamic new interactive graph-based ontology visualization.
Barbeyron, Tristan; Brillet-Guéguen, Loraine; Carré, Wilfrid; Carrière, Cathelène; Caron, Christophe; Czjzek, Mirjam; Hoebeke, Mark; Michel, Gurvan
2016-01-01
Sulfatases cleave sulfate groups from various molecules and constitute a biologically and industrially important group of enzymes. However, the number of sulfatases whose substrate has been characterized is limited in comparison to the huge diversity of sulfated compounds, yielding functional annotations of sulfatases particularly prone to flaws and misinterpretations. In the context of the explosion of genomic data, a classification system allowing a better prediction of substrate specificity and for setting the limit of functional annotations is urgently needed for sulfatases. Here, after an overview on the diversity of sulfated compounds and on the known sulfatases, we propose a classification database, SulfAtlas (http://abims.sb-roscoff.fr/sulfatlas/), based on sequence homology and composed of four families of sulfatases. The formylglycine-dependent sulfatases, which constitute the largest family, are also divided by phylogenetic approach into 73 subfamilies, each subfamily corresponding to either a known specificity or to an uncharacterized substrate. SulfAtlas summarizes information about the different families of sulfatases. Within a family a web page displays the list of its subfamilies (when they exist) and the list of EC numbers. The family or subfamily page shows some descriptors and a table with all the UniProt accession numbers linked to the databases UniProt, ExplorEnz, and PDB. PMID:27749924
A stylistic classification of Russian-language texts based on the random walk model
NASA Astrophysics Data System (ADS)
Kramarenko, A. A.; Nekrasov, K. A.; Filimonov, V. V.; Zhivoderov, A. A.; Amieva, A. A.
2017-09-01
A formal approach to text analysis is suggested that is based on the random walk model. The frequencies and reciprocal positions of the vowel letters are matched up by a process of quasi-particle migration. Statistically significant difference in the migration parameters for the texts of different functional styles is found. Thus, a possibility of classification of texts using the suggested method is demonstrated. Five groups of the texts are singled out that can be distinguished from one another by the parameters of the quasi-particle migration process.
A Higher Level Classification of All Living Organisms
Ruggiero, Michael A.; Gordon, Dennis P.; Orrell, Thomas M.; Bailly, Nicolas; Bourgoin, Thierry; Brusca, Richard C.; Cavalier-Smith, Thomas; Guiry, Michael D.; Kirk, Paul M.
2015-01-01
We present a consensus classification of life to embrace the more than 1.6 million species already provided by more than 3,000 taxonomists’ expert opinions in a unified and coherent, hierarchically ranked system known as the Catalogue of Life (CoL). The intent of this collaborative effort is to provide a hierarchical classification serving not only the needs of the CoL’s database providers but also the diverse public-domain user community, most of whom are familiar with the Linnaean conceptual system of ordering taxon relationships. This classification is neither phylogenetic nor evolutionary but instead represents a consensus view that accommodates taxonomic choices and practical compromises among diverse expert opinions, public usages, and conflicting evidence about the boundaries between taxa and the ranks of major taxa, including kingdoms. Certain key issues, some not fully resolved, are addressed in particular. Beyond its immediate use as a management tool for the CoL and ITIS (Integrated Taxonomic Information System), it is immediately valuable as a reference for taxonomic and biodiversity research, as a tool for societal communication, and as a classificatory “backbone” for biodiversity databases, museum collections, libraries, and textbooks. Such a modern comprehensive hierarchy has not previously existed at this level of specificity. PMID:25923521
Evolutionary fuzzy ARTMAP neural networks for classification of semiconductor defects.
Tan, Shing Chiang; Watada, Junzo; Ibrahim, Zuwairie; Khalid, Marzuki
2015-05-01
Wafer defect detection using an intelligent system is an approach of quality improvement in semiconductor manufacturing that aims to enhance its process stability, increase production capacity, and improve yields. Occasionally, only few records that indicate defective units are available and they are classified as a minority group in a large database. Such a situation leads to an imbalanced data set problem, wherein it engenders a great challenge to deal with by applying machine-learning techniques for obtaining effective solution. In addition, the database may comprise overlapping samples of different classes. This paper introduces two models of evolutionary fuzzy ARTMAP (FAM) neural networks to deal with the imbalanced data set problems in a semiconductor manufacturing operations. In particular, both the FAM models and hybrid genetic algorithms are integrated in the proposed evolutionary artificial neural networks (EANNs) to classify an imbalanced data set. In addition, one of the proposed EANNs incorporates a facility to learn overlapping samples of different classes from the imbalanced data environment. The classification results of the proposed evolutionary FAM neural networks are presented, compared, and analyzed using several classification metrics. The outcomes positively indicate the effectiveness of the proposed networks in handling classification problems with imbalanced data sets.
NASA Astrophysics Data System (ADS)
Hussain, M.; Chen, D.
2014-11-01
Buildings, the basic unit of an urban landscape, host most of its socio-economic activities and play an important role in the creation of urban land-use patterns. The spatial arrangement of different building types creates varied urban land-use clusters which can provide an insight to understand the relationships between social, economic, and living spaces. The classification of such urban clusters can help in policy-making and resource management. In many countries including the UK no national-level cadastral database containing information on individual building types exists in public domain. In this paper, we present a framework for inferring functional types of buildings based on the analysis of their form (e.g. geometrical properties, such as area and perimeter, layout) and spatial relationship from large topographic and address-based GIS database. Machine learning algorithms along with exploratory spatial analysis techniques are used to create the classification rules. The classification is extended to two further levels based on the functions (use) of buildings derived from address-based data. The developed methodology was applied to the Manchester metropolitan area using the Ordnance Survey's MasterMap®, a large-scale topographic and address-based data available for the UK.
SCOPE - Stellar Classification Online Public Exploration
NASA Astrophysics Data System (ADS)
Harenberg, Steven
2010-01-01
The Astronomical Photographic Data Archive (APDA) has been established to be the primary North American archive for the collections of astronomical photographic plates. Located at the Pisgah Astronomical Research Institute (PARI) in Rosman, NC, the archive contains hundreds of thousands stellar spectra, many of which have never before been classified. To help classify the vast number of stars, the public is invited to participate in a distributed computing online environment called Stellar Classification Online - Public Exploration (SCOPE). Through a website, the participants will have a tutorial on stellar spectra and practice classifying. After practice, the participants classify spectra on photographic plates uploaded online from APDA. These classifications will be recorded in a database where the results from many users will be statistically analyzed. Stars with known spectral types will be included to test the reliability of classifications. The process of building the database of stars from APDA, which the citizen scientist will be able to classify, includes: scanning the photographic plates, orienting the plate to correct for the change in right ascension/declination using Aladin, stellar HD catalog identification using Simbad, marking the boundaries for each spectrum, and setting up the image for use on the website. We will describe the details of this process.
Pathological speech signal analysis and classification using empirical mode decomposition.
Kaleem, Muhammad; Ghoraani, Behnaz; Guergachi, Aziz; Krishnan, Sridhar
2013-07-01
Automated classification of normal and pathological speech signals can provide an objective and accurate mechanism for pathological speech diagnosis, and is an active area of research. A large part of this research is based on analysis of acoustic measures extracted from sustained vowels. However, sustained vowels do not reflect real-world attributes of voice as effectively as continuous speech, which can take into account important attributes of speech such as rapid voice onset and termination, changes in voice frequency and amplitude, and sudden discontinuities in speech. This paper presents a methodology based on empirical mode decomposition (EMD) for classification of continuous normal and pathological speech signals obtained from a well-known database. EMD is used to decompose randomly chosen portions of speech signals into intrinsic mode functions, which are then analyzed to extract meaningful temporal and spectral features, including true instantaneous features which can capture discriminative information in signals hidden at local time-scales. A total of six features are extracted, and a linear classifier is used with the feature vector to classify continuous speech portions obtained from a database consisting of 51 normal and 161 pathological speakers. A classification accuracy of 95.7 % is obtained, thus demonstrating the effectiveness of the methodology.
Arrhythmia Evaluation in Wearable ECG Devices
Sadrawi, Muammar; Lin, Chien-Hung; Hsieh, Yita; Kuo, Chia-Chun; Chien, Jen Chien; Haraikawa, Koichi; Abbod, Maysam F.; Shieh, Jiann-Shing
2017-01-01
This study evaluates four databases from PhysioNet: The American Heart Association database (AHADB), Creighton University Ventricular Tachyarrhythmia database (CUDB), MIT-BIH Arrhythmia database (MITDB), and MIT-BIH Noise Stress Test database (NSTDB). The ANSI/AAMI EC57:2012 is used for the evaluation of the algorithms for the supraventricular ectopic beat (SVEB), ventricular ectopic beat (VEB), atrial fibrillation (AF), and ventricular fibrillation (VF) via the evaluation of the sensitivity, positive predictivity and false positive rate. Sample entropy, fast Fourier transform (FFT), and multilayer perceptron neural network with backpropagation training algorithm are selected for the integrated detection algorithms. For this study, the result for SVEB has some improvements compared to a previous study that also utilized ANSI/AAMI EC57. In further, VEB sensitivity and positive predictivity gross evaluations have greater than 80%, except for the positive predictivity of the NSTDB database. For AF gross evaluation of MITDB database, the results show very good classification, excluding the episode sensitivity. In advanced, for VF gross evaluation, the episode sensitivity and positive predictivity for the AHADB, MITDB, and CUDB, have greater than 80%, except for MITDB episode positive predictivity, which is 75%. The achieved results show that the proposed integrated SVEB, VEB, AF, and VF detection algorithm has an accurate classification according to ANSI/AAMI EC57:2012. In conclusion, the proposed integrated detection algorithm can achieve good accuracy in comparison with other previous studies. Furthermore, more advanced algorithms and hardware devices should be performed in future for arrhythmia detection and evaluation. PMID:29068369
Research on aviation unsafe incidents classification with improved TF-IDF algorithm
NASA Astrophysics Data System (ADS)
Wang, Yanhua; Zhang, Zhiyuan; Huo, Weigang
2016-05-01
The text content of Aviation Safety Confidential Reports contains a large number of valuable information. Term frequency-inverse document frequency algorithm is commonly used in text analysis, but it does not take into account the sequential relationship of the words in the text and its role in semantic expression. According to the seven category labels of civil aviation unsafe incidents, aiming at solving the problems of TF-IDF algorithm, this paper improved TF-IDF algorithm based on co-occurrence network; established feature words extraction and words sequential relations for classified incidents. Aviation domain lexicon was used to improve the accuracy rate of classification. Feature words network model was designed for multi-documents unsafe incidents classification, and it was used in the experiment. Finally, the classification accuracy of improved algorithm was verified by the experiments.
Emotion models for textual emotion classification
NASA Astrophysics Data System (ADS)
Bruna, O.; Avetisyan, H.; Holub, J.
2016-11-01
This paper deals with textual emotion classification which gained attention in recent years. Emotion classification is used in user experience, product evaluation, national security, and tutoring applications. It attempts to detect the emotional content in the input text and based on different approaches establish what kind of emotional content is present, if any. Textual emotion classification is the most difficult to handle, since it relies mainly on linguistic resources and it introduces many challenges to assignment of text to emotion represented by a proper model. A crucial part of each emotion detector is emotion model. Focus of this paper is to introduce emotion models used for classification. Categorical and dimensional models of emotion are explained and some more advanced approaches are mentioned.
NASA Astrophysics Data System (ADS)
Coopersmith, Evan Joseph
The techniques and information employed for decision-making vary with the spatial and temporal scope of the assessment required. In modern agriculture, the farm owner or manager makes decisions on a day-to-day or even hour-to-hour basis for dozens of fields scattered over as much as a fifty-mile radius from some central location. Following precipitation events, land begins to dry. Land-owners and managers often trace serpentine paths of 150+ miles every morning to inspect the conditions of their various parcels. His or her objective lies in appropriate resource usage -- is a given tract of land dry enough to be workable at this moment or would he or she be better served waiting patiently? Longer-term, these owners and managers decide upon which seeds will grow most effectively and which crops will make their operations profitable. At even longer temporal scales, decisions are made regarding which fields must be acquired and sold and what types of equipment will be necessary in future operations. This work develops and validates algorithms for these shorter-term decisions, along with models of national climate patterns and climate changes to enable longer-term operational planning. A test site at the University of Illinois South Farms (Urbana, IL, USA) served as the primary location to validate machine learning algorithms, employing public sources of precipitation and potential evapotranspiration to model the wetting/drying process. In expanding such local decision support tools to locations on a national scale, one must recognize the heterogeneity of hydroclimatic and soil characteristics throughout the United States. Machine learning algorithms modeling the wetting/drying process must address this variability, and yet it is wholly impractical to construct a separate algorithm for every conceivable location. For this reason, a national hydrological classification system is presented, allowing clusters of hydroclimatic similarity to emerge naturally from annual regime curve data and facilitate the development of cluster-specific algorithms. Given the desire to enable intelligent decision-making at any location, this classification system is developed in a manner that will allow for classification anywhere in the U.S., even in an ungauged basin. Daily time series data from 428 catchments in the MOPEX database are analyzed to produce an empirical classification tree, partitioning the United States into regions of hydroclimatic similarity. In constructing a classification tree based upon 55 years of data, it is important to recognize the non-stationary nature of climate data. The shifts in climatic regimes will cause certain locations to shift their ultimate position within the classification tree, requiring decision-makers to alter land usage, farming practices, and equipment needs, and algorithms to adjust accordingly. This work adapts the classification model to address the issue of regime shifts over larger temporal scales and suggests how land-usage and farming protocol may vary from hydroclimatic shifts in decades to come. Finally, the generalizability of the hydroclimatic classification system is tested with a physically-based soil moisture model calibrated at several locations throughout the continental United States. The soil moisture model is calibrated at a given site and then applied with the same parameters at other sites within and outside the same hydroclimatic class. The model's performance deteriorates minimally if the calibration and validation location are within the same hydroclimatic class, but deteriorates significantly if the calibration and validates sites are located in different hydroclimatic classes. These soil moisture estimates at the field scale are then further refined by the introduction of LiDAR elevation data, distinguishing faster-drying peaks and ridges from slower-drying valleys. The inclusion of LiDAR enabled multiple locations within the same field to be predicted accurately despite non-identical topography. This cross-application of parametric calibrations and LiDAR-driven disaggregation facilitates decision-support at locations without proximally-located soil moisture sensors.
Applications of Support Vector Machines In Chemo And Bioinformatics
NASA Astrophysics Data System (ADS)
Jayaraman, V. K.; Sundararajan, V.
2010-10-01
Conventional linear & nonlinear tools for classification, regression & data driven modeling are being replaced on a rapid scale by newer techniques & tools based on artificial intelligence and machine learning. While the linear techniques are not applicable for inherently nonlinear problems, newer methods serve as attractive alternatives for solving real life problems. Support Vector Machine (SVM) classifiers are a set of universal feed-forward network based classification algorithms that have been formulated from statistical learning theory and structural risk minimization principle. SVM regression closely follows the classification methodology. In this work recent applications of SVM in Chemo & Bioinformatics will be described with suitable illustrative examples.
[Construction and application of special analysis database of geoherbs based on 3S technology].
Guo, Lan-ping; Huang, Lu-qi; Lv, Dong-mei; Shao, Ai-juan; Wang, Jian
2007-09-01
In this paper,the structures, data sources, data codes of "the spacial analysis database of geoherbs" based 3S technology are introduced, and the essential functions of the database, such as data management, remote sensing, spacial interpolation, spacial statistics, spacial analysis and developing are described. At last, two examples for database usage are given, the one is classification and calculating of NDVI index of remote sensing image in geoherbal area of Atractylodes lancea, the other one is adaptation analysis of A. lancea. These indicate that "the spacial analysis database of geoherbs" has bright prospect in spacial analysis of geoherbs.
Modality-Driven Classification and Visualization of Ensemble Variance
DOE Office of Scientific and Technical Information (OSTI.GOV)
Bensema, Kevin; Gosink, Luke; Obermaier, Harald
Paper for the IEEE Visualization Conference Advances in computational power now enable domain scientists to address conceptual and parametric uncertainty by running simulations multiple times in order to sufficiently sample the uncertain input space.
Agile convolutional neural network for pulmonary nodule classification using CT images.
Zhao, Xinzhuo; Liu, Liyao; Qi, Shouliang; Teng, Yueyang; Li, Jianhua; Qian, Wei
2018-04-01
To distinguish benign from malignant pulmonary nodules using CT images is critical for their precise diagnosis and treatment. A new Agile convolutional neural network (CNN) framework is proposed to conquer the challenges of a small-scale medical image database and the small size of the nodules, and it improves the performance of pulmonary nodule classification using CT images. A hybrid CNN of LeNet and AlexNet is constructed through combining the layer settings of LeNet and the parameter settings of AlexNet. A dataset with 743 CT image nodule samples is built up based on the 1018 CT scans of LIDC to train and evaluate the Agile CNN model. Through adjusting the parameters of the kernel size, learning rate, and other factors, the effect of these parameters on the performance of the CNN model is investigated, and an optimized setting of the CNN is obtained finally. After finely optimizing the settings of the CNN, the estimation accuracy and the area under the curve can reach 0.822 and 0.877, respectively. The accuracy of the CNN is significantly dependent on the kernel size, learning rate, training batch size, dropout, and weight initializations. The best performance is achieved when the kernel size is set to [Formula: see text], the learning rate is 0.005, the batch size is 32, and dropout and Gaussian initialization are used. This competitive performance demonstrates that our proposed CNN framework and the optimization strategy of the CNN parameters are suitable for pulmonary nodule classification characterized by small medical datasets and small targets. The classification model might help diagnose and treat pulmonary nodules effectively.
Using FDA reports to inform a classification for health information technology safety problems
Ong, Mei-Sing; Runciman, William; Coiera, Enrico
2011-01-01
Objective To expand an emerging classification for problems with health information technology (HIT) using reports submitted to the US Food and Drug Administration Manufacturer and User Facility Device Experience (MAUDE) database. Design HIT events submitted to MAUDE were retrieved using a standardized search strategy. Using an emerging classification with 32 categories of HIT problems, a subset of relevant events were iteratively analyzed to identify new categories. Two coders then independently classified the remaining events into one or more categories. Free-text descriptions were analyzed to identify the consequences of events. Measurements Descriptive statistics by number of reported problems per category and by consequence; inter-rater reliability analysis using the κ statistic for the major categories and consequences. Results A search of 899 768 reports from January 2008 to July 2010 yielded 1100 reports about HIT. After removing duplicate and unrelated reports, 678 reports describing 436 events remained. The authors identified four new categories to describe problems with software functionality, system configuration, interface with devices, and network configuration; the authors' classification with 32 categories of HIT problems was expanded by the addition of these four categories. Examination of the 436 events revealed 712 problems, 96% were machine-related, and 4% were problems at the human–computer interface. Almost half (46%) of the events related to hazardous circumstances. Of the 46 events (11%) associated with patient harm, four deaths were linked to HIT problems (0.9% of 436 events). Conclusions Only 0.1% of the MAUDE reports searched were related to HIT. Nevertheless, Food and Drug Administration reports did prove to be a useful new source of information about the nature of software problems and their safety implications with potential to inform strategies for safe design and implementation. PMID:21903979
Multi-label literature classification based on the Gene Ontology graph.
Jin, Bo; Muller, Brian; Zhai, Chengxiang; Lu, Xinghua
2008-12-08
The Gene Ontology is a controlled vocabulary for representing knowledge related to genes and proteins in a computable form. The current effort of manually annotating proteins with the Gene Ontology is outpaced by the rate of accumulation of biomedical knowledge in literature, which urges the development of text mining approaches to facilitate the process by automatically extracting the Gene Ontology annotation from literature. The task is usually cast as a text classification problem, and contemporary methods are confronted with unbalanced training data and the difficulties associated with multi-label classification. In this research, we investigated the methods of enhancing automatic multi-label classification of biomedical literature by utilizing the structure of the Gene Ontology graph. We have studied three graph-based multi-label classification algorithms, including a novel stochastic algorithm and two top-down hierarchical classification methods for multi-label literature classification. We systematically evaluated and compared these graph-based classification algorithms to a conventional flat multi-label algorithm. The results indicate that, through utilizing the information from the structure of the Gene Ontology graph, the graph-based multi-label classification methods can significantly improve predictions of the Gene Ontology terms implied by the analyzed text. Furthermore, the graph-based multi-label classifiers are capable of suggesting Gene Ontology annotations (to curators) that are closely related to the true annotations even if they fail to predict the true ones directly. A software package implementing the studied algorithms is available for the research community. Through utilizing the information from the structure of the Gene Ontology graph, the graph-based multi-label classification methods have better potential than the conventional flat multi-label classification approach to facilitate protein annotation based on the literature.
Stoker, Jason M.; Cochrane, Mark A.; Roy, David P.
2013-01-01
With the acquisition of lidar data for over 30 percent of the US, it is now possible to assess the three-dimensional distribution of features at the national scale. This paper integrates over 350 billion lidar points from 28 disparate datasets into a national-scale database and evaluates if height above ground is an important variable in the context of other nationalscale layers, such as the US Geological Survey National Land Cover Database and the US Environmental Protection Agency ecoregions maps. While the results were not homoscedastic and the available data did not allow for a complete height census in any of the classes, it does appear that where lidar data were used, there were detectable differences in heights among many of these national classification schemes. This study supports the hypothesis that there were real, detectable differences in heights in certain national-scale classification schemes, despite height not being a variable used in any of the classification routines.
Odronitz, Florian; Kollmar, Martin
2006-11-29
Annotation of protein sequences of eukaryotic organisms is crucial for the understanding of their function in the cell. Manual annotation is still by far the most accurate way to correctly predict genes. The classification of protein sequences, their phylogenetic relation and the assignment of function involves information from various sources. This often leads to a collection of heterogeneous data, which is hard to track. Cytoskeletal and motor proteins consist of large and diverse superfamilies comprising up to several dozen members per organism. Up to date there is no integrated tool available to assist in the manual large-scale comparative genomic analysis of protein families. Pfarao (Protein Family Application for Retrieval, Analysis and Organisation) is a database driven online working environment for the analysis of manually annotated protein sequences and their relationship. Currently, the system can store and interrelate a wide range of information about protein sequences, species, phylogenetic relations and sequencing projects as well as links to literature and domain predictions. Sequences can be imported from multiple sequence alignments that are generated during the annotation process. A web interface allows to conveniently browse the database and to compile tabular and graphical summaries of its content. We implemented a protein sequence-centric web application to store, organize, interrelate, and present heterogeneous data that is generated in manual genome annotation and comparative genomics. The application has been developed for the analysis of cytoskeletal and motor proteins (CyMoBase) but can easily be adapted for any protein.
Database citation in full text biomedical articles.
Kafkas, Şenay; Kim, Jee-Hyub; McEntyre, Johanna R
2013-01-01
Molecular biology and literature databases represent essential infrastructure for life science research. Effective integration of these data resources requires that there are structured cross-references at the level of individual articles and biological records. Here, we describe the current patterns of how database entries are cited in research articles, based on analysis of the full text Open Access articles available from Europe PMC. Focusing on citation of entries in the European Nucleotide Archive (ENA), UniProt and Protein Data Bank, Europe (PDBe), we demonstrate that text mining doubles the number of structured annotations of database record citations supplied in journal articles by publishers. Many thousands of new literature-database relationships are found by text mining, since these relationships are also not present in the set of articles cited by database records. We recommend that structured annotation of database records in articles is extended to other databases, such as ArrayExpress and Pfam, entries from which are also cited widely in the literature. The very high precision and high-throughput of this text-mining pipeline makes this activity possible both accurately and at low cost, which will allow the development of new integrated data services.
Database Citation in Full Text Biomedical Articles
Kafkas, Şenay; Kim, Jee-Hyub; McEntyre, Johanna R.
2013-01-01
Molecular biology and literature databases represent essential infrastructure for life science research. Effective integration of these data resources requires that there are structured cross-references at the level of individual articles and biological records. Here, we describe the current patterns of how database entries are cited in research articles, based on analysis of the full text Open Access articles available from Europe PMC. Focusing on citation of entries in the European Nucleotide Archive (ENA), UniProt and Protein Data Bank, Europe (PDBe), we demonstrate that text mining doubles the number of structured annotations of database record citations supplied in journal articles by publishers. Many thousands of new literature-database relationships are found by text mining, since these relationships are also not present in the set of articles cited by database records. We recommend that structured annotation of database records in articles is extended to other databases, such as ArrayExpress and Pfam, entries from which are also cited widely in the literature. The very high precision and high-throughput of this text-mining pipeline makes this activity possible both accurately and at low cost, which will allow the development of new integrated data services. PMID:23734176
USER'S GUIDE FOR GLOED VERSION 1.0 - THE GLOBAL EMISSIONS DATABASE
The document is a user's guide for the EPA-developed, powerful software package, Global Emissions Database (GloED). GloED is a user-friendly, menu-driven tool for storing and retrieving emissions factors and activity data on a country-specific basis. Data can be selected from dat...
The BioMart community portal: an innovative alternative to large, centralized data repositories
USDA-ARS?s Scientific Manuscript database
The BioMart Community Portal (www.biomart.org) is a community-driven effort to provide a unified interface to biomedical databases that are distributed worldwide. The portal provides access to numerous database projects supported by 30 scientific organizations. It includes over 800 different biologi...
ERIC Educational Resources Information Center
Williamson, Ben
2015-01-01
This article examines the emergence of "digital governance" in public education in England. Drawing on and combining concepts from software studies, policy and political studies, it identifies some specific approaches to digital governance facilitated by network-based communications and database-driven information processing software…
On exploration of medical database of Crohn's disease
NASA Astrophysics Data System (ADS)
Manerowska, Anna; Dadalski, Maciej; Socha, Piotr; Mulawka, Jan
2010-09-01
The primary objective of this article is to find a new, more effective method of diagnosis of Crohn's disease. Having created the database on this disease we wanted to find the most suitable classification models. We used the algorithms with their implementations stored in R environment. Having carried out the investigations we have reached results interesting for clinical practice.
ERIC Educational Resources Information Center
Micco, Mary; Popp, Rich
Techniques for building a world-wide information infrastructure by reverse engineering existing databases to link them in a hierarchical system of subject clusters to create an integrated database are explored. The controlled vocabulary of the Library of Congress Subject Headings is used to ensure consistency and group similar items. Each database…
Jiang, Guoqian; Wang, Chen; Zhu, Qian; Chute, Christopher G
2013-01-01
Knowledge-driven text mining is becoming an important research area for identifying pharmacogenomics target genes. However, few of such studies have been focused on the pharmacogenomics targets of adverse drug events (ADEs). The objective of the present study is to build a framework of knowledge integration and discovery that aims to support pharmacogenomics target predication of ADEs. We integrate a semantically annotated literature corpus Semantic MEDLINE with a semantically coded ADE knowledgebase known as ADEpedia using a semantic web based framework. We developed a knowledge discovery approach combining a network analysis of a protein-protein interaction (PPI) network and a gene functional classification approach. We performed a case study of drug-induced long QT syndrome for demonstrating the usefulness of the framework in predicting potential pharmacogenomics targets of ADEs.
Extracting biomedical events from pairs of text entities
2015-01-01
Background Huge amounts of electronic biomedical documents, such as molecular biology reports or genomic papers are generated daily. Nowadays, these documents are mainly available in the form of unstructured free texts, which require heavy processing for their registration into organized databases. This organization is instrumental for information retrieval, enabling to answer the advanced queries of researchers and practitioners in biology, medicine, and related fields. Hence, the massive data flow calls for efficient automatic methods of text-mining that extract high-level information, such as biomedical events, from biomedical text. The usual computational tools of Natural Language Processing cannot be readily applied to extract these biomedical events, due to the peculiarities of the domain. Indeed, biomedical documents contain highly domain-specific jargon and syntax. These documents also describe distinctive dependencies, making text-mining in molecular biology a specific discipline. Results We address biomedical event extraction as the classification of pairs of text entities into the classes corresponding to event types. The candidate pairs of text entities are recursively provided to a multiclass classifier relying on Support Vector Machines. This recursive process extracts events involving other events as arguments. Compared to joint models based on Markov Random Fields, our model simplifies inference and hence requires shorter training and prediction times along with lower memory capacity. Compared to usual pipeline approaches, our model passes over a complex intermediate problem, while making a more extensive usage of sophisticated joint features between text entities. Our method focuses on the core event extraction of the Genia task of BioNLP challenges yielding the best result reported so far on the 2013 edition. PMID:26201478
Keyless Entry: Building a Text Database Using OCR Technology.
ERIC Educational Resources Information Center
Grotophorst, Clyde W.
1989-01-01
Discusses the use of optical character recognition (OCR) technology to produce an ASCII text database. A tutorial on digital scanning and OCR is provided, and a systems integration project which used the Calera CDP-3000XF scanner and text retrieval software to construct a database of dissertations at George Mason University is described. (four…
A Tutorial in Creating Web-Enabled Databases with Inmagic DB/TextWorks through ODBC.
ERIC Educational Resources Information Center
Breeding, Marshall
2000-01-01
Explains how to create Web-enabled databases. Highlights include Inmagic's DB/Text WebPublisher product called DB/TextWorks; ODBC (Open Database Connectivity) drivers; Perl programming language; HTML coding; Structured Query Language (SQL); Common Gateway Interface (CGI) programming; and examples of HTML pages and Perl scripts. (LRW)
Comparative Analysis of Document level Text Classification Algorithms using R
NASA Astrophysics Data System (ADS)
Syamala, Maganti; Nalini, N. J., Dr; Maguluri, Lakshamanaphaneendra; Ragupathy, R., Dr.
2017-08-01
From the past few decades there has been tremendous volumes of data available in Internet either in structured or unstructured form. Also, there is an exponential growth of information on Internet, so there is an emergent need of text classifiers. Text mining is an interdisciplinary field which draws attention on information retrieval, data mining, machine learning, statistics and computational linguistics. And to handle this situation, a wide range of supervised learning algorithms has been introduced. Among all these K-Nearest Neighbor(KNN) is efficient and simplest classifier in text classification family. But KNN suffers from imbalanced class distribution and noisy term features. So, to cope up with this challenge we use document based centroid dimensionality reduction(CentroidDR) using R Programming. By combining these two text classification techniques, KNN and Centroid classifiers, we propose a scalable and effective flat classifier, called MCenKNN which works well substantially better than CenKNN.
Valent, Francesca; Clagnan, Elena; Zanier, Loris
2014-01-01
to assess whether Naïve Bayes Classification could be used to classify injury causes from the Emergency Room (ER) database, because in the Friuli Venezia Giulia Region (Northern Italy) the electronic ER data have never been used to study the epidemiology of injuries, because the proportion of generic "accidental" causes is much higher than that of injuries with a specific cause. application of the Naïve Bayes Classification method to the regional ER database. sensitivity, specificity, positive and negative predictive values, agreement, and the kappa statistic were calculated for the train dataset and the distribution of causes of injury for the test dataset. on 22.248 records with known cause, the classifications assigned by the model agreed moderately (kappa =0.53) with those assigned by ER personnel. The model was then used on 76.660 unclassified cases. Although sensitivity and positive predictive value of the method were generally poor, mainly due to limitations in the ER data, it allowed to estimate for the first time the frequency of specific injury causes in the Region. the model was useful to provide the "big picture" of non-fatal injuries in the Region. To improve the collection of injury data at the ER, the options available for injury classification in the ER software are being revised to make categories exhaustive and mutually exclusive.
Classification of a large microarray data set: Algorithm comparison and analysis of drug signatures
Natsoulis, Georges; El Ghaoui, Laurent; Lanckriet, Gert R.G.; Tolley, Alexander M.; Leroy, Fabrice; Dunlea, Shane; Eynon, Barrett P.; Pearson, Cecelia I.; Tugendreich, Stuart; Jarnagin, Kurt
2005-01-01
A large gene expression database has been produced that characterizes the gene expression and physiological effects of hundreds of approved and withdrawn drugs, toxicants, and biochemical standards in various organs of live rats. In order to derive useful biological knowledge from this large database, a variety of supervised classification algorithms were compared using a 597-microarray subset of the data. Our studies show that several types of linear classifiers based on Support Vector Machines (SVMs) and Logistic Regression can be used to derive readily interpretable drug signatures with high classification performance. Both methods can be tuned to produce classifiers of drug treatments in the form of short, weighted gene lists which upon analysis reveal that some of the signature genes have a positive contribution (act as “rewards” for the class-of-interest) while others have a negative contribution (act as “penalties”) to the classification decision. The combination of reward and penalty genes enhances performance by keeping the number of false positive treatments low. The results of these algorithms are combined with feature selection techniques that further reduce the length of the drug signatures, an important step towards the development of useful diagnostic biomarkers and low-cost assays. Multiple signatures with no genes in common can be generated for the same classification end-point. Comparison of these gene lists identifies biological processes characteristic of a given class. PMID:15867433
Lean waste classification model to support the sustainable operational practice
NASA Astrophysics Data System (ADS)
Sutrisno, A.; Vanany, I.; Gunawan, I.; Asjad, M.
2018-04-01
Driven by growing pressure for a more sustainable operational practice, improvement on the classification of non-value added (waste) is one of the prerequisites to realize sustainability of a firm. While the use of the 7 (seven) types of the Ohno model now becoming a versatile tool to reveal the lean waste occurrence. In many recent investigations, the use of the Seven Waste model of Ohno is insufficient to cope with the types of waste occurred in industrial practices at various application levels. Intended to a narrowing down this limitation, this paper presented an improved waste classification model based on survey to recent studies discussing on waste at various operational stages. Implications on the waste classification model to the body of knowledge and industrial practices are provided.
Stellar Classification Online - Public Exploration
NASA Astrophysics Data System (ADS)
Castelaz, Michael W.; Bedell, W.; Barker, T.; Cline, J.; Owen, L.
2009-01-01
The Michigan Objective Prism Blue Survey (e.g. Sowell et al 2007, AJ, 134, 1089) photographic plates located in the Astronomical Photographic Data Archive at the Pisgah Astronomical Research Institute hold hundreds of thousands of stellar spectra, many of which have not been classified before. The public is invited to participate in a distributed computing online environment to classify the stars on the objective prism plates. The online environment is called Stellar Classification Online - Public Exploration (SCOPE). Through a website, SCOPE participants are given a tutorial on stellar spectra and their classification, and given the chance to practice their skills at classification. After practice, participants register, login, and select stars for classification from scans of the objective prism plates. Their classifications are recorded in a database where the accumulation of classifications of the same star by many users will be statistically analyzed. The project includes stars with known spectral types to help test the reliability of classifications. The SCOPE webpage and the use of results will be described.
Building a common pipeline for rule-based document classification.
Patterson, Olga V; Ginter, Thomas; DuVall, Scott L
2013-01-01
Instance-based classification of clinical text is a widely used natural language processing task employed as a step for patient classification, document retrieval, or information extraction. Rule-based approaches rely on concept identification and context analysis in order to determine the appropriate class. We propose a five-step process that enables even small research teams to develop simple but powerful rule-based NLP systems by taking advantage of a common UIMA AS based pipeline for classification. Our proposed methodology coupled with the general-purpose solution provides researchers with access to the data locked in clinical text in cases of limited human resources and compact timelines.
A classification of errors in lay comprehension of medical documents.
Keselman, Alla; Smith, Catherine Arnott
2012-12-01
Emphasis on participatory medicine requires that patients and consumers participate in tasks traditionally reserved for healthcare providers. This includes reading and comprehending medical documents, often but not necessarily in the context of interacting with Personal Health Records (PHRs). Research suggests that while giving patients access to medical documents has many benefits (e.g., improved patient-provider communication), lay people often have difficulty understanding medical information. Informatics can address the problem by developing tools that support comprehension; this requires in-depth understanding of the nature and causes of errors that lay people make when comprehending clinical documents. The objective of this study was to develop a classification scheme of comprehension errors, based on lay individuals' retellings of two documents containing clinical text: a description of a clinical trial and a typical office visit note. While not comprehensive, the scheme can serve as a foundation of further development of a taxonomy of patients' comprehension errors. Eighty participants, all healthy volunteers, read and retold two medical documents. A data-driven content analysis procedure was used to extract and classify retelling errors. The resulting hierarchical classification scheme contains nine categories and 23 subcategories. The most common error made by the participants involved incorrectly recalling brand names of medications. Other common errors included misunderstanding clinical concepts, misreporting the objective of a clinical research study and physician's findings during a patient's visit, and confusing and misspelling clinical terms. A combination of informatics support and health education is likely to improve the accuracy of lay comprehension of medical documents. Published by Elsevier Inc.
NASA Astrophysics Data System (ADS)
Verma, Surendra P.; Rivera-Gómez, M. Abdelaly; Díaz-González, Lorena; Quiroz-Ruiz, Alfredo
2016-12-01
A new multidimensional classification scheme consistent with the chemical classification of the International Union of Geological Sciences (IUGS) is proposed for the nomenclature of High-Mg altered rocks. Our procedure is based on an extensive database of major element (SiO2, TiO2, Al2O3, Fe2O3t, MnO, MgO, CaO, Na2O, K2O, and P2O5) compositions of a total of 33,868 (920 High-Mg and 32,948 "Common") relatively fresh igneous rock samples. The database consisting of these multinormally distributed samples in terms of their isometric log-ratios was used to propose a set of 11 discriminant functions and 6 diagrams to facilitate High-Mg rock classification. The multinormality required by linear discriminant and canonical analysis was ascertained by a new computer program DOMuDaF. One multidimensional function can distinguish the High-Mg and Common igneous rocks with high percent success values of about 86.4% and 98.9%, respectively. Similarly, from 10 discriminant functions the High-Mg rocks can also be classified as one of the four rock types (komatiite, meimechite, picrite, and boninite), with high success values of about 88%-100%. Satisfactory functioning of this new classification scheme was confirmed by seven independent tests. Five further case studies involving application to highly altered rocks illustrate the usefulness of our proposal. A computer program HMgClaMSys was written to efficiently apply the proposed classification scheme, which will be available for online processing of igneous rock compositional data. Monte Carlo simulation modeling and mass-balance computations confirmed the robustness of our classification with respect to analytical errors and postemplacement compositional changes.
Porras-Alfaro, Andrea; Liu, Kuan-Liang; Kuske, Cheryl R; Xie, Gary
2014-02-01
We compared the classification accuracy of two sections of the fungal internal transcribed spacer (ITS) region, individually and combined, and the 5' section (about 600 bp) of the large-subunit rRNA (LSU), using a naive Bayesian classifier and BLASTN. A hand-curated ITS-LSU training set of 1,091 sequences and a larger training set of 8,967 ITS region sequences were used. Of the factors evaluated, database composition and quality had the largest effect on classification accuracy, followed by fragment size and use of a bootstrap cutoff to improve classification confidence. The naive Bayesian classifier and BLASTN gave similar results at higher taxonomic levels, but the classifier was faster and more accurate at the genus level when a bootstrap cutoff was used. All of the ITS and LSU sections performed well (>97.7% accuracy) at higher taxonomic ranks from kingdom to family, and differences between them were small at the genus level (within 0.66 to 1.23%). When full-length sequence sections were used, the LSU outperformed the ITS1 and ITS2 fragments at the genus level, but the ITS1 and ITS2 showed higher accuracy when smaller fragment sizes of the same length and a 50% bootstrap cutoff were used. In a comparison using the larger ITS training set, ITS1 and ITS2 had very similar accuracy classification for fragments between 100 and 200 bp. Collectively, the results show that any of the ITS or LSU sections we tested provided comparable classification accuracy to the genus level and underscore the need for larger and more diverse classification training sets.
Liu, Kuan-Liang; Kuske, Cheryl R.
2014-01-01
We compared the classification accuracy of two sections of the fungal internal transcribed spacer (ITS) region, individually and combined, and the 5′ section (about 600 bp) of the large-subunit rRNA (LSU), using a naive Bayesian classifier and BLASTN. A hand-curated ITS-LSU training set of 1,091 sequences and a larger training set of 8,967 ITS region sequences were used. Of the factors evaluated, database composition and quality had the largest effect on classification accuracy, followed by fragment size and use of a bootstrap cutoff to improve classification confidence. The naive Bayesian classifier and BLASTN gave similar results at higher taxonomic levels, but the classifier was faster and more accurate at the genus level when a bootstrap cutoff was used. All of the ITS and LSU sections performed well (>97.7% accuracy) at higher taxonomic ranks from kingdom to family, and differences between them were small at the genus level (within 0.66 to 1.23%). When full-length sequence sections were used, the LSU outperformed the ITS1 and ITS2 fragments at the genus level, but the ITS1 and ITS2 showed higher accuracy when smaller fragment sizes of the same length and a 50% bootstrap cutoff were used. In a comparison using the larger ITS training set, ITS1 and ITS2 had very similar accuracy classification for fragments between 100 and 200 bp. Collectively, the results show that any of the ITS or LSU sections we tested provided comparable classification accuracy to the genus level and underscore the need for larger and more diverse classification training sets. PMID:24242255
Database Management Systems: New Homes for Migrating Bibliographic Records.
ERIC Educational Resources Information Center
Brooks, Terrence A.; Bierbaum, Esther G.
1987-01-01
Assesses bibliographic databases as part of visionary text systems such as hypertext and scholars' workstations. Downloading is discussed in terms of the capability to search records and to maintain unique bibliographic descriptions, and relational database management systems, file managers, and text databases are reviewed as possible hosts for…
Short text sentiment classification based on feature extension and ensemble classifier
NASA Astrophysics Data System (ADS)
Liu, Yang; Zhu, Xie
2018-05-01
With the rapid development of Internet social media, excavating the emotional tendencies of the short text information from the Internet, the acquisition of useful information has attracted the attention of researchers. At present, the commonly used can be attributed to the rule-based classification and statistical machine learning classification methods. Although micro-blog sentiment analysis has made good progress, there still exist some shortcomings such as not highly accurate enough and strong dependence from sentiment classification effect. Aiming at the characteristics of Chinese short texts, such as less information, sparse features, and diverse expressions, this paper considers expanding the original text by mining related semantic information from the reviews, forwarding and other related information. First, this paper uses Word2vec to compute word similarity to extend the feature words. And then uses an ensemble classifier composed of SVM, KNN and HMM to analyze the emotion of the short text of micro-blog. The experimental results show that the proposed method can make good use of the comment forwarding information to extend the original features. Compared with the traditional method, the accuracy, recall and F1 value obtained by this method have been improved.
NASA Astrophysics Data System (ADS)
Song, Xiaoning; Feng, Zhen-Hua; Hu, Guosheng; Yang, Xibei; Yang, Jingyu; Qi, Yunsong
2015-09-01
This paper proposes a progressive sparse representation-based classification algorithm using local discrete cosine transform (DCT) evaluation to perform face recognition. Specifically, the sum of the contributions of all training samples of each subject is first taken as the contribution of this subject, then the redundant subject with the smallest contribution to the test sample is iteratively eliminated. Second, the progressive method aims at representing the test sample as a linear combination of all the remaining training samples, by which the representation capability of each training sample is exploited to determine the optimal "nearest neighbors" for the test sample. Third, the transformed DCT evaluation is constructed to measure the similarity between the test sample and each local training sample using cosine distance metrics in the DCT domain. The final goal of the proposed method is to determine an optimal weighted sum of nearest neighbors that are obtained under the local correlative degree evaluation, which is approximately equal to the test sample, and we can use this weighted linear combination to perform robust classification. Experimental results conducted on the ORL database of faces (created by the Olivetti Research Laboratory in Cambridge), the FERET face database (managed by the Defense Advanced Research Projects Agency and the National Institute of Standards and Technology), AR face database (created by Aleix Martinez and Robert Benavente in the Computer Vision Center at U.A.B), and USPS handwritten digit database (gathered at the Center of Excellence in Document Analysis and Recognition at SUNY Buffalo) demonstrate the effectiveness of the proposed method.
LBP and SIFT based facial expression recognition
NASA Astrophysics Data System (ADS)
Sumer, Omer; Gunes, Ece O.
2015-02-01
This study compares the performance of local binary patterns (LBP) and scale invariant feature transform (SIFT) with support vector machines (SVM) in automatic classification of discrete facial expressions. Facial expression recognition is a multiclass classification problem and seven classes; happiness, anger, sadness, disgust, surprise, fear and comtempt are classified. Using SIFT feature vectors and linear SVM, 93.1% mean accuracy is acquired on CK+ database. On the other hand, the performance of LBP-based classifier with linear SVM is reported on SFEW using strictly person independent (SPI) protocol. Seven-class mean accuracy on SFEW is 59.76%. Experiments on both databases showed that LBP features can be used in a fairly descriptive way if a good localization of facial points and partitioning strategy are followed.
Program for Critical Technologies in Breast Oncology
1999-07-01
the tissues, and in a ethical manner that respects the patients’ rights . The Program for Critical Technologies in Breast Oncology helps address all of...diagnosis, database 15. NUMBER OF PAGES 148 16. PRICE CODE 17. SECURITY CLASSIFICATION OF REPORT Unclassified 18. SECURITY CLASSIFICATION OF THIS...closer to clinical utility. Page 17 References Adida C. Crotty PL. McGrath J. Berrebi D. Diebold J. Altieri DC. Developmentally regulated
Fuel Type Classification and Fuel Loading in Central Interior, Korea: Uiseong-Gun
Myoung Soo Won; Kyo Sang Koo; Myung Bo Lee; Si Young Lee
2006-01-01
The objective of this study is classification of fuel type and calculation of fuel loading to assess forest fire hazard by fuel characteristics at Uiseong-gun, Gyeongbuk located in the central interior of Korea. A database was constructed of eight factors such as forest type and topography using ArcGIS 9.1 GIS programs. An on-site survey was conducted for investigating...
NASA Technical Reports Server (NTRS)
Schutt, J.; Fessler, B.; Cassidy, W. A.
1993-01-01
This technical report is an update to LPI Technical Report 89-02, which contained data and information that was current to May 1987. Since that time approximately 4000 new meteorites have been collected, mapped, and characterized, mainly from the numerous ice fields in the Allan Hills-David Glacier region, from the Pecora Escarpment and Moulton Escarpment in the Thiel Mountains-Patuxent region, the Wisconsin Range region, and from the Beardmore region. Meteorite location maps for ice fields from these regions have been produced and are available. This report includes explanatory texts for the maps of new areas and provides information on updates of maps of the areas covered in LPI Technical Report 89-02. Sketch maps and description of locales that have been searched and have yielded single or few meteorites are also included. The meteorite listings for all the ice fields have been updated to include any classification changes and new meteorites recovered from ice fields in the Allan Hills-David Glacier region since 1987. The text has been reorganized and minor errors in the original report have been corrected. Computing capabilities have improved immensely since the early days of this project. Current software and hardware allow easy access to data over computer networks. With various commercial software packages, the data can be used many different ways, including database creation, statistics, and mapping. The databases, explanatory texts, and the plotter files used to produce the meteorite location maps are available through a computer network. Information on how to access AMLAMP data, its formats, and ways it can be used are given in the User's Guide to AMLAMP Data section. Meteorite location maps and thematic maps may be ordered from the Lunar and Planetary Institute. Ordering information is given in Appendix A.
Pereira, Florbela; Latino, Diogo A. R. S.; Gaudêncio, Susana P.
2014-01-01
The comprehensive information of small molecules and their biological activities in the PubChem database allows chemoinformatic researchers to access and make use of large-scale biological activity data to improve the precision of drug profiling. A Quantitative Structure–Activity Relationship approach, for classification, was used for the prediction of active/inactive compounds relatively to overall biological activity, antitumor and antibiotic activities using a data set of 1804 compounds from PubChem. Using the best classification models for antibiotic and antitumor activities a data set of marine and microbial natural products from the AntiMarin database were screened—57 and 16 new lead compounds for antibiotic and antitumor drug design were proposed, respectively. All compounds proposed by our approach are classified as non-antibiotic and non-antitumor compounds in the AntiMarin database. Recently several of the lead-like compounds proposed by us were reported as being active in the literature. PMID:24473174
Li, Haiquan; Dai, Xinbin; Zhao, Xuechun
2008-05-01
Membrane transport proteins play a crucial role in the import and export of ions, small molecules or macromolecules across biological membranes. Currently, there are a limited number of published computational tools which enable the systematic discovery and categorization of transporters prior to costly experimental validation. To approach this problem, we utilized a nearest neighbor method which seamlessly integrates homologous search and topological analysis into a machine-learning framework. Our approach satisfactorily distinguished 484 transporter families in the Transporter Classification Database, a curated and representative database for transporters. A five-fold cross-validation on the database achieved a positive classification rate of 72.3% on average. Furthermore, this method successfully detected transporters in seven model and four non-model organisms, ranging from archaean to mammalian species. A preliminary literature-based validation has cross-validated 65.8% of our predictions on the 11 organisms, including 55.9% of our predictions overlapping with 83.6% of the predicted transporters in TransportDB.
NASA Astrophysics Data System (ADS)
Van Hirtum, A.; Berckmans, D.
2003-09-01
A natural acoustic indicator of animal welfare is the appearance (or absence) of coughing in the animal habitat. A sound-database of 5319 individual sounds including 2034 coughs was collected on six healthy piglets containing both animal vocalizations and background noises. Each of the test animals was repeatedly placed in a laboratory installation where coughing was induced by nebulization of citric acid. A two-class classification into 'cough' or 'other' was performed by the application of a distance function to a fast Fourier spectral sound analysis. This resulted in a positive cough recognition of 92%. For the whole sound-database however there was a misclassification of 21%. As spectral information up to 10000 Hz is available, an improved overall classification on the same database is obtained by applying the distance function to nine frequency ranges and combining the achieved distance-values in fuzzy rules. For each frequency range clustering threshold is determined by fuzzy c-means clustering.
National Land Cover Database 2011 (NLCD 2011) is the most recent national land cover product created by the Multi-Resolution Land Characteristics (MRLC) Consortium. NLCD 2011 provides - for the first time - the capability to assess wall-to-wall, spatially explicit, national land cover changes and trends across the United States from 2001 to 2011. As with two previous NLCD land cover products NLCD 2011 keeps the same 16-class land cover classification scheme that has been applied consistently across the United States at a spatial resolution of 30 meters. NLCD 2011 is based primarily on a decision-tree classification of circa 2011 Landsat satellite data. This dataset is associated with the following publication:Homer, C., J. Dewitz, L. Yang, S. Jin, P. Danielson, G. Xian, J. Coulston, N. Herold, J. Wickham , and K. Megown. Completion of the 2011 National Land Cover Database for the Conterminous United States – Representing a Decade of Land Cover Change Information. PHOTOGRAMMETRIC ENGINEERING AND REMOTE SENSING. American Society for Photogrammetry and Remote Sensing, Bethesda, MD, USA, 81(0): 345-354, (2015).
A Novel Feature Selection Technique for Text Classification Using Naïve Bayes.
Dey Sarkar, Subhajit; Goswami, Saptarsi; Agarwal, Aman; Aktar, Javed
2014-01-01
With the proliferation of unstructured data, text classification or text categorization has found many applications in topic classification, sentiment analysis, authorship identification, spam detection, and so on. There are many classification algorithms available. Naïve Bayes remains one of the oldest and most popular classifiers. On one hand, implementation of naïve Bayes is simple and, on the other hand, this also requires fewer amounts of training data. From the literature review, it is found that naïve Bayes performs poorly compared to other classifiers in text classification. As a result, this makes the naïve Bayes classifier unusable in spite of the simplicity and intuitiveness of the model. In this paper, we propose a two-step feature selection method based on firstly a univariate feature selection and then feature clustering, where we use the univariate feature selection method to reduce the search space and then apply clustering to select relatively independent feature sets. We demonstrate the effectiveness of our method by a thorough evaluation and comparison over 13 datasets. The performance improvement thus achieved makes naïve Bayes comparable or superior to other classifiers. The proposed algorithm is shown to outperform other traditional methods like greedy search based wrapper or CFS.
Fesharaki, Nooshin Jafari; Pourghassem, Hossein
2013-07-01
Due to the daily mass production and the widespread variation of medical X-ray images, it is necessary to classify these for searching and retrieving proposes, especially for content-based medical image retrieval systems. In this paper, a medical X-ray image hierarchical classification structure based on a novel merging and splitting scheme and using shape and texture features is proposed. In the first level of the proposed structure, to improve the classification performance, similar classes with regard to shape contents are grouped based on merging measures and shape features into the general overlapped classes. In the next levels of this structure, the overlapped classes split in smaller classes based on the classification performance of combination of shape and texture features or texture features only. Ultimately, in the last levels, this procedure is also continued forming all the classes, separately. Moreover, to optimize the feature vector in the proposed structure, we use orthogonal forward selection algorithm according to Mahalanobis class separability measure as a feature selection and reduction algorithm. In other words, according to the complexity and inter-class distance of each class, a sub-space of the feature space is selected in each level and then a supervised merging and splitting scheme is applied to form the hierarchical classification. The proposed structure is evaluated on a database consisting of 2158 medical X-ray images of 18 classes (IMAGECLEF 2005 database) and accuracy rate of 93.6% in the last level of the hierarchical structure for an 18-class classification problem is obtained.
Yi, Chucai; Tian, Yingli
2012-09-01
In this paper, we propose a novel framework to extract text regions from scene images with complex backgrounds and multiple text appearances. This framework consists of three main steps: boundary clustering (BC), stroke segmentation, and string fragment classification. In BC, we propose a new bigram-color-uniformity-based method to model both text and attachment surface, and cluster edge pixels based on color pairs and spatial positions into boundary layers. Then, stroke segmentation is performed at each boundary layer by color assignment to extract character candidates. We propose two algorithms to combine the structural analysis of text stroke with color assignment and filter out background interferences. Further, we design a robust string fragment classification based on Gabor-based text features. The features are obtained from feature maps of gradient, stroke distribution, and stroke width. The proposed framework of text localization is evaluated on scene images, born-digital images, broadcast video images, and images of handheld objects captured by blind persons. Experimental results on respective datasets demonstrate that the framework outperforms state-of-the-art localization algorithms.
Younghak Shin; Balasingham, Ilangko
2017-07-01
Colonoscopy is a standard method for screening polyps by highly trained physicians. Miss-detected polyps in colonoscopy are potential risk factor for colorectal cancer. In this study, we investigate an automatic polyp classification framework. We aim to compare two different approaches named hand-craft feature method and convolutional neural network (CNN) based deep learning method. Combined shape and color features are used for hand craft feature extraction and support vector machine (SVM) method is adopted for classification. For CNN approach, three convolution and pooling based deep learning framework is used for classification purpose. The proposed framework is evaluated using three public polyp databases. From the experimental results, we have shown that the CNN based deep learning framework shows better classification performance than the hand-craft feature based methods. It achieves over 90% of classification accuracy, sensitivity, specificity and precision.
A fuzzy hill-climbing algorithm for the development of a compact associative classifier
NASA Astrophysics Data System (ADS)
Mitra, Soumyaroop; Lam, Sarah S.
2012-02-01
Classification, a data mining technique, has widespread applications including medical diagnosis, targeted marketing, and others. Knowledge discovery from databases in the form of association rules is one of the important data mining tasks. An integrated approach, classification based on association rules, has drawn the attention of the data mining community over the last decade. While attention has been mainly focused on increasing classifier accuracies, not much efforts have been devoted towards building interpretable and less complex models. This paper discusses the development of a compact associative classification model using a hill-climbing approach and fuzzy sets. The proposed methodology builds the rule-base by selecting rules which contribute towards increasing training accuracy, thus balancing classification accuracy with the number of classification association rules. The results indicated that the proposed associative classification model can achieve competitive accuracies on benchmark datasets with continuous attributes and lend better interpretability, when compared with other rule-based systems.
Classification of parotidectomies: a proposal of the European Salivary Gland Society.
Quer, M; Guntinas-Lichius, O; Marchal, F; Vander Poorten, V; Chevalier, D; León, X; Eisele, D; Dulguerov, P
2016-10-01
The objective of this study is to provide a comprehensive classification system for parotidectomy operations. Data sources include Medline publications, author's experience, and consensus round table at the Third European Salivary Gland Society (ESGS) Meeting. The Medline database was searched with the term "parotidectomy" and "definition". The various definitions of parotidectomy procedures and parotid gland subdivisions extracted. Previous classification systems re-examined and a new classification proposed by a consensus. The ESGS proposes to subdivide the parotid parenchyma in five levels: I (lateral superior), II (lateral inferior), III (deep inferior), IV (deep superior), V (accessory). A new classification is proposed where the type of resection is divided into formal parotidectomy with facial nerve dissection and extracapsular dissection. Parotidectomies are further classified according to the levels removed, as well as the extra-parotid structures ablated. A new classification of parotidectomy procedures is proposed.
Significance of clustering and classification applications in digital and physical libraries
NASA Astrophysics Data System (ADS)
Triantafyllou, Ioannis; Koulouris, Alexandros; Zervos, Spiros; Dendrinos, Markos; Giannakopoulos, Georgios
2015-02-01
Applications of clustering and classification techniques can be proved very significant in both digital and physical (paper-based) libraries. The most essential application, document classification and clustering, is crucial for the content that is produced and maintained in digital libraries, repositories, databases, social media, blogs etc., based on various tags and ontology elements, transcending the traditional library-oriented classification schemes. Other applications with very useful and beneficial role in the new digital library environment involve document routing, summarization and query expansion. Paper-based libraries can benefit as well since classification combined with advanced material characterization techniques such as FTIR (Fourier Transform InfraRed spectroscopy) can be vital for the study and prevention of material deterioration. An improved two-level self-organizing clustering architecture is proposed in order to enhance the discrimination capacity of the learning space, prior to classification, yielding promising results when applied to the above mentioned library tasks.
Rice, Thomas W; Rusch, Valerie W; Ishwaran, Hemant; Blackstone, Eugene H
2010-08-15
Previous American Joint Committee on Cancer/International Union Against Cancer (AJCC/UICC) stage groupings for esophageal cancer have not been data driven or harmonized with stomach cancer. At the request of the AJCC, worldwide data from 3 continents were assembled to develop data-driven, harmonized esophageal staging for the seventh edition of the AJCC/UICC cancer staging manuals. All-cause mortality among 4627 patients with esophageal and esophagogastric junction cancer who underwent surgery alone (no preoperative or postoperative adjuvant therapy) was analyzed by using novel random forest methodology to produce stage groups for which survival was monotonically decreasing, distinctive, and homogeneous. For lymph node-negative pN0M0 cancers, risk-adjusted 5-year survival was dominated by pathologic tumor classification (pT) but was modulated by histopathologic cell type, histologic grade, and location. For lymph node-positive, pN+M0 cancers, the number of cancer-positive lymph nodes (a new pN classification) dominated survival. Resulting stage groupings departed from a simple, logical arrangement of TNM. Stage groupings for stage I and II adenocarcinoma were based on pT, pN, and histologic grade; and groupings for squamous cell carcinoma were based on pT, pN, histologic grade, and location. Stage III was similar for histopathologic cell types and was based only on pT and pN. Stage 0 and stage IV, by definition, were categorized as tumor in situ (Tis) (high-grade dysplasia) and pM1, respectively. The prognosis for patients with esophageal and esophagogastric junction cancer depends on the complex interplay of TNM classifications as well as nonanatomic factors, including histopathologic cell type, histologic grade, and cancer location. These features were incorporated into a data-driven staging of these cancers for the seventh edition of the AJCC/UICC cancer staging manuals. Copyright (c) 2010 American Cancer Society.
Global Ground Motion Prediction Equations Program | Just another WordPress
Motion Task 2: Compile and Critically Review GMPEs Task 3: Select or Derive a Global Set of GMPEs Task 6 : Design the Specifications to Compile a Global Database of Soil Classification Task 5: Build a Database of Update on PEER's Global GMPEs Project from recent workshop in Turkey Posted on June 11, 2012 During May
Software Classifications: Trends in Literacy Software Publication and Marketing.
ERIC Educational Resources Information Center
Balajthy, Ernest
First in a continuing series of reports on trends in marketing and publication of software for literacy education, a study explored the development of a database to track the trends and reported on trends seen in 1995. The final version of the 1995 database consisted of 1011 software titles, 165 of which had been published in 1995 and 846…
Using Computational Text Classification for Qualitative Research and Evaluation in Extension
ERIC Educational Resources Information Center
Smith, Justin G.; Tissing, Reid
2018-01-01
This article introduces a process for computational text classification that can be used in a variety of qualitative research and evaluation settings. The process leverages supervised machine learning based on an implementation of a multinomial Bayesian classifier. Applied to a community of inquiry framework, the algorithm was used to identify…
Neighborhood Structural Similarity Mapping for the Classification of Masses in Mammograms.
Rabidas, Rinku; Midya, Abhishek; Chakraborty, Jayasree
2018-05-01
In this paper, two novel feature extraction methods, using neighborhood structural similarity (NSS), are proposed for the characterization of mammographic masses as benign or malignant. Since gray-level distribution of pixels is different in benign and malignant masses, more regular and homogeneous patterns are visible in benign masses compared to malignant masses; the proposed method exploits the similarity between neighboring regions of masses by designing two new features, namely, NSS-I and NSS-II, which capture global similarity at different scales. Complementary to these global features, uniform local binary patterns are computed to enhance the classification efficiency by combining with the proposed features. The performance of the features are evaluated using the images from the mini-mammographic image analysis society (mini-MIAS) and digital database for screening mammography (DDSM) databases, where a tenfold cross-validation technique is incorporated with Fisher linear discriminant analysis, after selecting the optimal set of features using stepwise logistic regression method. The best area under the receiver operating characteristic curve of 0.98 with an accuracy of is achieved with the mini-MIAS database, while the same for the DDSM database is 0.93 with accuracy .
Protein structure database search and evolutionary classification.
Yang, Jinn-Moon; Tung, Chi-Hua
2006-01-01
As more protein structures become available and structural genomics efforts provide structural models in a genome-wide strategy, there is a growing need for fast and accurate methods for discovering homologous proteins and evolutionary classifications of newly determined structures. We have developed 3D-BLAST, in part, to address these issues. 3D-BLAST is as fast as BLAST and calculates the statistical significance (E-value) of an alignment to indicate the reliability of the prediction. Using this method, we first identified 23 states of the structural alphabet that represent pattern profiles of the backbone fragments and then used them to represent protein structure databases as structural alphabet sequence databases (SADB). Our method enhanced BLAST as a search method, using a new structural alphabet substitution matrix (SASM) to find the longest common substructures with high-scoring structured segment pairs from an SADB database. Using personal computers with Intel Pentium4 (2.8 GHz) processors, our method searched more than 10 000 protein structures in 1.3 s and achieved a good agreement with search results from detailed structure alignment methods. [3D-BLAST is available at http://3d-blast.life.nctu.edu.tw].
DOE R&D Accomplishments Database
Chandonia, John-Marc; Hon, Gary; Walker, Nigel S.; Lo Conte, Loredana; Koehl, Patrice; Levitt, Michael; Brenner, Steven E.
2003-09-15
The ASTRAL compendium provides several databases and tools to aid in the analysis of protein structures, particularly through the use of their sequences. Partially derived from the SCOP database of protein structure domains, it includes sequences for each domain and other resources useful for studying these sequences and domain structures. The current release of ASTRAL contains 54,745 domains, more than three times as many as the initial release four years ago. ASTRAL has undergone major transformations in the past two years. In addition to several complete updates each year, ASTRAL is now updated on a weekly basis with preliminary classifications of domains from newly released PDB structures. These classifications are available as a stand-alone database, as well as available integrated into other ASTRAL databases such as representative subsets. To enhance the utility of ASTRAL to structural biologists, all SCOP domains are now made available as PDB-style coordinate files as well as sequences. In addition to sequences and representative subsets based on SCOP domains, sequences and subsets based on PDB chains are newly included in ASTRAL. Several search tools have been added to ASTRAL to facilitate retrieval of data by individual users and automated methods.
Using complex networks for text classification: Discriminating informative and imaginative documents
NASA Astrophysics Data System (ADS)
de Arruda, Henrique F.; Costa, Luciano da F.; Amancio, Diego R.
2016-01-01
Statistical methods have been widely employed in recent years to grasp many language properties. The application of such techniques have allowed an improvement of several linguistic applications, such as machine translation and document classification. In the latter, many approaches have emphasised the semantical content of texts, as is the case of bag-of-word language models. These approaches have certainly yielded reasonable performance. However, some potential features such as the structural organization of texts have been used only in a few studies. In this context, we probe how features derived from textual structure analysis can be effectively employed in a classification task. More specifically, we performed a supervised classification aiming at discriminating informative from imaginative documents. Using a networked model that describes the local topological/dynamical properties of function words, we achieved an accuracy rate of up to 95%, which is much higher than similar networked approaches. A systematic analysis of feature relevance revealed that symmetry and accessibility measurements are among the most prominent network measurements. Our results suggest that these measurements could be used in related language applications, as they play a complementary role in characterising texts.
AutoFACT: An Automatic Functional Annotation and Classification Tool
Koski, Liisa B; Gray, Michael W; Lang, B Franz; Burger, Gertraud
2005-01-01
Background Assignment of function to new molecular sequence data is an essential step in genomics projects. The usual process involves similarity searches of a given sequence against one or more databases, an arduous process for large datasets. Results We present AutoFACT, a fully automated and customizable annotation tool that assigns biologically informative functions to a sequence. Key features of this tool are that it (1) analyzes nucleotide and protein sequence data; (2) determines the most informative functional description by combining multiple BLAST reports from several user-selected databases; (3) assigns putative metabolic pathways, functional classes, enzyme classes, GeneOntology terms and locus names; and (4) generates output in HTML, text and GFF formats for the user's convenience. We have compared AutoFACT to four well-established annotation pipelines. The error rate of functional annotation is estimated to be only between 1–2%. Comparison of AutoFACT to the traditional top-BLAST-hit annotation method shows that our procedure increases the number of functionally informative annotations by approximately 50%. Conclusion AutoFACT will serve as a useful annotation tool for smaller sequencing groups lacking dedicated bioinformatics staff. It is implemented in PERL and runs on LINUX/UNIX platforms. AutoFACT is available at . PMID:15960857
Mujtaba, Ghulam; Shuib, Liyana; Raj, Ram Gopal; Rajandram, Retnagowri; Shaikh, Khairunisa; Al-Garadi, Mohammed Ali
2018-06-01
Text categorization has been used extensively in recent years to classify plain-text clinical reports. This study employs text categorization techniques for the classification of open narrative forensic autopsy reports. One of the key steps in text classification is document representation. In document representation, a clinical report is transformed into a format that is suitable for classification. The traditional document representation technique for text categorization is the bag-of-words (BoW) technique. In this study, the traditional BoW technique is ineffective in classifying forensic autopsy reports because it merely extracts frequent but discriminative features from clinical reports. Moreover, this technique fails to capture word inversion, as well as word-level synonymy and polysemy, when classifying autopsy reports. Hence, the BoW technique suffers from low accuracy and low robustness unless it is improved with contextual and application-specific information. To overcome the aforementioned limitations of the BoW technique, this research aims to develop an effective conceptual graph-based document representation (CGDR) technique to classify 1500 forensic autopsy reports from four (4) manners of death (MoD) and sixteen (16) causes of death (CoD). Term-based and Systematized Nomenclature of Medicine-Clinical Terms (SNOMED CT) based conceptual features were extracted and represented through graphs. These features were then used to train a two-level text classifier. The first level classifier was responsible for predicting MoD. In addition, the second level classifier was responsible for predicting CoD using the proposed conceptual graph-based document representation technique. To demonstrate the significance of the proposed technique, its results were compared with those of six (6) state-of-the-art document representation techniques. Lastly, this study compared the effects of one-level classification and two-level classification on the experimental results. The experimental results indicated that the CGDR technique achieved 12% to 15% improvement in accuracy compared with fully automated document representation baseline techniques. Moreover, two-level classification obtained better results compared with one-level classification. The promising results of the proposed conceptual graph-based document representation technique suggest that pathologists can adopt the proposed system as their basis for second opinion, thereby supporting them in effectively determining CoD. Copyright © 2018 Elsevier Inc. All rights reserved.
Kuhn, Jens H.; Bao, Yiming; Bavari, Sina; Becker, Stephan; Bradfute, Steven; Brister, J. Rodney; Bukreyev, Alexander A.; Caì, Yíngyún; Chandran, Kartik; Davey, Robert A.; Dolnik, Olga; Dye, John M.; Enterlein, Sven; Gonzalez, Jean-Paul; Formenty, Pierre; Freiberg, Alexander N.; Hensley, Lisa E.; Honko, Anna N.; Ignatyev, Georgy M.; Jahrling, Peter B.; Johnson, Karl M.; Klenk, Hans-Dieter; Kobinger, Gary; Lackemeyer, Matthew G.; Leroy, Eric M.; Lever, Mark S.; Lofts, Loreen L.; Mühlberger, Elke; Netesov, Sergey V.; Olinger, Gene G.; Palacios, Gustavo; Patterson, Jean L.; Paweska, Janusz T.; Pitt, Louise; Radoshitzky, Sheli R.; Ryabchikova, Elena I.; Saphire, Erica Ollmann; Shestopalov, Aleksandr M.; Smither, Sophie J.; Sullivan, Nancy J.; Swanepoel, Robert; Takada, Ayato; Towner, Jonathan S.; van der Groen, Guido; Volchkov, Viktor E.; Wahl-Jensen, Victoria; Warren, Travis K.; Warfield, Kelly L.; Weidmann, Manfred; Nichol, Stuart T.
2013-01-01
The International Committee on Taxonomy of Viruses (ICTV) organizes the classification of viruses into taxa, but is not responsible for the nomenclature for taxa members. International experts groups, such as the ICTV Study Groups, recommend the classification and naming of viruses and their strains, variants, and isolates. The ICTV Filoviridae Study Group has recently introduced an updated classification and nomenclature for filoviruses. Subsequently, and together with numerous other filovirus experts, a consistent nomenclature for their natural genetic variants and isolates was developed that aims at simplifying the retrieval of sequence data from electronic databases. This is a first important step toward a viral genome annotation standard as sought by the US National Center for Biotechnology Information (NCBI). Here, this work is extended to include filoviruses obtained in the laboratory by artificial selection through passage in laboratory hosts. The previously developed template for natural filovirus genetic variant naming (
Personalizing Sample Databases with Facebook Information to Increase Intrinsic Motivation
ERIC Educational Resources Information Center
Marzo, Asier; Ardaiz, Oscar; Sanz de Acedo, María Teresa; Sanz de Acedo, María Luisa
2017-01-01
Motivation is fundamental for students to achieve successful and complete learning. Motivation can be extrinsic, i.e., driven by external rewards, or intrinsic, i.e., driven by internal factors. Intrinsic motivation is the most effective and must be inspired by the task at hand. Here, a novel strategy is presented to increase intrinsic motivation…
Text Processing Differences among Readers.
ERIC Educational Resources Information Center
Garner, Ruth
Explanations for differences in reading proficiency should be constructed around an atlas of reading-related individual differences in cognition. Such an atlas should include well-documented "bottom-up", text-driven reading strategies and less thoroughly investigated "top-down", schema-driven reading strategies. Research…
Lu, Yingjie
2013-01-01
To facilitate patient involvement in online health community and obtain informative support and emotional support they need, a topic identification approach was proposed in this paper for identifying automatically topics of the health-related messages in online health community, thus assisting patients in reaching the most relevant messages for their queries efficiently. Feature-based classification framework was presented for automatic topic identification in our study. We first collected the messages related to some predefined topics in a online health community. Then we combined three different types of features, n-gram-based features, domain-specific features and sentiment features to build four feature sets for health-related text representation. Finally, three different text classification techniques, C4.5, Naïve Bayes and SVM were adopted to evaluate our topic classification model. By comparing different feature sets and different classification techniques, we found that n-gram-based features, domain-specific features and sentiment features were all considered to be effective in distinguishing different types of health-related topics. In addition, feature reduction technique based on information gain was also effective to improve the topic classification performance. In terms of classification techniques, SVM outperformed C4.5 and Naïve Bayes significantly. The experimental results demonstrated that the proposed approach could identify the topics of online health-related messages efficiently.
Decoding "us" and "them": Neural representations of generalized group concepts.
Cikara, Mina; Van Bavel, Jay J; Ingbretsen, Zachary A; Lau, Tatiana
2017-05-01
Humans form social coalitions in every society on earth, yet we know very little about how the general concepts us and them are represented in the brain. Evolutionary psychologists have argued that the human capacity for group affiliation is a byproduct of adaptations that evolved for tracking coalitions in general. These theories suggest that humans possess a common neural code for the concepts in-group and out-group, regardless of the category by which group boundaries are instantiated. The authors used multivoxel pattern analysis to identify the neural substrates of generalized group concept representations. They trained a classifier to encode how people represented the most basic instantiation of a specific social group (i.e., arbitrary teams created in the lab with no history of interaction or associated stereotypes) and tested how well the neural data decoded membership along an objectively orthogonal, real-world category (i.e., political parties). The dorsal anterior cingulate cortex/middle cingulate cortex and anterior insula were associated with representing groups across multiple social categories. Restricting the analyses to these regions in a separate sample of participants performing an explicit categorization task, the authors replicated cross-categorization classification in anterior insula. Classification accuracy across categories was driven predominantly by the correct categorization of in-group targets, consistent with theories indicating in-group preference is more central than out-group derogation to group perception and cognition. These findings highlight the extent to which social group concepts rely on domain-general circuitry associated with encoding stimuli's functional significance. (PsycINFO Database Record (c) 2017 APA, all rights reserved).
Data model and relational database design for the New England Water-Use Data System (NEWUDS)
Tessler, Steven
2001-01-01
The New England Water-Use Data System (NEWUDS) is a database for the storage and retrieval of water-use data. NEWUDS can handle data covering many facets of water use, including (1) tracking various types of water-use activities (withdrawals, returns, transfers, distributions, consumptive-use, wastewater collection, and treatment); (2) the description, classification and location of places and organizations involved in water-use activities; (3) details about measured or estimated volumes of water associated with water-use activities; and (4) information about data sources and water resources associated with water use. In NEWUDS, each water transaction occurs unidirectionally between two site objects, and the sites and conveyances form a water network. The core entities in the NEWUDS model are site, conveyance, transaction/rate, location, and owner. Other important entities include water resources (used for withdrawals and returns), data sources, and aliases. Multiple water-exchange estimates can be stored for individual transactions based on different methods or data sources. Storage of user-defined details is accommodated for several of the main entities. Numerous tables containing classification terms facilitate detailed descriptions of data items and can be used for routine or custom data summarization. NEWUDS handles single-user and aggregate-user water-use data, can be used for large or small water-network projects, and is available as a stand-alone Microsoft? Access database structure. Users can customize and extend the database, link it to other databases, or implement the design in other relational database applications.
PlantTribes: a gene and gene family resource for comparative genomics in plants
Wall, P. Kerr; Leebens-Mack, Jim; Müller, Kai F.; Field, Dawn; Altman, Naomi S.; dePamphilis, Claude W.
2008-01-01
The PlantTribes database (http://fgp.huck.psu.edu/tribe.html) is a plant gene family database based on the inferred proteomes of five sequenced plant species: Arabidopsis thaliana, Carica papaya, Medicago truncatula, Oryza sativa and Populus trichocarpa. We used the graph-based clustering algorithm MCL [Van Dongen (Technical Report INS-R0010 2000) and Enright et al. (Nucleic Acids Res. 2002; 30: 1575–1584)] to classify all of these species’ protein-coding genes into putative gene families, called tribes, using three clustering stringencies (low, medium and high). For all tribes, we have generated protein and DNA alignments and maximum-likelihood phylogenetic trees. A parallel database of microarray experimental results is linked to the genes, which lets researchers identify groups of related genes and their expression patterns. Unified nomenclatures were developed, and tribes can be related to traditional gene families and conserved domain identifiers. SuperTribes, constructed through a second iteration of MCL clustering, connect distant, but potentially related gene clusters. The global classification of nearly 200 000 plant proteins was used as a scaffold for sorting ∼4 million additional cDNA sequences from over 200 plant species. All data and analyses are accessible through a flexible interface allowing users to explore the classification, to place query sequences within the classification, and to download results for further study. PMID:18073194
Ruiz, Elena; Ramalle-Gómara, Enrique; Quiñones, Carmen; Rabasa, Pilar; Pisón, Carlos
2015-05-01
To analyse the validity of diagnosis of aplastic anaemia (AA) by International Classification of Diseases codes in hospital discharge data (MBDS) and the mortality registry (MR) of La Rioja to detect cases to be included in the Spanish National Rare Diseases Registry. International Classification of Diseases (ICD) codes were used to detect AA cases during the period 2007-2012 from two administrative databases: the MBDS and the MR of La Rioja (Spain). Medical records of population selected by merging both databases were used to confirm true AA cases. The annual mean incidence rate of AA was calculated using confirmed incident cases. By merging both databases, 62 hypothetical AA incident patients were detected during the period 2007-2012. The medical records of the 89% of them could be revised, and they confirmed that only the 15% of the patients actually suffered AA. The annual mean AA incidence in La Rioja was 4.17 per million inhabitants (6.23 per million, males; 2.10 per million, females). The MBDS and the MR are not in themselves sufficient to ascertain AA cases in La Rioja and medical records should be reviewed to confirm true AA cases to be included in the Spanish National Rare Diseases Registry. © 2014 John Wiley & Sons A/S. Published by John Wiley & Sons Ltd.
de Medeiros, Ana Claudia Torres; da Nóbrega, Maria Miriam Lima; Rodrigues, Rosalina Aparecida Partezani; Fernandes, Maria das Graças Melo
2013-01-01
To develop nursing diagnosis statements for the elderly based on the Activities of Living Model and on the International Classification for Nursing Practice. Descriptive and exploratory study, put in practice in two stages: 1) collection of terms and concepts that are considered clinically and culturally relevant for nursing care delivered to the elderly, in order to develop a database of terms and 2) development of nursing diagnosis statements for the elderly in primary health care, based on the guidelines of the International Council of Nurses and on the database of terms for nursing practice involving the elderly. 414 terms were identified and submitted to the content validation process, with the participation of ten nursing experts, which resulted in 263 validated terms. These terms were submitted to cross mapping with the terms of the International Classification for Nursing Practice, resulting in the identification of 115 listed terms and 148 non-listed terms, which constituted the database of terms, from which 127 nursing diagnosis statements were prepared and classified into factors that affect the performance of the elderly's activities of living - 69 into biological factors, 19 into psychological, 31 into sociocultural, five into environmental, and three into political-economic factors. After clinical validation, these statements can serve as a guide for nursing consultations with elderly patients, without ignoring clinical experience, critical thinking and decision-making.
Preparing College Students To Search Full-Text Databases: Is Instruction Necessary?
ERIC Educational Resources Information Center
Riley, Cheryl; Wales, Barbara
Full-text databases allow Central Missouri State University's clients to access some of the serials that libraries have had to cancel due to escalating subscription costs; EbscoHost, the subject of this study, is one such database. The database is available free to all Missouri residents. A survey was designed consisting of 21 questions intended…
ERIC Educational Resources Information Center
Bell, Steven J.
2003-01-01
Discusses full-text databases and whether existing aggregator databases are meeting user needs. Topics include the need for better search interfaces; concepts of quality research and information retrieval; information overload; full text in electronic journal collections versus aggregator databases; underrepresentation of certain disciplines; and…
Protein Bioinformatics Databases and Resources
Chen, Chuming; Huang, Hongzhan; Wu, Cathy H.
2017-01-01
Many publicly available data repositories and resources have been developed to support protein related information management, data-driven hypothesis generation and biological knowledge discovery. To help researchers quickly find the appropriate protein related informatics resources, we present a comprehensive review (with categorization and description) of major protein bioinformatics databases in this chapter. We also discuss the challenges and opportunities for developing next-generation protein bioinformatics databases and resources to support data integration and data analytics in the Big Data era. PMID:28150231
NASA Astrophysics Data System (ADS)
Land, Walker H., Jr.; Lewis, Michael; Sadik, Omowunmi; Wong, Lut; Wanekaya, Adam; Gonzalez, Richard J.; Balan, Arun
2004-04-01
This paper extends the classification approaches described in reference [1] in the following way: (1.) developing and evaluating a new method for evolving organophosphate nerve agent Support Vector Machine (SVM) classifiers using Evolutionary Programming, (2.) conducting research experiments using a larger database of organophosphate nerve agents, and (3.) upgrading the architecture to an object-based grid system for evaluating the classification of EP derived SVMs. Due to the increased threats of chemical and biological weapons of mass destruction (WMD) by international terrorist organizations, a significant effort is underway to develop tools that can be used to detect and effectively combat biochemical warfare. This paper reports the integration of multi-array sensors with Support Vector Machines (SVMs) for the detection of organophosphates nerve agents using a grid computing system called Legion. Grid computing is the use of large collections of heterogeneous, distributed resources (including machines, databases, devices, and users) to support large-scale computations and wide-area data access. Finally, preliminary results using EP derived support vector machines designed to operate on distributed systems have provided accurate classification results. In addition, distributed training time architectures are 50 times faster when compared to standard iterative training time methods.
NASA Astrophysics Data System (ADS)
Liu, Yang; Wang, Jiang; Cai, Lihui; Chen, Yingyuan; Qin, Yingmei
2018-03-01
As a pattern of cross-frequency coupling (CFC), phase-amplitude coupling (PAC) depicts the interaction between the phase and amplitude of distinct frequency bands from the same signal, and has been proved to be closely related to the brain’s cognitive and memory activities. This work utilized PAC and support vector machine (SVM) classifier to identify the epileptic seizures from electroencephalogram (EEG) data. The entropy-based modulation index (MI) matrixes are used to express the strength of PAC, from which we extracted features as the input for classifier. Based on the Bonn database, which contains five datasets of EEG segments obtained from healthy volunteers and epileptic subjects, a 100% classification accuracy is achieved for identifying seizure ictal from healthy data, and an accuracy of 97.67% is reached in the classification of ictal EEG signals from inter-ictal EEGs. Based on the CHB-MIT database which is a group of continuously recorded epileptic EEGs by scalp electrodes, a 97.50% classification accuracy is obtained and a raising sign of MI value is found at 6s before seizure onset. The classification performance in this work is effective, and PAC can be considered as a useful tool for detecting and predicting the epileptic seizures and providing reference for clinical diagnosis.
Histopathological Image Classification using Discriminative Feature-oriented Dictionary Learning
Vu, Tiep Huu; Mousavi, Hojjat Seyed; Monga, Vishal; Rao, Ganesh; Rao, UK Arvind
2016-01-01
In histopathological image analysis, feature extraction for classification is a challenging task due to the diversity of histology features suitable for each problem as well as presence of rich geometrical structures. In this paper, we propose an automatic feature discovery framework via learning class-specific dictionaries and present a low-complexity method for classification and disease grading in histopathology. Essentially, our Discriminative Feature-oriented Dictionary Learning (DFDL) method learns class-specific dictionaries such that under a sparsity constraint, the learned dictionaries allow representing a new image sample parsimoniously via the dictionary corresponding to the class identity of the sample. At the same time, the dictionary is designed to be poorly capable of representing samples from other classes. Experiments on three challenging real-world image databases: 1) histopathological images of intraductal breast lesions, 2) mammalian kidney, lung and spleen images provided by the Animal Diagnostics Lab (ADL) at Pennsylvania State University, and 3) brain tumor images from The Cancer Genome Atlas (TCGA) database, reveal the merits of our proposal over state-of-the-art alternatives. Moreover, we demonstrate that DFDL exhibits a more graceful decay in classification accuracy against the number of training images which is highly desirable in practice where generous training is often not available. PMID:26513781
Manifold Regularized Experimental Design for Active Learning.
Zhang, Lining; Shum, Hubert P H; Shao, Ling
2016-12-02
Various machine learning and data mining tasks in classification require abundant data samples to be labeled for training. Conventional active learning methods aim at labeling the most informative samples for alleviating the labor of the user. Many previous studies in active learning select one sample after another in a greedy manner. However, this is not very effective because the classification models has to be retrained for each newly labeled sample. Moreover, many popular active learning approaches utilize the most uncertain samples by leveraging the classification hyperplane of the classifier, which is not appropriate since the classification hyperplane is inaccurate when the training data are small-sized. The problem of insufficient training data in real-world systems limits the potential applications of these approaches. This paper presents a novel method of active learning called manifold regularized experimental design (MRED), which can label multiple informative samples at one time for training. In addition, MRED gives an explicit geometric explanation for the selected samples to be labeled by the user. Different from existing active learning methods, our method avoids the intrinsic problems caused by insufficiently labeled samples in real-world applications. Various experiments on synthetic datasets, the Yale face database and the Corel image database have been carried out to show how MRED outperforms existing methods.
Falk, Joakim; Björvell, Catrin
2012-01-01
The Swedish health care system stands before an implementation of standardized language. The first classification of nursing diagnoses translated into Swedish, The NANDA, was released in January 2011. The aim of the present study was to examine whether the usage of the NANDA classification affected nursing students’ choice of nursing interventions. Thirty-three nursing students in a clinical setting were divided into two groups. The intervention group had access to the NANDA classification text book, while the comparison group did not. In total 78 nursing assessments were performed and 218 nursing interventions initiated. The principle findings show that there were no statistical significant differences between the groups regarding the amount, quality or category of nursing interventions when using the NANDA classification compared to free text format nursing diagnoses. PMID:24199065
Taylor, Jonathan Christopher; Fenner, John Wesley
2017-11-29
Semi-quantification methods are well established in the clinic for assisted reporting of (I123) Ioflupane images. Arguably, these are limited diagnostic tools. Recent research has demonstrated the potential for improved classification performance offered by machine learning algorithms. A direct comparison between methods is required to establish whether a move towards widespread clinical adoption of machine learning algorithms is justified. This study compared three machine learning algorithms with that of a range of semi-quantification methods, using the Parkinson's Progression Markers Initiative (PPMI) research database and a locally derived clinical database for validation. Machine learning algorithms were based on support vector machine classifiers with three different sets of features: Voxel intensities Principal components of image voxel intensities Striatal binding radios from the putamen and caudate. Semi-quantification methods were based on striatal binding ratios (SBRs) from both putamina, with and without consideration of the caudates. Normal limits for the SBRs were defined through four different methods: Minimum of age-matched controls Mean minus 1/1.5/2 standard deviations from age-matched controls Linear regression of normal patient data against age (minus 1/1.5/2 standard errors) Selection of the optimum operating point on the receiver operator characteristic curve from normal and abnormal training data Each machine learning and semi-quantification technique was evaluated with stratified, nested 10-fold cross-validation, repeated 10 times. The mean accuracy of the semi-quantitative methods for classification of local data into Parkinsonian and non-Parkinsonian groups varied from 0.78 to 0.87, contrasting with 0.89 to 0.95 for classifying PPMI data into healthy controls and Parkinson's disease groups. The machine learning algorithms gave mean accuracies between 0.88 to 0.92 and 0.95 to 0.97 for local and PPMI data respectively. Classification performance was lower for the local database than the research database for both semi-quantitative and machine learning algorithms. However, for both databases, the machine learning methods generated equal or higher mean accuracies (with lower variance) than any of the semi-quantification approaches. The gain in performance from using machine learning algorithms as compared to semi-quantification was relatively small and may be insufficient, when considered in isolation, to offer significant advantages in the clinical context.
Maroun, Rana; Maunoury, Franck; Benjamin, Laure; Nachbaur, Gaëlle; Durand-Zaleski, Isabelle
2016-01-01
The aim of this study was to assess the economic burden of hospitalisations for metastatic renal cell carcinoma (mRCC), to describe the patterns of prescribing expensive drugs and to explore the impact of geographic and socio-demographic factors on the use of these drugs. We performed a retrospective analysis from the French national hospitals database. Hospital stays for mRCC between 2008 and 2013 were identified by combining the 10th revision of the International Classification of Diseases (ICD-10) codes for renal cell carcinoma (C64) and codes for metastases (C77 to C79). Incident cases were identified out of all hospital stays and followed till December 2013. Descriptive analyses were performed with a focus on hospital stays and patient characteristics. Costs were assessed from the perspective of the French National Health Insurance and were obtained from official diagnosis-related group tariffs for public and private hospitals. A total of 15,752 adult patients were hospitalised for mRCC, corresponding to 102,613 hospital stays. Of those patients, 68% were men and the median age at first hospitalisation was 69 years [Min-Max: 18-102]. Over the study period, the hospital mortality rate reached 37%. The annual cost of managing mRCC at hospital varied between 28M€ in 2008 and 42M€ in 2012 and was mainly driven by inpatient costs. The mean annual per capita cost of hospital management of mRCC varied across the study period from 8,993€ (SD: €8,906) in 2008 to 10,216€ (SD: €10,527) in 2012. Analysis of the determinants of prescribing expensive drugs at hospital did not show social or territorial differences in the use of these drugs. This study is the first to investigate the in-hospital economic burden of mRCC in France. Results showed that in-hospital costs of managing mRCC are mainly driven by expensive drugs and inpatient costs.
PDF text classification to leverage information extraction from publication reports.
Bui, Duy Duc An; Del Fiol, Guilherme; Jonnalagadda, Siddhartha
2016-06-01
Data extraction from original study reports is a time-consuming, error-prone process in systematic review development. Information extraction (IE) systems have the potential to assist humans in the extraction task, however majority of IE systems were not designed to work on Portable Document Format (PDF) document, an important and common extraction source for systematic review. In a PDF document, narrative content is often mixed with publication metadata or semi-structured text, which add challenges to the underlining natural language processing algorithm. Our goal is to categorize PDF texts for strategic use by IE systems. We used an open-source tool to extract raw texts from a PDF document and developed a text classification algorithm that follows a multi-pass sieve framework to automatically classify PDF text snippets (for brevity, texts) into TITLE, ABSTRACT, BODYTEXT, SEMISTRUCTURE, and METADATA categories. To validate the algorithm, we developed a gold standard of PDF reports that were included in the development of previous systematic reviews by the Cochrane Collaboration. In a two-step procedure, we evaluated (1) classification performance, and compared it with machine learning classifier, and (2) the effects of the algorithm on an IE system that extracts clinical outcome mentions. The multi-pass sieve algorithm achieved an accuracy of 92.6%, which was 9.7% (p<0.001) higher than the best performing machine learning classifier that used a logistic regression algorithm. F-measure improvements were observed in the classification of TITLE (+15.6%), ABSTRACT (+54.2%), BODYTEXT (+3.7%), SEMISTRUCTURE (+34%), and MEDADATA (+14.2%). In addition, use of the algorithm to filter semi-structured texts and publication metadata improved performance of the outcome extraction system (F-measure +4.1%, p=0.002). It also reduced of number of sentences to be processed by 44.9% (p<0.001), which corresponds to a processing time reduction of 50% (p=0.005). The rule-based multi-pass sieve framework can be used effectively in categorizing texts extracted from PDF documents. Text classification is an important prerequisite step to leverage information extraction from PDF documents. Copyright © 2016 Elsevier Inc. All rights reserved.
Feature-Based Morphometry: Discovering Group-related Anatomical Patterns
Toews, Matthew; Wells, William; Collins, D. Louis; Arbel, Tal
2015-01-01
This paper presents feature-based morphometry (FBM), a new, fully data-driven technique for discovering patterns of group-related anatomical structure in volumetric imagery. In contrast to most morphometry methods which assume one-to-one correspondence between subjects, FBM explicitly aims to identify distinctive anatomical patterns that may only be present in subsets of subjects, due to disease or anatomical variability. The image is modeled as a collage of generic, localized image features that need not be present in all subjects. Scale-space theory is applied to analyze image features at the characteristic scale of underlying anatomical structures, instead of at arbitrary scales such as global or voxel-level. A probabilistic model describes features in terms of their appearance, geometry, and relationship to subject groups, and is automatically learned from a set of subject images and group labels. Features resulting from learning correspond to group-related anatomical structures that can potentially be used as image biomarkers of disease or as a basis for computer-aided diagnosis. The relationship between features and groups is quantified by the likelihood of feature occurrence within a specific group vs. the rest of the population, and feature significance is quantified in terms of the false discovery rate. Experiments validate FBM clinically in the analysis of normal (NC) and Alzheimer's (AD) brain images using the freely available OASIS database. FBM automatically identifies known structural differences between NC and AD subjects in a fully data-driven fashion, and an equal error classification rate of 0.80 is achieved for subjects aged 60-80 years exhibiting mild AD (CDR=1). PMID:19853047
Rimland, Joseph M; Abraha, Iosief; Luchetta, Maria Laura; Cozzolino, Francesco; Orso, Massimiliano; Cherubini, Antonio; Dell'Aquila, Giuseppina; Chiatti, Carlos; Ambrosio, Giuseppe; Montedori, Alessandro
2016-06-01
Healthcare databases are useful sources to investigate the epidemiology of chronic obstructive pulmonary disease (COPD), to assess longitudinal outcomes in patients with COPD, and to develop disease management strategies. However, in order to constitute a reliable source for research, healthcare databases need to be validated. The aim of this protocol is to perform the first systematic review of studies reporting the validation of codes related to COPD diagnoses in healthcare databases. MEDLINE, EMBASE, Web of Science and the Cochrane Library databases will be searched using appropriate search strategies. Studies that evaluated the validity of COPD codes (such as the International Classification of Diseases 9th Revision and 10th Revision system; the Real codes system or the International Classification of Primary Care) in healthcare databases will be included. Inclusion criteria will be: (1) the presence of a reference standard case definition for COPD; (2) the presence of at least one test measure (eg, sensitivity, positive predictive values, etc); and (3) the use of a healthcare database (including administrative claims databases, electronic healthcare databases or COPD registries) as a data source. Pairs of reviewers will independently abstract data using standardised forms and will assess quality using a checklist based on the Standards for Reporting of Diagnostic accuracy (STARD) criteria. This systematic review protocol has been produced in accordance with the Preferred Reporting Items for Systematic Reviews and Meta-Analyses Protocol (PRISMA-P) 2015 statement. Ethics approval is not required. Results of this study will be submitted to a peer-reviewed journal for publication. The results from this systematic review will be used for outcome research on COPD and will serve as a guide to identify appropriate case definitions of COPD, and reference standards, for researchers involved in validating healthcare databases. CRD42015029204. Published by the BMJ Publishing Group Limited. For permission to use (where not already granted under a licence) please go to http://www.bmj.com/company/products-services/rights-and-licensing/
Ma, Xu; Cheng, Yongmei; Hao, Shuai
2016-12-10
Automatic classification of terrain surfaces from an aerial image is essential for an autonomous unmanned aerial vehicle (UAV) landing at an unprepared site by using vision. Diverse terrain surfaces may show similar spectral properties due to the illumination and noise that easily cause poor classification performance. To address this issue, a multi-stage classification algorithm based on low-rank recovery and multi-feature fusion sparse representation is proposed. First, color moments and Gabor texture feature are extracted from training data and stacked as column vectors of a dictionary. Then we perform low-rank matrix recovery for the dictionary by using augmented Lagrange multipliers and construct a multi-stage terrain classifier. Experimental results on an aerial map database that we prepared verify the classification accuracy and robustness of the proposed method.
NASA Astrophysics Data System (ADS)
Nomura, Yukihiro; Lu, Jianming; Sekiya, Hiroo; Yahagi, Takashi
This paper presents a speech enhancement using the classification between the dominants of speech and noise. In our system, a new classification scheme between the dominants of speech and noise is proposed. The proposed classifications use the standard deviation of the spectrum of observation signal in each band. We introduce two oversubtraction factors for the dominants of speech and noise, respectively. And spectral subtraction is carried out after the classification. The proposed method is tested on several noise types from the Noisex-92 database. From the investigation of segmental SNR, Itakura-Saito distance measure, inspection of spectrograms and listening tests, the proposed system is shown to be effective to reduce background noise. Moreover, the enhanced speech using our system generates less musical noise and distortion than that of conventional systems.
PACSY, a relational database management system for protein structure and chemical shift analysis.
Lee, Woonghee; Yu, Wookyung; Kim, Suhkmann; Chang, Iksoo; Lee, Weontae; Markley, John L
2012-10-01
PACSY (Protein structure And Chemical Shift NMR spectroscopY) is a relational database management system that integrates information from the Protein Data Bank, the Biological Magnetic Resonance Data Bank, and the Structural Classification of Proteins database. PACSY provides three-dimensional coordinates and chemical shifts of atoms along with derived information such as torsion angles, solvent accessible surface areas, and hydrophobicity scales. PACSY consists of six relational table types linked to one another for coherence by key identification numbers. Database queries are enabled by advanced search functions supported by an RDBMS server such as MySQL or PostgreSQL. PACSY enables users to search for combinations of information from different database sources in support of their research. Two software packages, PACSY Maker for database creation and PACSY Analyzer for database analysis, are available from http://pacsy.nmrfam.wisc.edu.
Woon, Yuan-Liang; Lee, Keng-Yee; Mohd Anuar, Siti Fatimah Zahra; Goh, Pik-Pin; Lim, Teck-Onn
2018-04-20
Hospitalization due to dengue illness is an important measure of dengue morbidity. However, limited studies are based on administrative database because the validity of the diagnosis codes is unknown. We validated the International Classification of Diseases, 10th revision (ICD) diagnosis coding for dengue infections in the Malaysian Ministry of Health's (MOH) hospital discharge database. This validation study involves retrospective review of available hospital discharge records and hand-search medical records for years 2010 and 2013. We randomly selected 3219 hospital discharge records coded with dengue and non-dengue infections as their discharge diagnoses from the national hospital discharge database. We then randomly sampled 216 and 144 records for patients with and without codes for dengue respectively, in keeping with their relative frequency in the MOH database, for chart review. The ICD codes for dengue were validated against lab-based diagnostic standard (NS1 or IgM). The ICD-10-CM codes for dengue had a sensitivity of 94%, modest specificity of 83%, positive predictive value of 87% and negative predictive value 92%. These results were stable between 2010 and 2013. However, its specificity decreased substantially when patients manifested with bleeding or low platelet count. The diagnostic performance of the ICD codes for dengue in the MOH's hospital discharge database is adequate for use in health services research on dengue.
Harwood, Valerie J.; Whitlock, John; Withington, Victoria
2000-01-01
The antibiotic resistance patterns of fecal streptococci and fecal coliforms isolated from domestic wastewater and animal feces were determined using a battery of antibiotics (amoxicillin, ampicillin, cephalothin, chlortetracycline, oxytetracycline, tetracycline, erythromycin, streptomycin, and vancomycin) at four concentrations each. The sources of animal feces included wild birds, cattle, chickens, dogs, pigs, and raccoons. Antibiotic resistance patterns of fecal streptococci and fecal coliforms from known sources were grouped into two separate databases, and discriminant analysis of these patterns was used to establish the relationship between the antibiotic resistance patterns and the bacterial source. The fecal streptococcus and fecal coliform databases classified isolates from known sources with similar accuracies. The average rate of correct classification for the fecal streptococcus database was 62.3%, and that for the fecal coliform database was 63.9%. The sources of fecal streptococci and fecal coliforms isolated from surface waters were identified by discriminant analysis of their antibiotic resistance patterns. Both databases identified the source of indicator bacteria isolated from surface waters directly impacted by septic tank discharges as human. At sample sites selected for relatively low anthropogenic impact, the dominant sources of indicator bacteria were identified as various animals. The antibiotic resistance analysis technique promises to be a useful tool in assessing sources of fecal contamination in subtropical waters, such as those in Florida. PMID:10966379
Park, Hyun-Seok
2012-12-01
Whereas a vast amount of new information on bioinformatics is made available to the public through patents, only a small set of patents are cited in academic papers. A detailed analysis of registered bioinformatics patents, using the existing patent search system, can provide valuable information links between science and technology. However, it is extremely difficult to select keywords to capture bioinformatics patents, reflecting the convergence of several underlying technologies. No single word or even several words are sufficient to identify such patents. The analysis of patent subclasses can provide valuable information. In this paper, I did a preliminary study of the current status of bioinformatics patents and their International Patent Classification (IPC) groups registered in the Korea Intellectual Property Rights Information Service (KIPRIS) database.
Classification of Traffic Related Short Texts to Analyse Road Problems in Urban Areas
NASA Astrophysics Data System (ADS)
Saldana-Perez, A. M. M.; Moreno-Ibarra, M.; Tores-Ruiz, M.
2017-09-01
The Volunteer Geographic Information (VGI) can be used to understand the urban dynamics. In the classification of traffic related short texts to analyze road problems in urban areas, a VGI data analysis is done over a social media's publications, in order to classify traffic events at big cities that modify the movement of vehicles and people through the roads, such as car accidents, traffic and closures. The classification of traffic events described in short texts is done by applying a supervised machine learning algorithm. In the approach users are considered as sensors which describe their surroundings and provide their geographic position at the social network. The posts are treated by a text mining process and classified into five groups. Finally, the classified events are grouped in a data corpus and geo-visualized in the study area, to detect the places with more vehicular problems.
Enhanced DIII-D Data Management Through a Relational Database
NASA Astrophysics Data System (ADS)
Burruss, J. R.; Peng, Q.; Schachter, J.; Schissel, D. P.; Terpstra, T. B.
2000-10-01
A relational database is being used to serve data about DIII-D experiments. The database is optimized for queries across multiple shots, allowing for rapid data mining by SQL-literate researchers. The relational database relates different experiments and datasets, thus providing a big picture of DIII-D operations. Users are encouraged to add their own tables to the database. Summary physics quantities about DIII-D discharges are collected and stored in the database automatically. Meta-data about code runs, MDSplus usage, and visualization tool usage are collected, stored in the database, and later analyzed to improve computing. Documentation on the database may be accessed through programming languages such as C, Java, and IDL, or through ODBC compliant applications such as Excel and Access. A database-driven web page also provides a convenient means for viewing database quantities through the World Wide Web. Demonstrations will be given at the poster.
Research on computer virus database management system
NASA Astrophysics Data System (ADS)
Qi, Guoquan
2011-12-01
The growing proliferation of computer viruses becomes the lethal threat and research focus of the security of network information. While new virus is emerging, the number of viruses is growing, virus classification increasing complex. Virus naming because of agencies' capture time differences can not be unified. Although each agency has its own virus database, the communication between each other lacks, or virus information is incomplete, or a small number of sample information. This paper introduces the current construction status of the virus database at home and abroad, analyzes how to standardize and complete description of virus characteristics, and then gives the information integrity, storage security and manageable computer virus database design scheme.
Development of a land-cover characteristics database for the conterminous U.S.
Loveland, Thomas R.; Merchant, J.W.; Ohlen, D.O.; Brown, Jesslyn F.
1991-01-01
Information regarding the characteristics and spatial distribution of the Earth's land cover is critical to global environmental research. A prototype land-cover database for the conterminous United States designed for use in a variety of global modelling, monitoring, mapping, and analytical endeavors has been created. The resultant database contains multiple layers, including the source AVHRR data, the ancillary data layers, the land-cover regions defined by the research, and translation tables linking the regions to other land classification schema (for example, UNESCO, USGS Anderson System). The land-cover characteristics database can be analyzed, transformed, or aggregated by users to meet a broad spectrum of requirements. -from Authors
ERIC Educational Resources Information Center
Fagan, Judy Condit
2001-01-01
Discusses the need for libraries to routinely redesign their Web sites, and presents a case study that describes how a Perl-driven database at Southern Illinois University's library improved Web site organization and patron access, simplified revisions, and allowed staff unfamiliar with HTML to update content. (Contains 56 references.) (Author/LRW)
Case and Model Driven Dynamic Template Linking
2005-06-01
store the trips in a PostgreSQL database (www.postgresql.org) and the values stored in this database could be re-used to provide values for similar trips...Preferences YES Yes but limited Print Form YES NO Close Form YES NO Just “X” Quit YES NO Just “X” Show User Action History YES NO 6.5 DAML Ontologies