semi-automatic functional annotation: Topics by Science.gov

Sample records for semi-automatic functional annotation

A semi-automatic annotation tool for cooking video

NASA Astrophysics Data System (ADS)

Bianco, Simone; Ciocca, Gianluigi; Napoletano, Paolo; Schettini, Raimondo; Margherita, Roberto; Marini, Gianluca; Gianforme, Giorgio; Pantaleo, Giuseppe

2013-03-01

In order to create a cooking assistant application to guide the users in the preparation of the dishes relevant to their profile diets and food preferences, it is necessary to accurately annotate the video recipes, identifying and tracking the foods of the cook. These videos present particular annotation challenges such as frequent occlusions, food appearance changes, etc. Manually annotate the videos is a time-consuming, tedious and error-prone task. Fully automatic tools that integrate computer vision algorithms to extract and identify the elements of interest are not error free, and false positive and false negative detections need to be corrected in a post-processing stage. We present an interactive, semi-automatic tool for the annotation of cooking videos that integrates computer vision techniques under the supervision of the user. The annotation accuracy is increased with respect to completely automatic tools and the human effort is reduced with respect to completely manual ones. The performance and usability of the proposed tool are evaluated on the basis of the time and effort required to annotate the same video sequences.
Semi-automatic semantic annotation of PubMed Queries: a study on quality, efficiency, satisfaction

PubMed Central

Névéol, Aurélie; Islamaj-Doğan, Rezarta; Lu, Zhiyong

2010-01-01

Information processing algorithms require significant amounts of annotated data for training and testing. The availability of such data is often hindered by the complexity and high cost of production. In this paper, we investigate the benefits of a state-of-the-art tool to help with the semantic annotation of a large set of biomedical information queries. Seven annotators were recruited to annotate a set of 10,000 PubMed® queries with 16 biomedical and bibliographic categories. About half of the queries were annotated from scratch, while the other half were automatically pre-annotated and manually corrected. The impact of the automatic pre-annotations was assessed on several aspects of the task: time, number of actions, annotator satisfaction, inter-annotator agreement, quality and number of the resulting annotations. The analysis of annotation results showed that the number of required hand annotations is 28.9% less when using pre-annotated results from automatic tools. As a result, the overall annotation time was substantially lower when pre-annotations were used, while inter-annotator agreement was significantly higher. In addition, there was no statistically significant difference in the semantic distribution or number of annotations produced when pre-annotations were used. The annotated query corpus is freely available to the research community. This study shows that automatic pre-annotations are found helpful by most annotators. Our experience suggests using an automatic tool to assist large-scale manual annotation projects. This helps speed-up the annotation time and improve annotation consistency while maintaining high quality of the final annotations. PMID:21094696
Utilization of Automatic Tagging Using Web Information to Datamining

NASA Astrophysics Data System (ADS)

Sugimura, Hiroshi; Matsumoto, Kazunori

This paper proposes a data annotation system using the automatic tagging approach. Although annotations of data are useful for deep analysis and mining of it, the cost of providing them becomes huge in most of the cases. In order to solve this problem, we develop a semi-automatic method that consists of two stages. In the first stage, it searches the Web space for relating information, and discovers candidates of effective annotations. The second stage uses knowledge of a human user. The candidates are investigated and refined by the user, and then they become annotations. We in this paper focus on time-series data, and show effectiveness of a GUI tool that supports the above process.
Towards Automated Annotation of Benthic Survey Images: Variability of Human Experts and Operational Modes of Automation

PubMed Central

Beijbom, Oscar; Edmunds, Peter J.; Roelfsema, Chris; Smith, Jennifer; Kline, David I.; Neal, Benjamin P.; Dunlap, Matthew J.; Moriarty, Vincent; Fan, Tung-Yung; Tan, Chih-Jui; Chan, Stephen; Treibitz, Tali; Gamst, Anthony; Mitchell, B. Greg; Kriegman, David

2015-01-01

Global climate change and other anthropogenic stressors have heightened the need to rapidly characterize ecological changes in marine benthic communities across large scales. Digital photography enables rapid collection of survey images to meet this need, but the subsequent image annotation is typically a time consuming, manual task. We investigated the feasibility of using automated point-annotation to expedite cover estimation of the 17 dominant benthic categories from survey-images captured at four Pacific coral reefs. Inter- and intra- annotator variability among six human experts was quantified and compared to semi- and fully- automated annotation methods, which are made available at coralnet.ucsd.edu. Our results indicate high expert agreement for identification of coral genera, but lower agreement for algal functional groups, in particular between turf algae and crustose coralline algae. This indicates the need for unequivocal definitions of algal groups, careful training of multiple annotators, and enhanced imaging technology. Semi-automated annotation, where 50% of the annotation decisions were performed automatically, yielded cover estimate errors comparable to those of the human experts. Furthermore, fully-automated annotation yielded rapid, unbiased cover estimates but with increased variance. These results show that automated annotation can increase spatial coverage and decrease time and financial outlay for image-based reef surveys. PMID:26154157
Supporting the annotation of chronic obstructive pulmonary disease (COPD) phenotypes with text mining workflows.

PubMed

Fu, Xiao; Batista-Navarro, Riza; Rak, Rafal; Ananiadou, Sophia

2015-01-01

Chronic obstructive pulmonary disease (COPD) is a life-threatening lung disorder whose recent prevalence has led to an increasing burden on public healthcare. Phenotypic information in electronic clinical records is essential in providing suitable personalised treatment to patients with COPD. However, as phenotypes are often "hidden" within free text in clinical records, clinicians could benefit from text mining systems that facilitate their prompt recognition. This paper reports on a semi-automatic methodology for producing a corpus that can ultimately support the development of text mining tools that, in turn, will expedite the process of identifying groups of COPD patients. A corpus of 30 full-text papers was formed based on selection criteria informed by the expertise of COPD specialists. We developed an annotation scheme that is aimed at producing fine-grained, expressive and computable COPD annotations without burdening our curators with a highly complicated task. This was implemented in the Argo platform by means of a semi-automatic annotation workflow that integrates several text mining tools, including a graphical user interface for marking up documents. When evaluated using gold standard (i.e., manually validated) annotations, the semi-automatic workflow was shown to obtain a micro-averaged F-score of 45.70% (with relaxed matching). Utilising the gold standard data to train new concept recognisers, we demonstrated that our corpus, although still a work in progress, can foster the development of significantly better performing COPD phenotype extractors. We describe in this work the means by which we aim to eventually support the process of COPD phenotype curation, i.e., by the application of various text mining tools integrated into an annotation workflow. Although the corpus being described is still under development, our results thus far are encouraging and show great potential in stimulating the development of further automatic COPD phenotype extractors.
Combining evidence, biomedical literature and statistical dependence: new insights for functional annotation of gene sets

PubMed Central

Aubry, Marc; Monnier, Annabelle; Chicault, Celine; de Tayrac, Marie; Galibert, Marie-Dominique; Burgun, Anita; Mosser, Jean

2006-01-01

Background Large-scale genomic studies based on transcriptome technologies provide clusters of genes that need to be functionally annotated. The Gene Ontology (GO) implements a controlled vocabulary organised into three hierarchies: cellular components, molecular functions and biological processes. This terminology allows a coherent and consistent description of the knowledge about gene functions. The GO terms related to genes come primarily from semi-automatic annotations made by trained biologists (annotation based on evidence) or text-mining of the published scientific literature (literature profiling). Results We report an original functional annotation method based on a combination of evidence and literature that overcomes the weaknesses and the limitations of each approach. It relies on the Gene Ontology Annotation database (GOA Human) and the PubGene biomedical literature index. We support these annotations with statistically associated GO terms and retrieve associative relations across the three GO hierarchies to emphasise the major pathways involved by a gene cluster. Both annotation methods and associative relations were quantitatively evaluated with a reference set of 7397 genes and a multi-cluster study of 14 clusters. We also validated the biological appropriateness of our hybrid method with the annotation of a single gene (cdc2) and that of a down-regulated cluster of 37 genes identified by a transcriptome study of an in vitro enterocyte differentiation model (CaCo-2 cells). Conclusion The combination of both approaches is more informative than either separate approach: literature mining can enrich an annotation based only on evidence. Text-mining of the literature can also find valuable associated MEDLINE references that confirm the relevance of the annotation. Eventually, GO terms networks can be built with associative relations in order to highlight cooperative and competitive pathways and their connected molecular functions. PMID:16674810
A semi-supervised learning framework for biomedical event extraction based on hidden topics.

PubMed

Zhou, Deyu; Zhong, Dayou

2015-05-01

Scientists have devoted decades of efforts to understanding the interaction between proteins or RNA production. The information might empower the current knowledge on drug reactions or the development of certain diseases. Nevertheless, due to the lack of explicit structure, literature in life science, one of the most important sources of this information, prevents computer-based systems from accessing. Therefore, biomedical event extraction, automatically acquiring knowledge of molecular events in research articles, has attracted community-wide efforts recently. Most approaches are based on statistical models, requiring large-scale annotated corpora to precisely estimate models' parameters. However, it is usually difficult to obtain in practice. Therefore, employing un-annotated data based on semi-supervised learning for biomedical event extraction is a feasible solution and attracts more interests. In this paper, a semi-supervised learning framework based on hidden topics for biomedical event extraction is presented. In this framework, sentences in the un-annotated corpus are elaborately and automatically assigned with event annotations based on their distances to these sentences in the annotated corpus. More specifically, not only the structures of the sentences, but also the hidden topics embedded in the sentences are used for describing the distance. The sentences and newly assigned event annotations, together with the annotated corpus, are employed for training. Experiments were conducted on the multi-level event extraction corpus, a golden standard corpus. Experimental results show that more than 2.2% improvement on F-score on biomedical event extraction is achieved by the proposed framework when compared to the state-of-the-art approach. The results suggest that by incorporating un-annotated data, the proposed framework indeed improves the performance of the state-of-the-art event extraction system and the similarity between sentences might be precisely described by hidden topics and structures of the sentences. Copyright © 2015 Elsevier B.V. All rights reserved.
Semi Automatic Ontology Instantiation in the domain of Risk Management

NASA Astrophysics Data System (ADS)

Makki, Jawad; Alquier, Anne-Marie; Prince, Violaine

One of the challenging tasks in the context of Ontological Engineering is to automatically or semi-automatically support the process of Ontology Learning and Ontology Population from semi-structured documents (texts). In this paper we describe a Semi-Automatic Ontology Instantiation method from natural language text, in the domain of Risk Management. This method is composed from three steps 1 ) Annotation with part-of-speech tags, 2) Semantic Relation Instances Extraction, 3) Ontology instantiation process. It's based on combined NLP techniques using human intervention between steps 2 and 3 for control and validation. Since it heavily relies on linguistic knowledge it is not domain dependent which is a good feature for portability between the different fields of risk management application. The proposed methodology uses the ontology of the PRIMA1 project (supported by the European community) as a Generic Domain Ontology and populates it via an available corpus. A first validation of the approach is done through an experiment with Chemical Fact Sheets from Environmental Protection Agency2.
Argo: enabling the development of bespoke workflows and services for disease annotation.

PubMed

Batista-Navarro, Riza; Carter, Jacob; Ananiadou, Sophia

2016-01-01

Argo (http://argo.nactem.ac.uk) is a generic text mining workbench that can cater to a variety of use cases, including the semi-automatic annotation of literature. It enables its technical users to build their own customised text mining solutions by providing a wide array of interoperable and configurable elementary components that can be seamlessly integrated into processing workflows. With Argo's graphical annotation interface, domain experts can then make use of the workflows' automatically generated output to curate information of interest.With the continuously rising need to understand the aetiology of diseases as well as the demand for their informed diagnosis and personalised treatment, the curation of disease-relevant information from medical and clinical documents has become an indispensable scientific activity. In the Fifth BioCreative Challenge Evaluation Workshop (BioCreative V), there was substantial interest in the mining of literature for disease-relevant information. Apart from a panel discussion focussed on disease annotations, the chemical-disease relations (CDR) track was also organised to foster the sharing and advancement of disease annotation tools and resources.This article presents the application of Argo's capabilities to the literature-based annotation of diseases. As part of our participation in BioCreative V's User Interactive Track (IAT), we demonstrated and evaluated Argo's suitability to the semi-automatic curation of chronic obstructive pulmonary disease (COPD) phenotypes. Furthermore, the workbench facilitated the development of some of the CDR track's top-performing web services for normalising disease mentions against the Medical Subject Headings (MeSH) database. In this work, we highlight Argo's support for developing various types of bespoke workflows ranging from ones which enabled us to easily incorporate information from various databases, to those which train and apply machine learning-based concept recognition models, through to user-interactive ones which allow human curators to manually provide their corrections to automatically generated annotations. Our participation in the BioCreative V challenges shows Argo's potential as an enabling technology for curating disease and phenotypic information from literature.Database URL: http://argo.nactem.ac.uk. © The Author(s) 2016. Published by Oxford University Press.
Argo: enabling the development of bespoke workflows and services for disease annotation

PubMed Central

Batista-Navarro, Riza; Carter, Jacob; Ananiadou, Sophia

2016-01-01

Argo (http://argo.nactem.ac.uk) is a generic text mining workbench that can cater to a variety of use cases, including the semi-automatic annotation of literature. It enables its technical users to build their own customised text mining solutions by providing a wide array of interoperable and configurable elementary components that can be seamlessly integrated into processing workflows. With Argo's graphical annotation interface, domain experts can then make use of the workflows' automatically generated output to curate information of interest. With the continuously rising need to understand the aetiology of diseases as well as the demand for their informed diagnosis and personalised treatment, the curation of disease-relevant information from medical and clinical documents has become an indispensable scientific activity. In the Fifth BioCreative Challenge Evaluation Workshop (BioCreative V), there was substantial interest in the mining of literature for disease-relevant information. Apart from a panel discussion focussed on disease annotations, the chemical-disease relations (CDR) track was also organised to foster the sharing and advancement of disease annotation tools and resources. This article presents the application of Argo’s capabilities to the literature-based annotation of diseases. As part of our participation in BioCreative V’s User Interactive Track (IAT), we demonstrated and evaluated Argo’s suitability to the semi-automatic curation of chronic obstructive pulmonary disease (COPD) phenotypes. Furthermore, the workbench facilitated the development of some of the CDR track’s top-performing web services for normalising disease mentions against the Medical Subject Headings (MeSH) database. In this work, we highlight Argo’s support for developing various types of bespoke workflows ranging from ones which enabled us to easily incorporate information from various databases, to those which train and apply machine learning-based concept recognition models, through to user-interactive ones which allow human curators to manually provide their corrections to automatically generated annotations. Our participation in the BioCreative V challenges shows Argo’s potential as an enabling technology for curating disease and phenotypic information from literature. Database URL: http://argo.nactem.ac.uk PMID:27189607
Comparison of a semi-automatic annotation tool and a natural language processing application for the generation of clinical statement entries.

PubMed

Lin, Ching-Heng; Wu, Nai-Yuan; Lai, Wei-Shao; Liou, Der-Ming

2015-01-01

Electronic medical records with encoded entries should enhance the semantic interoperability of document exchange. However, it remains a challenge to encode the narrative concept and to transform the coded concepts into a standard entry-level document. This study aimed to use a novel approach for the generation of entry-level interoperable clinical documents. Using HL7 clinical document architecture (CDA) as the example, we developed three pipelines to generate entry-level CDA documents. The first approach was a semi-automatic annotation pipeline (SAAP), the second was a natural language processing (NLP) pipeline, and the third merged the above two pipelines. We randomly selected 50 test documents from the i2b2 corpora to evaluate the performance of the three pipelines. The 50 randomly selected test documents contained 9365 words, including 588 Observation terms and 123 Procedure terms. For the Observation terms, the merged pipeline had a significantly higher F-measure than the NLP pipeline (0.89 vs 0.80, p<0.0001), but a similar F-measure to that of the SAAP (0.89 vs 0.87). For the Procedure terms, the F-measure was not significantly different among the three pipelines. The combination of a semi-automatic annotation approach and the NLP application seems to be a solution for generating entry-level interoperable clinical documents. © The Author 2014. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For Permissions, please email: journals.permissions@oup.comFor numbered affiliation see end of article.
Semantator: semantic annotator for converting biomedical text to linked data.

PubMed

Tao, Cui; Song, Dezhao; Sharma, Deepak; Chute, Christopher G

2013-10-01

More than 80% of biomedical data is embedded in plain text. The unstructured nature of these text-based documents makes it challenging to easily browse and query the data of interest in them. One approach to facilitate browsing and querying biomedical text is to convert the plain text to a linked web of data, i.e., converting data originally in free text to structured formats with defined meta-level semantics. In this paper, we introduce Semantator (Semantic Annotator), a semantic-web-based environment for annotating data of interest in biomedical documents, browsing and querying the annotated data, and interactively refining annotation results if needed. Through Semantator, information of interest can be either annotated manually or semi-automatically using plug-in information extraction tools. The annotated results will be stored in RDF and can be queried using the SPARQL query language. In addition, semantic reasoners can be directly applied to the annotated data for consistency checking and knowledge inference. Semantator has been released online and was used by the biomedical ontology community who provided positive feedbacks. Our evaluation results indicated that (1) Semantator can perform the annotation functionalities as designed; (2) Semantator can be adopted in real applications in clinical and transactional research; and (3) the annotated results using Semantator can be easily used in Semantic-web-based reasoning tools for further inference. Copyright © 2013 Elsevier Inc. All rights reserved.
Sma3s: a three-step modular annotator for large sequence datasets.

PubMed

Muñoz-Mérida, Antonio; Viguera, Enrique; Claros, M Gonzalo; Trelles, Oswaldo; Pérez-Pulido, Antonio J

2014-08-01

Automatic sequence annotation is an essential component of modern 'omics' studies, which aim to extract information from large collections of sequence data. Most existing tools use sequence homology to establish evolutionary relationships and assign putative functions to sequences. However, it can be difficult to define a similarity threshold that achieves sufficient coverage without sacrificing annotation quality. Defining the correct configuration is critical and can be challenging for non-specialist users. Thus, the development of robust automatic annotation techniques that generate high-quality annotations without needing expert knowledge would be very valuable for the research community. We present Sma3s, a tool for automatically annotating very large collections of biological sequences from any kind of gene library or genome. Sma3s is composed of three modules that progressively annotate query sequences using either: (i) very similar homologues, (ii) orthologous sequences or (iii) terms enriched in groups of homologous sequences. We trained the system using several random sets of known sequences, demonstrating average sensitivity and specificity values of ~85%. In conclusion, Sma3s is a versatile tool for high-throughput annotation of a wide variety of sequence datasets that outperforms the accuracy of other well-established annotation algorithms, and it can enrich existing database annotations and uncover previously hidden features. Importantly, Sma3s has already been used in the functional annotation of two published transcriptomes. © The Author 2014. Published by Oxford University Press on behalf of Kazusa DNA Research Institute.
Clever generation of rich SPARQL queries from annotated relational schema: application to Semantic Web Service creation for biological databases.

PubMed

Wollbrett, Julien; Larmande, Pierre; de Lamotte, Frédéric; Ruiz, Manuel

2013-04-15

In recent years, a large amount of "-omics" data have been produced. However, these data are stored in many different species-specific databases that are managed by different institutes and laboratories. Biologists often need to find and assemble data from disparate sources to perform certain analyses. Searching for these data and assembling them is a time-consuming task. The Semantic Web helps to facilitate interoperability across databases. A common approach involves the development of wrapper systems that map a relational database schema onto existing domain ontologies. However, few attempts have been made to automate the creation of such wrappers. We developed a framework, named BioSemantic, for the creation of Semantic Web Services that are applicable to relational biological databases. This framework makes use of both Semantic Web and Web Services technologies and can be divided into two main parts: (i) the generation and semi-automatic annotation of an RDF view; and (ii) the automatic generation of SPARQL queries and their integration into Semantic Web Services backbones. We have used our framework to integrate genomic data from different plant databases. BioSemantic is a framework that was designed to speed integration of relational databases. We present how it can be used to speed the development of Semantic Web Services for existing relational biological databases. Currently, it creates and annotates RDF views that enable the automatic generation of SPARQL queries. Web Services are also created and deployed automatically, and the semantic annotations of our Web Services are added automatically using SAWSDL attributes. BioSemantic is downloadable at http://southgreen.cirad.fr/?q=content/Biosemantic.
Clever generation of rich SPARQL queries from annotated relational schema: application to Semantic Web Service creation for biological databases

PubMed Central

2013-01-01

Background In recent years, a large amount of “-omics” data have been produced. However, these data are stored in many different species-specific databases that are managed by different institutes and laboratories. Biologists often need to find and assemble data from disparate sources to perform certain analyses. Searching for these data and assembling them is a time-consuming task. The Semantic Web helps to facilitate interoperability across databases. A common approach involves the development of wrapper systems that map a relational database schema onto existing domain ontologies. However, few attempts have been made to automate the creation of such wrappers. Results We developed a framework, named BioSemantic, for the creation of Semantic Web Services that are applicable to relational biological databases. This framework makes use of both Semantic Web and Web Services technologies and can be divided into two main parts: (i) the generation and semi-automatic annotation of an RDF view; and (ii) the automatic generation of SPARQL queries and their integration into Semantic Web Services backbones. We have used our framework to integrate genomic data from different plant databases. Conclusions BioSemantic is a framework that was designed to speed integration of relational databases. We present how it can be used to speed the development of Semantic Web Services for existing relational biological databases. Currently, it creates and annotates RDF views that enable the automatic generation of SPARQL queries. Web Services are also created and deployed automatically, and the semantic annotations of our Web Services are added automatically using SAWSDL attributes. BioSemantic is downloadable at http://southgreen.cirad.fr/?q=content/Biosemantic. PMID:23586394
Semantic annotation of Web data applied to risk in food.

PubMed

Hignette, Gaëlle; Buche, Patrice; Couvert, Olivier; Dibie-Barthélemy, Juliette; Doussot, David; Haemmerlé, Ollivier; Mettler, Eric; Soler, Lydie

2008-11-30

A preliminary step to risk in food assessment is the gathering of experimental data. In the framework of the Sym'Previus project (http://www.symprevius.org), a complete data integration system has been designed, grouping data provided by industrial partners and data extracted from papers published in the main scientific journals of the domain. Those data have been classified by means of a predefined vocabulary, called ontology. Our aim is to complement the database with data extracted from the Web. In the framework of the WebContent project (www.webcontent.fr), we have designed a semi-automatic acquisition tool, called @WEB, which retrieves scientific documents from the Web. During the @WEB process, data tables are extracted from the documents and then annotated with the ontology. We focus on the data tables as they contain, in general, a synthesis of data published in the documents. In this paper, we explain how the columns of the data tables are automatically annotated with data types of the ontology and how the relations represented by the table are recognised. We also give the results of our experimentation to assess the quality of such an annotation.
BrEPS 2.0: Optimization of sequence pattern prediction for enzyme annotation.

PubMed

Dudek, Christian-Alexander; Dannheim, Henning; Schomburg, Dietmar

2017-01-01

The prediction of gene functions is crucial for a large number of different life science areas. Faster high throughput sequencing techniques generate more and larger datasets. The manual annotation by classical wet-lab experiments is not suitable for these large amounts of data. We showed earlier that the automatic sequence pattern-based BrEPS protocol, based on manually curated sequences, can be used for the prediction of enzymatic functions of genes. The growing sequence databases provide the opportunity for more reliable patterns, but are also a challenge for the implementation of automatic protocols. We reimplemented and optimized the BrEPS pattern generation to be applicable for larger datasets in an acceptable timescale. Primary improvement of the new BrEPS protocol is the enhanced data selection step. Manually curated annotations from Swiss-Prot are used as reliable source for function prediction of enzymes observed on protein level. The pool of sequences is extended by highly similar sequences from TrEMBL and SwissProt. This allows us to restrict the selection of Swiss-Prot entries, without losing the diversity of sequences needed to generate significant patterns. Additionally, a supporting pattern type was introduced by extending the patterns at semi-conserved positions with highly similar amino acids. Extended patterns have an increased complexity, increasing the chance to match more sequences, without losing the essential structural information of the pattern. To enhance the usability of the database, we introduced enzyme function prediction based on consensus EC numbers and IUBMB enzyme nomenclature. BrEPS is part of the Braunschweig Enzyme Database (BRENDA) and is available on a completely redesigned website and as download. The database can be downloaded and used with the BrEPScmd command line tool for large scale sequence analysis. The BrEPS website and downloads for the database creation tool, command line tool and database are freely accessible at http://breps.tu-bs.de.
BrEPS 2.0: Optimization of sequence pattern prediction for enzyme annotation

PubMed Central

Schomburg, Dietmar

2017-01-01

The prediction of gene functions is crucial for a large number of different life science areas. Faster high throughput sequencing techniques generate more and larger datasets. The manual annotation by classical wet-lab experiments is not suitable for these large amounts of data. We showed earlier that the automatic sequence pattern-based BrEPS protocol, based on manually curated sequences, can be used for the prediction of enzymatic functions of genes. The growing sequence databases provide the opportunity for more reliable patterns, but are also a challenge for the implementation of automatic protocols. We reimplemented and optimized the BrEPS pattern generation to be applicable for larger datasets in an acceptable timescale. Primary improvement of the new BrEPS protocol is the enhanced data selection step. Manually curated annotations from Swiss-Prot are used as reliable source for function prediction of enzymes observed on protein level. The pool of sequences is extended by highly similar sequences from TrEMBL and SwissProt. This allows us to restrict the selection of Swiss-Prot entries, without losing the diversity of sequences needed to generate significant patterns. Additionally, a supporting pattern type was introduced by extending the patterns at semi-conserved positions with highly similar amino acids. Extended patterns have an increased complexity, increasing the chance to match more sequences, without losing the essential structural information of the pattern. To enhance the usability of the database, we introduced enzyme function prediction based on consensus EC numbers and IUBMB enzyme nomenclature. BrEPS is part of the Braunschweig Enzyme Database (BRENDA) and is available on a completely redesigned website and as download. The database can be downloaded and used with the BrEPScmd command line tool for large scale sequence analysis. The BrEPS website and downloads for the database creation tool, command line tool and database are freely accessible at http://breps.tu-bs.de. PMID:28750104
Extending bicluster analysis to annotate unclassified ORFs and predict novel functional modules using expression data

PubMed Central

Bryan, Kenneth; Cunningham, Pádraig

2008-01-01

Background Microarrays have the capacity to measure the expressions of thousands of genes in parallel over many experimental samples. The unsupervised classification technique of bicluster analysis has been employed previously to uncover gene expression correlations over subsets of samples with the aim of providing a more accurate model of the natural gene functional classes. This approach also has the potential to aid functional annotation of unclassified open reading frames (ORFs). Until now this aspect of biclustering has been under-explored. In this work we illustrate how bicluster analysis may be extended into a 'semi-supervised' ORF annotation approach referred to as BALBOA. Results The efficacy of the BALBOA ORF classification technique is first assessed via cross validation and compared to a multi-class k-Nearest Neighbour (kNN) benchmark across three independent gene expression datasets. BALBOA is then used to assign putative functional annotations to unclassified yeast ORFs. These predictions are evaluated using existing experimental and protein sequence information. Lastly, we employ a related semi-supervised method to predict the presence of novel functional modules within yeast. Conclusion In this paper we demonstrate how unsupervised classification methods, such as bicluster analysis, may be extended using of available annotations to form semi-supervised approaches within the gene expression analysis domain. We show that such methods have the potential to improve upon supervised approaches and shed new light on the functions of unclassified ORFs and their co-regulation. PMID:18831786
LTRsift: a graphical user interface for semi-automatic classification and postprocessing of de novo detected LTR retrotransposons

PubMed Central

2012-01-01

Background Long terminal repeat (LTR) retrotransposons are a class of eukaryotic mobile elements characterized by a distinctive sequence similarity-based structure. Hence they are well suited for computational identification. Current software allows for a comprehensive genome-wide de novo detection of such elements. The obvious next step is the classification of newly detected candidates resulting in (super-)families. Such a de novo classification approach based on sequence-based clustering of transposon features has been proposed before, resulting in a preliminary assignment of candidates to families as a basis for subsequent manual refinement. However, such a classification workflow is typically split across a heterogeneous set of glue scripts and generic software (for example, spreadsheets), making it tedious for a human expert to inspect, curate and export the putative families produced by the workflow. Results We have developed LTRsift, an interactive graphical software tool for semi-automatic postprocessing of de novo predicted LTR retrotransposon annotations. Its user-friendly interface offers customizable filtering and classification functionality, displaying the putative candidate groups, their members and their internal structure in a hierarchical fashion. To ease manual work, it also supports graphical user interface-driven reassignment, splitting and further annotation of candidates. Export of grouped candidate sets in standard formats is possible. In two case studies, we demonstrate how LTRsift can be employed in the context of a genome-wide LTR retrotransposon survey effort. Conclusions LTRsift is a useful and convenient tool for semi-automated classification of newly detected LTR retrotransposons based on their internal features. Its efficient implementation allows for convenient and seamless filtering and classification in an integrated environment. Developed for life scientists, it is helpful in postprocessing and refining the output of software for predicting LTR retrotransposons up to the stage of preparing full-length reference sequence libraries. The LTRsift software is freely available at http://www.zbh.uni-hamburg.de/LTRsift under an open-source license. PMID:23131050

LTRsift: a graphical user interface for semi-automatic classification and postprocessing of de novo detected LTR retrotransposons.

PubMed

Steinbiss, Sascha; Kastens, Sascha; Kurtz, Stefan

2012-11-07

Long terminal repeat (LTR) retrotransposons are a class of eukaryotic mobile elements characterized by a distinctive sequence similarity-based structure. Hence they are well suited for computational identification. Current software allows for a comprehensive genome-wide de novo detection of such elements. The obvious next step is the classification of newly detected candidates resulting in (super-)families. Such a de novo classification approach based on sequence-based clustering of transposon features has been proposed before, resulting in a preliminary assignment of candidates to families as a basis for subsequent manual refinement. However, such a classification workflow is typically split across a heterogeneous set of glue scripts and generic software (for example, spreadsheets), making it tedious for a human expert to inspect, curate and export the putative families produced by the workflow. We have developed LTRsift, an interactive graphical software tool for semi-automatic postprocessing of de novo predicted LTR retrotransposon annotations. Its user-friendly interface offers customizable filtering and classification functionality, displaying the putative candidate groups, their members and their internal structure in a hierarchical fashion. To ease manual work, it also supports graphical user interface-driven reassignment, splitting and further annotation of candidates. Export of grouped candidate sets in standard formats is possible. In two case studies, we demonstrate how LTRsift can be employed in the context of a genome-wide LTR retrotransposon survey effort. LTRsift is a useful and convenient tool for semi-automated classification of newly detected LTR retrotransposons based on their internal features. Its efficient implementation allows for convenient and seamless filtering and classification in an integrated environment. Developed for life scientists, it is helpful in postprocessing and refining the output of software for predicting LTR retrotransposons up to the stage of preparing full-length reference sequence libraries. The LTRsift software is freely available at http://www.zbh.uni-hamburg.de/LTRsift under an open-source license.
GFam: a platform for automatic annotation of gene families.

PubMed

Sasidharan, Rajkumar; Nepusz, Tamás; Swarbreck, David; Huala, Eva; Paccanaro, Alberto

2012-10-01

We have developed GFam, a platform for automatic annotation of gene/protein families. GFam provides a framework for genome initiatives and model organism resources to build domain-based families, derive meaningful functional labels and offers a seamless approach to propagate functional annotation across periodic genome updates. GFam is a hybrid approach that uses a greedy algorithm to chain component domains from InterPro annotation provided by its 12 member resources followed by a sequence-based connected component analysis of un-annotated sequence regions to derive consensus domain architecture for each sequence and subsequently generate families based on common architectures. Our integrated approach increases sequence coverage by 7.2 percentage points and residue coverage by 14.6 percentage points higher than the coverage relative to the best single-constituent database within InterPro for the proteome of Arabidopsis. The true power of GFam lies in maximizing annotation provided by the different InterPro data sources that offer resource-specific coverage for different regions of a sequence. GFam's capability to capture higher sequence and residue coverage can be useful for genome annotation, comparative genomics and functional studies. GFam is a general-purpose software and can be used for any collection of protein sequences. The software is open source and can be obtained from http://www.paccanarolab.org/software/gfam/.
Association algorithm to mine the rules that govern enzyme definition and to classify protein sequences.

PubMed

Chiu, Shih-Hau; Chen, Chien-Chi; Yuan, Gwo-Fang; Lin, Thy-Hou

2006-06-15

The number of sequences compiled in many genome projects is growing exponentially, but most of them have not been characterized experimentally. An automatic annotation scheme must be in an urgent need to reduce the gap between the amount of new sequences produced and reliable functional annotation. This work proposes rules for automatically classifying the fungus genes. The approach involves elucidating the enzyme classifying rule that is hidden in UniProt protein knowledgebase and then applying it for classification. The association algorithm, Apriori, is utilized to mine the relationship between the enzyme class and significant InterPro entries. The candidate rules are evaluated for their classificatory capacity. There were five datasets collected from the Swiss-Prot for establishing the annotation rules. These were treated as the training sets. The TrEMBL entries were treated as the testing set. A correct enzyme classification rate of 70% was obtained for the prokaryote datasets and a similar rate of about 80% was obtained for the eukaryote datasets. The fungus training dataset which lacks an enzyme class description was also used to evaluate the fungus candidate rules. A total of 88 out of 5085 test entries were matched with the fungus rule set. These were otherwise poorly annotated using their functional descriptions. The feasibility of using the method presented here to classify enzyme classes based on the enzyme domain rules is evident. The rules may be also employed by the protein annotators in manual annotation or implemented in an automatic annotation flowchart.
CpGAVAS, an integrated web server for the annotation, visualization, analysis, and GenBank submission of completely sequenced chloroplast genome sequences

PubMed Central

2012-01-01

Background The complete sequences of chloroplast genomes provide wealthy information regarding the evolutionary history of species. With the advance of next-generation sequencing technology, the number of completely sequenced chloroplast genomes is expected to increase exponentially, powerful computational tools annotating the genome sequences are in urgent need. Results We have developed a web server CPGAVAS. The server accepts a complete chloroplast genome sequence as input. First, it predicts protein-coding and rRNA genes based on the identification and mapping of the most similar, full-length protein, cDNA and rRNA sequences by integrating results from Blastx, Blastn, protein2genome and est2genome programs. Second, tRNA genes and inverted repeats (IR) are identified using tRNAscan, ARAGORN and vmatch respectively. Third, it calculates the summary statistics for the annotated genome. Fourth, it generates a circular map ready for publication. Fifth, it can create a Sequin file for GenBank submission. Last, it allows the extractions of protein and mRNA sequences for given list of genes and species. The annotation results in GFF3 format can be edited using any compatible annotation editing tools. The edited annotations can then be uploaded to CPGAVAS for update and re-analyses repeatedly. Using known chloroplast genome sequences as test set, we show that CPGAVAS performs comparably to another application DOGMA, while having several superior functionalities. Conclusions CPGAVAS allows the semi-automatic and complete annotation of a chloroplast genome sequence, and the visualization, editing and analysis of the annotation results. It will become an indispensible tool for researchers studying chloroplast genomes. The software is freely accessible from http://www.herbalgenomics.org/cpgavas. PMID:23256920
MIPS bacterial genomes functional annotation benchmark dataset.

PubMed

Tetko, Igor V; Brauner, Barbara; Dunger-Kaltenbach, Irmtraud; Frishman, Goar; Montrone, Corinna; Fobo, Gisela; Ruepp, Andreas; Antonov, Alexey V; Surmeli, Dimitrij; Mewes, Hans-Wernen

2005-05-15

Any development of new methods for automatic functional annotation of proteins according to their sequences requires high-quality data (as benchmark) as well as tedious preparatory work to generate sequence parameters required as input data for the machine learning methods. Different program settings and incompatible protocols make a comparison of the analyzed methods difficult. The MIPS Bacterial Functional Annotation Benchmark dataset (MIPS-BFAB) is a new, high-quality resource comprising four bacterial genomes manually annotated according to the MIPS functional catalogue (FunCat). These resources include precalculated sequence parameters, such as sequence similarity scores, InterPro domain composition and other parameters that could be used to develop and benchmark methods for functional annotation of bacterial protein sequences. These data are provided in XML format and can be used by scientists who are not necessarily experts in genome annotation. BFAB is available at http://mips.gsf.de/proj/bfab
Association algorithm to mine the rules that govern enzyme definition and to classify protein sequences

PubMed Central

Chiu, Shih-Hau; Chen, Chien-Chi; Yuan, Gwo-Fang; Lin, Thy-Hou

2006-01-01

Background The number of sequences compiled in many genome projects is growing exponentially, but most of them have not been characterized experimentally. An automatic annotation scheme must be in an urgent need to reduce the gap between the amount of new sequences produced and reliable functional annotation. This work proposes rules for automatically classifying the fungus genes. The approach involves elucidating the enzyme classifying rule that is hidden in UniProt protein knowledgebase and then applying it for classification. The association algorithm, Apriori, is utilized to mine the relationship between the enzyme class and significant InterPro entries. The candidate rules are evaluated for their classificatory capacity. Results There were five datasets collected from the Swiss-Prot for establishing the annotation rules. These were treated as the training sets. The TrEMBL entries were treated as the testing set. A correct enzyme classification rate of 70% was obtained for the prokaryote datasets and a similar rate of about 80% was obtained for the eukaryote datasets. The fungus training dataset which lacks an enzyme class description was also used to evaluate the fungus candidate rules. A total of 88 out of 5085 test entries were matched with the fungus rule set. These were otherwise poorly annotated using their functional descriptions. Conclusion The feasibility of using the method presented here to classify enzyme classes based on the enzyme domain rules is evident. The rules may be also employed by the protein annotators in manual annotation or implemented in an automatic annotation flowchart. PMID:16776838
PipeOnline 2.0: automated EST processing and functional data sorting.

PubMed

Ayoubi, Patricia; Jin, Xiaojing; Leite, Saul; Liu, Xianghui; Martajaja, Jeson; Abduraham, Abdurashid; Wan, Qiaolan; Yan, Wei; Misawa, Eduardo; Prade, Rolf A

2002-11-01

Expressed sequence tags (ESTs) are generated and deposited in the public domain, as redundant, unannotated, single-pass reactions, with virtually no biological content. PipeOnline automatically analyses and transforms large collections of raw DNA-sequence data from chromatograms or FASTA files by calling the quality of bases, screening and removing vector sequences, assembling and rewriting consensus sequences of redundant input files into a unigene EST data set and finally through translation, amino acid sequence similarity searches, annotation of public databases and functional data. PipeOnline generates an annotated database, retaining the processed unigene sequence, clone/file history, alignments with similar sequences, and proposed functional classification, if available. Functional annotation is automatic and based on a novel method that relies on homology of amino acid sequence multiplicity within GenBank records. Records are examined through a function ordered browser or keyword queries with automated export of results. PipeOnline offers customization for individual projects (MyPipeOnline), automated updating and alert service. PipeOnline is available at http://stress-genomics.org.
EUCLID: automatic classification of proteins in functional classes by their database annotations.

PubMed

Tamames, J; Ouzounis, C; Casari, G; Sander, C; Valencia, A

1998-01-01

A tool is described for the automatic classification of sequences in functional classes using their database annotations. The Euclid system is based on a simple learning procedure from examples provided by human experts. Euclid is freely available for academics at http://www.gredos.cnb.uam.es/EUCLID, with the corresponding dictionaries for the generation of three, eight and 14 functional classes. E-mail: valencia@cnb.uam.es The results of the EUCLID classification of different genomes are available at http://www.sander.ebi.ac. uk/genequiz/. A detailed description of the different applications mentioned in the text is available at http://www.gredos.cnb.uam. es/EUCLID/Full_Paper
Current and future trends in marine image annotation software

NASA Astrophysics Data System (ADS)

Gomes-Pereira, Jose Nuno; Auger, Vincent; Beisiegel, Kolja; Benjamin, Robert; Bergmann, Melanie; Bowden, David; Buhl-Mortensen, Pal; De Leo, Fabio C.; Dionísio, Gisela; Durden, Jennifer M.; Edwards, Luke; Friedman, Ariell; Greinert, Jens; Jacobsen-Stout, Nancy; Lerner, Steve; Leslie, Murray; Nattkemper, Tim W.; Sameoto, Jessica A.; Schoening, Timm; Schouten, Ronald; Seager, James; Singh, Hanumant; Soubigou, Olivier; Tojeira, Inês; van den Beld, Inge; Dias, Frederico; Tempera, Fernando; Santos, Ricardo S.

2016-12-01

Given the need to describe, analyze and index large quantities of marine imagery data for exploration and monitoring activities, a range of specialized image annotation tools have been developed worldwide. Image annotation - the process of transposing objects or events represented in a video or still image to the semantic level, may involve human interactions and computer-assisted solutions. Marine image annotation software (MIAS) have enabled over 500 publications to date. We review the functioning, application trends and developments, by comparing general and advanced features of 23 different tools utilized in underwater image analysis. MIAS requiring human input are basically a graphical user interface, with a video player or image browser that recognizes a specific time code or image code, allowing to log events in a time-stamped (and/or geo-referenced) manner. MIAS differ from similar software by the capability of integrating data associated to video collection, the most simple being the position coordinates of the video recording platform. MIAS have three main characteristics: annotating events in real time, posteriorly to annotation and interact with a database. These range from simple annotation interfaces, to full onboard data management systems, with a variety of toolboxes. Advanced packages allow to input and display data from multiple sensors or multiple annotators via intranet or internet. Posterior human-mediated annotation often include tools for data display and image analysis, e.g. length, area, image segmentation, point count; and in a few cases the possibility of browsing and editing previous dive logs or to analyze the annotations. The interaction with a database allows the automatic integration of annotations from different surveys, repeated annotation and collaborative annotation of shared datasets, browsing and querying of data. Progress in the field of automated annotation is mostly in post processing, for stable platforms or still images. Integration into available MIAS is currently limited to semi-automated processes of pixel recognition through computer-vision modules that compile expert-based knowledge. Important topics aiding the choice of a specific software are outlined, the ideal software is discussed and future trends are presented.
FOCIH: Form-Based Ontology Creation and Information Harvesting

NASA Astrophysics Data System (ADS)

Tao, Cui; Embley, David W.; Liddle, Stephen W.

Creating an ontology and populating it with data are both labor-intensive tasks requiring a high degree of expertise. Thus, scaling ontology creation and population to the size of the web in an effort to create a web of data—which some see as Web 3.0—is prohibitive. Can we find ways to streamline these tasks and lower the barrier enough to enable Web 3.0? Toward this end we offer a form-based approach to ontology creation that provides a way to create Web 3.0 ontologies without the need for specialized training. And we offer a way to semi-automatically harvest data from the current web of pages for a Web 3.0 ontology. In addition to harvesting information with respect to an ontology, the approach also annotates web pages and links facts in web pages to ontological concepts, resulting in a web of data superimposed over the web of pages. Experience with our prototype system shows that mappings between conceptual-model-based ontologies and forms are sufficient for creating the kind of ontologies needed for Web 3.0, and experiments with our prototype system show that automatic harvesting, automatic annotation, and automatic superimposition of a web of data over a web of pages work well.
MPEG-7 based video annotation and browsing

NASA Astrophysics Data System (ADS)

Hoeynck, Michael; Auweiler, Thorsten; Wellhausen, Jens

2003-11-01

The huge amount of multimedia data produced worldwide requires annotation in order to enable universal content access and to provide content-based search-and-retrieval functionalities. Since manual video annotation can be time consuming, automatic annotation systems are required. We review recent approaches to content-based indexing and annotation of videos for different kind of sports and describe our approach to automatic annotation of equestrian sports videos. We especially concentrate on MPEG-7 based feature extraction and content description, where we apply different visual descriptors for cut detection. Further, we extract the temporal positions of single obstacles on the course by analyzing MPEG-7 edge information. Having determined single shot positions as well as the visual highlights, the information is jointly stored with meta-textual information in an MPEG-7 description scheme. Based on this information, we generate content summaries which can be utilized in a user-interface in order to provide content-based access to the video stream, but further for media browsing on a streaming server.
A multilingual gold-standard corpus for biomedical concept recognition: the Mantra GSC

PubMed Central

Clematide, Simon; Akhondi, Saber A; van Mulligen, Erik M; Rebholz-Schuhmann, Dietrich

2015-01-01

Objective To create a multilingual gold-standard corpus for biomedical concept recognition. Materials and methods We selected text units from different parallel corpora (Medline abstract titles, drug labels, biomedical patent claims) in English, French, German, Spanish, and Dutch. Three annotators per language independently annotated the biomedical concepts, based on a subset of the Unified Medical Language System and covering a wide range of semantic groups. To reduce the annotation workload, automatically generated preannotations were provided. Individual annotations were automatically harmonized and then adjudicated, and cross-language consistency checks were carried out to arrive at the final annotations. Results The number of final annotations was 5530. Inter-annotator agreement scores indicate good agreement (median F-score 0.79), and are similar to those between individual annotators and the gold standard. The automatically generated harmonized annotation set for each language performed equally well as the best annotator for that language. Discussion The use of automatic preannotations, harmonized annotations, and parallel corpora helped to keep the manual annotation efforts manageable. The inter-annotator agreement scores provide a reference standard for gauging the performance of automatic annotation techniques. Conclusion To our knowledge, this is the first gold-standard corpus for biomedical concept recognition in languages other than English. Other distinguishing features are the wide variety of semantic groups that are being covered, and the diversity of text genres that were annotated. PMID:25948699
Automatic annotation of protein motif function with Gene Ontology terms.

PubMed

Lu, Xinghua; Zhai, Chengxiang; Gopalakrishnan, Vanathi; Buchanan, Bruce G

2004-09-02

Conserved protein sequence motifs are short stretches of amino acid sequence patterns that potentially encode the function of proteins. Several sequence pattern searching algorithms and programs exist foridentifying candidate protein motifs at the whole genome level. However, a much needed and important task is to determine the functions of the newly identified protein motifs. The Gene Ontology (GO) project is an endeavor to annotate the function of genes or protein sequences with terms from a dynamic, controlled vocabulary and these annotations serve well as a knowledge base. This paper presents methods to mine the GO knowledge base and use the association between the GO terms assigned to a sequence and the motifs matched by the same sequence as evidence for predicting the functions of novel protein motifs automatically. The task of assigning GO terms to protein motifs is viewed as both a binary classification and information retrieval problem, where PROSITE motifs are used as samples for mode training and functional prediction. The mutual information of a motif and aGO term association is found to be a very useful feature. We take advantage of the known motifs to train a logistic regression classifier, which allows us to combine mutual information with other frequency-based features and obtain a probability of correct association. The trained logistic regression model has intuitively meaningful and logically plausible parameter values, and performs very well empirically according to our evaluation criteria. In this research, different methods for automatic annotation of protein motifs have been investigated. Empirical result demonstrated that the methods have a great potential for detecting and augmenting information about the functions of newly discovered candidate protein motifs.
ExAtlas: An interactive online tool for meta-analysis of gene expression data.

PubMed

Sharov, Alexei A; Schlessinger, David; Ko, Minoru S H

2015-12-01

We have developed ExAtlas, an on-line software tool for meta-analysis and visualization of gene expression data. In contrast to existing software tools, ExAtlas compares multi-component data sets and generates results for all combinations (e.g. all gene expression profiles versus all Gene Ontology annotations). ExAtlas handles both users' own data and data extracted semi-automatically from the public repository (GEO/NCBI database). ExAtlas provides a variety of tools for meta-analyses: (1) standard meta-analysis (fixed effects, random effects, z-score, and Fisher's methods); (2) analyses of global correlations between gene expression data sets; (3) gene set enrichment; (4) gene set overlap; (5) gene association by expression profile; (6) gene specificity; and (7) statistical analysis (ANOVA, pairwise comparison, and PCA). ExAtlas produces graphical outputs, including heatmaps, scatter-plots, bar-charts, and three-dimensional images. Some of the most widely used public data sets (e.g. GNF/BioGPS, Gene Ontology, KEGG, GAD phenotypes, BrainScan, ENCODE ChIP-seq, and protein-protein interaction) are pre-loaded and can be used for functional annotations.
The identification of complete domains within protein sequences using accurate E-values for semi-global alignment

PubMed Central

Kann, Maricel G.; Sheetlin, Sergey L.; Park, Yonil; Bryant, Stephen H.; Spouge, John L.

2007-01-01

The sequencing of complete genomes has created a pressing need for automated annotation of gene function. Because domains are the basic units of protein function and evolution, a gene can be annotated from a domain database by aligning domains to the corresponding protein sequence. Ideally, complete domains are aligned to protein subsequences, in a ‘semi-global alignment’. Local alignment, which aligns pieces of domains to subsequences, is common in high-throughput annotation applications, however. It is a mature technique, with the heuristics and accurate E-values required for screening large databases and evaluating the screening results. Hidden Markov models (HMMs) provide an alternative theoretical framework for semi-global alignment, but their use is limited because they lack heuristic acceleration and accurate E-values. Our new tool, GLOBAL, overcomes some limitations of previous semi-global HMMs: it has accurate E-values and the possibility of the heuristic acceleration required for high-throughput applications. Moreover, according to a standard of truth based on protein structure, two semi-global HMM alignment tools (GLOBAL and HMMer) had comparable performance in identifying complete domains, but distinctly outperformed two tools based on local alignment. When searching for complete protein domains, therefore, GLOBAL avoids disadvantages commonly associated with HMMs, yet maintains their superior retrieval performance. PMID:17596268
A multilingual gold-standard corpus for biomedical concept recognition: the Mantra GSC.

PubMed

Kors, Jan A; Clematide, Simon; Akhondi, Saber A; van Mulligen, Erik M; Rebholz-Schuhmann, Dietrich

2015-09-01

To create a multilingual gold-standard corpus for biomedical concept recognition. We selected text units from different parallel corpora (Medline abstract titles, drug labels, biomedical patent claims) in English, French, German, Spanish, and Dutch. Three annotators per language independently annotated the biomedical concepts, based on a subset of the Unified Medical Language System and covering a wide range of semantic groups. To reduce the annotation workload, automatically generated preannotations were provided. Individual annotations were automatically harmonized and then adjudicated, and cross-language consistency checks were carried out to arrive at the final annotations. The number of final annotations was 5530. Inter-annotator agreement scores indicate good agreement (median F-score 0.79), and are similar to those between individual annotators and the gold standard. The automatically generated harmonized annotation set for each language performed equally well as the best annotator for that language. The use of automatic preannotations, harmonized annotations, and parallel corpora helped to keep the manual annotation efforts manageable. The inter-annotator agreement scores provide a reference standard for gauging the performance of automatic annotation techniques. To our knowledge, this is the first gold-standard corpus for biomedical concept recognition in languages other than English. Other distinguishing features are the wide variety of semantic groups that are being covered, and the diversity of text genres that were annotated. © The Author 2015. Published by Oxford University Press on behalf of the American Medical Informatics Association.
Generating Customized Verifiers for Automatically Generated Code

NASA Technical Reports Server (NTRS)

Denney, Ewen; Fischer, Bernd

2008-01-01

Program verification using Hoare-style techniques requires many logical annotations. We have previously developed a generic annotation inference algorithm that weaves in all annotations required to certify safety properties for automatically generated code. It uses patterns to capture generator- and property-specific code idioms and property-specific meta-program fragments to construct the annotations. The algorithm is customized by specifying the code patterns and integrating them with the meta-program fragments for annotation construction. However, this is difficult since it involves tedious and error-prone low-level term manipulations. Here, we describe an annotation schema compiler that largely automates this customization task using generative techniques. It takes a collection of high-level declarative annotation schemas tailored towards a specific code generator and safety property, and generates all customized analysis functions and glue code required for interfacing with the generic algorithm core, thus effectively creating a customized annotation inference algorithm. The compiler raises the level of abstraction and simplifies schema development and maintenance. It also takes care of some more routine aspects of formulating patterns and schemas, in particular handling of irrelevant program fragments and irrelevant variance in the program structure, which reduces the size, complexity, and number of different patterns and annotation schemas that are required. The improvements described here make it easier and faster to customize the system to a new safety property or a new generator, and we demonstrate this by customizing it to certify frame safety of space flight navigation code that was automatically generated from Simulink models by MathWorks' Real-Time Workshop.
Surface smoothness: cartilage biomarkers for knee OA beyond the radiologist

NASA Astrophysics Data System (ADS)

Tummala, Sudhakar; Dam, Erik B.

2010-03-01

Fully automatic imaging biomarkers may allow quantification of patho-physiological processes that a radiologist would not be able to assess reliably. This can introduce new insight but is problematic to validate due to lack of meaningful ground truth expert measurements. Rather than quantification accuracy, such novel markers must therefore be validated against clinically meaningful end-goals such as the ability to allow correct diagnosis. We present a method for automatic cartilage surface smoothness quantification in the knee joint. The quantification is based on a curvature flow method used on tibial and femoral cartilage compartments resulting from an automatic segmentation scheme. These smoothness estimates are validated for their ability to diagnose osteoarthritis and compared to smoothness estimates based on manual expert segmentations and to conventional cartilage volume quantification. We demonstrate that the fully automatic markers eliminate the time required for radiologist annotations, and in addition provide a diagnostic marker superior to the evaluated semi-manual markers.
AutoFACT: An Automatic Functional Annotation and Classification Tool

PubMed Central

Koski, Liisa B; Gray, Michael W; Lang, B Franz; Burger, Gertraud

2005-01-01

Background Assignment of function to new molecular sequence data is an essential step in genomics projects. The usual process involves similarity searches of a given sequence against one or more databases, an arduous process for large datasets. Results We present AutoFACT, a fully automated and customizable annotation tool that assigns biologically informative functions to a sequence. Key features of this tool are that it (1) analyzes nucleotide and protein sequence data; (2) determines the most informative functional description by combining multiple BLAST reports from several user-selected databases; (3) assigns putative metabolic pathways, functional classes, enzyme classes, GeneOntology terms and locus names; and (4) generates output in HTML, text and GFF formats for the user's convenience. We have compared AutoFACT to four well-established annotation pipelines. The error rate of functional annotation is estimated to be only between 1–2%. Comparison of AutoFACT to the traditional top-BLAST-hit annotation method shows that our procedure increases the number of functionally informative annotations by approximately 50%. Conclusion AutoFACT will serve as a useful annotation tool for smaller sequencing groups lacking dedicated bioinformatics staff. It is implemented in PERL and runs on LINUX/UNIX platforms. AutoFACT is available at . PMID:15960857
Caveat emptor: limitations of the automated reconstruction of metabolic pathways in Plasmodium.

PubMed

Ginsburg, Hagai

2009-01-01

The functional reconstruction of metabolic pathways from an annotated genome is a tedious and demanding enterprise. Automation of this endeavor using bioinformatics algorithms could cope with the ever-increasing number of sequenced genomes and accelerate the process. Here, the manual reconstruction of metabolic pathways in the functional genomic database of Plasmodium falciparum--Malaria Parasite Metabolic Pathways--is described and compared with pathways generated automatically as they appear in PlasmoCyc, metaSHARK and the Kyoto Encyclopedia for Genes and Genomes. A critical evaluation of this comparison discloses that the automatic reconstruction of pathways generates manifold paths that need an expert manual verification to accept some and reject most others based on manually curated gene annotation.

Rule Mining Techniques to Predict Prokaryotic Metabolic Pathways.

PubMed

Saidi, Rabie; Boudellioua, Imane; Martin, Maria J; Solovyev, Victor

2017-01-01

It is becoming more evident that computational methods are needed for the identification and the mapping of pathways in new genomes. We introduce an automatic annotation system (ARBA4Path Association Rule-Based Annotator for Pathways) that utilizes rule mining techniques to predict metabolic pathways across wide range of prokaryotes. It was demonstrated that specific combinations of protein domains (recorded in our rules) strongly determine pathways in which proteins are involved and thus provide information that let us very accurately assign pathway membership (with precision of 0.999 and recall of 0.966) to proteins of a given prokaryotic taxon. Our system can be used to enhance the quality of automatically generated annotations as well as annotating proteins with unknown function. The prediction models are represented in the form of human-readable rules, and they can be used effectively to add absent pathway information to many proteins in UniProtKB/TrEMBL database.
iPad: Semantic annotation and markup of radiological images.

PubMed

Rubin, Daniel L; Rodriguez, Cesar; Shah, Priyanka; Beaulieu, Chris

2008-11-06

Radiological images contain a wealth of information,such as anatomy and pathology, which is often not explicit and computationally accessible. Information schemes are being developed to describe the semantic content of images, but such schemes can be unwieldy to operationalize because there are few tools to enable users to capture structured information easily as part of the routine research workflow. We have created iPad, an open source tool enabling researchers and clinicians to create semantic annotations on radiological images. iPad hides the complexity of the underlying image annotation information model from users, permitting them to describe images and image regions using a graphical interface that maps their descriptions to structured ontologies semi-automatically. Image annotations are saved in a variety of formats,enabling interoperability among medical records systems, image archives in hospitals, and the Semantic Web. Tools such as iPad can help reduce the burden of collecting structured information from images, and it could ultimately enable researchers and physicians to exploit images on a very large scale and glean the biological and physiological significance of image content.
Semi-automatic tracking, smoothing and segmentation of hyoid bone motion from videofluoroscopic swallowing study.

PubMed

Kim, Won-Seok; Zeng, Pengcheng; Shi, Jian Qing; Lee, Youngjo; Paik, Nam-Jong

2017-01-01

Motion analysis of the hyoid bone via videofluoroscopic study has been used in clinical research, but the classical manual tracking method is generally labor intensive and time consuming. Although some automatic tracking methods have been developed, masked points could not be tracked and smoothing and segmentation, which are necessary for functional motion analysis prior to registration, were not provided by the previous software. We developed software to track the hyoid bone motion semi-automatically. It works even in the situation where the hyoid bone is masked by the mandible and has been validated in dysphagia patients with stroke. In addition, we added the function of semi-automatic smoothing and segmentation. A total of 30 patients' data were used to develop the software, and data collected from 17 patients were used for validation, of which the trajectories of 8 patients were partly masked. Pearson correlation coefficients between the manual and automatic tracking are high and statistically significant (0.942 to 0.991, P-value<0.0001). Relative errors between automatic tracking and manual tracking in terms of the x-axis, y-axis and 2D range of hyoid bone excursion range from 3.3% to 9.2%. We also developed an automatic method to segment each hyoid bone trajectory into four phases (elevation phase, anterior movement phase, descending phase and returning phase). The semi-automatic hyoid bone tracking from VFSS data by our software is valid compared to the conventional manual tracking method. In addition, the ability of automatic indication to switch the automatic mode to manual mode in extreme cases and calibration without attaching the radiopaque object is convenient and useful for users. Semi-automatic smoothing and segmentation provide further information for functional motion analysis which is beneficial to further statistical analysis such as functional classification and prognostication for dysphagia. Therefore, this software could provide the researchers in the field of dysphagia with a convenient, useful, and all-in-one platform for analyzing the hyoid bone motion. Further development of our method to track the other swallowing related structures or objects such as epiglottis and bolus and to carry out the 2D curve registration may be needed for a more comprehensive functional data analysis for dysphagia with big data.
Automatic vs. manual curation of a multi-source chemical dictionary: the impact on text mining.

PubMed

Hettne, Kristina M; Williams, Antony J; van Mulligen, Erik M; Kleinjans, Jos; Tkachenko, Valery; Kors, Jan A

2010-03-23

Previously, we developed a combined dictionary dubbed Chemlist for the identification of small molecules and drugs in text based on a number of publicly available databases and tested it on an annotated corpus. To achieve an acceptable recall and precision we used a number of automatic and semi-automatic processing steps together with disambiguation rules. However, it remained to be investigated which impact an extensive manual curation of a multi-source chemical dictionary would have on chemical term identification in text. ChemSpider is a chemical database that has undergone extensive manual curation aimed at establishing valid chemical name-to-structure relationships. We acquired the component of ChemSpider containing only manually curated names and synonyms. Rule-based term filtering, semi-automatic manual curation, and disambiguation rules were applied. We tested the dictionary from ChemSpider on an annotated corpus and compared the results with those for the Chemlist dictionary. The ChemSpider dictionary of ca. 80 k names was only a 1/3 to a 1/4 the size of Chemlist at around 300 k. The ChemSpider dictionary had a precision of 0.43 and a recall of 0.19 before the application of filtering and disambiguation and a precision of 0.87 and a recall of 0.19 after filtering and disambiguation. The Chemlist dictionary had a precision of 0.20 and a recall of 0.47 before the application of filtering and disambiguation and a precision of 0.67 and a recall of 0.40 after filtering and disambiguation. We conclude the following: (1) The ChemSpider dictionary achieved the best precision but the Chemlist dictionary had a higher recall and the best F-score; (2) Rule-based filtering and disambiguation is necessary to achieve a high precision for both the automatically generated and the manually curated dictionary. ChemSpider is available as a web service at http://www.chemspider.com/ and the Chemlist dictionary is freely available as an XML file in Simple Knowledge Organization System format on the web at http://www.biosemantics.org/chemlist.
Automatic vs. manual curation of a multi-source chemical dictionary: the impact on text mining

PubMed Central

2010-01-01

Background Previously, we developed a combined dictionary dubbed Chemlist for the identification of small molecules and drugs in text based on a number of publicly available databases and tested it on an annotated corpus. To achieve an acceptable recall and precision we used a number of automatic and semi-automatic processing steps together with disambiguation rules. However, it remained to be investigated which impact an extensive manual curation of a multi-source chemical dictionary would have on chemical term identification in text. ChemSpider is a chemical database that has undergone extensive manual curation aimed at establishing valid chemical name-to-structure relationships. Results We acquired the component of ChemSpider containing only manually curated names and synonyms. Rule-based term filtering, semi-automatic manual curation, and disambiguation rules were applied. We tested the dictionary from ChemSpider on an annotated corpus and compared the results with those for the Chemlist dictionary. The ChemSpider dictionary of ca. 80 k names was only a 1/3 to a 1/4 the size of Chemlist at around 300 k. The ChemSpider dictionary had a precision of 0.43 and a recall of 0.19 before the application of filtering and disambiguation and a precision of 0.87 and a recall of 0.19 after filtering and disambiguation. The Chemlist dictionary had a precision of 0.20 and a recall of 0.47 before the application of filtering and disambiguation and a precision of 0.67 and a recall of 0.40 after filtering and disambiguation. Conclusions We conclude the following: (1) The ChemSpider dictionary achieved the best precision but the Chemlist dictionary had a higher recall and the best F-score; (2) Rule-based filtering and disambiguation is necessary to achieve a high precision for both the automatically generated and the manually curated dictionary. ChemSpider is available as a web service at http://www.chemspider.com/ and the Chemlist dictionary is freely available as an XML file in Simple Knowledge Organization System format on the web at http://www.biosemantics.org/chemlist. PMID:20331846
MetMSLine: an automated and fully integrated pipeline for rapid processing of high-resolution LC-MS metabolomic datasets.

PubMed

Edmands, William M B; Barupal, Dinesh K; Scalbert, Augustin

2015-03-01

MetMSLine represents a complete collection of functions in the R programming language as an accessible GUI for biomarker discovery in large-scale liquid-chromatography high-resolution mass spectral datasets from acquisition through to final metabolite identification forming a backend to output from any peak-picking software such as XCMS. MetMSLine automatically creates subdirectories, data tables and relevant figures at the following steps: (i) signal smoothing, normalization, filtration and noise transformation (PreProc.QC.LSC.R); (ii) PCA and automatic outlier removal (Auto.PCA.R); (iii) automatic regression, biomarker selection, hierarchical clustering and cluster ion/artefact identification (Auto.MV.Regress.R); (iv) Biomarker-MS/MS fragmentation spectra matching and fragment/neutral loss annotation (Auto.MS.MS.match.R) and (v) semi-targeted metabolite identification based on a list of theoretical masses obtained from public databases (DBAnnotate.R). All source code and suggested parameters are available in an un-encapsulated layout on http://wmbedmands.github.io/MetMSLine/. Readme files and a synthetic dataset of both X-variables (simulated LC-MS data), Y-variables (simulated continuous variables) and metabolite theoretical masses are also available on our GitHub repository. © The Author 2014. Published by Oxford University Press.
MetMSLine: an automated and fully integrated pipeline for rapid processing of high-resolution LC–MS metabolomic datasets

PubMed Central

Edmands, William M. B.; Barupal, Dinesh K.; Scalbert, Augustin

2015-01-01

Summary: MetMSLine represents a complete collection of functions in the R programming language as an accessible GUI for biomarker discovery in large-scale liquid-chromatography high-resolution mass spectral datasets from acquisition through to final metabolite identification forming a backend to output from any peak-picking software such as XCMS. MetMSLine automatically creates subdirectories, data tables and relevant figures at the following steps: (i) signal smoothing, normalization, filtration and noise transformation (PreProc.QC.LSC.R); (ii) PCA and automatic outlier removal (Auto.PCA.R); (iii) automatic regression, biomarker selection, hierarchical clustering and cluster ion/artefact identification (Auto.MV.Regress.R); (iv) Biomarker—MS/MS fragmentation spectra matching and fragment/neutral loss annotation (Auto.MS.MS.match.R) and (v) semi-targeted metabolite identification based on a list of theoretical masses obtained from public databases (DBAnnotate.R). Availability and implementation: All source code and suggested parameters are available in an un-encapsulated layout on http://wmbedmands.github.io/MetMSLine/. Readme files and a synthetic dataset of both X-variables (simulated LC–MS data), Y-variables (simulated continuous variables) and metabolite theoretical masses are also available on our GitHub repository. Contact: ScalbertA@iarc.fr PMID:25348215
Automated prediction of protein function and detection of functional sites from structure.

PubMed

Pazos, Florencio; Sternberg, Michael J E

2004-10-12

Current structural genomics projects are yielding structures for proteins whose functions are unknown. Accordingly, there is a pressing requirement for computational methods for function prediction. Here we present PHUNCTIONER, an automatic method for structure-based function prediction using automatically extracted functional sites (residues associated to functions). The method relates proteins with the same function through structural alignments and extracts 3D profiles of conserved residues. Functional features to train the method are extracted from the Gene Ontology (GO) database. The method extracts these features from the entire GO hierarchy and hence is applicable across the whole range of function specificity. 3D profiles associated with 121 GO annotations were extracted. We tested the power of the method both for the prediction of function and for the extraction of functional sites. The success of function prediction by our method was compared with the standard homology-based method. In the zone of low sequence similarity (approximately 15%), our method assigns the correct GO annotation in 90% of the protein structures considered, approximately 20% higher than inheritance of function from the closest homologue.
K-Nearest Neighbors Relevance Annotation Model for Distance Education

ERIC Educational Resources Information Center

Ke, Xiao; Li, Shaozi; Cao, Donglin

2011-01-01

With the rapid development of Internet technologies, distance education has become a popular educational mode. In this paper, the authors propose an online image automatic annotation distance education system, which could effectively help children learn interrelations between image content and corresponding keywords. Image automatic annotation is…
A computational platform to maintain and migrate manual functional annotations for BioCyc databases.

PubMed

Walsh, Jesse R; Sen, Taner Z; Dickerson, Julie A

2014-10-12

BioCyc databases are an important resource for information on biological pathways and genomic data. Such databases represent the accumulation of biological data, some of which has been manually curated from literature. An essential feature of these databases is the continuing data integration as new knowledge is discovered. As functional annotations are improved, scalable methods are needed for curators to manage annotations without detailed knowledge of the specific design of the BioCyc database. We have developed CycTools, a software tool which allows curators to maintain functional annotations in a model organism database. This tool builds on existing software to improve and simplify annotation data imports of user provided data into BioCyc databases. Additionally, CycTools automatically resolves synonyms and alternate identifiers contained within the database into the appropriate internal identifiers. Automating steps in the manual data entry process can improve curation efforts for major biological databases. The functionality of CycTools is demonstrated by transferring GO term annotations from MaizeCyc to matching proteins in CornCyc, both maize metabolic pathway databases available at MaizeGDB, and by creating strain specific databases for metabolic engineering.
Semi-Automatic Segmentation Software for Quantitative Clinical Brain Glioblastoma Evaluation

PubMed Central

Zhu, Y; Young, G; Xue, Z; Huang, R; You, H; Setayesh, K; Hatabu, H; Cao, F; Wong, S.T.

2012-01-01

Rationale and Objectives Quantitative measurement provides essential information about disease progression and treatment response in patients with Glioblastoma multiforme (GBM). The goal of this paper is to present and validate a software pipeline for semi-automatic GBM segmentation, called AFINITI (Assisted Follow-up in NeuroImaging of Therapeutic Intervention), using clinical data from GBM patients. Materials and Methods Our software adopts the current state-of-the-art tumor segmentation algorithms and combines them into one clinically usable pipeline. Both the advantages of the traditional voxel-based and the deformable shape-based segmentation are embedded into the software pipeline. The former provides an automatic tumor segmentation scheme based on T1- and T2-weighted MR brain data, and the latter refines the segmentation results with minimal manual input. Results Twenty six clinical MR brain images of GBM patients were processed and compared with manual results. The results can be visualized using the embedded graphic user interface (GUI). Conclusion Validation results using clinical GBM data showed high correlation between the AFINITI results and manual annotation. Compared to the voxel-wise segmentation, AFINITI yielded more accurate results in segmenting the enhanced GBM from multimodality MRI data. The proposed pipeline could be used as additional information to interpret MR brain images in neuroradiology. PMID:22591720
The language of gene ontology: a Zipf's law analysis.

PubMed

Kalankesh, Leila Ranandeh; Stevens, Robert; Brass, Andy

2012-06-07

Most major genome projects and sequence databases provide a GO annotation of their data, either automatically or through human annotators, creating a large corpus of data written in the language of GO. Texts written in natural language show a statistical power law behaviour, Zipf's law, the exponent of which can provide useful information on the nature of the language being used. We have therefore explored the hypothesis that collections of GO annotations will show similar statistical behaviours to natural language. Annotations from the Gene Ontology Annotation project were found to follow Zipf's law. Surprisingly, the measured power law exponents were consistently different between annotation captured using the three GO sub-ontologies in the corpora (function, process and component). On filtering the corpora using GO evidence codes we found that the value of the measured power law exponent responded in a predictable way as a function of the evidence codes used to support the annotation. Techniques from computational linguistics can provide new insights into the annotation process. GO annotations show similar statistical behaviours to those seen in natural language with measured exponents that provide a signal which correlates with the nature of the evidence codes used to support the annotations, suggesting that the measured exponent might provide a signal regarding the information content of the annotation.
Managing the data deluge: data-driven GO category assignment improves while complexity of functional annotation increases.

PubMed

Gobeill, Julien; Pasche, Emilie; Vishnyakova, Dina; Ruch, Patrick

2013-01-01

The available curated data lag behind current biological knowledge contained in the literature. Text mining can assist biologists and curators to locate and access this knowledge, for instance by characterizing the functional profile of publications. Gene Ontology (GO) category assignment in free text already supports various applications, such as powering ontology-based search engines, finding curation-relevant articles (triage) or helping the curator to identify and encode functions. Popular text mining tools for GO classification are based on so called thesaurus-based--or dictionary-based--approaches, which exploit similarities between the input text and GO terms themselves. But their effectiveness remains limited owing to the complex nature of GO terms, which rarely occur in text. In contrast, machine learning approaches exploit similarities between the input text and already curated instances contained in a knowledge base to infer a functional profile. GO Annotations (GOA) and MEDLINE make possible to exploit a growing amount of curated abstracts (97 000 in November 2012) for populating this knowledge base. Our study compares a state-of-the-art thesaurus-based system with a machine learning system (based on a k-Nearest Neighbours algorithm) for the task of proposing a functional profile for unseen MEDLINE abstracts, and shows how resources and performances have evolved. Systems are evaluated on their ability to propose for a given abstract the GO terms (2.8 on average) used for curation in GOA. We show that since 2006, although a massive effort was put into adding synonyms in GO (+300%), our thesaurus-based system effectiveness is rather constant, reaching from 0.28 to 0.31 for Recall at 20 (R20). In contrast, thanks to its knowledge base growth, our machine learning system has steadily improved, reaching from 0.38 in 2006 to 0.56 for R20 in 2012. Integrated in semi-automatic workflows or in fully automatic pipelines, such systems are more and more efficient to provide assistance to biologists. DATABASE URL: http://eagl.unige.ch/GOCat/
Automatic annotation of histopathological images using a latent topic model based on non-negative matrix factorization

PubMed Central

Cruz-Roa, Angel; Díaz, Gloria; Romero, Eduardo; González, Fabio A.

2011-01-01

Histopathological images are an important resource for clinical diagnosis and biomedical research. From an image understanding point of view, the automatic annotation of these images is a challenging problem. This paper presents a new method for automatic histopathological image annotation based on three complementary strategies, first, a part-based image representation, called the bag of features, which takes advantage of the natural redundancy of histopathological images for capturing the fundamental patterns of biological structures, second, a latent topic model, based on non-negative matrix factorization, which captures the high-level visual patterns hidden in the image, and, third, a probabilistic annotation model that links visual appearance of morphological and architectural features associated to 10 histopathological image annotations. The method was evaluated using 1,604 annotated images of skin tissues, which included normal and pathological architectural and morphological features, obtaining a recall of 74% and a precision of 50%, which improved a baseline annotation method based on support vector machines in a 64% and 24%, respectively. PMID:22811960
Automatic Annotation Method on Learners' Opinions in Case Method Discussion

ERIC Educational Resources Information Center

Samejima, Masaki; Hisakane, Daichi; Komoda, Norihisa

2015-01-01

Purpose: The purpose of this paper is to annotate an attribute of a problem, a solution or no annotation on learners' opinions automatically for supporting the learners' discussion without a facilitator. The case method aims at discussing problems and solutions in a target case. However, the learners miss discussing some of problems and solutions.…
Biotea: semantics for Pubmed Central.

PubMed

Garcia, Alexander; Lopez, Federico; Garcia, Leyla; Giraldo, Olga; Bucheli, Victor; Dumontier, Michel

2018-01-01

A significant portion of biomedical literature is represented in a manner that makes it difficult for consumers to find or aggregate content through a computational query. One approach to facilitate reuse of the scientific literature is to structure this information as linked data using standardized web technologies. In this paper we present the second version of Biotea, a semantic, linked data version of the open-access subset of PubMed Central that has been enhanced with specialized annotation pipelines that uses existing infrastructure from the National Center for Biomedical Ontology. We expose our models, services, software and datasets. Our infrastructure enables manual and semi-automatic annotation, resulting data are represented as RDF-based linked data and can be readily queried using the SPARQL query language. We illustrate the utility of our system with several use cases. Our datasets, methods and techniques are available at http://biotea.github.io.
Evaluation and integration of functional annotation pipelines for newly sequenced organisms: the potato genome as a test case.

PubMed

Amar, David; Frades, Itziar; Danek, Agnieszka; Goldberg, Tatyana; Sharma, Sanjeev K; Hedley, Pete E; Proux-Wera, Estelle; Andreasson, Erik; Shamir, Ron; Tzfadia, Oren; Alexandersson, Erik

2014-12-05

For most organisms, even if their genome sequence is available, little functional information about individual genes or proteins exists. Several annotation pipelines have been developed for functional analysis based on sequence, 'omics', and literature data. However, researchers encounter little guidance on how well they perform. Here, we used the recently sequenced potato genome as a case study. The potato genome was selected since its genome is newly sequenced and it is a non-model plant even if there is relatively ample information on individual potato genes, and multiple gene expression profiles are available. We show that the automatic gene annotations of potato have low accuracy when compared to a "gold standard" based on experimentally validated potato genes. Furthermore, we evaluate six state-of-the-art annotation pipelines and show that their predictions are markedly dissimilar (Jaccard similarity coefficient of 0.27 between pipelines on average). To overcome this discrepancy, we introduce a simple GO structure-based algorithm that reconciles the predictions of the different pipelines. We show that the integrated annotation covers more genes, increases by over 50% the number of highly co-expressed GO processes, and obtains much higher agreement with the gold standard. We find that different annotation pipelines produce different results, and show how to integrate them into a unified annotation that is of higher quality than each single pipeline. We offer an improved functional annotation of both PGSC and ITAG potato gene models, as well as tools that can be applied to additional pipelines and improve annotation in other organisms. This will greatly aid future functional analysis of '-omics' datasets from potato and other organisms with newly sequenced genomes. The new potato annotations are available with this paper.
BRENDA in 2013: integrated reactions, kinetic data, enzyme function data, improved disease classification: new options and contents in BRENDA.

PubMed

Schomburg, Ida; Chang, Antje; Placzek, Sandra; Söhngen, Carola; Rother, Michael; Lang, Maren; Munaretto, Cornelia; Ulas, Susanne; Stelzer, Michael; Grote, Andreas; Scheer, Maurice; Schomburg, Dietmar

2013-01-01

The BRENDA (BRaunschweig ENzyme DAtabase) enzyme portal (http://www.brenda-enzymes.org) is the main information system of functional biochemical and molecular enzyme data and provides access to seven interconnected databases. BRENDA contains 2.7 million manually annotated data on enzyme occurrence, function, kinetics and molecular properties. Each entry is connected to a reference and the source organism. Enzyme ligands are stored with their structures and can be accessed via their names, synonyms or via a structure search. FRENDA (Full Reference ENzyme DAta) and AMENDA (Automatic Mining of ENzyme DAta) are based on text mining methods and represent a complete survey of PubMed abstracts with information on enzymes in different organisms, tissues or organelles. The supplemental database DRENDA provides more than 910 000 new EC number-disease relations in more than 510 000 references from automatic search and a classification of enzyme-disease-related information. KENDA (Kinetic ENzyme DAta), a new amendment extracts and displays kinetic values from PubMed abstracts. The integration of the EnzymeDetector offers an automatic comparison, evaluation and prediction of enzyme function annotations for prokaryotic genomes. The biochemical reaction database BKM-react contains non-redundant enzyme-catalysed and spontaneous reactions and was developed to facilitate and accelerate the construction of biochemical models.
PANDORA: keyword-based analysis of protein sets by integration of annotation sources.

PubMed

Kaplan, Noam; Vaaknin, Avishay; Linial, Michal

2003-10-01

Recent advances in high-throughput methods and the application of computational tools for automatic classification of proteins have made it possible to carry out large-scale proteomic analyses. Biological analysis and interpretation of sets of proteins is a time-consuming undertaking carried out manually by experts. We have developed PANDORA (Protein ANnotation Diagram ORiented Analysis), a web-based tool that provides an automatic representation of the biological knowledge associated with any set of proteins. PANDORA uses a unique approach of keyword-based graphical analysis that focuses on detecting subsets of proteins that share unique biological properties and the intersections of such sets. PANDORA currently supports SwissProt keywords, NCBI Taxonomy, InterPro entries and the hierarchical classification terms from ENZYME, SCOP and GO databases. The integrated study of several annotation sources simultaneously allows a representation of biological relations of structure, function, cellular location, taxonomy, domains and motifs. PANDORA is also integrated into the ProtoNet system, thus allowing testing thousands of automatically generated clusters. We illustrate how PANDORA enhances the biological understanding of large, non-uniform sets of proteins originating from experimental and computational sources, without the need for prior biological knowledge on individual proteins.
Automatic short axis orientation of the left ventricle in 3D ultrasound recordings

NASA Astrophysics Data System (ADS)

Pedrosa, João.; Heyde, Brecht; Heeren, Laurens; Engvall, Jan; Zamorano, Jose; Papachristidis, Alexandros; Edvardsen, Thor; Claus, Piet; D'hooge, Jan

2016-04-01

The recent advent of three-dimensional echocardiography has led to an increased interest from the scientific community in left ventricle segmentation frameworks for cardiac volume and function assessment. An automatic orientation of the segmented left ventricular mesh is an important step to obtain a point-to-point correspondence between the mesh and the cardiac anatomy. Furthermore, this would allow for an automatic division of the left ventricle into the standard 17 segments and, thus, fully automatic per-segment analysis, e.g. regional strain assessment. In this work, a method for fully automatic short axis orientation of the segmented left ventricle is presented. The proposed framework aims at detecting the inferior right ventricular insertion point. 211 three-dimensional echocardiographic images were used to validate this framework by comparison to manual annotation of the inferior right ventricular insertion point. A mean unsigned error of 8, 05° +/- 18, 50° was found, whereas the mean signed error was 1, 09°. Large deviations between the manual and automatic annotations (> 30°) only occurred in 3, 79% of cases. The average computation time was 666ms in a non-optimized MATLAB environment, which potentiates real-time application. In conclusion, a successful automatic real-time method for orientation of the segmented left ventricle is proposed.

PFAAT version 2.0: a tool for editing, annotating, and analyzing multiple sequence alignments.

PubMed

Caffrey, Daniel R; Dana, Paul H; Mathur, Vidhya; Ocano, Marco; Hong, Eun-Jong; Wang, Yaoyu E; Somaroo, Shyamal; Caffrey, Brian E; Potluri, Shobha; Huang, Enoch S

2007-10-11

By virtue of their shared ancestry, homologous sequences are similar in their structure and function. Consequently, multiple sequence alignments are routinely used to identify trends that relate to function. This type of analysis is particularly productive when it is combined with structural and phylogenetic analysis. Here we describe the release of PFAAT version 2.0, a tool for editing, analyzing, and annotating multiple sequence alignments. Support for multiple annotations is a key component of this release as it provides a framework for most of the new functionalities. The sequence annotations are accessible from the alignment and tree, where they are typically used to label sequences or hyperlink them to related databases. Sequence annotations can be created manually or extracted automatically from UniProt entries. Once a multiple sequence alignment is populated with sequence annotations, sequences can be easily selected and sorted through a sophisticated search dialog. The selected sequences can be further analyzed using statistical methods that explicitly model relationships between the sequence annotations and residue properties. Residue annotations are accessible from the alignment viewer and are typically used to designate binding sites or properties for a particular residue. Residue annotations are also searchable, and allow one to quickly select alignment columns for further sequence analysis, e.g. computing percent identities. Other features include: novel algorithms to compute sequence conservation, mapping conservation scores to a 3D structure in Jmol, displaying secondary structure elements, and sorting sequences by residue composition. PFAAT provides a framework whereby end-users can specify knowledge for a protein family in the form of annotation. The annotations can be combined with sophisticated analysis to test hypothesis that relate to sequence, structure and function.
Nonlinear Deep Kernel Learning for Image Annotation.

PubMed

Jiu, Mingyuan; Sahbi, Hichem

2017-02-08

Multiple kernel learning (MKL) is a widely used technique for kernel design. Its principle consists in learning, for a given support vector classifier, the most suitable convex (or sparse) linear combination of standard elementary kernels. However, these combinations are shallow and often powerless to capture the actual similarity between highly semantic data, especially for challenging classification tasks such as image annotation. In this paper, we redefine multiple kernels using deep multi-layer networks. In this new contribution, a deep multiple kernel is recursively defined as a multi-layered combination of nonlinear activation functions, each one involves a combination of several elementary or intermediate kernels, and results into a positive semi-definite deep kernel. We propose four different frameworks in order to learn the weights of these networks: supervised, unsupervised, kernel-based semisupervised and Laplacian-based semi-supervised. When plugged into support vector machines (SVMs), the resulting deep kernel networks show clear gain, compared to several shallow kernels for the task of image annotation. Extensive experiments and analysis on the challenging ImageCLEF photo annotation benchmark, the COREL5k database and the Banana dataset validate the effectiveness of the proposed method.
A domain-centric solution to functional genomics via dcGO Predictor

PubMed Central

2013-01-01

Background Computational/manual annotations of protein functions are one of the first routes to making sense of a newly sequenced genome. Protein domain predictions form an essential part of this annotation process. This is due to the natural modularity of proteins with domains as structural, evolutionary and functional units. Sometimes two, three, or more adjacent domains (called supra-domains) are the operational unit responsible for a function, e.g. via a binding site at the interface. These supra-domains have contributed to functional diversification in higher organisms. Traditionally functional ontologies have been applied to individual proteins, rather than families of related domains and supra-domains. We expect, however, to some extent functional signals can be carried by protein domains and supra-domains, and consequently used in function prediction and functional genomics. Results Here we present a domain-centric Gene Ontology (dcGO) perspective. We generalize a framework for automatically inferring ontological terms associated with domains and supra-domains from full-length sequence annotations. This general framework has been applied specifically to primary protein-level annotations from UniProtKB-GOA, generating GO term associations with SCOP domains and supra-domains. The resulting 'dcGO Predictor', can be used to provide functional annotation to protein sequences. The functional annotation of sequences in the Critical Assessment of Function Annotation (CAFA) has been used as a valuable opportunity to validate our method and to be assessed by the community. The functional annotation of all completely sequenced genomes has demonstrated the potential for domain-centric GO enrichment analysis to yield functional insights into newly sequenced or yet-to-be-annotated genomes. This generalized framework we have presented has also been applied to other domain classifications such as InterPro and Pfam, and other ontologies such as mammalian phenotype and disease ontology. The dcGO and its predictor are available at http://supfam.org/SUPERFAMILY/dcGO including an enrichment analysis tool. Conclusions As functional units, domains offer a unique perspective on function prediction regardless of whether proteins are multi-domain or single-domain. The 'dcGO Predictor' holds great promise for contributing to a domain-centric functional understanding of genomes in the next generation sequencing era. PMID:23514627
Standardized evaluation framework for evaluating coronary artery stenosis detection, stenosis quantification and lumen segmentation algorithms in computed tomography angiography.

PubMed

Kirişli, H A; Schaap, M; Metz, C T; Dharampal, A S; Meijboom, W B; Papadopoulou, S L; Dedic, A; Nieman, K; de Graaf, M A; Meijs, M F L; Cramer, M J; Broersen, A; Cetin, S; Eslami, A; Flórez-Valencia, L; Lor, K L; Matuszewski, B; Melki, I; Mohr, B; Oksüz, I; Shahzad, R; Wang, C; Kitslaar, P H; Unal, G; Katouzian, A; Örkisz, M; Chen, C M; Precioso, F; Najman, L; Masood, S; Ünay, D; van Vliet, L; Moreno, R; Goldenberg, R; Vuçini, E; Krestin, G P; Niessen, W J; van Walsum, T

2013-12-01

Though conventional coronary angiography (CCA) has been the standard of reference for diagnosing coronary artery disease in the past decades, computed tomography angiography (CTA) has rapidly emerged, and is nowadays widely used in clinical practice. Here, we introduce a standardized evaluation framework to reliably evaluate and compare the performance of the algorithms devised to detect and quantify the coronary artery stenoses, and to segment the coronary artery lumen in CTA data. The objective of this evaluation framework is to demonstrate the feasibility of dedicated algorithms to: (1) (semi-)automatically detect and quantify stenosis on CTA, in comparison with quantitative coronary angiography (QCA) and CTA consensus reading, and (2) (semi-)automatically segment the coronary lumen on CTA, in comparison with expert's manual annotation. A database consisting of 48 multicenter multivendor cardiac CTA datasets with corresponding reference standards are described and made available. The algorithms from 11 research groups were quantitatively evaluated and compared. The results show that (1) some of the current stenosis detection/quantification algorithms may be used for triage or as a second-reader in clinical practice, and that (2) automatic lumen segmentation is possible with a precision similar to that obtained by experts. The framework is open for new submissions through the website, at http://coronary.bigr.nl/stenoses/. Copyright © 2013 Elsevier B.V. All rights reserved.
MicroScope: a platform for microbial genome annotation and comparative genomics

PubMed Central

Vallenet, D.; Engelen, S.; Mornico, D.; Cruveiller, S.; Fleury, L.; Lajus, A.; Rouy, Z.; Roche, D.; Salvignol, G.; Scarpelli, C.; Médigue, C.

2009-01-01

The initial outcome of genome sequencing is the creation of long text strings written in a four letter alphabet. The role of in silico sequence analysis is to assist biologists in the act of associating biological knowledge with these sequences, allowing investigators to make inferences and predictions that can be tested experimentally. A wide variety of software is available to the scientific community, and can be used to identify genomic objects, before predicting their biological functions. However, only a limited number of biologically interesting features can be revealed from an isolated sequence. Comparative genomics tools, on the other hand, by bringing together the information contained in numerous genomes simultaneously, allow annotators to make inferences based on the idea that evolution and natural selection are central to the definition of all biological processes. We have developed the MicroScope platform in order to offer a web-based framework for the systematic and efficient revision of microbial genome annotation and comparative analysis (http://www.genoscope.cns.fr/agc/microscope). Starting with the description of the flow chart of the annotation processes implemented in the MicroScope pipeline, and the development of traditional and novel microbial annotation and comparative analysis tools, this article emphasizes the essential role of expert annotation as a complement of automatic annotation. Several examples illustrate the use of implemented tools for the review and curation of annotations of both new and publicly available microbial genomes within MicroScope’s rich integrated genome framework. The platform is used as a viewer in order to browse updated annotation information of available microbial genomes (more than 440 organisms to date), and in the context of new annotation projects (117 bacterial genomes). The human expertise gathered in the MicroScope database (about 280,000 independent annotations) contributes to improve the quality of microbial genome annotation, especially for genomes initially analyzed by automatic procedures alone. Database URLs: http://www.genoscope.cns.fr/agc/mage and http://www.genoscope.cns.fr/agc/microcyc PMID:20157493
MicroScope: a platform for microbial genome annotation and comparative genomics.

PubMed

Vallenet, D; Engelen, S; Mornico, D; Cruveiller, S; Fleury, L; Lajus, A; Rouy, Z; Roche, D; Salvignol, G; Scarpelli, C; Médigue, C

2009-01-01

The initial outcome of genome sequencing is the creation of long text strings written in a four letter alphabet. The role of in silico sequence analysis is to assist biologists in the act of associating biological knowledge with these sequences, allowing investigators to make inferences and predictions that can be tested experimentally. A wide variety of software is available to the scientific community, and can be used to identify genomic objects, before predicting their biological functions. However, only a limited number of biologically interesting features can be revealed from an isolated sequence. Comparative genomics tools, on the other hand, by bringing together the information contained in numerous genomes simultaneously, allow annotators to make inferences based on the idea that evolution and natural selection are central to the definition of all biological processes. We have developed the MicroScope platform in order to offer a web-based framework for the systematic and efficient revision of microbial genome annotation and comparative analysis (http://www.genoscope.cns.fr/agc/microscope). Starting with the description of the flow chart of the annotation processes implemented in the MicroScope pipeline, and the development of traditional and novel microbial annotation and comparative analysis tools, this article emphasizes the essential role of expert annotation as a complement of automatic annotation. Several examples illustrate the use of implemented tools for the review and curation of annotations of both new and publicly available microbial genomes within MicroScope's rich integrated genome framework. The platform is used as a viewer in order to browse updated annotation information of available microbial genomes (more than 440 organisms to date), and in the context of new annotation projects (117 bacterial genomes). The human expertise gathered in the MicroScope database (about 280,000 independent annotations) contributes to improve the quality of microbial genome annotation, especially for genomes initially analyzed by automatic procedures alone.Database URLs: http://www.genoscope.cns.fr/agc/mage and http://www.genoscope.cns.fr/agc/microcyc.
Homology to peptide pattern for annotation of carbohydrate-active enzymes and prediction of function.

PubMed

Busk, P K; Pilgaard, B; Lezyk, M J; Meyer, A S; Lange, L

2017-04-12

Carbohydrate-active enzymes are found in all organisms and participate in key biological processes. These enzymes are classified in 274 families in the CAZy database but the sequence diversity within each family makes it a major task to identify new family members and to provide basis for prediction of enzyme function. A fast and reliable method for de novo annotation of genes encoding carbohydrate-active enzymes is to identify conserved peptides in the curated enzyme families followed by matching of the conserved peptides to the sequence of interest as demonstrated for the glycosyl hydrolase and the lytic polysaccharide monooxygenase families. This approach not only assigns the enzymes to families but also provides functional prediction of the enzymes with high accuracy. We identified conserved peptides for all enzyme families in the CAZy database with Peptide Pattern Recognition. The conserved peptides were matched to protein sequence for de novo annotation and functional prediction of carbohydrate-active enzymes with the Hotpep method. Annotation of protein sequences from 12 bacterial and 16 fungal genomes to families with Hotpep had an accuracy of 0.84 (measured as F1-score) compared to semiautomatic annotation by the CAZy database whereas the dbCAN HMM-based method had an accuracy of 0.77 with optimized parameters. Furthermore, Hotpep provided a functional prediction with 86% accuracy for the annotated genes. Hotpep is available as a stand-alone application for MS Windows. Hotpep is a state-of-the-art method for automatic annotation and functional prediction of carbohydrate-active enzymes.
Modeling loosely annotated images using both given and imagined annotations

NASA Astrophysics Data System (ADS)

Tang, Hong; Boujemaa, Nozha; Chen, Yunhao; Deng, Lei

2011-12-01

In this paper, we present an approach to learn latent semantic analysis models from loosely annotated images for automatic image annotation and indexing. The given annotation in training images is loose due to: 1. ambiguous correspondences between visual features and annotated keywords; 2. incomplete lists of annotated keywords. The second reason motivates us to enrich the incomplete annotation in a simple way before learning a topic model. In particular, some ``imagined'' keywords are poured into the incomplete annotation through measuring similarity between keywords in terms of their co-occurrence. Then, both given and imagined annotations are employed to learn probabilistic topic models for automatically annotating new images. We conduct experiments on two image databases (i.e., Corel and ESP) coupled with their loose annotations, and compare the proposed method with state-of-the-art discrete annotation methods. The proposed method improves word-driven probability latent semantic analysis (PLSA-words) up to a comparable performance with the best discrete annotation method, while a merit of PLSA-words is still kept, i.e., a wider semantic range.
Application of MPEG-7 descriptors for content-based indexing of sports videos

NASA Astrophysics Data System (ADS)

Hoeynck, Michael; Auweiler, Thorsten; Ohm, Jens-Rainer

2003-06-01

The amount of multimedia data available worldwide is increasing every day. There is a vital need to annotate multimedia data in order to allow universal content access and to provide content-based search-and-retrieval functionalities. Since supervised video annotation can be time consuming, an automatic solution is appreciated. We review recent approaches to content-based indexing and annotation of videos for different kind of sports, and present our application for the automatic annotation of equestrian sports videos. Thereby, we especially concentrate on MPEG-7 based feature extraction and content description. We apply different visual descriptors for cut detection. Further, we extract the temporal positions of single obstacles on the course by analyzing MPEG-7 edge information and taking specific domain knowledge into account. Having determined single shot positions as well as the visual highlights, the information is jointly stored together with additional textual information in an MPEG-7 description scheme. Using this information, we generate content summaries which can be utilized in a user front-end in order to provide content-based access to the video stream, but further content-based queries and navigation on a video-on-demand streaming server.
EuCAP, a Eukaryotic Community Annotation Package, and its application to the rice genome

PubMed Central

Thibaud-Nissen, Françoise; Campbell, Matthew; Hamilton, John P; Zhu, Wei; Buell, C Robin

2007-01-01

Background Despite the improvements of tools for automated annotation of genome sequences, manual curation at the structural and functional level can provide an increased level of refinement to genome annotation. The Institute for Genomic Research Rice Genome Annotation (hereafter named the Osa1 Genome Annotation) is the product of an automated pipeline and, for this reason, will benefit from the input of biologists with expertise in rice and/or particular gene families. Leveraging knowledge from a dispersed community of scientists is a demonstrated way of improving a genome annotation. This requires tools that facilitate 1) the submission of gene annotation to an annotation project, 2) the review of the submitted models by project annotators, and 3) the incorporation of the submitted models in the ongoing annotation effort. Results We have developed the Eukaryotic Community Annotation Package (EuCAP), an annotation tool, and have applied it to the rice genome. The primary level of curation by community annotators (CA) has been the annotation of gene families. Annotation can be submitted by email or through the EuCAP Web Tool. The CA models are aligned to the rice pseudomolecules and the coordinates of these alignments, along with functional annotation, are stored in the MySQL EuCAP Gene Model database. Web pages displaying the alignments of the CA models to the Osa1 Genome models are automatically generated from the EuCAP Gene Model database. The alignments are reviewed by the project annotators (PAs) in the context of experimental evidence. Upon approval by the PAs, the CA models, along with the corresponding functional annotations, are integrated into the Osa1 Genome Annotation. The CA annotations, grouped by family, are displayed on the Community Annotation pages of the project website , as well as in the Community Annotation track of the Genome Browser. Conclusion We have applied EuCAP to rice. As of July 2007, the structural and/or functional annotation of 1,094 genes representing 57 families have been deposited and integrated into the current gene set. All of the EuCAP components are open-source, thereby allowing the implementation of EuCAP for the annotation of other genomes. EuCAP is available at . PMID:17961238
An ontology-based annotation of cardiac implantable electronic devices to detect therapy changes in a national registry.

PubMed

Rosier, Arnaud; Mabo, Philippe; Chauvin, Michel; Burgun, Anita

2015-05-01

The patient population benefitting from cardiac implantable electronic devices (CIEDs) is increasing. This study introduces a device annotation method that supports the consistent description of the functional attributes of cardiac devices and evaluates how this method can detect device changes from a CIED registry. We designed the Cardiac Device Ontology, an ontology of CIEDs and device functions. We annotated 146 cardiac devices with this ontology and used it to detect therapy changes with respect to atrioventricular pacing, cardiac resynchronization therapy, and defibrillation capability in a French national registry of patients with implants (STIDEFIX). We then analyzed a set of 6905 device replacements from the STIDEFIX registry. Ontology-based identification of therapy changes (upgraded, downgraded, or similar) was accurate (6905 cases) and performed better than straightforward analysis of the registry codes (F-measure 1.00 versus 0.75 to 0.97). This study demonstrates the feasibility and effectiveness of ontology-based functional annotation of devices in the cardiac domain. Such annotation allowed a better description and in-depth analysis of STIDEFIX. This method was useful for the automatic detection of therapy changes and may be reused for analyzing data from other device registries.
larvalign: Aligning Gene Expression Patterns from the Larval Brain of Drosophila melanogaster.

PubMed

Muenzing, Sascha E A; Strauch, Martin; Truman, James W; Bühler, Katja; Thum, Andreas S; Merhof, Dorit

2018-01-01

The larval brain of the fruit fly Drosophila melanogaster is a small, tractable model system for neuroscience. Genes for fluorescent marker proteins can be expressed in defined, spatially restricted neuron populations. Here, we introduce the methods for 1) generating a standard template of the larval central nervous system (CNS), 2) spatial mapping of expression patterns from different larvae into a reference space defined by the standard template. We provide a manually annotated gold standard that serves for evaluation of the registration framework involved in template generation and mapping. A method for registration quality assessment enables the automatic detection of registration errors, and a semi-automatic registration method allows one to correct registrations, which is a prerequisite for a high-quality, curated database of expression patterns. All computational methods are available within the larvalign software package: https://github.com/larvalign/larvalign/releases/tag/v1.0.
Real-time pulse oximetry artifact annotation on computerized anaesthetic records.

PubMed

Gostt, Richard Karl; Rathbone, Graeme Dennis; Tucker, Adam Paul

2002-01-01

Adoption of computerised anaesthesia record keeping systems has been limited by the concern that they record artifactual data and accurate data indiscriminately. Data resulting from artifacts does not reflect the patient's true condition and presents a problem in later analysis of the record, with associated medico-legal implications. This study developed an algorithm to automatically annotate pulse oximetry artifacts and sought to evaluate the algorithm's accuracy in routine surgical procedures. MacAnaesthetist is a semi-automatic anaesthetic record keeping system developed for the Apple Macintosh computer, which incorporated an algorithm designed to automatically detect pulse oximetry artifacts. The algorithm labeled artifactual oxygen saturation values < 90%. This was done in real-time by analyzing physiological data captured from a Datex AS/3 Anaesthesia Monitor. An observational study was conducted to evaluate the accuracy of the algorithm during routine surgical procedures (n = 20). An anaesthetic record was made by an anaesthetist using the Datex AS/3 record keeper, while a second anaesthetic record was produced in parallel using MacAnaesthetist. A copy of the Datex AS/3 record was kept for later review by a group of anaesthetists (n = 20), who judged oxygen saturation values < 90% to be either genuine or artifact. MacAnaesthetist correctly labeled 12 out of 13 oxygen saturations < 90% (92.3% accuracy). A post-operative review of the Datex AS/3 anaesthetic records (n = 8) by twenty anaesthetists resulted in 127 correct responses out of total of 200 (63.5% accuracy). The remaining Datex AS/3 records (n = 12) were not reviewed, as they did not contain any oxygen saturations <90%. The real-time artifact detection algorithm developed in this study was more accurate than anaesthetists who post-operatively reviewed records produced by an existing computerised anaesthesia record keeping system. Algorithms have the potential to more accurately identify and annotate artifacts on computerised anaesthetic records, assisting clinicians to more correctly interpret abnormal data.
Haptic exploratory behavior during object discrimination: a novel automatic annotation method.

PubMed

Jansen, Sander E M; Bergmann Tiest, Wouter M; Kappers, Astrid M L

2015-01-01

In order to acquire information concerning the geometry and material of handheld objects, people tend to execute stereotypical hand movement patterns called haptic Exploratory Procedures (EPs). Manual annotation of haptic exploration trials with these EPs is a laborious task that is affected by subjectivity, attentional lapses, and viewing angle limitations. In this paper we propose an automatic EP annotation method based on position and orientation data from motion tracking sensors placed on both hands and inside a stimulus. A set of kinematic variables is computed from these data and compared to sets of predefined criteria for each of four EPs. Whenever all criteria for a specific EP are met, it is assumed that that particular hand movement pattern was performed. This method is applied to data from an experiment where blindfolded participants haptically discriminated between objects differing in hardness, roughness, volume, and weight. In order to validate the method, its output is compared to manual annotation based on video recordings of the same trials. Although mean pairwise agreement is less between human-automatic pairs than between human-human pairs (55.7% vs 74.5%), the proposed method performs much better than random annotation (2.4%). Furthermore, each EP is linked to a specific object property for which it is optimal (e.g., Lateral Motion for roughness). We found that the percentage of trials where the expected EP was found does not differ between manual and automatic annotation. For now, this method cannot yet completely replace a manual annotation procedure. However, it could be used as a starting point that can be supplemented by manual annotation.
Semi-Supervised Learning to Identify UMLS Semantic Relations.

PubMed

Luo, Yuan; Uzuner, Ozlem

2014-01-01

The UMLS Semantic Network is constructed by experts and requires periodic expert review to update. We propose and implement a semi-supervised approach for automatically identifying UMLS semantic relations from narrative text in PubMed. Our method analyzes biomedical narrative text to collect semantic entity pairs, and extracts multiple semantic, syntactic and orthographic features for the collected pairs. We experiment with seeded k-means clustering with various distance metrics. We create and annotate a ground truth corpus according to the top two levels of the UMLS semantic relation hierarchy. We evaluate our system on this corpus and characterize the learning curves of different clustering configuration. Using KL divergence consistently performs the best on the held-out test data. With full seeding, we obtain macro-averaged F-measures above 70% for clustering the top level UMLS relations (2-way), and above 50% for clustering the second level relations (7-way).
Achieving Accurate Automatic Sleep Staging on Manually Pre-processed EEG Data Through Synchronization Feature Extraction and Graph Metrics.

PubMed

Chriskos, Panteleimon; Frantzidis, Christos A; Gkivogkli, Polyxeni T; Bamidis, Panagiotis D; Kourtidou-Papadeli, Chrysoula

2018-01-01

Sleep staging, the process of assigning labels to epochs of sleep, depending on the stage of sleep they belong, is an arduous, time consuming and error prone process as the initial recordings are quite often polluted by noise from different sources. To properly analyze such data and extract clinical knowledge, noise components must be removed or alleviated. In this paper a pre-processing and subsequent sleep staging pipeline for the sleep analysis of electroencephalographic signals is described. Two novel methods of functional connectivity estimation (Synchronization Likelihood/SL and Relative Wavelet Entropy/RWE) are comparatively investigated for automatic sleep staging through manually pre-processed electroencephalographic recordings. A multi-step process that renders signals suitable for further analysis is initially described. Then, two methods that rely on extracting synchronization features from electroencephalographic recordings to achieve computerized sleep staging are proposed, based on bivariate features which provide a functional overview of the brain network, contrary to most proposed methods that rely on extracting univariate time and frequency features. Annotation of sleep epochs is achieved through the presented feature extraction methods by training classifiers, which are in turn able to accurately classify new epochs. Analysis of data from sleep experiments on a randomized, controlled bed-rest study, which was organized by the European Space Agency and was conducted in the "ENVIHAB" facility of the Institute of Aerospace Medicine at the German Aerospace Center (DLR) in Cologne, Germany attains high accuracy rates, over 90% based on ground truth that resulted from manual sleep staging by two experienced sleep experts. Therefore, it can be concluded that the above feature extraction methods are suitable for semi-automatic sleep staging.
Achieving Accurate Automatic Sleep Staging on Manually Pre-processed EEG Data Through Synchronization Feature Extraction and Graph Metrics

PubMed Central

Chriskos, Panteleimon; Frantzidis, Christos A.; Gkivogkli, Polyxeni T.; Bamidis, Panagiotis D.; Kourtidou-Papadeli, Chrysoula

2018-01-01

Sleep staging, the process of assigning labels to epochs of sleep, depending on the stage of sleep they belong, is an arduous, time consuming and error prone process as the initial recordings are quite often polluted by noise from different sources. To properly analyze such data and extract clinical knowledge, noise components must be removed or alleviated. In this paper a pre-processing and subsequent sleep staging pipeline for the sleep analysis of electroencephalographic signals is described. Two novel methods of functional connectivity estimation (Synchronization Likelihood/SL and Relative Wavelet Entropy/RWE) are comparatively investigated for automatic sleep staging through manually pre-processed electroencephalographic recordings. A multi-step process that renders signals suitable for further analysis is initially described. Then, two methods that rely on extracting synchronization features from electroencephalographic recordings to achieve computerized sleep staging are proposed, based on bivariate features which provide a functional overview of the brain network, contrary to most proposed methods that rely on extracting univariate time and frequency features. Annotation of sleep epochs is achieved through the presented feature extraction methods by training classifiers, which are in turn able to accurately classify new epochs. Analysis of data from sleep experiments on a randomized, controlled bed-rest study, which was organized by the European Space Agency and was conducted in the “ENVIHAB” facility of the Institute of Aerospace Medicine at the German Aerospace Center (DLR) in Cologne, Germany attains high accuracy rates, over 90% based on ground truth that resulted from manual sleep staging by two experienced sleep experts. Therefore, it can be concluded that the above feature extraction methods are suitable for semi-automatic sleep staging. PMID:29628883
A fully automatic end-to-end method for content-based image retrieval of CT scans with similar liver lesion annotations.

PubMed

Spanier, A B; Caplan, N; Sosna, J; Acar, B; Joskowicz, L

2018-01-01

The goal of medical content-based image retrieval (M-CBIR) is to assist radiologists in the decision-making process by retrieving medical cases similar to a given image. One of the key interests of radiologists is lesions and their annotations, since the patient treatment depends on the lesion diagnosis. Therefore, a key feature of M-CBIR systems is the retrieval of scans with the most similar lesion annotations. To be of value, M-CBIR systems should be fully automatic to handle large case databases. We present a fully automatic end-to-end method for the retrieval of CT scans with similar liver lesion annotations. The input is a database of abdominal CT scans labeled with liver lesions, a query CT scan, and optionally one radiologist-specified lesion annotation of interest. The output is an ordered list of the database CT scans with the most similar liver lesion annotations. The method starts by automatically segmenting the liver in the scan. It then extracts a histogram-based features vector from the segmented region, learns the features' relative importance, and ranks the database scans according to the relative importance measure. The main advantages of our method are that it fully automates the end-to-end querying process, that it uses simple and efficient techniques that are scalable to large datasets, and that it produces quality retrieval results using an unannotated CT scan. Our experimental results on 9 CT queries on a dataset of 41 volumetric CT scans from the 2014 Image CLEF Liver Annotation Task yield an average retrieval accuracy (Normalized Discounted Cumulative Gain index) of 0.77 and 0.84 without/with annotation, respectively. Fully automatic end-to-end retrieval of similar cases based on image information alone, rather that on disease diagnosis, may help radiologists to better diagnose liver lesions.
MEGANTE: A Web-Based System for Integrated Plant Genome Annotation

PubMed Central

Numa, Hisataka; Itoh, Takeshi

2014-01-01

The recent advancement of high-throughput genome sequencing technologies has resulted in a considerable increase in demands for large-scale genome annotation. While annotation is a crucial step for downstream data analyses and experimental studies, this process requires substantial expertise and knowledge of bioinformatics. Here we present MEGANTE, a web-based annotation system that makes plant genome annotation easy for researchers unfamiliar with bioinformatics. Without any complicated configuration, users can perform genomic sequence annotations simply by uploading a sequence and selecting the species to query. MEGANTE automatically runs several analysis programs and integrates the results to select the appropriate consensus exon–intron structures and to predict open reading frames (ORFs) at each locus. Functional annotation, including a similarity search against known proteins and a functional domain search, are also performed for the predicted ORFs. The resultant annotation information is visualized with a widely used genome browser, GBrowse. For ease of analysis, the results can be downloaded in Microsoft Excel format. All of the query sequences and annotation results are stored on the server side so that users can access their own data from virtually anywhere on the web. The current release of MEGANTE targets 24 plant species from the Brassicaceae, Fabaceae, Musaceae, Poaceae, Salicaceae, Solanaceae, Rosaceae and Vitaceae families, and it allows users to submit a sequence up to 10 Mb in length and to save up to 100 sequences with the annotation information on the server. The MEGANTE web service is available at https://megante.dna.affrc.go.jp/. PMID:24253915
Overview of the gene ontology task at BioCreative IV.

PubMed

Mao, Yuqing; Van Auken, Kimberly; Li, Donghui; Arighi, Cecilia N; McQuilton, Peter; Hayman, G Thomas; Tweedie, Susan; Schaeffer, Mary L; Laulederkind, Stanley J F; Wang, Shur-Jen; Gobeill, Julien; Ruch, Patrick; Luu, Anh Tuan; Kim, Jung-Jae; Chiang, Jung-Hsien; Chen, Yu-De; Yang, Chia-Jung; Liu, Hongfang; Zhu, Dongqing; Li, Yanpeng; Yu, Hong; Emadzadeh, Ehsan; Gonzalez, Graciela; Chen, Jian-Ming; Dai, Hong-Jie; Lu, Zhiyong

2014-01-01

Gene ontology (GO) annotation is a common task among model organism databases (MODs) for capturing gene function data from journal articles. It is a time-consuming and labor-intensive task, and is thus often considered as one of the bottlenecks in literature curation. There is a growing need for semiautomated or fully automated GO curation techniques that will help database curators to rapidly and accurately identify gene function information in full-length articles. Despite multiple attempts in the past, few studies have proven to be useful with regard to assisting real-world GO curation. The shortage of sentence-level training data and opportunities for interaction between text-mining developers and GO curators has limited the advances in algorithm development and corresponding use in practical circumstances. To this end, we organized a text-mining challenge task for literature-based GO annotation in BioCreative IV. More specifically, we developed two subtasks: (i) to automatically locate text passages that contain GO-relevant information (a text retrieval task) and (ii) to automatically identify relevant GO terms for the genes in a given article (a concept-recognition task). With the support from five MODs, we provided teams with >4000 unique text passages that served as the basis for each GO annotation in our task data. Such evidence text information has long been recognized as critical for text-mining algorithm development but was never made available because of the high cost of curation. In total, seven teams participated in the challenge task. From the team results, we conclude that the state of the art in automatically mining GO terms from literature has improved over the past decade while much progress is still needed for computer-assisted GO curation. Future work should focus on addressing remaining technical challenges for improved performance of automatic GO concept recognition and incorporating practical benefits of text-mining tools into real-world GO annotation. http://www.biocreative.org/tasks/biocreative-iv/track-4-GO/. Published by Oxford University Press 2014. This work is written by US Government employees and is in the public domain in the US.

Optimizing Automatic Deployment Using Non-functional Requirement Annotations

NASA Astrophysics Data System (ADS)

Kugele, Stefan; Haberl, Wolfgang; Tautschnig, Michael; Wechs, Martin

Model-driven development has become common practice in design of safety-critical real-time systems. High-level modeling constructs help to reduce the overall system complexity apparent to developers. This abstraction caters for fewer implementation errors in the resulting systems. In order to retain correctness of the model down to the software executed on a concrete platform, human faults during implementation must be avoided. This calls for an automatic, unattended deployment process including allocation, scheduling, and platform configuration.
Exogean: a framework for annotating protein-coding genes in eukaryotic genomic DNA

PubMed Central

Djebali, Sarah; Delaplace, Franck; Crollius, Hugues Roest

2006-01-01

Background Accurate and automatic gene identification in eukaryotic genomic DNA is more than ever of crucial importance to efficiently exploit the large volume of assembled genome sequences available to the community. Automatic methods have always been considered less reliable than human expertise. This is illustrated in the EGASP project, where reference annotations against which all automatic methods are measured are generated by human annotators and experimentally verified. We hypothesized that replicating the accuracy of human annotators in an automatic method could be achieved by formalizing the rules and decisions that they use, in a mathematical formalism. Results We have developed Exogean, a flexible framework based on directed acyclic colored multigraphs (DACMs) that can represent biological objects (for example, mRNA, ESTs, protein alignments, exons) and relationships between them. Graphs are analyzed to process the information according to rules that replicate those used by human annotators. Simple individual starting objects given as input to Exogean are thus combined and synthesized into complex objects such as protein coding transcripts. Conclusion We show here, in the context of the EGASP project, that Exogean is currently the method that best reproduces protein coding gene annotations from human experts, in terms of identifying at least one exact coding sequence per gene. We discuss current limitations of the method and several avenues for improvement. PMID:16925841
Text-mining-assisted biocuration workflows in Argo

PubMed Central

Rak, Rafal; Batista-Navarro, Riza Theresa; Rowley, Andrew; Carter, Jacob; Ananiadou, Sophia

2014-01-01

Biocuration activities have been broadly categorized into the selection of relevant documents, the annotation of biological concepts of interest and identification of interactions between the concepts. Text mining has been shown to have a potential to significantly reduce the effort of biocurators in all the three activities, and various semi-automatic methodologies have been integrated into curation pipelines to support them. We investigate the suitability of Argo, a workbench for building text-mining solutions with the use of a rich graphical user interface, for the process of biocuration. Central to Argo are customizable workflows that users compose by arranging available elementary analytics to form task-specific processing units. A built-in manual annotation editor is the single most used biocuration tool of the workbench, as it allows users to create annotations directly in text, as well as modify or delete annotations created by automatic processing components. Apart from syntactic and semantic analytics, the ever-growing library of components includes several data readers and consumers that support well-established as well as emerging data interchange formats such as XMI, RDF and BioC, which facilitate the interoperability of Argo with other platforms or resources. To validate the suitability of Argo for curation activities, we participated in the BioCreative IV challenge whose purpose was to evaluate Web-based systems addressing user-defined biocuration tasks. Argo proved to have the edge over other systems in terms of flexibility of defining biocuration tasks. As expected, the versatility of the workbench inevitably lengthened the time the curators spent on learning the system before taking on the task, which may have affected the usability of Argo. The participation in the challenge gave us an opportunity to gather valuable feedback and identify areas of improvement, some of which have already been introduced. Database URL: http://argo.nactem.ac.uk PMID:25037308
A robust data-driven approach for gene ontology annotation.

PubMed

Li, Yanpeng; Yu, Hong

2014-01-01

Gene ontology (GO) and GO annotation are important resources for biological information management and knowledge discovery, but the speed of manual annotation became a major bottleneck of database curation. BioCreative IV GO annotation task aims to evaluate the performance of system that automatically assigns GO terms to genes based on the narrative sentences in biomedical literature. This article presents our work in this task as well as the experimental results after the competition. For the evidence sentence extraction subtask, we built a binary classifier to identify evidence sentences using reference distance estimator (RDE), a recently proposed semi-supervised learning method that learns new features from around 10 million unlabeled sentences, achieving an F1 of 19.3% in exact match and 32.5% in relaxed match. In the post-submission experiment, we obtained 22.1% and 35.7% F1 performance by incorporating bigram features in RDE learning. In both development and test sets, RDE-based method achieved over 20% relative improvement on F1 and AUC performance against classical supervised learning methods, e.g. support vector machine and logistic regression. For the GO term prediction subtask, we developed an information retrieval-based method to retrieve the GO term most relevant to each evidence sentence using a ranking function that combined cosine similarity and the frequency of GO terms in documents, and a filtering method based on high-level GO classes. The best performance of our submitted runs was 7.8% F1 and 22.2% hierarchy F1. We found that the incorporation of frequency information and hierarchy filtering substantially improved the performance. In the post-submission evaluation, we obtained a 10.6% F1 using a simpler setting. Overall, the experimental analysis showed our approaches were robust in both the two tasks. © The Author(s) 2014. Published by Oxford University Press.
Response Evaluation of Malignant Liver Lesions After TACE/SIRT: Comparison of Manual and Semi-Automatic Measurement of Different Response Criteria in Multislice CT.

PubMed

Höink, Anna Janina; Schülke, Christoph; Koch, Raphael; Löhnert, Annika; Kammerer, Sara; Fortkamp, Rasmus; Heindel, Walter; Buerke, Boris

2017-11-01

Purpose To compare measurement precision and interobserver variability in the evaluation of hepatocellular carcinoma (HCC) and liver metastases in MSCT before and after transarterial local ablative therapies. Materials and Methods Retrospective study of 72 patients with malignant liver lesions (42 metastases; 30 HCCs) before and after therapy (43 SIRT procedures; 29 TACE procedures). Established (LAD; SAD; WHO) and vitality-based parameters (mRECIST; mLAD; mSAD; EASL) were assessed manually and semi-automatically by two readers. The relative interobserver difference (RID) and intraclass correlation coefficient (ICC) were calculated. Results The median RID for vitality-based parameters was lower from semi-automatic than from manual measurement of mLAD (manual 12.5 %; semi-automatic 3.4 %), mSAD (manual 12.7 %; semi-automatic 5.7 %) and EASL (manual 10.4 %; semi-automatic 1.8 %). The difference in established parameters was not statistically noticeable (p > 0.05). The ICCs of LAD (manual 0.984; semi-automatic 0.982), SAD (manual 0.975; semi-automatic 0.958) and WHO (manual 0.984; semi-automatic 0.978) are high, both in manual and semi-automatic measurements. The ICCs of manual measurements of mLAD (0.897), mSAD (0.844) and EASL (0.875) are lower. This decrease cannot be found in semi-automatic measurements of mLAD (0.997), mSAD (0.992) and EASL (0.998). Conclusion Vitality-based tumor measurements of HCC and metastases after transarterial local therapies should be performed semi-automatically due to greater measurement precision, thus increasing the reproducibility and in turn the reliability of therapeutic decisions. Key points · Liver lesion measurements according to EASL and mRECIST are more precise when performed semi-automatically.. · The higher reproducibility may facilitate a more reliable classification of therapy response.. · Measurements according to RECIST and WHO offer equivalent precision semi-automatically and manually.. Citation Format · Höink AJ, Schülke C, Koch R et al. Response Evaluation of Malignant Liver Lesions After TACE/SIRT: Comparison of Manual and Semi-Automatic Measurement of Different Response Criteria in Multislice CT. Fortschr Röntgenstr 2017; 189: 1067 - 1075. © Georg Thieme Verlag KG Stuttgart · New York.
Design and implementation of a database for Brucella melitensis genome annotation.

PubMed

De Hertogh, Benoît; Lahlimi, Leïla; Lambert, Christophe; Letesson, Jean-Jacques; Depiereux, Eric

2008-03-18

The genome sequences of three Brucella biovars and of some species close to Brucella sp. have become available, leading to new relationship analysis. Moreover, the automatic genome annotation of the pathogenic bacteria Brucella melitensis has been manually corrected by a consortium of experts, leading to 899 modifications of start sites predictions among the 3198 open reading frames (ORFs) examined. This new annotation, coupled with the results of automatic annotation tools of the complete genome sequences of the B. melitensis genome (including BLASTs to 9 genomes close to Brucella), provides numerous data sets related to predicted functions, biochemical properties and phylogenic comparisons. To made these results available, alphaPAGe, a functional auto-updatable database of the corrected sequence genome of B. melitensis, has been built, using the entity-relationship (ER) approach and a multi-purpose database structure. A friendly graphical user interface has been designed, and users can carry out different kinds of information by three levels of queries: (1) the basic search use the classical keywords or sequence identifiers; (2) the original advanced search engine allows to combine (by using logical operators) numerous criteria: (a) keywords (textual comparison) related to the pCDS's function, family domains and cellular localization; (b) physico-chemical characteristics (numerical comparison) such as isoelectric point or molecular weight and structural criteria such as the nucleic length or the number of transmembrane helix (TMH); (c) similarity scores with Escherichia coli and 10 species phylogenetically close to B. melitensis; (3) complex queries can be performed by using a SQL field, which allows all queries respecting the database's structure. The database is publicly available through a Web server at the following url: http://www.fundp.ac.be/urbm/bioinfo/aPAGe.
Morphosyntactic annotation of CHILDES transcripts*

PubMed Central

SAGAE, KENJI; DAVIS, ERIC; LAVIE, ALON; MACWHINNEY, BRIAN; WINTNER, SHULY

2014-01-01

Corpora of child language are essential for research in child language acquisition and psycholinguistics. Linguistic annotation of the corpora provides researchers with better means for exploring the development of grammatical constructions and their usage. We describe a project whose goal is to annotate the English section of the CHILDES database with grammatical relations in the form of labeled dependency structures. We have produced a corpus of over 18,800 utterances (approximately 65,000 words) with manually curated gold-standard grammatical relation annotations. Using this corpus, we have developed a highly accurate data-driven parser for the English CHILDES data, which we used to automatically annotate the remainder of the English section of CHILDES. We have also extended the parser to Spanish, and are currently working on supporting more languages. The parser and the manually and automatically annotated data are freely available for research purposes. PMID:20334720
Literature classification for semi-automated updating of biological knowledgebases

PubMed Central

2013-01-01

Background As the output of biological assays increase in resolution and volume, the body of specialized biological data, such as functional annotations of gene and protein sequences, enables extraction of higher-level knowledge needed for practical application in bioinformatics. Whereas common types of biological data, such as sequence data, are extensively stored in biological databases, functional annotations, such as immunological epitopes, are found primarily in semi-structured formats or free text embedded in primary scientific literature. Results We defined and applied a machine learning approach for literature classification to support updating of TANTIGEN, a knowledgebase of tumor T-cell antigens. Abstracts from PubMed were downloaded and classified as either "relevant" or "irrelevant" for database update. Training and five-fold cross-validation of a k-NN classifier on 310 abstracts yielded classification accuracy of 0.95, thus showing significant value in support of data extraction from the literature. Conclusion We here propose a conceptual framework for semi-automated extraction of epitope data embedded in scientific literature using principles from text mining and machine learning. The addition of such data will aid in the transition of biological databases to knowledgebases. PMID:24564403
Automatic Tool for Local Assembly Structures

DOE Office of Scientific and Technical Information (OSTI.GOV)

Whole community shotgun sequencing of total DNA (i.e. metagenomics) and total RNA (i.e. metatranscriptomics) has provided a wealth of information in the microbial community structure, predicted functions, metabolic networks, and is even able to reconstruct complete genomes directly. Here we present ATLAS (Automatic Tool for Local Assembly Structures) a comprehensive pipeline for assembly, annotation, genomic binning of metagenomic and metatranscriptomic data with an integrated framework for Multi-Omics. This will provide an open source tool for the Multi-Omic community at large.
Critical Assessment of Small Molecule Identification 2016: automated methods.

PubMed

Schymanski, Emma L; Ruttkies, Christoph; Krauss, Martin; Brouard, Céline; Kind, Tobias; Dührkop, Kai; Allen, Felicity; Vaniya, Arpana; Verdegem, Dries; Böcker, Sebastian; Rousu, Juho; Shen, Huibin; Tsugawa, Hiroshi; Sajed, Tanvir; Fiehn, Oliver; Ghesquière, Bart; Neumann, Steffen

2017-03-27

The fourth round of the Critical Assessment of Small Molecule Identification (CASMI) Contest ( www.casmi-contest.org ) was held in 2016, with two new categories for automated methods. This article covers the 208 challenges in Categories 2 and 3, without and with metadata, from organization, participation, results and post-contest evaluation of CASMI 2016 through to perspectives for future contests and small molecule annotation/identification. The Input Output Kernel Regression (CSI:IOKR) machine learning approach performed best in "Category 2: Best Automatic Structural Identification-In Silico Fragmentation Only", won by Team Brouard with 41% challenge wins. The winner of "Category 3: Best Automatic Structural Identification-Full Information" was Team Kind (MS-FINDER), with 76% challenge wins. The best methods were able to achieve over 30% Top 1 ranks in Category 2, with all methods ranking the correct candidate in the Top 10 in around 50% of challenges. This success rate rose to 70% Top 1 ranks in Category 3, with candidates in the Top 10 in over 80% of the challenges. The machine learning and chemistry-based approaches are shown to perform in complementary ways. The improvement in (semi-)automated fragmentation methods for small molecule identification has been substantial. The achieved high rates of correct candidates in the Top 1 and Top 10, despite large candidate numbers, open up great possibilities for high-throughput annotation of untargeted analysis for "known unknowns". As more high quality training data becomes available, the improvements in machine learning methods will likely continue, but the alternative approaches still provide valuable complementary information. Improved integration of experimental context will also improve identification success further for "real life" annotations. The true "unknown unknowns" remain to be evaluated in future CASMI contests. Graphical abstract .
Literature mining of protein-residue associations with graph rules learned through distant supervision.

PubMed

Ravikumar, Ke; Liu, Haibin; Cohn, Judith D; Wall, Michael E; Verspoor, Karin

2012-10-05

We propose a method for automatic extraction of protein-specific residue mentions from the biomedical literature. The method searches text for mentions of amino acids at specific sequence positions and attempts to correctly associate each mention with a protein also named in the text. The methods presented in this work will enable improved protein functional site extraction from articles, ultimately supporting protein function prediction. Our method made use of linguistic patterns for identifying the amino acid residue mentions in text. Further, we applied an automated graph-based method to learn syntactic patterns corresponding to protein-residue pairs mentioned in the text. We finally present an approach to automated construction of relevant training and test data using the distant supervision model. The performance of the method was assessed by extracting protein-residue relations from a new automatically generated test set of sentences containing high confidence examples found using distant supervision. It achieved a F-measure of 0.84 on automatically created silver corpus and 0.79 on a manually annotated gold data set for this task, outperforming previous methods. The primary contributions of this work are to (1) demonstrate the effectiveness of distant supervision for automatic creation of training data for protein-residue relation extraction, substantially reducing the effort and time involved in manual annotation of a data set and (2) show that the graph-based relation extraction approach we used generalizes well to the problem of protein-residue association extraction. This work paves the way towards effective extraction of protein functional residues from the literature.
Implementation of a microcontroller-based semi-automatic coagulator.

PubMed

Chan, K; Kirumira, A; Elkateeb, A

2001-01-01

The coagulator is an instrument used in hospitals to detect clot formation as a function of time. Generally, these coagulators are very expensive and therefore not affordable by a doctors' office and small clinics. The objective of this project is to design and implement a low cost semi-automatic coagulator (SAC) prototype. The SAC is capable of assaying up to 12 samples and can perform the following tests: prothrombin time (PT), activated partial thromboplastin time (APTT), and PT/APTT combination. The prototype has been tested successfully.
Automatic recognition of conceptualization zones in scientific articles and two life science applications.

PubMed

Liakata, Maria; Saha, Shyamasree; Dobnik, Simon; Batchelor, Colin; Rebholz-Schuhmann, Dietrich

2012-04-01

Scholarly biomedical publications report on the findings of a research investigation. Scientists use a well-established discourse structure to relate their work to the state of the art, express their own motivation and hypotheses and report on their methods, results and conclusions. In previous work, we have proposed ways to explicitly annotate the structure of scientific investigations in scholarly publications. Here we present the means to facilitate automatic access to the scientific discourse of articles by automating the recognition of 11 categories at the sentence level, which we call Core Scientific Concepts (CoreSCs). These include: Hypothesis, Motivation, Goal, Object, Background, Method, Experiment, Model, Observation, Result and Conclusion. CoreSCs provide the structure and context to all statements and relations within an article and their automatic recognition can greatly facilitate biomedical information extraction by characterizing the different types of facts, hypotheses and evidence available in a scientific publication. We have trained and compared machine learning classifiers (support vector machines and conditional random fields) on a corpus of 265 full articles in biochemistry and chemistry to automatically recognize CoreSCs. We have evaluated our automatic classifications against a manually annotated gold standard, and have achieved promising accuracies with 'Experiment', 'Background' and 'Model' being the categories with the highest F1-scores (76%, 62% and 53%, respectively). We have analysed the task of CoreSC annotation both from a sentence classification as well as sequence labelling perspective and we present a detailed feature evaluation. The most discriminative features are local sentence features such as unigrams, bigrams and grammatical dependencies while features encoding the document structure, such as section headings, also play an important role for some of the categories. We discuss the usefulness of automatically generated CoreSCs in two biomedical applications as well as work in progress. A web-based tool for the automatic annotation of articles with CoreSCs and corresponding documentation is available online at http://www.sapientaproject.com/software http://www.sapientaproject.com also contains detailed information pertaining to CoreSC annotation and links to annotation guidelines as well as a corpus of manually annotated articles, which served as our training data. liakata@ebi.ac.uk Supplementary data are available at Bioinformatics online.
A Case Study on Sepsis Using PubMed and Deep Learning for Ontology Learning.

PubMed

Arguello Casteleiro, Mercedes; Maseda Fernandez, Diego; Demetriou, George; Read, Warren; Fernandez Prieto, Maria Jesus; Des Diz, Julio; Nenadic, Goran; Keane, John; Stevens, Robert

2017-01-01

We investigate the application of distributional semantics models for facilitating unsupervised extraction of biomedical terms from unannotated corpora. Term extraction is used as the first step of an ontology learning process that aims to (semi-)automatic annotation of biomedical concepts and relations from more than 300K PubMed titles and abstracts. We experimented with both traditional distributional semantics methods such as Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA) as well as the neural language models CBOW and Skip-gram from Deep Learning. The evaluation conducted concentrates on sepsis, a major life-threatening condition, and shows that Deep Learning models outperform LSA and LDA with much higher precision.
Management and analysis of genomic functional and phenotypic controlled annotations to support biomedical investigation and practice.

PubMed

Masseroli, Marco

2007-07-01

The growing available genomic information provides new opportunities for novel research approaches and original biomedical applications that can provide effective data management and analysis support. In fact, integration and comprehensive evaluation of available controlled data can highlight information patterns leading to unveil new biomedical knowledge. Here, we describe Genome Function INtegrated Discover (GFINDer), a Web-accessible three-tier multidatabase system we developed to automatically enrich lists of user-classified genes with several functional and phenotypic controlled annotations, and to statistically evaluate them in order to identify annotation categories significantly over- or underrepresented in each considered gene class. Genomic controlled annotations from Gene Ontology (GO), KEGG, Pfam, InterPro, and Online Mendelian Inheritance in Man (OMIM) were integrated in GFINDer and several categorical tests were implemented for their analysis. A controlled vocabulary of inherited disorder phenotypes was obtained by normalizing and hierarchically structuring disease accompanying signs and symptoms from OMIM Clinical Synopsis sections. GFINDer modular architecture is well suited for further system expansion and for sustaining increasing workload. Testing results showed that GFINDer analyses can highlight gene functional and phenotypic characteristics and differences, demonstrating its value in supporting genomic biomedical approaches aiming at understanding the complex biomolecular mechanisms underlying patho-physiological phenotypes, and in helping the transfer of genomic results to medical practice.
SADI, SHARE, and the in silico scientific method

PubMed Central

2010-01-01

Background The emergence and uptake of Semantic Web technologies by the Life Sciences provides exciting opportunities for exploring novel ways to conduct in silico science. Web Service Workflows are already becoming first-class objects in “the new way”, and serve as explicit, shareable, referenceable representations of how an experiment was done. In turn, Semantic Web Service projects aim to facilitate workflow construction by biological domain-experts such that workflows can be edited, re-purposed, and re-published by non-informaticians. However the aspects of the scientific method relating to explicit discourse, disagreement, and hypothesis generation have remained relatively impervious to new technologies. Results Here we present SADI and SHARE - a novel Semantic Web Service framework, and a reference implementation of its client libraries. Together, SADI and SHARE allow the semi- or fully-automatic discovery and pipelining of Semantic Web Services in response to ad hoc user queries. Conclusions The semantic behaviours exhibited by SADI and SHARE extend the functionalities provided by Description Logic Reasoners such that novel assertions can be automatically added to a data-set without logical reasoning, but rather by analytical or annotative services. This behaviour might be applied to achieve the “semantification” of those aspects of the in silico scientific method that are not yet supported by Semantic Web technologies. We support this suggestion using an example in the clinical research space. PMID:21210986
Fuzzy logic and image processing techniques for the interpretation of seismic data

NASA Astrophysics Data System (ADS)

Orozco-del-Castillo, M. G.; Ortiz-Alemán, C.; Urrutia-Fucugauchi, J.; Rodríguez-Castellanos, A.

2011-06-01

Since interpretation of seismic data is usually a tedious and repetitive task, the ability to do so automatically or semi-automatically has become an important objective of recent research. We believe that the vagueness and uncertainty in the interpretation process makes fuzzy logic an appropriate tool to deal with seismic data. In this work we developed a semi-automated fuzzy inference system to detect the internal architecture of a mass transport complex (MTC) in seismic images. We propose that the observed characteristics of a MTC can be expressed as fuzzy if-then rules consisting of linguistic values associated with fuzzy membership functions. The constructions of the fuzzy inference system and various image processing techniques are presented. We conclude that this is a well-suited problem for fuzzy logic since the application of the proposed methodology yields a semi-automatically interpreted MTC which closely resembles the MTC from expert manual interpretation.
Using Ontologies to Interlink Linguistic Annotations and Improve Their Accuracy

ERIC Educational Resources Information Center

Pareja-Lora, Antonio

2016-01-01

For the new approaches to language e-learning (e.g. language blended learning, language autonomous learning or mobile-assisted language learning) to succeed, some automatic functions for error correction (for instance, in exercises) will have to be included in the long run in the corresponding environments and/or applications. A possible way to…
Plexiform neurofibroma tissue classification

NASA Astrophysics Data System (ADS)

Weizman, L.; Hoch, L.; Ben Sira, L.; Joskowicz, L.; Pratt, L.; Constantini, S.; Ben Bashat, D.

2011-03-01

Plexiform Neurofibroma (PN) is a major complication of NeuroFibromatosis-1 (NF1), a common genetic disease that involving the nervous system. PNs are peripheral nerve sheath tumors extending along the length of the nerve in various parts of the body. Treatment decision is based on tumor volume assessment using MRI, which is currently time consuming and error prone, with limited semi-automatic segmentation support. We present in this paper a new method for the segmentation and tumor mass quantification of PN from STIR MRI scans. The method starts with a user-based delineation of the tumor area in a single slice and automatically detects the PN lesions in the entire image based on the tumor connectivity. Experimental results on seven datasets yield a mean volume overlap difference of 25% as compared to manual segmentation by expert radiologist with a mean computation and interaction time of 12 minutes vs. over an hour for manual annotation. Since the user interaction in the segmentation process is minimal, our method has the potential to successfully become part of the clinical workflow.
Interoperability between biomedical ontologies through relation expansion, upper-level ontologies and automatic reasoning.

PubMed

Hoehndorf, Robert; Dumontier, Michel; Oellrich, Anika; Rebholz-Schuhmann, Dietrich; Schofield, Paul N; Gkoutos, Georgios V

2011-01-01

Researchers design ontologies as a means to accurately annotate and integrate experimental data across heterogeneous and disparate data- and knowledge bases. Formal ontologies make the semantics of terms and relations explicit such that automated reasoning can be used to verify the consistency of knowledge. However, many biomedical ontologies do not sufficiently formalize the semantics of their relations and are therefore limited with respect to automated reasoning for large scale data integration and knowledge discovery. We describe a method to improve automated reasoning over biomedical ontologies and identify several thousand contradictory class definitions. Our approach aligns terms in biomedical ontologies with foundational classes in a top-level ontology and formalizes composite relations as class expressions. We describe the semi-automated repair of contradictions and demonstrate expressive queries over interoperable ontologies. Our work forms an important cornerstone for data integration, automatic inference and knowledge discovery based on formal representations of knowledge. Our results and analysis software are available at http://bioonto.de/pmwiki.php/Main/ReasonableOntologies.

Game-powered machine learning

PubMed Central

Barrington, Luke; Turnbull, Douglas; Lanckriet, Gert

2012-01-01

Searching for relevant content in a massive amount of multimedia information is facilitated by accurately annotating each image, video, or song with a large number of relevant semantic keywords, or tags. We introduce game-powered machine learning, an integrated approach to annotating multimedia content that combines the effectiveness of human computation, through online games, with the scalability of machine learning. We investigate this framework for labeling music. First, a socially-oriented music annotation game called Herd It collects reliable music annotations based on the “wisdom of the crowds.” Second, these annotated examples are used to train a supervised machine learning system. Third, the machine learning system actively directs the annotation games to collect new data that will most benefit future model iterations. Once trained, the system can automatically annotate a corpus of music much larger than what could be labeled using human computation alone. Automatically annotated songs can be retrieved based on their semantic relevance to text-based queries (e.g., “funky jazz with saxophone,” “spooky electronica,” etc.). Based on the results presented in this paper, we find that actively coupling annotation games with machine learning provides a reliable and scalable approach to making searchable massive amounts of multimedia data. PMID:22460786
Game-powered machine learning.

PubMed

Barrington, Luke; Turnbull, Douglas; Lanckriet, Gert

2012-04-24

Searching for relevant content in a massive amount of multimedia information is facilitated by accurately annotating each image, video, or song with a large number of relevant semantic keywords, or tags. We introduce game-powered machine learning, an integrated approach to annotating multimedia content that combines the effectiveness of human computation, through online games, with the scalability of machine learning. We investigate this framework for labeling music. First, a socially-oriented music annotation game called Herd It collects reliable music annotations based on the "wisdom of the crowds." Second, these annotated examples are used to train a supervised machine learning system. Third, the machine learning system actively directs the annotation games to collect new data that will most benefit future model iterations. Once trained, the system can automatically annotate a corpus of music much larger than what could be labeled using human computation alone. Automatically annotated songs can be retrieved based on their semantic relevance to text-based queries (e.g., "funky jazz with saxophone," "spooky electronica," etc.). Based on the results presented in this paper, we find that actively coupling annotation games with machine learning provides a reliable and scalable approach to making searchable massive amounts of multimedia data.
Application of semi-active RFID power meter in automatic verification pipeline and intelligent storage system

NASA Astrophysics Data System (ADS)

Chen, Xiangqun; Huang, Rui; Shen, Liman; chen, Hao; Xiong, Dezhi; Xiao, Xiangqi; Liu, Mouhai; Xu, Renheng

2018-03-01

In this paper, the semi-active RFID watt-hour meter is applied to automatic test lines and intelligent warehouse management, from the transmission system, test system and auxiliary system, monitoring system, realize the scheduling of watt-hour meter, binding, control and data exchange, and other functions, make its more accurate positioning, high efficiency of management, update the data quickly, all the information at a glance. Effectively improve the quality, efficiency and automation of verification, and realize more efficient data management and warehouse management.
Automated Detection of Microaneurysms Using Scale-Adapted Blob Analysis and Semi-Supervised Learning

DOE Office of Scientific and Technical Information (OSTI.GOV)

Adal, Kedir M.; Sidebe, Desire; Ali, Sharib

2014-01-07

Despite several attempts, automated detection of microaneurysm (MA) from digital fundus images still remains to be an open issue. This is due to the subtle nature of MAs against the surrounding tissues. In this paper, the microaneurysm detection problem is modeled as finding interest regions or blobs from an image and an automatic local-scale selection technique is presented. Several scale-adapted region descriptors are then introduced to characterize these blob regions. A semi-supervised based learning approach, which requires few manually annotated learning examples, is also proposed to train a classifier to detect true MAs. The developed system is built using onlymore » few manually labeled and a large number of unlabeled retinal color fundus images. The performance of the overall system is evaluated on Retinopathy Online Challenge (ROC) competition database. A competition performance measure (CPM) of 0.364 shows the competitiveness of the proposed system against state-of-the art techniques as well as the applicability of the proposed features to analyze fundus images.« less
Pulmonary lobar volumetry using novel volumetric computer-aided diagnosis and computed tomography

PubMed Central

Iwano, Shingo; Kitano, Mariko; Matsuo, Keiji; Kawakami, Kenichi; Koike, Wataru; Kishimoto, Mariko; Inoue, Tsutomu; Li, Yuanzhong; Naganawa, Shinji

2013-01-01

OBJECTIVES To compare the accuracy of pulmonary lobar volumetry using the conventional number of segments method and novel volumetric computer-aided diagnosis using 3D computed tomography images. METHODS We acquired 50 consecutive preoperative 3D computed tomography examinations for lung tumours reconstructed at 1-mm slice thicknesses. We calculated the lobar volume and the emphysematous lobar volume < −950 HU of each lobe using (i) the slice-by-slice method (reference standard), (ii) number of segments method, and (iii) semi-automatic and (iv) automatic computer-aided diagnosis. We determined Pearson correlation coefficients between the reference standard and the three other methods for lobar volumes and emphysematous lobar volumes. We also compared the relative errors among the three measurement methods. RESULTS Both semi-automatic and automatic computer-aided diagnosis results were more strongly correlated with the reference standard than the number of segments method. The correlation coefficients for automatic computer-aided diagnosis were slightly lower than those for semi-automatic computer-aided diagnosis because there was one outlier among 50 cases (2%) in the right upper lobe and two outliers among 50 cases (4%) in the other lobes. The number of segments method relative error was significantly greater than those for semi-automatic and automatic computer-aided diagnosis (P < 0.001). The computational time for automatic computer-aided diagnosis was 1/2 to 2/3 than that of semi-automatic computer-aided diagnosis. CONCLUSIONS A novel lobar volumetry computer-aided diagnosis system could more precisely measure lobar volumes than the conventional number of segments method. Because semi-automatic computer-aided diagnosis and automatic computer-aided diagnosis were complementary, in clinical use, it would be more practical to first measure volumes by automatic computer-aided diagnosis, and then use semi-automatic measurements if automatic computer-aided diagnosis failed. PMID:23526418
NCBI prokaryotic genome annotation pipeline.

PubMed

Tatusova, Tatiana; DiCuccio, Michael; Badretdin, Azat; Chetvernin, Vyacheslav; Nawrocki, Eric P; Zaslavsky, Leonid; Lomsadze, Alexandre; Pruitt, Kim D; Borodovsky, Mark; Ostell, James

2016-08-19

Recent technological advances have opened unprecedented opportunities for large-scale sequencing and analysis of populations of pathogenic species in disease outbreaks, as well as for large-scale diversity studies aimed at expanding our knowledge across the whole domain of prokaryotes. To meet the challenge of timely interpretation of structure, function and meaning of this vast genetic information, a comprehensive approach to automatic genome annotation is critically needed. In collaboration with Georgia Tech, NCBI has developed a new approach to genome annotation that combines alignment based methods with methods of predicting protein-coding and RNA genes and other functional elements directly from sequence. A new gene finding tool, GeneMarkS+, uses the combined evidence of protein and RNA placement by homology as an initial map of annotation to generate and modify ab initio gene predictions across the whole genome. Thus, the new NCBI's Prokaryotic Genome Annotation Pipeline (PGAP) relies more on sequence similarity when confident comparative data are available, while it relies more on statistical predictions in the absence of external evidence. The pipeline provides a framework for generation and analysis of annotation on the full breadth of prokaryotic taxonomy. For additional information on PGAP see https://www.ncbi.nlm.nih.gov/genome/annotation_prok/ and the NCBI Handbook, https://www.ncbi.nlm.nih.gov/books/NBK174280/. Published by Oxford University Press on behalf of Nucleic Acids Research 2016. This work is written by (a) US Government employee(s) and is in the public domain in the US.
Semi-Automated Annotation of Biobank Data Using Standard Medical Terminologies in a Graph Database.

PubMed

Hofer, Philipp; Neururer, Sabrina; Goebel, Georg

2016-01-01

Data describing biobank resources frequently contains unstructured free-text information or insufficient coding standards. (Bio-) medical ontologies like Orphanet Rare Diseases Ontology (ORDO) or the Human Disease Ontology (DOID) provide a high number of concepts, synonyms and entity relationship properties. Such standard terminologies increase quality and granularity of input data by adding comprehensive semantic background knowledge from validated entity relationships. Moreover, cross-references between terminology concepts facilitate data integration across databases using different coding standards. In order to encourage the use of standard terminologies, our aim is to identify and link relevant concepts with free-text diagnosis inputs within a biobank registry. Relevant concepts are selected automatically by lexical matching and SPARQL queries against a RDF triplestore. To ensure correctness of annotations, proposed concepts have to be confirmed by medical data administration experts before they are entered into the registry database. Relevant (bio-) medical terminologies describing diseases and phenotypes were identified and stored in a graph database which was tied to a local biobank registry. Concept recommendations during data input trigger a structured description of medical data and facilitate data linkage between heterogeneous systems.
FIJI Macro 3D ART VeSElecT: 3D Automated Reconstruction Tool for Vesicle Structures of Electron Tomograms

PubMed Central

Kaltdorf, Kristin Verena; Schulze, Katja; Helmprobst, Frederik; Kollmannsberger, Philip; Stigloher, Christian

2017-01-01

Automatic image reconstruction is critical to cope with steadily increasing data from advanced microscopy. We describe here the Fiji macro 3D ART VeSElecT which we developed to study synaptic vesicles in electron tomograms. We apply this tool to quantify vesicle properties (i) in embryonic Danio rerio 4 and 8 days past fertilization (dpf) and (ii) to compare Caenorhabditis elegans N2 neuromuscular junctions (NMJ) wild-type and its septin mutant (unc-59(e261)). We demonstrate development-specific and mutant-specific changes in synaptic vesicle pools in both models. We confirm the functionality of our macro by applying our 3D ART VeSElecT on zebrafish NMJ showing smaller vesicles in 8 dpf embryos then 4 dpf, which was validated by manual reconstruction of the vesicle pool. Furthermore, we analyze the impact of C. elegans septin mutant unc-59(e261) on vesicle pool formation and vesicle size. Automated vesicle registration and characterization was implemented in Fiji as two macros (registration and measurement). This flexible arrangement allows in particular reducing false positives by an optional manual revision step. Preprocessing and contrast enhancement work on image-stacks of 1nm/pixel in x and y direction. Semi-automated cell selection was integrated. 3D ART VeSElecT removes interfering components, detects vesicles by 3D segmentation and calculates vesicle volume and diameter (spherical approximation, inner/outer diameter). Results are collected in color using the RoiManager plugin including the possibility of manual removal of non-matching confounder vesicles. Detailed evaluation considered performance (detected vesicles) and specificity (true vesicles) as well as precision and recall. We furthermore show gain in segmentation and morphological filtering compared to learning based methods and a large time gain compared to manual segmentation. 3D ART VeSElecT shows small error rates and its speed gain can be up to 68 times faster in comparison to manual annotation. Both automatic and semi-automatic modes are explained including a tutorial. PMID:28056033
Integrated protocol for reliable and fast quantification and documentation of electrophoresis gels.

PubMed

Rehbein, Peter; Schwalbe, Harald

2015-06-01

Quantitative analysis of electrophoresis gels is an important part in molecular cloning, as well as in protein expression and purification. Parallel quantifications in yield and purity can be most conveniently obtained from densitometric analysis. This communication reports a comprehensive, reliable and simple protocol for gel quantification and documentation, applicable for single samples and with special features for protein expression screens. As major component of the protocol, the fully annotated code of a proprietary open source computer program for semi-automatic densitometric quantification of digitized electrophoresis gels is disclosed. The program ("GelQuant") is implemented for the C-based macro-language of the widespread integrated development environment of IGOR Pro. Copyright © 2014 Elsevier Inc. All rights reserved.
Semantics-enabled service discovery framework in the SIMDAT pharma grid.

PubMed

Qu, Cangtao; Zimmermann, Falk; Kumpf, Kai; Kamuzinzi, Richard; Ledent, Valérie; Herzog, Robert

2008-03-01

We present the design and implementation of a semantics-enabled service discovery framework in the data Grids for process and product development using numerical simulation and knowledge discovery (SIMDAT) Pharma Grid, an industry-oriented Grid environment for integrating thousands of Grid-enabled biological data services and analysis services. The framework consists of three major components: the Web ontology language (OWL)-description logic (DL)-based biological domain ontology, OWL Web service ontology (OWL-S)-based service annotation, and semantic matchmaker based on the ontology reasoning. Built upon the framework, workflow technologies are extensively exploited in the SIMDAT to assist biologists in (semi)automatically performing in silico experiments. We present a typical usage scenario through the case study of a biological workflow: IXodus.
A Tutorial on Parallel and Concurrent Programming in Haskell

NASA Astrophysics Data System (ADS)

Peyton Jones, Simon; Singh, Satnam

This practical tutorial introduces the features available in Haskell for writing parallel and concurrent programs. We first describe how to write semi-explicit parallel programs by using annotations to express opportunities for parallelism and to help control the granularity of parallelism for effective execution on modern operating systems and processors. We then describe the mechanisms provided by Haskell for writing explicitly parallel programs with a focus on the use of software transactional memory to help share information between threads. Finally, we show how nested data parallelism can be used to write deterministically parallel programs which allows programmers to use rich data types in data parallel programs which are automatically transformed into flat data parallel versions for efficient execution on multi-core processors.
Clusters of orthologous genes for 41 archaeal genomes and implications for evolutionary genomics of archaea.

PubMed

Makarova, Kira S; Sorokin, Alexander V; Novichkov, Pavel S; Wolf, Yuri I; Koonin, Eugene V

2007-11-27

An evolutionary classification of genes from sequenced genomes that distinguishes between orthologs and paralogs is indispensable for genome annotation and evolutionary reconstruction. Shortly after multiple genome sequences of bacteria, archaea, and unicellular eukaryotes became available, an attempt on such a classification was implemented in Clusters of Orthologous Groups of proteins (COGs). Rapid accumulation of genome sequences creates opportunities for refining COGs but also represents a challenge because of error amplification. One of the practical strategies involves construction of refined COGs for phylogenetically compact subsets of genomes. New Archaeal Clusters of Orthologous Genes (arCOGs) were constructed for 41 archaeal genomes (13 Crenarchaeota, 27 Euryarchaeota and one Nanoarchaeon) using an improved procedure that employs a similarity tree between smaller, group-specific clusters, semi-automatically partitions orthology domains in multidomain proteins, and uses profile searches for identification of remote orthologs. The annotation of arCOGs is a consensus between three assignments based on the COGs, the CDD database, and the annotations of homologs in the NR database. The 7538 arCOGs, on average, cover approximately 88% of the genes in a genome compared to a approximately 76% coverage in COGs. The finer granularity of ortholog identification in the arCOGs is apparent from the fact that 4538 arCOGs correspond to 2362 COGs; approximately 40% of the arCOGs are new. The archaeal gene core (protein-coding genes found in all 41 genome) consists of 166 arCOGs. The arCOGs were used to reconstruct gene loss and gene gain events during archaeal evolution and gene sets of ancestral forms. The Last Archaeal Common Ancestor (LACA) is conservatively estimated to possess 996 genes compared to 1245 and 1335 genes for the last common ancestors of Crenarchaeota and Euryarchaeota, respectively. It is inferred that LACA was a chemoautotrophic hyperthermophile that, in addition to the core archaeal functions, encoded more idiosyncratic systems, e.g., the CASS systems of antivirus defense and some toxin-antitoxin systems. The arCOGs provide a convenient, flexible framework for functional annotation of archaeal genomes, comparative genomics and evolutionary reconstructions. Genomic reconstructions suggest that the last common ancestor of archaea might have been (nearly) as advanced as the modern archaeal hyperthermophiles. ArCOGs and related information are available at: ftp://ftp.ncbi.nih.gov/pub/koonin/arCOGs/.
Crowdsourcing for error detection in cortical surface delineations.

PubMed

Ganz, Melanie; Kondermann, Daniel; Andrulis, Jonas; Knudsen, Gitte Moos; Maier-Hein, Lena

2017-01-01

With the recent trend toward big data analysis, neuroimaging datasets have grown substantially in the past years. While larger datasets potentially offer important insights for medical research, one major bottleneck is the requirement for resources of medical experts needed to validate automatic processing results. To address this issue, the goal of this paper was to assess whether anonymous nonexperts from an online community can perform quality control of MR-based cortical surface delineations derived by an automatic algorithm. So-called knowledge workers from an online crowdsourcing platform were asked to annotate errors in automatic cortical surface delineations on 100 central, coronal slices of MR images. On average, annotations for 100 images were obtained in less than an hour. When using expert annotations as reference, the crowd on average achieves a sensitivity of 82 % and a precision of 42 %. Merging multiple annotations per image significantly improves the sensitivity of the crowd (up to 95 %), but leads to a decrease in precision (as low as 22 %). Our experiments show that the detection of errors in automatic cortical surface delineations generated by anonymous untrained workers is feasible. Future work will focus on increasing the sensitivity of our method further, such that the error detection tasks can be handled exclusively by the crowd and expert resources can be focused on error correction.
Support patient search on pathology reports with interactive online learning based data extraction.

PubMed

Zheng, Shuai; Lu, James J; Appin, Christina; Brat, Daniel; Wang, Fusheng

2015-01-01

Structural reporting enables semantic understanding and prompt retrieval of clinical findings about patients. While synoptic pathology reporting provides templates for data entries, information in pathology reports remains primarily in narrative free text form. Extracting data of interest from narrative pathology reports could significantly improve the representation of the information and enable complex structured queries. However, manual extraction is tedious and error-prone, and automated tools are often constructed with a fixed training dataset and not easily adaptable. Our goal is to extract data from pathology reports to support advanced patient search with a highly adaptable semi-automated data extraction system, which can adjust and self-improve by learning from a user's interaction with minimal human effort. We have developed an online machine learning based information extraction system called IDEAL-X. With its graphical user interface, the system's data extraction engine automatically annotates values for users to review upon loading each report text. The system analyzes users' corrections regarding these annotations with online machine learning, and incrementally enhances and refines the learning model as reports are processed. The system also takes advantage of customized controlled vocabularies, which can be adaptively refined during the online learning process to further assist the data extraction. As the accuracy of automatic annotation improves overtime, the effort of human annotation is gradually reduced. After all reports are processed, a built-in query engine can be applied to conveniently define queries based on extracted structured data. We have evaluated the system with a dataset of anatomic pathology reports from 50 patients. Extracted data elements include demographical data, diagnosis, genetic marker, and procedure. The system achieves F-1 scores of around 95% for the majority of tests. Extracting data from pathology reports could enable more accurate knowledge to support biomedical research and clinical diagnosis. IDEAL-X provides a bridge that takes advantage of online machine learning based data extraction and the knowledge from human's feedback. By combining iterative online learning and adaptive controlled vocabularies, IDEAL-X can deliver highly adaptive and accurate data extraction to support patient search.
Characterizing artifacts in RR stress test time series.

PubMed

Astudillo-Salinas, Fabian; Palacio-Baus, Kenneth; Solano-Quinde, Lizandro; Medina, Ruben; Wong, Sara

2016-08-01

Electrocardiographic stress test records have a lot of artifacts. In this paper we explore a simple method to characterize the amount of artifacts present in unprocessed RR stress test time series. Four time series classes were defined: Very good lead, Good lead, Low quality lead and Useless lead. 65 ECG, 8 lead, records of stress test series were analyzed. Firstly, RR-time series were annotated by two experts. The automatic methodology is based on dividing the RR-time series in non-overlapping windows. Each window is marked as noisy whenever it exceeds an established standard deviation threshold (SDT). Series are classified according to the percentage of windows that exceeds a given value, based upon the first manual annotation. Different SDT were explored. Results show that SDT close to 20% (as a percentage of the mean) provides the best results. The coincidence between annotators classification is 70.77% whereas, the coincidence between the second annotator and the automatic method providing the best matches is larger than 63%. Leads classified as Very good leads and Good leads could be combined to improve automatic heartbeat labeling.
Natural-Annotation-based Unsupervised Construction of Korean-Chinese Domain Dictionary

NASA Astrophysics Data System (ADS)

Liu, Wuying; Wang, Lin

2018-03-01

The large-scale bilingual parallel resource is significant to statistical learning and deep learning in natural language processing. This paper addresses the automatic construction issue of the Korean-Chinese domain dictionary, and presents a novel unsupervised construction method based on the natural annotation in the raw corpus. We firstly extract all Korean-Chinese word pairs from Korean texts according to natural annotations, secondly transform the traditional Chinese characters into the simplified ones, and finally distill out a bilingual domain dictionary after retrieving the simplified Chinese words in an extra Chinese domain dictionary. The experimental results show that our method can automatically build multiple Korean-Chinese domain dictionaries efficiently.
ChemBrowser: a flexible framework for mining chemical documents.

PubMed

Wu, Xian; Zhang, Li; Chen, Ying; Rhodes, James; Griffin, Thomas D; Boyer, Stephen K; Alba, Alfredo; Cai, Keke

2010-01-01

The ability to extract chemical and biological entities and relations from text documents automatically has great value to biochemical research and development activities. The growing maturity of text mining and artificial intelligence technologies shows promise in enabling such automatic chemical entity extraction capabilities (called "Chemical Annotation" in this paper). Many techniques have been reported in the literature, ranging from dictionary and rule-based techniques to machine learning approaches. In practice, we found that no single technique works well in all cases. A combinatorial approach that allows one to quickly compose different annotation techniques together for a given situation is most effective. In this paper, we describe the key challenges we face in real-world chemical annotation scenarios. We then present a solution called ChemBrowser which has a flexible framework for chemical annotation. ChemBrowser includes a suite of customizable processing units that might be utilized in a chemical annotator, a high-level language that describes the composition of various processing units that would form a chemical annotator, and an execution engine that translates the composition language to an actual annotator that can generate annotation results for a given set of documents. We demonstrate the impact of this approach by tailoring an annotator for extracting chemical names from patent documents and show how this annotator can be easily modified with simple configuration alone.
A semi-automatic computer-aided method for surgical template design

NASA Astrophysics Data System (ADS)

Chen, Xiaojun; Xu, Lu; Yang, Yue; Egger, Jan

2016-02-01

This paper presents a generalized integrated framework of semi-automatic surgical template design. Several algorithms were implemented including the mesh segmentation, offset surface generation, collision detection, ruled surface generation, etc., and a special software named TemDesigner was developed. With a simple user interface, a customized template can be semi- automatically designed according to the preoperative plan. Firstly, mesh segmentation with signed scalar of vertex is utilized to partition the inner surface from the input surface mesh based on the indicated point loop. Then, the offset surface of the inner surface is obtained through contouring the distance field of the inner surface, and segmented to generate the outer surface. Ruled surface is employed to connect inner and outer surfaces. Finally, drilling tubes are generated according to the preoperative plan through collision detection and merging. It has been applied to the template design for various kinds of surgeries, including oral implantology, cervical pedicle screw insertion, iliosacral screw insertion and osteotomy, demonstrating the efficiency, functionality and generality of our method.
A semi-automatic computer-aided method for surgical template design

PubMed Central

Chen, Xiaojun; Xu, Lu; Yang, Yue; Egger, Jan

2016-01-01

This paper presents a generalized integrated framework of semi-automatic surgical template design. Several algorithms were implemented including the mesh segmentation, offset surface generation, collision detection, ruled surface generation, etc., and a special software named TemDesigner was developed. With a simple user interface, a customized template can be semi- automatically designed according to the preoperative plan. Firstly, mesh segmentation with signed scalar of vertex is utilized to partition the inner surface from the input surface mesh based on the indicated point loop. Then, the offset surface of the inner surface is obtained through contouring the distance field of the inner surface, and segmented to generate the outer surface. Ruled surface is employed to connect inner and outer surfaces. Finally, drilling tubes are generated according to the preoperative plan through collision detection and merging. It has been applied to the template design for various kinds of surgeries, including oral implantology, cervical pedicle screw insertion, iliosacral screw insertion and osteotomy, demonstrating the efficiency, functionality and generality of our method. PMID:26843434
A semi-automatic computer-aided method for surgical template design.

PubMed

Chen, Xiaojun; Xu, Lu; Yang, Yue; Egger, Jan

2016-02-04

This paper presents a generalized integrated framework of semi-automatic surgical template design. Several algorithms were implemented including the mesh segmentation, offset surface generation, collision detection, ruled surface generation, etc., and a special software named TemDesigner was developed. With a simple user interface, a customized template can be semi- automatically designed according to the preoperative plan. Firstly, mesh segmentation with signed scalar of vertex is utilized to partition the inner surface from the input surface mesh based on the indicated point loop. Then, the offset surface of the inner surface is obtained through contouring the distance field of the inner surface, and segmented to generate the outer surface. Ruled surface is employed to connect inner and outer surfaces. Finally, drilling tubes are generated according to the preoperative plan through collision detection and merging. It has been applied to the template design for various kinds of surgeries, including oral implantology, cervical pedicle screw insertion, iliosacral screw insertion and osteotomy, demonstrating the efficiency, functionality and generality of our method.

ATLAS (Automatic Tool for Local Assembly Structures) - A Comprehensive Infrastructure for Assembly, Annotation, and Genomic Binning of Metagenomic and Metaranscripomic Data

DOE Office of Scientific and Technical Information (OSTI.GOV)

White, Richard A.; Brown, Joseph M.; Colby, Sean M.

ATLAS (Automatic Tool for Local Assembly Structures) is a comprehensive multiomics data analysis pipeline that is massively parallel and scalable. ATLAS contains a modular analysis pipeline for assembly, annotation, quantification and genome binning of metagenomics and metatranscriptomics data and a framework for reference metaproteomic database construction. ATLAS transforms raw sequence data into functional and taxonomic data at the microbial population level and provides genome-centric resolution through genome binning. ATLAS provides robust taxonomy based on majority voting of protein coding open reading frames rolled-up at the contig level using modified lowest common ancestor (LCA) analysis. ATLAS provides robust taxonomy based onmore » majority voting of protein coding open reading frames rolled-up at the contig level using modified lowest common ancestor (LCA) analysis. ATLAS is user-friendly, easy install through bioconda maintained as open-source on GitHub, and is implemented in Snakemake for modular customizable workflows.« less
Dictionary-driven protein annotation.

PubMed

Rigoutsos, Isidore; Huynh, Tien; Floratos, Aris; Parida, Laxmi; Platt, Daniel

2002-09-01

Computational methods seeking to automatically determine the properties (functional, structural, physicochemical, etc.) of a protein directly from the sequence have long been the focus of numerous research groups. With the advent of advanced sequencing methods and systems, the number of amino acid sequences that are being deposited in the public databases has been increasing steadily. This has in turn generated a renewed demand for automated approaches that can annotate individual sequences and complete genomes quickly, exhaustively and objectively. In this paper, we present one such approach that is centered around and exploits the Bio-Dictionary, a collection of amino acid patterns that completely covers the natural sequence space and can capture functional and structural signals that have been reused during evolution, within and across protein families. Our annotation approach also makes use of a weighted, position-specific scoring scheme that is unaffected by the over-representation of well-conserved proteins and protein fragments in the databases used. For a given query sequence, the method permits one to determine, in a single pass, the following: local and global similarities between the query and any protein already present in a public database; the likeness of the query to all available archaeal/ bacterial/eukaryotic/viral sequences in the database as a function of amino acid position within the query; the character of secondary structure of the query as a function of amino acid position within the query; the cytoplasmic, transmembrane or extracellular behavior of the query; the nature and position of binding domains, active sites, post-translationally modified sites, signal peptides, etc. In terms of performance, the proposed method is exhaustive, objective and allows for the rapid annotation of individual sequences and full genomes. Annotation examples are presented and discussed in Results, including individual queries and complete genomes that were released publicly after we built the Bio-Dictionary that is used in our experiments. Finally, we have computed the annotations of more than 70 complete genomes and made them available on the World Wide Web at http://cbcsrv.watson.ibm.com/Annotations/.
ISEScan: automated identification of insertion sequence elements in prokaryotic genomes.

PubMed

Xie, Zhiqun; Tang, Haixu

2017-11-01

The insertion sequence (IS) elements are the smallest but most abundant autonomous transposable elements in prokaryotic genomes, which play a key role in prokaryotic genome organization and evolution. With the fast growing genomic data, it is becoming increasingly critical for biology researchers to be able to accurately and automatically annotate ISs in prokaryotic genome sequences. The available automatic IS annotation systems are either providing only incomplete IS annotation or relying on the availability of existing genome annotations. Here, we present a new IS elements annotation pipeline to address these issues. ISEScan is a highly sensitive software pipeline based on profile hidden Markov models constructed from manually curated IS elements. ISEScan performs better than existing IS annotation systems when tested on prokaryotic genomes with curated annotations of IS elements. Applying it to 2784 prokaryotic genomes, we report the global distribution of IS families across taxonomic clades in Archaea and Bacteria. ISEScan is implemented in Python and released as an open source software at https://github.com/xiezhq/ISEScan. hatang@indiana.edu. Supplementary data are available at Bioinformatics online. © The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com
MIPS: analysis and annotation of genome information in 2007

PubMed Central

Mewes, H. W.; Dietmann, S.; Frishman, D.; Gregory, R.; Mannhaupt, G.; Mayer, K. F. X.; Münsterkötter, M.; Ruepp, A.; Spannagl, M.; Stümpflen, V.; Rattei, T.

2008-01-01

The Munich Information Center for Protein Sequences (MIPS-GSF, Neuherberg, Germany) combines automatic processing of large amounts of sequences with manual annotation of selected model genomes. Due to the massive growth of the available data, the depth of annotation varies widely between independent databases. Also, the criteria for the transfer of information from known to orthologous sequences are diverse. To cope with the task of global in-depth genome annotation has become unfeasible. Therefore, our efforts are dedicated to three levels of annotation: (i) the curation of selected genomes, in particular from fungal and plant taxa (e.g. CYGD, MNCDB, MatDB), (ii) the comprehensive, consistent, automatic annotation employing exhaustive methods for the computation of sequence similarities and sequence-related attributes as well as the classification of individual sequences (SIMAP, PEDANT and FunCat) and (iii) the compilation of manually curated databases for protein interactions based on scrutinized information from the literature to serve as an accepted set of reliable annotated interaction data (MPACT, MPPI, CORUM). All databases and tools described as well as the detailed descriptions of our projects can be accessed through the MIPS web server (http://mips.gsf.de). PMID:18158298
MIPS: analysis and annotation of genome information in 2007.

PubMed

Mewes, H W; Dietmann, S; Frishman, D; Gregory, R; Mannhaupt, G; Mayer, K F X; Münsterkötter, M; Ruepp, A; Spannagl, M; Stümpflen, V; Rattei, T

2008-01-01

The Munich Information Center for Protein Sequences (MIPS-GSF, Neuherberg, Germany) combines automatic processing of large amounts of sequences with manual annotation of selected model genomes. Due to the massive growth of the available data, the depth of annotation varies widely between independent databases. Also, the criteria for the transfer of information from known to orthologous sequences are diverse. To cope with the task of global in-depth genome annotation has become unfeasible. Therefore, our efforts are dedicated to three levels of annotation: (i) the curation of selected genomes, in particular from fungal and plant taxa (e.g. CYGD, MNCDB, MatDB), (ii) the comprehensive, consistent, automatic annotation employing exhaustive methods for the computation of sequence similarities and sequence-related attributes as well as the classification of individual sequences (SIMAP, PEDANT and FunCat) and (iii) the compilation of manually curated databases for protein interactions based on scrutinized information from the literature to serve as an accepted set of reliable annotated interaction data (MPACT, MPPI, CORUM). All databases and tools described as well as the detailed descriptions of our projects can be accessed through the MIPS web server (http://mips.gsf.de).
On the creation of a clinical gold standard corpus in Spanish: Mining adverse drug reactions.

PubMed

Oronoz, Maite; Gojenola, Koldo; Pérez, Alicia; de Ilarraza, Arantza Díaz; Casillas, Arantza

2015-08-01

The advances achieved in Natural Language Processing make it possible to automatically mine information from electronically created documents. Many Natural Language Processing methods that extract information from texts make use of annotated corpora, but these are scarce in the clinical domain due to legal and ethical issues. In this paper we present the creation of the IxaMed-GS gold standard composed of real electronic health records written in Spanish and manually annotated by experts in pharmacology and pharmacovigilance. The experts mainly annotated entities related to diseases and drugs, but also relationships between entities indicating adverse drug reaction events. To help the experts in the annotation task, we adapted a general corpus linguistic analyzer to the medical domain. The quality of the annotation process in the IxaMed-GS corpus has been assessed by measuring the inter-annotator agreement, which was 90.53% for entities and 82.86% for events. In addition, the corpus has been used for the automatic extraction of adverse drug reaction events using machine learning. Copyright © 2015 Elsevier Inc. All rights reserved.
Comparison of topological clustering within protein networks using edge metrics that evaluate full sequence, full structure, and active site microenvironment similarity.

PubMed

Leuthaeuser, Janelle B; Knutson, Stacy T; Kumar, Kiran; Babbitt, Patricia C; Fetrow, Jacquelyn S

2015-09-01

The development of accurate protein function annotation methods has emerged as a major unsolved biological problem. Protein similarity networks, one approach to function annotation via annotation transfer, group proteins into similarity-based clusters. An underlying assumption is that the edge metric used to identify such clusters correlates with functional information. In this contribution, this assumption is evaluated by observing topologies in similarity networks using three different edge metrics: sequence (BLAST), structure (TM-Align), and active site similarity (active site profiling, implemented in DASP). Network topologies for four well-studied protein superfamilies (enolase, peroxiredoxin (Prx), glutathione transferase (GST), and crotonase) were compared with curated functional hierarchies and structure. As expected, network topology differs, depending on edge metric; comparison of topologies provides valuable information on structure/function relationships. Subnetworks based on active site similarity correlate with known functional hierarchies at a single edge threshold more often than sequence- or structure-based networks. Sequence- and structure-based networks are useful for identifying sequence and domain similarities and differences; therefore, it is important to consider the clustering goal before deciding appropriate edge metric. Further, conserved active site residues identified in enolase and GST active site subnetworks correspond with published functionally important residues. Extension of this analysis yields predictions of functionally determinant residues for GST subgroups. These results support the hypothesis that active site similarity-based networks reveal clusters that share functional details and lay the foundation for capturing functionally relevant hierarchies using an approach that is both automatable and can deliver greater precision in function annotation than current similarity-based methods. © 2015 The Authors Protein Science published by Wiley Periodicals, Inc. on behalf of The Protein Society.
Comparison of topological clustering within protein networks using edge metrics that evaluate full sequence, full structure, and active site microenvironment similarity

PubMed Central

Leuthaeuser, Janelle B; Knutson, Stacy T; Kumar, Kiran; Babbitt, Patricia C; Fetrow, Jacquelyn S

2015-01-01

The development of accurate protein function annotation methods has emerged as a major unsolved biological problem. Protein similarity networks, one approach to function annotation via annotation transfer, group proteins into similarity-based clusters. An underlying assumption is that the edge metric used to identify such clusters correlates with functional information. In this contribution, this assumption is evaluated by observing topologies in similarity networks using three different edge metrics: sequence (BLAST), structure (TM-Align), and active site similarity (active site profiling, implemented in DASP). Network topologies for four well-studied protein superfamilies (enolase, peroxiredoxin (Prx), glutathione transferase (GST), and crotonase) were compared with curated functional hierarchies and structure. As expected, network topology differs, depending on edge metric; comparison of topologies provides valuable information on structure/function relationships. Subnetworks based on active site similarity correlate with known functional hierarchies at a single edge threshold more often than sequence- or structure-based networks. Sequence- and structure-based networks are useful for identifying sequence and domain similarities and differences; therefore, it is important to consider the clustering goal before deciding appropriate edge metric. Further, conserved active site residues identified in enolase and GST active site subnetworks correspond with published functionally important residues. Extension of this analysis yields predictions of functionally determinant residues for GST subgroups. These results support the hypothesis that active site similarity-based networks reveal clusters that share functional details and lay the foundation for capturing functionally relevant hierarchies using an approach that is both automatable and can deliver greater precision in function annotation than current similarity-based methods. PMID:26073648
The Universal Protein Resource (UniProt): an expanding universe of protein information.

PubMed

Wu, Cathy H; Apweiler, Rolf; Bairoch, Amos; Natale, Darren A; Barker, Winona C; Boeckmann, Brigitte; Ferro, Serenella; Gasteiger, Elisabeth; Huang, Hongzhan; Lopez, Rodrigo; Magrane, Michele; Martin, Maria J; Mazumder, Raja; O'Donovan, Claire; Redaschi, Nicole; Suzek, Baris

2006-01-01

The Universal Protein Resource (UniProt) provides a central resource on protein sequences and functional annotation with three database components, each addressing a key need in protein bioinformatics. The UniProt Knowledgebase (UniProtKB), comprising the manually annotated UniProtKB/Swiss-Prot section and the automatically annotated UniProtKB/TrEMBL section, is the preeminent storehouse of protein annotation. The extensive cross-references, functional and feature annotations and literature-based evidence attribution enable scientists to analyse proteins and query across databases. The UniProt Reference Clusters (UniRef) speed similarity searches via sequence space compression by merging sequences that are 100% (UniRef100), 90% (UniRef90) or 50% (UniRef50) identical. Finally, the UniProt Archive (UniParc) stores all publicly available protein sequences, containing the history of sequence data with links to the source databases. UniProt databases continue to grow in size and in availability of information. Recent and upcoming changes to database contents, formats, controlled vocabularies and services are described. New download availability includes all major releases of UniProtKB, sequence collections by taxonomic division and complete proteomes. A bibliography mapping service has been added, and an ID mapping service will be available soon. UniProt databases can be accessed online at http://www.uniprot.org or downloaded at ftp://ftp.uniprot.org/pub/databases/.
The CHEMDNER corpus of chemicals and drugs and its annotation principles.

PubMed

Krallinger, Martin; Rabal, Obdulia; Leitner, Florian; Vazquez, Miguel; Salgado, David; Lu, Zhiyong; Leaman, Robert; Lu, Yanan; Ji, Donghong; Lowe, Daniel M; Sayle, Roger A; Batista-Navarro, Riza Theresa; Rak, Rafal; Huber, Torsten; Rocktäschel, Tim; Matos, Sérgio; Campos, David; Tang, Buzhou; Xu, Hua; Munkhdalai, Tsendsuren; Ryu, Keun Ho; Ramanan, S V; Nathan, Senthil; Žitnik, Slavko; Bajec, Marko; Weber, Lutz; Irmer, Matthias; Akhondi, Saber A; Kors, Jan A; Xu, Shuo; An, Xin; Sikdar, Utpal Kumar; Ekbal, Asif; Yoshioka, Masaharu; Dieb, Thaer M; Choi, Miji; Verspoor, Karin; Khabsa, Madian; Giles, C Lee; Liu, Hongfang; Ravikumar, Komandur Elayavilli; Lamurias, Andre; Couto, Francisco M; Dai, Hong-Jie; Tsai, Richard Tzong-Han; Ata, Caglar; Can, Tolga; Usié, Anabel; Alves, Rui; Segura-Bedmar, Isabel; Martínez, Paloma; Oyarzabal, Julen; Valencia, Alfonso

2015-01-01

The automatic extraction of chemical information from text requires the recognition of chemical entity mentions as one of its key steps. When developing supervised named entity recognition (NER) systems, the availability of a large, manually annotated text corpus is desirable. Furthermore, large corpora permit the robust evaluation and comparison of different approaches that detect chemicals in documents. We present the CHEMDNER corpus, a collection of 10,000 PubMed abstracts that contain a total of 84,355 chemical entity mentions labeled manually by expert chemistry literature curators, following annotation guidelines specifically defined for this task. The abstracts of the CHEMDNER corpus were selected to be representative for all major chemical disciplines. Each of the chemical entity mentions was manually labeled according to its structure-associated chemical entity mention (SACEM) class: abbreviation, family, formula, identifier, multiple, systematic and trivial. The difficulty and consistency of tagging chemicals in text was measured using an agreement study between annotators, obtaining a percentage agreement of 91. For a subset of the CHEMDNER corpus (the test set of 3,000 abstracts) we provide not only the Gold Standard manual annotations, but also mentions automatically detected by the 26 teams that participated in the BioCreative IV CHEMDNER chemical mention recognition task. In addition, we release the CHEMDNER silver standard corpus of automatically extracted mentions from 17,000 randomly selected PubMed abstracts. A version of the CHEMDNER corpus in the BioC format has been generated as well. We propose a standard for required minimum information about entity annotations for the construction of domain specific corpora on chemical and drug entities. The CHEMDNER corpus and annotation guidelines are available at: http://www.biocreative.org/resources/biocreative-iv/chemdner-corpus/.
The CHEMDNER corpus of chemicals and drugs and its annotation principles

PubMed Central

2015-01-01

The automatic extraction of chemical information from text requires the recognition of chemical entity mentions as one of its key steps. When developing supervised named entity recognition (NER) systems, the availability of a large, manually annotated text corpus is desirable. Furthermore, large corpora permit the robust evaluation and comparison of different approaches that detect chemicals in documents. We present the CHEMDNER corpus, a collection of 10,000 PubMed abstracts that contain a total of 84,355 chemical entity mentions labeled manually by expert chemistry literature curators, following annotation guidelines specifically defined for this task. The abstracts of the CHEMDNER corpus were selected to be representative for all major chemical disciplines. Each of the chemical entity mentions was manually labeled according to its structure-associated chemical entity mention (SACEM) class: abbreviation, family, formula, identifier, multiple, systematic and trivial. The difficulty and consistency of tagging chemicals in text was measured using an agreement study between annotators, obtaining a percentage agreement of 91. For a subset of the CHEMDNER corpus (the test set of 3,000 abstracts) we provide not only the Gold Standard manual annotations, but also mentions automatically detected by the 26 teams that participated in the BioCreative IV CHEMDNER chemical mention recognition task. In addition, we release the CHEMDNER silver standard corpus of automatically extracted mentions from 17,000 randomly selected PubMed abstracts. A version of the CHEMDNER corpus in the BioC format has been generated as well. We propose a standard for required minimum information about entity annotations for the construction of domain specific corpora on chemical and drug entities. The CHEMDNER corpus and annotation guidelines are available at: http://www.biocreative.org/resources/biocreative-iv/chemdner-corpus/ PMID:25810773
MitoFish and MitoAnnotator: A Mitochondrial Genome Database of Fish with an Accurate and Automatic Annotation Pipeline

PubMed Central

Iwasaki, Wataru; Fukunaga, Tsukasa; Isagozawa, Ryota; Yamada, Koichiro; Maeda, Yasunobu; Satoh, Takashi P.; Sado, Tetsuya; Mabuchi, Kohji; Takeshima, Hirohiko; Miya, Masaki; Nishida, Mutsumi

2013-01-01

Mitofish is a database of fish mitochondrial genomes (mitogenomes) that includes powerful and precise de novo annotations for mitogenome sequences. Fish occupy an important position in the evolution of vertebrates and the ecology of the hydrosphere, and mitogenomic sequence data have served as a rich source of information for resolving fish phylogenies and identifying new fish species. The importance of a mitogenomic database continues to grow at a rapid pace as massive amounts of mitogenomic data are generated with the advent of new sequencing technologies. A severe bottleneck seems likely to occur with regard to mitogenome annotation because of the overwhelming pace of data accumulation and the intrinsic difficulties in annotating sequences with degenerating transfer RNA structures, divergent start/stop codons of the coding elements, and the overlapping of adjacent elements. To ease this data backlog, we developed an annotation pipeline named MitoAnnotator. MitoAnnotator automatically annotates a fish mitogenome with a high degree of accuracy in approximately 5 min; thus, it is readily applicable to data sets of dozens of sequences. MitoFish also contains re-annotations of previously sequenced fish mitogenomes, enabling researchers to refer to them when they find annotations that are likely to be erroneous or while conducting comparative mitogenomic analyses. For users who need more information on the taxonomy, habitats, phenotypes, or life cycles of fish, MitoFish provides links to related databases. MitoFish and MitoAnnotator are freely available at http://mitofish.aori.u-tokyo.ac.jp/ (last accessed August 28, 2013); all of the data can be batch downloaded, and the annotation pipeline can be used via a web interface. PMID:23955518
Text-mining and information-retrieval services for molecular biology

PubMed Central

Krallinger, Martin; Valencia, Alfonso

2005-01-01

Text-mining in molecular biology - defined as the automatic extraction of information about genes, proteins and their functional relationships from text documents - has emerged as a hybrid discipline on the edges of the fields of information science, bioinformatics and computational linguistics. A range of text-mining applications have been developed recently that will improve access to knowledge for biologists and database annotators. PMID:15998455
COMP Superscalar, an interoperable programming framework

NASA Astrophysics Data System (ADS)

Badia, Rosa M.; Conejero, Javier; Diaz, Carlos; Ejarque, Jorge; Lezzi, Daniele; Lordan, Francesc; Ramon-Cortes, Cristian; Sirvent, Raul

2015-12-01

COMPSs is a programming framework that aims to facilitate the parallelization of existing applications written in Java, C/C++ and Python scripts. For that purpose, it offers a simple programming model based on sequential development in which the user is mainly responsible for (i) identifying the functions to be executed as asynchronous parallel tasks and (ii) annotating them with annotations or standard Python decorators. A runtime system is in charge of exploiting the inherent concurrency of the code, automatically detecting and enforcing the data dependencies between tasks and spawning these tasks to the available resources, which can be nodes in a cluster, clouds or grids. In cloud environments, COMPSs provides scalability and elasticity features allowing the dynamic provision of resources.
Incremental Ontology-Based Extraction and Alignment in Semi-structured Documents

NASA Astrophysics Data System (ADS)

Thiam, Mouhamadou; Bennacer, Nacéra; Pernelle, Nathalie; Lô, Moussa

SHIRIis an ontology-based system for integration of semi-structured documents related to a specific domain. The system’s purpose is to allow users to access to relevant parts of documents as answers to their queries. SHIRI uses RDF/OWL for representation of resources and SPARQL for their querying. It relies on an automatic, unsupervised and ontology-driven approach for extraction, alignment and semantic annotation of tagged elements of documents. In this paper, we focus on the Extract-Align algorithm which exploits a set of named entity and term patterns to extract term candidates to be aligned with the ontology. It proceeds in an incremental manner in order to populate the ontology with terms describing instances of the domain and to reduce the access to extern resources such as Web. We experiment it on a HTML corpus related to call for papers in computer science and the results that we obtain are very promising. These results show how the incremental behaviour of Extract-Align algorithm enriches the ontology and the number of terms (or named entities) aligned directly with the ontology increases.
Automated detection of microaneurysms using scale-adapted blob analysis and semi-supervised learning.

PubMed

Adal, Kedir M; Sidibé, Désiré; Ali, Sharib; Chaum, Edward; Karnowski, Thomas P; Mériaudeau, Fabrice

2014-04-01

Despite several attempts, automated detection of microaneurysm (MA) from digital fundus images still remains to be an open issue. This is due to the subtle nature of MAs against the surrounding tissues. In this paper, the microaneurysm detection problem is modeled as finding interest regions or blobs from an image and an automatic local-scale selection technique is presented. Several scale-adapted region descriptors are introduced to characterize these blob regions. A semi-supervised based learning approach, which requires few manually annotated learning examples, is also proposed to train a classifier which can detect true MAs. The developed system is built using only few manually labeled and a large number of unlabeled retinal color fundus images. The performance of the overall system is evaluated on Retinopathy Online Challenge (ROC) competition database. A competition performance measure (CPM) of 0.364 shows the competitiveness of the proposed system against state-of-the art techniques as well as the applicability of the proposed features to analyze fundus images. Copyright © 2013 Elsevier Ireland Ltd. All rights reserved.
Unsupervised Decoding of Long-Term, Naturalistic Human Neural Recordings with Automated Video and Audio Annotations

PubMed Central

Wang, Nancy X. R.; Olson, Jared D.; Ojemann, Jeffrey G.; Rao, Rajesh P. N.; Brunton, Bingni W.

2016-01-01

Fully automated decoding of human activities and intentions from direct neural recordings is a tantalizing challenge in brain-computer interfacing. Implementing Brain Computer Interfaces (BCIs) outside carefully controlled experiments in laboratory settings requires adaptive and scalable strategies with minimal supervision. Here we describe an unsupervised approach to decoding neural states from naturalistic human brain recordings. We analyzed continuous, long-term electrocorticography (ECoG) data recorded over many days from the brain of subjects in a hospital room, with simultaneous audio and video recordings. We discovered coherent clusters in high-dimensional ECoG recordings using hierarchical clustering and automatically annotated them using speech and movement labels extracted from audio and video. To our knowledge, this represents the first time techniques from computer vision and speech processing have been used for natural ECoG decoding. Interpretable behaviors were decoded from ECoG data, including moving, speaking and resting; the results were assessed by comparison with manual annotation. Discovered clusters were projected back onto the brain revealing features consistent with known functional areas, opening the door to automated functional brain mapping in natural settings. PMID:27148018
Trust, control strategies and allocation of function in human-machine systems.

PubMed

Lee, J; Moray, N

1992-10-01

As automated controllers supplant human intervention in controlling complex systems, the operators' role often changes from that of an active controller to that of a supervisory controller. Acting as supervisors, operators can choose between automatic and manual control. Improperly allocating function between automatic and manual control can have negative consequences for the performance of a system. Previous research suggests that the decision to perform the job manually or automatically depends, in part, upon the trust the operators invest in the automatic controllers. This paper reports an experiment to characterize the changes in operators' trust during an interaction with a semi-automatic pasteurization plant, and investigates the relationship between changes in operators' control strategies and trust. A regression model identifies the causes of changes in trust, and a 'trust transfer function' is developed using time series analysis to describe the dynamics of trust. Based on a detailed analysis of operators' strategies in response to system faults we suggest a model for the choice between manual and automatic control, based on trust in automatic controllers and self-confidence in the ability to control the system manually.
Semi-Supervised Active Learning for Sound Classification in Hybrid Learning Environments.

PubMed

Han, Wenjing; Coutinho, Eduardo; Ruan, Huabin; Li, Haifeng; Schuller, Björn; Yu, Xiaojie; Zhu, Xuan

2016-01-01

Coping with scarcity of labeled data is a common problem in sound classification tasks. Approaches for classifying sounds are commonly based on supervised learning algorithms, which require labeled data which is often scarce and leads to models that do not generalize well. In this paper, we make an efficient combination of confidence-based Active Learning and Self-Training with the aim of minimizing the need for human annotation for sound classification model training. The proposed method pre-processes the instances that are ready for labeling by calculating their classifier confidence scores, and then delivers the candidates with lower scores to human annotators, and those with high scores are automatically labeled by the machine. We demonstrate the feasibility and efficacy of this method in two practical scenarios: pool-based and stream-based processing. Extensive experimental results indicate that our approach requires significantly less labeled instances to reach the same performance in both scenarios compared to Passive Learning, Active Learning and Self-Training. A reduction of 52.2% in human labeled instances is achieved in both of the pool-based and stream-based scenarios on a sound classification task considering 16,930 sound instances.
A Simple and Practical Dictionary-based Approach for Identification of Proteins in Medline Abstracts

PubMed Central

Egorov, Sergei; Yuryev, Anton; Daraselia, Nikolai

2004-01-01

Objective: The aim of this study was to develop a practical and efficient protein identification system for biomedical corpora. Design: The developed system, called ProtScan, utilizes a carefully constructed dictionary of mammalian proteins in conjunction with a specialized tokenization algorithm to identify and tag protein name occurrences in biomedical texts and also takes advantage of Medline “Name-of-Substance” (NOS) annotation. The dictionaries for ProtScan were constructed in a semi-automatic way from various public-domain sequence databases followed by an intensive expert curation step. Measurements: The recall and precision of the system have been determined using 1,000 randomly selected and hand-tagged Medline abstracts. Results: The developed system is capable of identifying protein occurrences in Medline abstracts with a 98% precision and 88% recall. It was also found to be capable of processing approximately 300 abstracts per second. Without utilization of NOS annotation, precision and recall were found to be 98.5% and 84%, respectively. Conclusion: The developed system appears to be well suited for protein-based Medline indexing and can help to improve biomedical information retrieval. Further approaches to ProtScan's recall improvement also are discussed. PMID:14764613

Semi-Supervised Active Learning for Sound Classification in Hybrid Learning Environments

PubMed Central

Han, Wenjing; Coutinho, Eduardo; Li, Haifeng; Schuller, Björn; Yu, Xiaojie; Zhu, Xuan

2016-01-01

Coping with scarcity of labeled data is a common problem in sound classification tasks. Approaches for classifying sounds are commonly based on supervised learning algorithms, which require labeled data which is often scarce and leads to models that do not generalize well. In this paper, we make an efficient combination of confidence-based Active Learning and Self-Training with the aim of minimizing the need for human annotation for sound classification model training. The proposed method pre-processes the instances that are ready for labeling by calculating their classifier confidence scores, and then delivers the candidates with lower scores to human annotators, and those with high scores are automatically labeled by the machine. We demonstrate the feasibility and efficacy of this method in two practical scenarios: pool-based and stream-based processing. Extensive experimental results indicate that our approach requires significantly less labeled instances to reach the same performance in both scenarios compared to Passive Learning, Active Learning and Self-Training. A reduction of 52.2% in human labeled instances is achieved in both of the pool-based and stream-based scenarios on a sound classification task considering 16,930 sound instances. PMID:27627768
A sentence sliding window approach to extract protein annotations from biomedical articles

PubMed Central

Krallinger, Martin; Padron, Maria; Valencia, Alfonso

2005-01-01

Background Within the emerging field of text mining and statistical natural language processing (NLP) applied to biomedical articles, a broad variety of techniques have been developed during the past years. Nevertheless, there is still a great ned of comparative assessment of the performance of the proposed methods and the development of common evaluation criteria. This issue was addressed by the Critical Assessment of Text Mining Methods in Molecular Biology (BioCreative) contest. The aim of this contest was to assess the performance of text mining systems applied to biomedical texts including tools which recognize named entities such as genes and proteins, and tools which automatically extract protein annotations. Results The "sentence sliding window" approach proposed here was found to efficiently extract text fragments from full text articles containing annotations on proteins, providing the highest number of correctly predicted annotations. Moreover, the number of correct extractions of individual entities (i.e. proteins and GO terms) involved in the relationships used for the annotations was significantly higher than the correct extractions of the complete annotations (protein-function relations). Conclusion We explored the use of averaging sentence sliding windows for information extraction, especially in a context where conventional training data is unavailable. The combination of our approach with more refined statistical estimators and machine learning techniques might be a way to improve annotation extraction for future biomedical text mining applications. PMID:15960831
INDIGO – INtegrated Data Warehouse of MIcrobial GenOmes with Examples from the Red Sea Extremophiles

PubMed Central

Alam, Intikhab; Antunes, André; Kamau, Allan Anthony; Ba alawi, Wail; Kalkatawi, Manal; Stingl, Ulrich; Bajic, Vladimir B.

2013-01-01

Background The next generation sequencing technologies substantially increased the throughput of microbial genome sequencing. To functionally annotate newly sequenced microbial genomes, a variety of experimental and computational methods are used. Integration of information from different sources is a powerful approach to enhance such annotation. Functional analysis of microbial genomes, necessary for downstream experiments, crucially depends on this annotation but it is hampered by the current lack of suitable information integration and exploration systems for microbial genomes. Results We developed a data warehouse system (INDIGO) that enables the integration of annotations for exploration and analysis of newly sequenced microbial genomes. INDIGO offers an opportunity to construct complex queries and combine annotations from multiple sources starting from genomic sequence to protein domain, gene ontology and pathway levels. This data warehouse is aimed at being populated with information from genomes of pure cultures and uncultured single cells of Red Sea bacteria and Archaea. Currently, INDIGO contains information from Salinisphaera shabanensis, Haloplasma contractile, and Halorhabdus tiamatea - extremophiles isolated from deep-sea anoxic brine lakes of the Red Sea. We provide examples of utilizing the system to gain new insights into specific aspects on the unique lifestyle and adaptations of these organisms to extreme environments. Conclusions We developed a data warehouse system, INDIGO, which enables comprehensive integration of information from various resources to be used for annotation, exploration and analysis of microbial genomes. It will be regularly updated and extended with new genomes. It is aimed to serve as a resource dedicated to the Red Sea microbes. In addition, through INDIGO, we provide our Automatic Annotation of Microbial Genomes (AAMG) pipeline. The INDIGO web server is freely available at http://www.cbrc.kaust.edu.sa/indigo. PMID:24324765
INDIGO - INtegrated data warehouse of microbial genomes with examples from the red sea extremophiles.

PubMed

Alam, Intikhab; Antunes, André; Kamau, Allan Anthony; Ba Alawi, Wail; Kalkatawi, Manal; Stingl, Ulrich; Bajic, Vladimir B

2013-01-01

The next generation sequencing technologies substantially increased the throughput of microbial genome sequencing. To functionally annotate newly sequenced microbial genomes, a variety of experimental and computational methods are used. Integration of information from different sources is a powerful approach to enhance such annotation. Functional analysis of microbial genomes, necessary for downstream experiments, crucially depends on this annotation but it is hampered by the current lack of suitable information integration and exploration systems for microbial genomes. We developed a data warehouse system (INDIGO) that enables the integration of annotations for exploration and analysis of newly sequenced microbial genomes. INDIGO offers an opportunity to construct complex queries and combine annotations from multiple sources starting from genomic sequence to protein domain, gene ontology and pathway levels. This data warehouse is aimed at being populated with information from genomes of pure cultures and uncultured single cells of Red Sea bacteria and Archaea. Currently, INDIGO contains information from Salinisphaera shabanensis, Haloplasma contractile, and Halorhabdus tiamatea - extremophiles isolated from deep-sea anoxic brine lakes of the Red Sea. We provide examples of utilizing the system to gain new insights into specific aspects on the unique lifestyle and adaptations of these organisms to extreme environments. We developed a data warehouse system, INDIGO, which enables comprehensive integration of information from various resources to be used for annotation, exploration and analysis of microbial genomes. It will be regularly updated and extended with new genomes. It is aimed to serve as a resource dedicated to the Red Sea microbes. In addition, through INDIGO, we provide our Automatic Annotation of Microbial Genomes (AAMG) pipeline. The INDIGO web server is freely available at http://www.cbrc.kaust.edu.sa/indigo.
Automatically classifying sentences in full-text biomedical articles into Introduction, Methods, Results and Discussion.

PubMed

Agarwal, Shashank; Yu, Hong

2009-12-01

Biomedical texts can be typically represented by four rhetorical categories: Introduction, Methods, Results and Discussion (IMRAD). Classifying sentences into these categories can benefit many other text-mining tasks. Although many studies have applied different approaches for automatically classifying sentences in MEDLINE abstracts into the IMRAD categories, few have explored the classification of sentences that appear in full-text biomedical articles. We first evaluated whether sentences in full-text biomedical articles could be reliably annotated into the IMRAD format and then explored different approaches for automatically classifying these sentences into the IMRAD categories. Our results show an overall annotation agreement of 82.14% with a Kappa score of 0.756. The best classification system is a multinomial naïve Bayes classifier trained on manually annotated data that achieved 91.95% accuracy and an average F-score of 91.55%, which is significantly higher than baseline systems. A web version of this system is available online at-http://wood.ims.uwm.edu/full_text_classifier/.
Breadboard activities for advanced protein crystal growth

NASA Technical Reports Server (NTRS)

Rosenberger, Franz; Banish, Michael

1993-01-01

The proposed work entails the design, assembly, testing, and delivery of a turn-key system for the semi-automated determination of protein solubilities as a function of temperature. The system will utilize optical scintillation as a means of detecting and monitoring nucleation and crystallite growth during temperature lowering (or raising, with retrograde solubility systems). The deliverables of this contract are: (1) turn-key scintillation system for the semi-automatic determination of protein solubilities as a function of temperature, (2) instructions and software package for the operation of the scintillation system, and (3) one semi-annual and one final report including the test results obtained for ovostatin with the above scintillation system.
Functional-to-form mapping for assembly design automation

NASA Astrophysics Data System (ADS)

Xu, Z. G.; Liu, W. M.; Shen, W. D.; Yang, D. Y.; Liu, T. T.

2017-11-01

Assembly-level function-to-form mapping is the most effective procedure towards design automation. The research work mainly includes: the assembly-level function definitions, product network model and the two-step mapping mechanisms. The function-to-form mapping is divided into two steps, i.e. mapping of function-to-behavior, called the first-step mapping, and the second-step mapping, i.e. mapping of behavior-to-structure. After the first step mapping, the three dimensional transmission chain (or 3D sketch) is studied, and the feasible design computing tools are developed. The mapping procedure is relatively easy to be implemented interactively, but, it is quite difficult to finish it automatically. So manual, semi-automatic, automatic and interactive modification of the mapping model are studied. A mechanical hand F-F mapping process is illustrated to verify the design methodologies.
Automatic acquisition of motion trajectories: tracking hockey players

NASA Astrophysics Data System (ADS)

Okuma, Kenji; Little, James J.; Lowe, David

2003-12-01

Computer systems that have the capability of analyzing complex and dynamic scenes play an essential role in video annotation. Scenes can be complex in such a way that there are many cluttered objects with different colors, shapes and sizes, and can be dynamic with multiple interacting moving objects and a constantly changing background. In reality, there are many scenes that are complex, dynamic, and challenging enough for computers to describe. These scenes include games of sports, air traffic, car traffic, street intersections, and cloud transformations. Our research is about the challenge of inventing a descriptive computer system that analyzes scenes of hockey games where multiple moving players interact with each other on a constantly moving background due to camera motions. Ultimately, such a computer system should be able to acquire reliable data by extracting the players" motion as their trajectories, querying them by analyzing the descriptive information of data, and predict the motions of some hockey players based on the result of the query. Among these three major aspects of the system, we primarily focus on visual information of the scenes, that is, how to automatically acquire motion trajectories of hockey players from video. More accurately, we automatically analyze the hockey scenes by estimating parameters (i.e., pan, tilt, and zoom) of the broadcast cameras, tracking hockey players in those scenes, and constructing a visual description of the data by displaying trajectories of those players. Many technical problems in vision such as fast and unpredictable players' motions and rapid camera motions make our challenge worth tackling. To the best of our knowledge, there have not been any automatic video annotation systems for hockey developed in the past. Although there are many obstacles to overcome, our efforts and accomplishments would hopefully establish the infrastructure of the automatic hockey annotation system and become a milestone for research in automatic video annotation in this domain.
Comparison between manual and semi-automatic segmentation of nasal cavity and paranasal sinuses from CT images.

PubMed

Tingelhoff, K; Moral, A I; Kunkel, M E; Rilk, M; Wagner, I; Eichhorn, K G; Wahl, F M; Bootz, F

2007-01-01

Segmentation of medical image data is getting more and more important over the last years. The results are used for diagnosis, surgical planning or workspace definition of robot-assisted systems. The purpose of this paper is to find out whether manual or semi-automatic segmentation is adequate for ENT surgical workflow or whether fully automatic segmentation of paranasal sinuses and nasal cavity is needed. We present a comparison of manual and semi-automatic segmentation of paranasal sinuses and the nasal cavity. Manual segmentation is performed by custom software whereas semi-automatic segmentation is realized by a commercial product (Amira). For this study we used a CT dataset of the paranasal sinuses which consists of 98 transversal slices, each 1.0 mm thick, with a resolution of 512 x 512 pixels. For the analysis of both segmentation procedures we used volume, extension (width, length and height), segmentation time and 3D-reconstruction. The segmentation time was reduced from 960 minutes with manual to 215 minutes with semi-automatic segmentation. We found highest variances segmenting nasal cavity. For the paranasal sinuses manual and semi-automatic volume differences are not significant. Dependent on the segmentation accuracy both approaches deliver useful results and could be used for e.g. robot-assisted systems. Nevertheless both procedures are not useful for everyday surgical workflow, because they take too much time. Fully automatic and reproducible segmentation algorithms are needed for segmentation of paranasal sinuses and nasal cavity.
Preliminary clinical evaluation of semi-automated nailfold capillaroscopy in the assessment of patients with Raynaud's phenomenon.

PubMed

Murray, Andrea K; Feng, Kaiyan; Moore, Tonia L; Allen, Phillip D; Taylor, Christopher J; Herrick, Ariane L

2011-08-01

Nailfold capillaroscopy is well established in screening patients with Raynaud's phenomenon for underlying SSc-spectrum disorders, by identifying abnormal capillaries. Our aim was to compare semi-automatic feature measurement from newly developed software with manual measurements, and determine the degree to which semi-automated data allows disease group classification. Images from 46 healthy controls, 21 patients with PRP and 49 with SSc were preprocessed, and semi-automated measurements of intercapillary distance and capillary width, tortuosity, and derangement were performed. These were compared with manual measurements. Features were used to classify images into the three subject groups. Comparison of automatic and manual measures for distance, width, tortuosity, and derangement had correlations of r=0.583, 0.624, 0.495 (p<0.001), and 0.195 (p=0.040). For automatic measures, correlations were found between width and intercapillary distance, r=0.374, and width and tortuosity, r=0.573 (p<0.001). Significant differences between subject groups were found for all features (p<0.002). Overall, 75% of images correctly matched clinical classification using semi-automated features, compared with 71% for manual measurements. Semi-automatic and manual measurements of distance, width, and tortuosity showed moderate (but statistically significant) correlations. Correlation for derangement was weaker. Semi-automatic measurements are faster than manual measurements. Semi-automatic parameters identify differences between groups, and are as good as manual measurements for between-group classification. © 2011 John Wiley & Sons Ltd.
Automatically Recognizing Medication and Adverse Event Information From Food and Drug Administration’s Adverse Event Reporting System Narratives

PubMed Central

Polepalli Ramesh, Balaji; Belknap, Steven M; Li, Zuofeng; Frid, Nadya; West, Dennis P

2014-01-01

Background The Food and Drug Administration’s (FDA) Adverse Event Reporting System (FAERS) is a repository of spontaneously-reported adverse drug events (ADEs) for FDA-approved prescription drugs. FAERS reports include both structured reports and unstructured narratives. The narratives often include essential information for evaluation of the severity, causality, and description of ADEs that are not present in the structured data. The timely identification of unknown toxicities of prescription drugs is an important, unsolved problem. Objective The objective of this study was to develop an annotated corpus of FAERS narratives and biomedical named entity tagger to automatically identify ADE related information in the FAERS narratives. Methods We developed an annotation guideline and annotate medication information and adverse event related entities on 122 FAERS narratives comprising approximately 23,000 word tokens. A named entity tagger using supervised machine learning approaches was built for detecting medication information and adverse event entities using various categories of features. Results The annotated corpus had an agreement of over .9 Cohen’s kappa for medication and adverse event entities. The best performing tagger achieves an overall performance of 0.73 F1 score for detection of medication, adverse event and other named entities. Conclusions In this study, we developed an annotated corpus of FAERS narratives and machine learning based models for automatically extracting medication and adverse event information from the FAERS narratives. Our study is an important step towards enriching the FAERS data for postmarketing pharmacovigilance. PMID:25600332
New directions in biomedical text annotation: definitions, guidelines and corpus construction

PubMed Central

Wilbur, W John; Rzhetsky, Andrey; Shatkay, Hagit

2006-01-01

Background While biomedical text mining is emerging as an important research area, practical results have proven difficult to achieve. We believe that an important first step towards more accurate text-mining lies in the ability to identify and characterize text that satisfies various types of information needs. We report here the results of our inquiry into properties of scientific text that have sufficient generality to transcend the confines of a narrow subject area, while supporting practical mining of text for factual information. Our ultimate goal is to annotate a significant corpus of biomedical text and train machine learning methods to automatically categorize such text along certain dimensions that we have defined. Results We have identified five qualitative dimensions that we believe characterize a broad range of scientific sentences, and are therefore useful for supporting a general approach to text-mining: focus, polarity, certainty, evidence, and directionality. We define these dimensions and describe the guidelines we have developed for annotating text with regard to them. To examine the effectiveness of the guidelines, twelve annotators independently annotated the same set of 101 sentences that were randomly selected from current biomedical periodicals. Analysis of these annotations shows 70–80% inter-annotator agreement, suggesting that our guidelines indeed present a well-defined, executable and reproducible task. Conclusion We present our guidelines defining a text annotation task, along with annotation results from multiple independently produced annotations, demonstrating the feasibility of the task. The annotation of a very large corpus of documents along these guidelines is currently ongoing. These annotations form the basis for the categorization of text along multiple dimensions, to support viable text mining for experimental results, methodology statements, and other forms of information. We are currently developing machine learning methods, to be trained and tested on the annotated corpus, that would allow for the automatic categorization of biomedical text along the general dimensions that we have presented. The guidelines in full detail, along with annotated examples, are publicly available. PMID:16867190
A repeated-measures analysis of the effects of soft tissues on wrist range of motion in the extant phylogenetic bracket of dinosaurs: Implications for the functional origins of an automatic wrist folding mechanism in Crocodilia.

PubMed

Hutson, Joel David; Hutson, Kelda Nadine

2014-07-01

A recent study hypothesized that avian-like wrist folding in quadrupedal dinosaurs could have aided their distinctive style of locomotion with semi-pronated and therefore medially facing palms. However, soft tissues that automatically guide avian wrist folding rarely fossilize, and automatic wrist folding of unknown function in extant crocodilians has not been used to test this hypothesis. Therefore, an investigation of the relative contributions of soft tissues to wrist range of motion (ROM) in the extant phylogenetic bracket of dinosaurs, and the quadrupedal function of crocodilian wrist folding, could inform these questions. Here, we repeatedly measured wrist ROM in degrees through fully fleshed, skinned, minus muscles/tendons, minus ligaments, and skeletonized stages in the American alligator Alligator mississippiensis and the ostrich Struthio camelus. The effects of dissection treatment and observer were statistically significant for alligator wrist folding and ostrich wrist flexion, but not ostrich wrist folding. Final skeletonized wrist folding ROM was higher than (ostrich) or equivalent to (alligator) initial fully fleshed ROM, while final ROM was lower than initial ROM for ostrich wrist flexion. These findings suggest that, unlike the hinge/ball and socket-type elbow and shoulder joints in these archosaurs, ROM within gliding/planar diarthrotic joints is more restricted to the extent of articular surfaces. The alligator data indicate that the crocodilian wrist mechanism functions to automatically lock their semi-pronated palms into a rigid column, which supports the hypothesis that this palmar orientation necessitated soft tissue stiffening mechanisms in certain dinosaurs, although ROM-restricted articulations argue against the presence of an extensive automatic mechanism. Anat Rec, 297:1228-1249, 2014. © 2014 Wiley Periodicals, Inc. © 2014 Wiley Periodicals, Inc.
Introducing meta-services for biomedical information extraction

PubMed Central

Leitner, Florian; Krallinger, Martin; Rodriguez-Penagos, Carlos; Hakenberg, Jörg; Plake, Conrad; Kuo, Cheng-Ju; Hsu, Chun-Nan; Tsai, Richard Tzong-Han; Hung, Hsi-Chuan; Lau, William W; Johnson, Calvin A; Sætre, Rune; Yoshida, Kazuhiro; Chen, Yan Hua; Kim, Sun; Shin, Soo-Yong; Zhang, Byoung-Tak; Baumgartner, William A; Hunter, Lawrence; Haddow, Barry; Matthews, Michael; Wang, Xinglong; Ruch, Patrick; Ehrler, Frédéric; Özgür, Arzucan; Erkan, Güneş; Radev, Dragomir R; Krauthammer, Michael; Luong, ThaiBinh; Hoffmann, Robert; Sander, Chris; Valencia, Alfonso

2008-01-01

We introduce the first meta-service for information extraction in molecular biology, the BioCreative MetaServer (BCMS; ). This prototype platform is a joint effort of 13 research groups and provides automatically generated annotations for PubMed/Medline abstracts. Annotation types cover gene names, gene IDs, species, and protein-protein interactions. The annotations are distributed by the meta-server in both human and machine readable formats (HTML/XML). This service is intended to be used by biomedical researchers and database annotators, and in biomedical language processing. The platform allows direct comparison, unified access, and result aggregation of the annotations. PMID:18834497
Dictionary-driven protein annotation

PubMed Central

Rigoutsos, Isidore; Huynh, Tien; Floratos, Aris; Parida, Laxmi; Platt, Daniel

2002-01-01

Computational methods seeking to automatically determine the properties (functional, structural, physicochemical, etc.) of a protein directly from the sequence have long been the focus of numerous research groups. With the advent of advanced sequencing methods and systems, the number of amino acid sequences that are being deposited in the public databases has been increasing steadily. This has in turn generated a renewed demand for automated approaches that can annotate individual sequences and complete genomes quickly, exhaustively and objectively. In this paper, we present one such approach that is centered around and exploits the Bio-Dictionary, a collection of amino acid patterns that completely covers the natural sequence space and can capture functional and structural signals that have been reused during evolution, within and across protein families. Our annotation approach also makes use of a weighted, position-specific scoring scheme that is unaffected by the over-representation of well-conserved proteins and protein fragments in the databases used. For a given query sequence, the method permits one to determine, in a single pass, the following: local and global similarities between the query and any protein already present in a public database; the likeness of the query to all available archaeal/bacterial/eukaryotic/viral sequences in the database as a function of amino acid position within the query; the character of secondary structure of the query as a function of amino acid position within the query; the cytoplasmic, transmembrane or extracellular behavior of the query; the nature and position of binding domains, active sites, post-translationally modified sites, signal peptides, etc. In terms of performance, the proposed method is exhaustive, objective and allows for the rapid annotation of individual sequences and full genomes. Annotation examples are presented and discussed in Results, including individual queries and complete genomes that were released publicly after we built the Bio-Dictionary that is used in our experiments. Finally, we have computed the annotations of more than 70 complete genomes and made them available on the World Wide Web at http://cbcsrv.watson.ibm.com/Annotations/. PMID:12202776
Multi-dimensional classification of biomedical text: Toward automated, practical provision of high-utility text to diverse users

PubMed Central

Shatkay, Hagit; Pan, Fengxia; Rzhetsky, Andrey; Wilbur, W. John

2008-01-01

Motivation: Much current research in biomedical text mining is concerned with serving biologists by extracting certain information from scientific text. We note that there is no ‘average biologist’ client; different users have distinct needs. For instance, as noted in past evaluation efforts (BioCreative, TREC, KDD) database curators are often interested in sentences showing experimental evidence and methods. Conversely, lab scientists searching for known information about a protein may seek facts, typically stated with high confidence. Text-mining systems can target specific end-users and become more effective, if the system can first identify text regions rich in the type of scientific content that is of interest to the user, retrieve documents that have many such regions, and focus on fact extraction from these regions. Here, we study the ability to characterize and classify such text automatically. We have recently introduced a multi-dimensional categorization and annotation scheme, developed to be applicable to a wide variety of biomedical documents and scientific statements, while intended to support specific biomedical retrieval and extraction tasks. Results: The annotation scheme was applied to a large corpus in a controlled effort by eight independent annotators, where three individual annotators independently tagged each sentence. We then trained and tested machine learning classifiers to automatically categorize sentence fragments based on the annotation. We discuss here the issues involved in this task, and present an overview of the results. The latter strongly suggest that automatic annotation along most of the dimensions is highly feasible, and that this new framework for scientific sentence categorization is applicable in practice. Contact: shatkay@cs.queensu.ca PMID:18718948
Confidence-based ensemble for GBM brain tumor segmentation

NASA Astrophysics Data System (ADS)

Huo, Jing; van Rikxoort, Eva M.; Okada, Kazunori; Kim, Hyun J.; Pope, Whitney; Goldin, Jonathan; Brown, Matthew

2011-03-01

It is a challenging task to automatically segment glioblastoma multiforme (GBM) brain tumors on T1w post-contrast isotropic MR images. A semi-automated system using fuzzy connectedness has recently been developed for computing the tumor volume that reduces the cost of manual annotation. In this study, we propose a an ensemble method that combines multiple segmentation results into a final ensemble one. The method is evaluated on a dataset of 20 cases from a multi-center pharmaceutical drug trial and compared to the fuzzy connectedness method. Three individual methods were used in the framework: fuzzy connectedness, GrowCut, and voxel classification. The combination method is a confidence map averaging (CMA) method. The CMA method shows an improved ROC curve compared to the fuzzy connectedness method (p < 0.001). The CMA ensemble result is more robust compared to the three individual methods.
Automatic textual annotation of video news based on semantic visual object extraction

NASA Astrophysics Data System (ADS)

Boujemaa, Nozha; Fleuret, Francois; Gouet, Valerie; Sahbi, Hichem

2003-12-01

In this paper, we present our work for automatic generation of textual metadata based on visual content analysis of video news. We present two methods for semantic object detection and recognition from a cross modal image-text thesaurus. These thesaurus represent a supervised association between models and semantic labels. This paper is concerned with two semantic objects: faces and Tv logos. In the first part, we present our work for efficient face detection and recogniton with automatic name generation. This method allows us also to suggest the textual annotation of shots close-up estimation. On the other hand, we were interested to automatically detect and recognize different Tv logos present on incoming different news from different Tv Channels. This work was done jointly with the French Tv Channel TF1 within the "MediaWorks" project that consists on an hybrid text-image indexing and retrieval plateform for video news.
Common data model for natural language processing based on two existing standard information models: CDA+GrAF.

PubMed

Meystre, Stéphane M; Lee, Sanghoon; Jung, Chai Young; Chevrier, Raphaël D

2012-08-01

An increasing need for collaboration and resources sharing in the Natural Language Processing (NLP) research and development community motivates efforts to create and share a common data model and a common terminology for all information annotated and extracted from clinical text. We have combined two existing standards: the HL7 Clinical Document Architecture (CDA), and the ISO Graph Annotation Format (GrAF; in development), to develop such a data model entitled "CDA+GrAF". We experimented with several methods to combine these existing standards, and eventually selected a method wrapping separate CDA and GrAF parts in a common standoff annotation (i.e., separate from the annotated text) XML document. Two use cases, clinical document sections, and the 2010 i2b2/VA NLP Challenge (i.e., problems, tests, and treatments, with their assertions and relations), were used to create examples of such standoff annotation documents, and were successfully validated with the XML schemata provided with both standards. We developed a tool to automatically translate annotation documents from the 2010 i2b2/VA NLP Challenge format to GrAF, and automatically generated 50 annotation documents using this tool, all successfully validated. Finally, we adapted the XSL stylesheet provided with HL7 CDA to allow viewing annotation XML documents in a web browser, and plan to adapt existing tools for translating annotation documents between CDA+GrAF and the UIMA and GATE frameworks. This common data model may ease directly comparing NLP tools and applications, combining their output, transforming and "translating" annotations between different NLP applications, and eventually "plug-and-play" of different modules in NLP applications. Copyright © 2011 Elsevier Inc. All rights reserved.
Active Deep Learning-Based Annotation of Electroencephalography Reports for Cohort Identification

PubMed Central

Maldonado, Ramon; Goodwin, Travis R; Harabagiu, Sanda M

2017-01-01

The annotation of a large corpus of Electroencephalography (EEG) reports is a crucial step in the development of an EEG-specific patient cohort retrieval system. The annotation of multiple types of EEG-specific medical concepts, along with their polarity and modality, is challenging, especially when automatically performed on Big Data. To address this challenge, we present a novel framework which combines the advantages of active and deep learning while producing annotations that capture a variety of attributes of medical concepts. Results obtained through our novel framework show great promise. PMID:28815135

UniProtKB/Swiss-Prot, the Manually Annotated Section of the UniProt KnowledgeBase: How to Use the Entry View.

PubMed

Boutet, Emmanuel; Lieberherr, Damien; Tognolli, Michael; Schneider, Michel; Bansal, Parit; Bridge, Alan J; Poux, Sylvain; Bougueleret, Lydie; Xenarios, Ioannis

2016-01-01

The Universal Protein Resource (UniProt, http://www.uniprot.org ) consortium is an initiative of the SIB Swiss Institute of Bioinformatics (SIB), the European Bioinformatics Institute (EBI) and the Protein Information Resource (PIR) to provide the scientific community with a central resource for protein sequences and functional information. The UniProt consortium maintains the UniProt KnowledgeBase (UniProtKB), updated every 4 weeks, and several supplementary databases including the UniProt Reference Clusters (UniRef) and the UniProt Archive (UniParc).The Swiss-Prot section of the UniProt KnowledgeBase (UniProtKB/Swiss-Prot) contains publicly available expertly manually annotated protein sequences obtained from a broad spectrum of organisms. Plant protein entries are produced in the frame of the Plant Proteome Annotation Program (PPAP), with an emphasis on characterized proteins of Arabidopsis thaliana and Oryza sativa. High level annotations provided by UniProtKB/Swiss-Prot are widely used to predict annotation of newly available proteins through automatic pipelines.The purpose of this chapter is to present a guided tour of a UniProtKB/Swiss-Prot entry. We will also present some of the tools and databases that are linked to each entry.
Synthesizing Certified Code

NASA Technical Reports Server (NTRS)

Whalen, Michael; Schumann, Johann; Fischer, Bernd

2002-01-01

Code certification is a lightweight approach to demonstrate software quality on a formal level. Its basic idea is to require producers to provide formal proofs that their code satisfies certain quality properties. These proofs serve as certificates which can be checked independently. Since code certification uses the same underlying technology as program verification, it also requires many detailed annotations (e.g., loop invariants) to make the proofs possible. However, manually adding theses annotations to the code is time-consuming and error-prone. We address this problem by combining code certification with automatic program synthesis. We propose an approach to generate simultaneously, from a high-level specification, code and all annotations required to certify generated code. Here, we describe a certification extension of AUTOBAYES, a synthesis tool which automatically generates complex data analysis programs from compact specifications. AUTOBAYES contains sufficient high-level domain knowledge to generate detailed annotations. This allows us to use a general-purpose verification condition generator to produce a set of proof obligations in first-order logic. The obligations are then discharged using the automated theorem E-SETHEO. We demonstrate our approach by certifying operator safety for a generated iterative data classification program without manual annotation of the code.
Validation of semi-automatic segmentation of the left atrium

NASA Astrophysics Data System (ADS)

Rettmann, M. E.; Holmes, D. R., III; Camp, J. J.; Packer, D. L.; Robb, R. A.

2008-03-01

Catheter ablation therapy has become increasingly popular for the treatment of left atrial fibrillation. The effect of this treatment on left atrial morphology, however, has not yet been completely quantified. Initial studies have indicated a decrease in left atrial size with a concomitant decrease in pulmonary vein diameter. In order to effectively study if catheter based therapies affect left atrial geometry, robust segmentations with minimal user interaction are required. In this work, we validate a method to semi-automatically segment the left atrium from computed-tomography scans. The first step of the technique utilizes seeded region growing to extract the entire blood pool including the four chambers of the heart, the pulmonary veins, aorta, superior vena cava, inferior vena cava, and other surrounding structures. Next, the left atrium and pulmonary veins are separated from the rest of the blood pool using an algorithm that searches for thin connections between user defined points in the volumetric data or on a surface rendering. Finally, pulmonary veins are separated from the left atrium using a three dimensional tracing tool. A single user segmented three datasets three times using both the semi-automatic technique as well as manual tracing. The user interaction time for the semi-automatic technique was approximately forty-five minutes per dataset and the manual tracing required between four and eight hours per dataset depending on the number of slices. A truth model was generated using a simple voting scheme on the repeated manual segmentations. A second user segmented each of the nine datasets using the semi-automatic technique only. Several metrics were computed to assess the agreement between the semi-automatic technique and the truth model including percent differences in left atrial volume, DICE overlap, and mean distance between the boundaries of the segmented left atria. Overall, the semi-automatic approach was demonstrated to be repeatable within and between raters, and accurate when compared to the truth model. Finally, we generated a visualization to assess the spatial variability in the segmentation errors between the semi-automatic approach and the truth model. The visualization demonstrates the highest errors occur at the boundaries between the left atium and pulmonary veins as well as the left atrium and left atrial appendage. In conclusion, we describe a semi-automatic approach for left atrial segmentation that demonstrates repeatability and accuracy, with the advantage of significant time reduction in user interaction time.
Computer systems for annotation of single molecule fragments

DOEpatents

Schwartz, David Charles; Severin, Jessica

2016-07-19

There are provided computer systems for visualizing and annotating single molecule images. Annotation systems in accordance with this disclosure allow a user to mark and annotate single molecules of interest and their restriction enzyme cut sites thereby determining the restriction fragments of single nucleic acid molecules. The markings and annotations may be automatically generated by the system in certain embodiments and they may be overlaid translucently onto the single molecule images. An image caching system may be implemented in the computer annotation systems to reduce image processing time. The annotation systems include one or more connectors connecting to one or more databases capable of storing single molecule data as well as other biomedical data. Such diverse array of data can be retrieved and used to validate the markings and annotations. The annotation systems may be implemented and deployed over a computer network. They may be ergonomically optimized to facilitate user interactions.
Next Generation Models for Storage and Representation of Microbial Biological Annotation

DOE Office of Scientific and Technical Information (OSTI.GOV)

Quest, Daniel J; Land, Miriam L; Brettin, Thomas S

2010-01-01

Background Traditional genome annotation systems were developed in a very different computing era, one where the World Wide Web was just emerging. Consequently, these systems are built as centralized black boxes focused on generating high quality annotation submissions to GenBank/EMBL supported by expert manual curation. The exponential growth of sequence data drives a growing need for increasingly higher quality and automatically generated annotation. Typical annotation pipelines utilize traditional database technologies, clustered computing resources, Perl, C, and UNIX file systems to process raw sequence data, identify genes, and predict and categorize gene function. These technologies tightly couple the annotation software systemmore » to hardware and third party software (e.g. relational database systems and schemas). This makes annotation systems hard to reproduce, inflexible to modification over time, difficult to assess, difficult to partition across multiple geographic sites, and difficult to understand for those who are not domain experts. These systems are not readily open to scrutiny and therefore not scientifically tractable. The advent of Semantic Web standards such as Resource Description Framework (RDF) and OWL Web Ontology Language (OWL) enables us to construct systems that address these challenges in a new comprehensive way. Results Here, we develop a framework for linking traditional data to OWL-based ontologies in genome annotation. We show how data standards can decouple hardware and third party software tools from annotation pipelines, thereby making annotation pipelines easier to reproduce and assess. An illustrative example shows how TURTLE (Terse RDF Triple Language) can be used as a human readable, but also semantically-aware, equivalent to GenBank/EMBL files. Conclusions The power of this approach lies in its ability to assemble annotation data from multiple databases across multiple locations into a representation that is understandable to researchers. In this way, all researchers, experimental and computational, will more easily understand the informatics processes constructing genome annotation and ultimately be able to help improve the systems that produce them.« less
Biologically inspired EM image alignment and neural reconstruction.

PubMed

Knowles-Barley, Seymour; Butcher, Nancy J; Meinertzhagen, Ian A; Armstrong, J Douglas

2011-08-15

Three-dimensional reconstruction of consecutive serial-section transmission electron microscopy (ssTEM) images of neural tissue currently requires many hours of manual tracing and annotation. Several computational techniques have already been applied to ssTEM images to facilitate 3D reconstruction and ease this burden. Here, we present an alternative computational approach for ssTEM image analysis. We have used biologically inspired receptive fields as a basis for a ridge detection algorithm to identify cell membranes, synaptic contacts and mitochondria. Detected line segments are used to improve alignment between consecutive images and we have joined small segments of membrane into cell surfaces using a dynamic programming algorithm similar to the Needleman-Wunsch and Smith-Waterman DNA sequence alignment procedures. A shortest path-based approach has been used to close edges and achieve image segmentation. Partial reconstructions were automatically generated and used as a basis for semi-automatic reconstruction of neural tissue. The accuracy of partial reconstructions was evaluated and 96% of membrane could be identified at the cost of 13% false positive detections. An open-source reference implementation is available in the Supplementary information. seymour.kb@ed.ac.uk; douglas.armstrong@ed.ac.uk Supplementary data are available at Bioinformatics online.
Classification of Chemical Compounds to Support Complex Queries in a Pathway Database

PubMed Central

Weidemann, Andreas; Kania, Renate; Peiss, Christian; Rojas, Isabel

2004-01-01

Data quality in biological databases has become a topic of great discussion. To provide high quality data and to deal with the vast amount of biochemical data, annotators and curators need to be supported by software that carries out part of their work in an (semi-) automatic manner. The detection of errors and inconsistencies is a part that requires the knowledge of domain experts, thus in most cases it is done manually, making it very expensive and time-consuming. This paper presents two tools to partially support the curation of data on biochemical pathways. The tool enables the automatic classification of chemical compounds based on their respective SMILES strings. Such classification allows the querying and visualization of biochemical reactions at different levels of abstraction, according to the level of detail at which the reaction participants are described. Chemical compounds can be classified in a flexible manner based on different criteria. The support of the process of data curation is provided by facilitating the detection of compounds that are identified as different but that are actually the same. This is also used to identify similar reactions and, in turn, pathways. PMID:18629066
EnzML: multi-label prediction of enzyme classes using InterPro signatures

PubMed Central

2012-01-01

Background Manual annotation of enzymatic functions cannot keep up with automatic genome sequencing. In this work we explore the capacity of InterPro sequence signatures to automatically predict enzymatic function. Results We present EnzML, a multi-label classification method that can efficiently account also for proteins with multiple enzymatic functions: 50,000 in UniProt. EnzML was evaluated using a standard set of 300,747 proteins for which the manually curated Swiss-Prot and KEGG databases have agreeing Enzyme Commission (EC) annotations. EnzML achieved more than 98% subset accuracy (exact match of all correct Enzyme Commission classes of a protein) for the entire dataset and between 87 and 97% subset accuracy in reannotating eight entire proteomes: human, mouse, rat, mouse-ear cress, fruit fly, the S. pombe yeast, the E. coli bacterium and the M. jannaschii archaebacterium. To understand the role played by the dataset size, we compared the cross-evaluation results of smaller datasets, either constructed at random or from specific taxonomic domains such as archaea, bacteria, fungi, invertebrates, plants and vertebrates. The results were confirmed even when the redundancy in the dataset was reduced using UniRef100, UniRef90 or UniRef50 clusters. Conclusions InterPro signatures are a compact and powerful attribute space for the prediction of enzymatic function. This representation makes multi-label machine learning feasible in reasonable time (30 minutes to train on 300,747 instances with 10,852 attributes and 2,201 class values) using the Mulan Binary Relevance Nearest Neighbours algorithm implementation (BR-kNN). PMID:22533924
GeneTopics - interpretation of gene sets via literature-driven topic models

PubMed Central

2013-01-01

Background Annotation of a set of genes is often accomplished through comparison to a library of labelled gene sets such as biological processes or canonical pathways. However, this approach might fail if the employed libraries are not up to date with the latest research, don't capture relevant biological themes or are curated at a different level of granularity than is required to appropriately analyze the input gene set. At the same time, the vast biomedical literature offers an unstructured repository of the latest research findings that can be tapped to provide thematic sub-groupings for any input gene set. Methods Our proposed method relies on a gene-specific text corpus and extracts commonalities between documents in an unsupervised manner using a topic model approach. We automatically determine the number of topics summarizing the corpus and calculate a gene relevancy score for each topic allowing us to eliminate non-specific topics. As a result we obtain a set of literature topics in which each topic is associated with a subset of the input genes providing directly interpretable keywords and corresponding documents for literature research. Results We validate our method based on labelled gene sets from the KEGG metabolic pathway collection and the genetic association database (GAD) and show that the approach is able to detect topics consistent with the labelled annotation. Furthermore, we discuss the results on three different types of experimentally derived gene sets, (1) differentially expressed genes from a cardiac hypertrophy experiment in mice, (2) altered transcript abundance in human pancreatic beta cells, and (3) genes implicated by GWA studies to be associated with metabolite levels in a healthy population. In all three cases, we are able to replicate findings from the original papers in a quick and semi-automated manner. Conclusions Our approach provides a novel way of automatically generating meaningful annotations for gene sets that are directly tied to relevant articles in the literature. Extending a general topic model method, the approach introduced here establishes a workflow for the interpretation of gene sets generated from diverse experimental scenarios that can complement the classical approach of comparison to reference gene sets. PMID:24564875
Workflow and web application for annotating NCBI BioProject transcriptome data

PubMed Central

Vera Alvarez, Roberto; Medeiros Vidal, Newton; Garzón-Martínez, Gina A.; Barrero, Luz S.; Landsman, David

2017-01-01

Abstract The volume of transcriptome data is growing exponentially due to rapid improvement of experimental technologies. In response, large central resources such as those of the National Center for Biotechnology Information (NCBI) are continually adapting their computational infrastructure to accommodate this large influx of data. New and specialized databases, such as Transcriptome Shotgun Assembly Sequence Database (TSA) and Sequence Read Archive (SRA), have been created to aid the development and expansion of centralized repositories. Although the central resource databases are under continual development, they do not include automatic pipelines to increase annotation of newly deposited data. Therefore, third-party applications are required to achieve that aim. Here, we present an automatic workflow and web application for the annotation of transcriptome data. The workflow creates secondary data such as sequencing reads and BLAST alignments, which are available through the web application. They are based on freely available bioinformatics tools and scripts developed in-house. The interactive web application provides a search engine and several browser utilities. Graphical views of transcript alignments are available through SeqViewer, an embedded tool developed by NCBI for viewing biological sequence data. The web application is tightly integrated with other NCBI web applications and tools to extend the functionality of data processing and interconnectivity. We present a case study for the species Physalis peruviana with data generated from BioProject ID 67621. Database URL: http://www.ncbi.nlm.nih.gov/projects/physalis/ PMID:28605765
NaviGO: interactive tool for visualization and functional similarity and coherence analysis with gene ontology.

PubMed

Wei, Qing; Khan, Ishita K; Ding, Ziyun; Yerneni, Satwica; Kihara, Daisuke

2017-03-20

The number of genomics and proteomics experiments is growing rapidly, producing an ever-increasing amount of data that are awaiting functional interpretation. A number of function prediction algorithms were developed and improved to enable fast and automatic function annotation. With the well-defined structure and manual curation, Gene Ontology (GO) is the most frequently used vocabulary for representing gene functions. To understand relationship and similarity between GO annotations of genes, it is important to have a convenient pipeline that quantifies and visualizes the GO function analyses in a systematic fashion. NaviGO is a web-based tool for interactive visualization, retrieval, and computation of functional similarity and associations of GO terms and genes. Similarity of GO terms and gene functions is quantified with six different scores including protein-protein interaction and context based association scores we have developed in our previous works. Interactive navigation of the GO function space provides intuitive and effective real-time visualization of functional groupings of GO terms and genes as well as statistical analysis of enriched functions. We developed NaviGO, which visualizes and analyses functional similarity and associations of GO terms and genes. The NaviGO webserver is freely available at: http://kiharalab.org/web/navigo .
An Infrastructure for Indexing and Organizing Best Practices

DOE Office of Scientific and Technical Information (OSTI.GOV)

Zhu, Liming; Staples, Mark; Gorton, Ian

Industry best practices are widely held but not necessarily empirically verified software engineering beliefs. Best practices can be documented in distributed web-based public repositories as pattern catalogues or practice libraries. There is a need to systematically index and organize these practices to enable their better practical use and scientific evaluation. In this paper, we propose a semi-automatic approach to index and organise best practices. A central repository acts as an information overlay on top of other pre-existing resources to facilitate organization, navigation, annotation and meta-analysis while maintaining synchronization with those resources. An initial population of the central repository is automatedmore » using Yahoo! contextual search services. The collected data is organized using semantic web technologies so that the data can be more easily shared and used for innovative analyses. A prototype has demonstrated the capability of the approach.« less
Incorporating Feature-Based Annotations into Automatically Generated Knowledge Representations

NASA Astrophysics Data System (ADS)

Lumb, L. I.; Lederman, J. I.; Aldridge, K. D.

2006-12-01

Earth Science Markup Language (ESML) is efficient and effective in representing scientific data in an XML- based formalism. However, features of the data being represented are not accounted for in ESML. Such features might derive from events (e.g., a gap in data collection due to instrument servicing), identifications (e.g., a scientifically interesting area/volume in an image), or some other source. In order to account for features in an ESML context, we consider them from the perspective of annotation, i.e., the addition of information to existing documents without changing the originals. Although it is possible to extend ESML to incorporate feature-based annotations internally (e.g., by extending the XML schema for ESML), there are a number of complicating factors that we identify. Rather than pursuing the ESML-extension approach, we focus on an external representation for feature-based annotations via XML Pointer Language (XPointer). In previous work (Lumb &Aldridge, HPCS 2006, IEEE, doi:10.1109/HPCS.2006.26), we have shown that it is possible to extract relationships from ESML-based representations, and capture the results in the Resource Description Format (RDF). Thus we explore and report on this same requirement for XPointer-based annotations of ESML representations. As in our past efforts, the Global Geodynamics Project (GGP) allows us to illustrate with a real-world example this approach for introducing annotations into automatically generated knowledge representations.
Enriching regulatory networks by bootstrap learning using optimised GO-based gene similarity and gene links mined from PubMed abstracts

DOE Office of Scientific and Technical Information (OSTI.GOV)

Taylor, Ronald C.; Sanfilippo, Antonio P.; McDermott, Jason E.

2011-02-18

Transcriptional regulatory networks are being determined using “reverse engineering” methods that infer connections based on correlations in gene state. Corroboration of such networks through independent means such as evidence from the biomedical literature is desirable. Here, we explore a novel approach, a bootstrapping version of our previous Cross-Ontological Analytic method (XOA) that can be used for semi-automated annotation and verification of inferred regulatory connections, as well as for discovery of additional functional relationships between the genes. First, we use our annotation and network expansion method on a biological network learned entirely from the literature. We show how new relevant linksmore » between genes can be iteratively derived using a gene similarity measure based on the Gene Ontology that is optimized on the input network at each iteration. Second, we apply our method to annotation, verification, and expansion of a set of regulatory connections found by the Context Likelihood of Relatedness algorithm.« less
Annotation of Korean Learner Corpora for Particle Error Detection

ERIC Educational Resources Information Center

Lee, Sun-Hee; Jang, Seok Bae; Seo, Sang-Kyu

2009-01-01

In this study, we focus on particle errors and discuss an annotation scheme for Korean learner corpora that can be used to extract heuristic patterns of particle errors efficiently. We investigate different properties of particle errors so that they can be later used to identify learner errors automatically, and we provide resourceful annotation…
Text Mining to Support Gene Ontology Curation and Vice Versa.

PubMed

Ruch, Patrick

2017-01-01

In this chapter, we explain how text mining can support the curation of molecular biology databases dealing with protein functions. We also show how curated data can play a disruptive role in the developments of text mining methods. We review a decade of efforts to improve the automatic assignment of Gene Ontology (GO) descriptors, the reference ontology for the characterization of genes and gene products. To illustrate the high potential of this approach, we compare the performances of an automatic text categorizer and show a large improvement of +225 % in both precision and recall on benchmarked data. We argue that automatic text categorization functions can ultimately be embedded into a Question-Answering (QA) system to answer questions related to protein functions. Because GO descriptors can be relatively long and specific, traditional QA systems cannot answer such questions. A new type of QA system, so-called Deep QA which uses machine learning methods trained with curated contents, is thus emerging. Finally, future advances of text mining instruments are directly dependent on the availability of high-quality annotated contents at every curation step. Databases workflows must start recording explicitly all the data they curate and ideally also some of the data they do not curate.
Ensembl 2002: accommodating comparative genomics.

PubMed

Clamp, M; Andrews, D; Barker, D; Bevan, P; Cameron, G; Chen, Y; Clark, L; Cox, T; Cuff, J; Curwen, V; Down, T; Durbin, R; Eyras, E; Gilbert, J; Hammond, M; Hubbard, T; Kasprzyk, A; Keefe, D; Lehvaslaiho, H; Iyer, V; Melsopp, C; Mongin, E; Pettett, R; Potter, S; Rust, A; Schmidt, E; Searle, S; Slater, G; Smith, J; Spooner, W; Stabenau, A; Stalker, J; Stupka, E; Ureta-Vidal, A; Vastrik, I; Birney, E

2003-01-01

The Ensembl (http://www.ensembl.org/) database project provides a bioinformatics framework to organise biology around the sequences of large genomes. It is a comprehensive source of stable automatic annotation of human, mouse and other genome sequences, available as either an interactive web site or as flat files. Ensembl also integrates manually annotated gene structures from external sources where available. As well as being one of the leading sources of genome annotation, Ensembl is an open source software engineering project to develop a portable system able to handle very large genomes and associated requirements. These range from sequence analysis to data storage and visualisation and installations exist around the world in both companies and at academic sites. With both human and mouse genome sequences available and more vertebrate sequences to follow, many of the recent developments in Ensembl have focusing on developing automatic comparative genome analysis and visualisation.
GRN2SBML: automated encoding and annotation of inferred gene regulatory networks complying with SBML.

PubMed

Vlaic, Sebastian; Hoffmann, Bianca; Kupfer, Peter; Weber, Michael; Dräger, Andreas

2013-09-01

GRN2SBML automatically encodes gene regulatory networks derived from several inference tools in systems biology markup language. Providing a graphical user interface, the networks can be annotated via the simple object access protocol (SOAP)-based application programming interface of BioMart Central Portal and minimum information required in the annotation of models registry. Additionally, we provide an R-package, which processes the output of supported inference algorithms and automatically passes all required parameters to GRN2SBML. Therefore, GRN2SBML closes a gap in the processing pipeline between the inference of gene regulatory networks and their subsequent analysis, visualization and storage. GRN2SBML is freely available under the GNU Public License version 3 and can be downloaded from http://www.hki-jena.de/index.php/0/2/490. General information on GRN2SBML, examples and tutorials are available at the tool's web page.
Addressing case specific biogas plant tasks: industry oriented methane yields derived from 5L Automatic Methane Potential Test Systems in batch or semi-continuous tests using realistic inocula, substrate particle sizes and organic loading.

PubMed

Kolbl, Sabina; Paloczi, Attila; Panjan, Jože; Stres, Blaž

2014-02-01

The primary aim of the study was to develop and validate an in-house upscale of Automatic Methane Potential Test System II for studying real-time inocula and real-scale substrates in batch, codigestion and enzyme enhanced hydrolysis experiments, in addition to semi-continuous operation of the developed equipment and experiments testing inoculum functional quality. The successful upscale to 5L enabled comparison of different process configurations in shorter preparation times with acceptable accuracy and high-through put intended for industrial decision making. The adoption of the same scales, equipment and methodologies in batch and semi-continuous tests mirroring those at full scale biogas plants resulted in matching methane yields between the two laboratory tests and full-scale, confirming thus the increased decision making value of the approach for industrial operations. Copyright © 2013 Elsevier Ltd. All rights reserved.
Comparison and assessment of semi-automatic image segmentation in computed tomography scans for image-guided kidney surgery.

PubMed

Glisson, Courtenay L; Altamar, Hernan O; Herrell, S Duke; Clark, Peter; Galloway, Robert L

2011-11-01

Image segmentation is integral to implementing intraoperative guidance for kidney tumor resection. Results seen in computed tomography (CT) data are affected by target organ physiology as well as by the segmentation algorithm used. This work studies variables involved in using level set methods found in the Insight Toolkit to segment kidneys from CT scans and applies the results to an image guidance setting. A composite algorithm drawing on the strengths of multiple level set approaches was built using the Insight Toolkit. This algorithm requires image contrast state and seed points to be identified as input, and functions independently thereafter, selecting and altering method and variable choice as needed. Semi-automatic results were compared to expert hand segmentation results directly and by the use of the resultant surfaces for registration of intraoperative data. Direct comparison using the Dice metric showed average agreement of 0.93 between semi-automatic and hand segmentation results. Use of the segmented surfaces in closest point registration of intraoperative laser range scan data yielded average closest point distances of approximately 1 mm. Application of both inverse registration transforms from the previous step to all hand segmented image space points revealed that the distance variability introduced by registering to the semi-automatically segmented surface versus the hand segmented surface was typically less than 3 mm both near the tumor target and at distal points, including subsurface points. Use of the algorithm shortened user interaction time and provided results which were comparable to the gold standard of hand segmentation. Further, the use of the algorithm's resultant surfaces in image registration provided comparable transformations to surfaces produced by hand segmentation. These data support the applicability and utility of such an algorithm as part of an image guidance workflow.

Ontorat: automatic generation of new ontology terms, annotations, and axioms based on ontology design patterns.

PubMed

Xiang, Zuoshuang; Zheng, Jie; Lin, Yu; He, Yongqun

2015-01-01

It is time-consuming to build an ontology with many terms and axioms. Thus it is desired to automate the process of ontology development. Ontology Design Patterns (ODPs) provide a reusable solution to solve a recurrent modeling problem in the context of ontology engineering. Because ontology terms often follow specific ODPs, the Ontology for Biomedical Investigations (OBI) developers proposed a Quick Term Templates (QTTs) process targeted at generating new ontology classes following the same pattern, using term templates in a spreadsheet format. Inspired by the ODPs and QTTs, the Ontorat web application is developed to automatically generate new ontology terms, annotations of terms, and logical axioms based on a specific ODP(s). The inputs of an Ontorat execution include axiom expression settings, an input data file, ID generation settings, and a target ontology (optional). The axiom expression settings can be saved as a predesigned Ontorat setting format text file for reuse. The input data file is generated based on a template file created by a specific ODP (text or Excel format). Ontorat is an efficient tool for ontology expansion. Different use cases are described. For example, Ontorat was applied to automatically generate over 1,000 Japan RIKEN cell line cell terms with both logical axioms and rich annotation axioms in the Cell Line Ontology (CLO). Approximately 800 licensed animal vaccines were represented and annotated in the Vaccine Ontology (VO) by Ontorat. The OBI team used Ontorat to add assay and device terms required by ENCODE project. Ontorat was also used to add missing annotations to all existing Biobank specific terms in the Biobank Ontology. A collection of ODPs and templates with examples are provided on the Ontorat website and can be reused to facilitate ontology development. With ever increasing ontology development and applications, Ontorat provides a timely platform for generating and annotating a large number of ontology terms by following design patterns. http://ontorat.hegroup.org/.
A study of the effectiveness of machine learning methods for classification of clinical interview fragments into a large number of categories.

PubMed

Hasan, Mehedi; Kotov, Alexander; Carcone, April; Dong, Ming; Naar, Sylvie; Hartlieb, Kathryn Brogan

2016-08-01

This study examines the effectiveness of state-of-the-art supervised machine learning methods in conjunction with different feature types for the task of automatic annotation of fragments of clinical text based on codebooks with a large number of categories. We used a collection of motivational interview transcripts consisting of 11,353 utterances, which were manually annotated by two human coders as the gold standard, and experimented with state-of-art classifiers, including Naïve Bayes, J48 Decision Tree, Support Vector Machine (SVM), Random Forest (RF), AdaBoost, DiscLDA, Conditional Random Fields (CRF) and Convolutional Neural Network (CNN) in conjunction with lexical, contextual (label of the previous utterance) and semantic (distribution of words in the utterance across the Linguistic Inquiry and Word Count dictionaries) features. We found out that, when the number of classes is large, the performance of CNN and CRF is inferior to SVM. When only lexical features were used, interview transcripts were automatically annotated by SVM with the highest classification accuracy among all classifiers of 70.8%, 61% and 53.7% based on the codebooks consisting of 17, 20 and 41 codes, respectively. Using contextual and semantic features, as well as their combination, in addition to lexical ones, improved the accuracy of SVM for annotation of utterances in motivational interview transcripts with a codebook consisting of 17 classes to 71.5%, 74.2%, and 75.1%, respectively. Our results demonstrate the potential of using machine learning methods in conjunction with lexical, semantic and contextual features for automatic annotation of clinical interview transcripts with near-human accuracy. Copyright © 2016 Elsevier Inc. All rights reserved.
Developing a standard for de-identifying electronic patient records written in Swedish: precision, recall and F-measure in a manual and computerized annotation trial.

PubMed

Velupillai, Sumithra; Dalianis, Hercules; Hassel, Martin; Nilsson, Gunnar H

2009-12-01

Electronic patient records (EPRs) contain a large amount of information written in free text. This information is considered very valuable for research but is also very sensitive since the free text parts may contain information that could reveal the identity of a patient. Therefore, methods for de-identifying EPRs are needed. The work presented here aims to perform a manual and automatic Protected Health Information (PHI)-annotation trial for EPRs written in Swedish. This study consists of two main parts: the initial creation of a manually PHI-annotated gold standard, and the porting and evaluation of an existing de-identification software written for American English to Swedish in a preliminary automatic de-identification trial. Results are measured with precision, recall and F-measure. This study reports fairly high Inter-Annotator Agreement (IAA) results on the manually created gold standard, especially for specific tags such as names. The average IAA over all tags was 0.65 F-measure (0.84 F-measure highest pairwise agreement). For name tags the average IAA was 0.80 F-measure (0.91 F-measure highest pairwise agreement). Porting a de-identification software written for American English to Swedish directly was unfortunately non-trivial, yielding poor results. Developing gold standard sets as well as automatic systems for de-identification tasks in Swedish is feasible. However, discussions and definitions on identifiable information is needed, as well as further developments both on the tag sets and the annotation guidelines, in order to get a reliable gold standard. A completely new de-identification software needs to be developed.
Automatic medical image annotation and keyword-based image retrieval using relevance feedback.

PubMed

Ko, Byoung Chul; Lee, JiHyeon; Nam, Jae-Yeal

2012-08-01

This paper presents novel multiple keywords annotation for medical images, keyword-based medical image retrieval, and relevance feedback method for image retrieval for enhancing image retrieval performance. For semantic keyword annotation, this study proposes a novel medical image classification method combining local wavelet-based center symmetric-local binary patterns with random forests. For keyword-based image retrieval, our retrieval system use the confidence score that is assigned to each annotated keyword by combining probabilities of random forests with predefined body relation graph. To overcome the limitation of keyword-based image retrieval, we combine our image retrieval system with relevance feedback mechanism based on visual feature and pattern classifier. Compared with other annotation and relevance feedback algorithms, the proposed method shows both improved annotation performance and accurate retrieval results.
Optimized Graph Learning Using Partial Tags and Multiple Features for Image and Video Annotation.

PubMed

Song, Jingkuan; Gao, Lianli; Nie, Feiping; Shen, Heng Tao; Yan, Yan; Sebe, Nicu

2016-11-01

In multimedia annotation, due to the time constraints and the tediousness of manual tagging, it is quite common to utilize both tagged and untagged data to improve the performance of supervised learning when only limited tagged training data are available. This is often done by adding a geometry-based regularization term in the objective function of a supervised learning model. In this case, a similarity graph is indispensable to exploit the geometrical relationships among the training data points, and the graph construction scheme essentially determines the performance of these graph-based learning algorithms. However, most of the existing works construct the graph empirically and are usually based on a single feature without using the label information. In this paper, we propose a semi-supervised annotation approach by learning an optimized graph (OGL) from multi-cues (i.e., partial tags and multiple features), which can more accurately embed the relationships among the data points. Since OGL is a transductive method and cannot deal with novel data points, we further extend our model to address the out-of-sample issue. Extensive experiments on image and video annotation show the consistent superiority of OGL over the state-of-the-art methods.
Automatic measurement of voice onset time using discriminative structured prediction.

PubMed

Sonderegger, Morgan; Keshet, Joseph

2012-12-01

A discriminative large-margin algorithm for automatic measurement of voice onset time (VOT) is described, considered as a case of predicting structured output from speech. Manually labeled data are used to train a function that takes as input a speech segment of an arbitrary length containing a voiceless stop, and outputs its VOT. The function is explicitly trained to minimize the difference between predicted and manually measured VOT; it operates on a set of acoustic feature functions designed based on spectral and temporal cues used by human VOT annotators. The algorithm is applied to initial voiceless stops from four corpora, representing different types of speech. Using several evaluation methods, the algorithm's performance is near human intertranscriber reliability, and compares favorably with previous work. Furthermore, the algorithm's performance is minimally affected by training and testing on different corpora, and remains essentially constant as the amount of training data is reduced to 50-250 manually labeled examples, demonstrating the method's practical applicability to new datasets.
Smart Annotation of Cyclic Data Using Hierarchical Hidden Markov Models.

PubMed

Martindale, Christine F; Hoenig, Florian; Strohrmann, Christina; Eskofier, Bjoern M

2017-10-13

Cyclic signals are an intrinsic part of daily life, such as human motion and heart activity. The detailed analysis of them is important for clinical applications such as pathological gait analysis and for sports applications such as performance analysis. Labeled training data for algorithms that analyze these cyclic data come at a high annotation cost due to only limited annotations available under laboratory conditions or requiring manual segmentation of the data under less restricted conditions. This paper presents a smart annotation method that reduces this cost of labeling for sensor-based data, which is applicable to data collected outside of strict laboratory conditions. The method uses semi-supervised learning of sections of cyclic data with a known cycle number. A hierarchical hidden Markov model (hHMM) is used, achieving a mean absolute error of 0.041 ± 0.020 s relative to a manually-annotated reference. The resulting model was also used to simultaneously segment and classify continuous, 'in the wild' data, demonstrating the applicability of using hHMM, trained on limited data sections, to label a complete dataset. This technique achieved comparable results to its fully-supervised equivalent. Our semi-supervised method has the significant advantage of reduced annotation cost. Furthermore, it reduces the opportunity for human error in the labeling process normally required for training of segmentation algorithms. It also lowers the annotation cost of training a model capable of continuous monitoring of cycle characteristics such as those employed to analyze the progress of movement disorders or analysis of running technique.
Comparative Omics-Driven Genome Annotation Refinement: Application across Yersiniae

DOE Office of Scientific and Technical Information (OSTI.GOV)

Rutledge, Alexandra C.; Jones, Marcus B.; Chauhan, Sadhana

2012-03-27

Genome sequencing continues to be a rapidly evolving technology, yet most downstream aspects of genome annotation pipelines remain relatively stable or are even being abandoned. To date, the perceived value of manual curation for genome annotations is not offset by the real cost and time associated with the process. In order to balance the large number of sequences generated, the annotation process is now performed almost exclusively in an automated fashion for most genome sequencing projects. One possible way to reduce errors inherent to automated computational annotations is to apply data from 'omics' measurements (i.e. transcriptional and proteomic) to themore » un-annotated genome with a proteogenomic-based approach. This approach does require additional experimental and bioinformatics methods to include omics technologies; however, the approach is readily automatable and can benefit from rapid developments occurring in those research domains as well. The annotation process can be improved by experimental validation of transcription and translation and aid in the discovery of annotation errors. Here the concept of annotation refinement has been extended to include a comparative assessment of genomes across closely related species, as is becoming common in sequencing efforts. Transcriptomic and proteomic data derived from three highly similar pathogenic Yersiniae (Y. pestis CO92, Y. pestis pestoides F, and Y. pseudotuberculosis PB1/+) was used to demonstrate a comprehensive comparative omic-based annotation methodology. Peptide and oligo measurements experimentally validated the expression of nearly 40% of each strain's predicted proteome and revealed the identification of 28 novel and 68 previously incorrect protein-coding sequences (e.g., observed frameshifts, extended start sites, and translated pseudogenes) within the three current Yersinia genome annotations. Gene loss is presumed to play a major role in Y. pestis acquiring its niche as a virulent pathogen, thus the discovery of many translated pseudogenes underscores a need for functional analyses to investigate hypotheses related to divergence. Refinements included the discovery of a seemingly essential ribosomal protein, several virulence-associated factors, and a transcriptional regulator, among other proteins, most of which are annotated as hypothetical, that were missed during annotation.« less
Application of a semi-automatic cartilage segmentation method for biomechanical modeling of the knee joint.

PubMed

Liukkonen, Mimmi K; Mononen, Mika E; Tanska, Petri; Saarakkala, Simo; Nieminen, Miika T; Korhonen, Rami K

2017-10-01

Manual segmentation of articular cartilage from knee joint 3D magnetic resonance images (MRI) is a time consuming and laborious task. Thus, automatic methods are needed for faster and reproducible segmentations. In the present study, we developed a semi-automatic segmentation method based on radial intensity profiles to generate 3D geometries of knee joint cartilage which were then used in computational biomechanical models of the knee joint. Six healthy volunteers were imaged with a 3T MRI device and their knee cartilages were segmented both manually and semi-automatically. The values of cartilage thicknesses and volumes produced by these two methods were compared. Furthermore, the influences of possible geometrical differences on cartilage stresses and strains in the knee were evaluated with finite element modeling. The semi-automatic segmentation and 3D geometry construction of one knee joint (menisci, femoral and tibial cartilages) was approximately two times faster than with manual segmentation. Differences in cartilage thicknesses, volumes, contact pressures, stresses, and strains between segmentation methods in femoral and tibial cartilage were mostly insignificant (p > 0.05) and random, i.e. there were no systematic differences between the methods. In conclusion, the devised semi-automatic segmentation method is a quick and accurate way to determine cartilage geometries; it may become a valuable tool for biomechanical modeling applications with large patient groups.
Transcription and Annotation of a Japanese Accented Spoken Corpus of L2 Spanish for the Development of CAPT Applications

ERIC Educational Resources Information Center

Carranza, Mario

2016-01-01

This paper addresses the process of transcribing and annotating spontaneous non-native speech with the aim of compiling a training corpus for the development of Computer Assisted Pronunciation Training (CAPT) applications, enhanced with Automatic Speech Recognition (ASR) technology. To better adapt ASR technology to CAPT tools, the recognition…
Three-dimensional reconstruction from serial sections in PC-Windows platform by using 3D_Viewer.

PubMed

Xu, Yi-Hua; Lahvis, Garet; Edwards, Harlene; Pitot, Henry C

2004-11-01

Three-dimensional (3D) reconstruction from serial sections allows identification of objects of interest in 3D and clarifies the relationship among these objects. 3D_Viewer, developed in our laboratory for this purpose, has four major functions: image alignment, movie frame production, movie viewing, and shift-overlay image generation. Color images captured from serial sections were aligned; then the contours of objects of interest were highlighted in a semi-automatic manner. These 2D images were then automatically stacked at different viewing angles, and their composite images on a projected plane were recorded by an image transform-shift-overlay technique. These composition images are used in the object-rotation movie show. The design considerations of the program and the procedures used for 3D reconstruction from serial sections are described. This program, with a digital image-capture system, a semi-automatic contours highlight method, and an automatic image transform-shift-overlay technique, greatly speeds up the reconstruction process. Since images generated by 3D_Viewer are in a general graphic format, data sharing with others is easy. 3D_Viewer is written in MS Visual Basic 6, obtainable from our laboratory on request.
Efficient Semi-Automatic 3D Segmentation for Neuron Tracing in Electron Microscopy Images

PubMed Central

Jones, Cory; Liu, Ting; Cohan, Nathaniel Wood; Ellisman, Mark; Tasdizen, Tolga

2015-01-01

0.1. Background In the area of connectomics, there is a significant gap between the time required for data acquisition and dense reconstruction of the neural processes contained in the same dataset. Automatic methods are able to eliminate this timing gap, but the state-of-the-art accuracy so far is insufficient for use without user corrections. If completed naively, this process of correction can be tedious and time consuming. 0.2. New Method We present a new semi-automatic method that can be used to perform 3D segmentation of neurites in EM image stacks. It utilizes an automatic method that creates a hierarchical structure for recommended merges of superpixels. The user is then guided through each predicted region to quickly identify errors and establish correct links. 0.3. Results We tested our method on three datasets with both novice and expert users. Accuracy and timing were compared with published automatic, semi-automatic, and manual results. 0.4. Comparison with Existing Methods Post-automatic correction methods have also been used in [1] and [2]. These methods do not provide navigation or suggestions in the manner we present. Other semi-automatic methods require user input prior to the automatic segmentation such as [3] and [4] and are inherently different than our method. 0.5. Conclusion Using this method on the three datasets, novice users achieved accuracy exceeding state-of-the-art automatic results, and expert users achieved accuracy on par with full manual labeling but with a 70% time improvement when compared with other examples in publication. PMID:25769273
Incorporating Semantics into Data Driven Workflows for Content Based Analysis

NASA Astrophysics Data System (ADS)

Argüello, M.; Fernandez-Prieto, M. J.

Finding meaningful associations between text elements and knowledge structures within clinical narratives in a highly verbal domain, such as psychiatry, is a challenging goal. The research presented here uses a small corpus of case histories and brings into play pre-existing knowledge, and therefore, complements other approaches that use large corpus (millions of words) and no pre-existing knowledge. The paper describes a variety of experiments for content-based analysis: Linguistic Analysis using NLP-oriented approaches, Sentiment Analysis, and Semantically Meaningful Analysis. Although it is not standard practice, the paper advocates providing automatic support to annotate the functionality as well as the data for each experiment by performing semantic annotation that uses OWL and OWL-S. Lessons learnt can be transmitted to legacy clinical databases facing the conversion of clinical narratives according to prominent Electronic Health Records standards.
MIPS: a database for genomes and protein sequences

PubMed Central

Mewes, H. W.; Frishman, D.; Güldener, U.; Mannhaupt, G.; Mayer, K.; Mokrejs, M.; Morgenstern, B.; Münsterkötter, M.; Rudd, S.; Weil, B.

2002-01-01

The Munich Information Center for Protein Sequences (MIPS-GSF, Neuherberg, Germany) continues to provide genome-related information in a systematic way. MIPS supports both national and European sequencing and functional analysis projects, develops and maintains automatically generated and manually annotated genome-specific databases, develops systematic classification schemes for the functional annotation of protein sequences, and provides tools for the comprehensive analysis of protein sequences. This report updates the information on the yeast genome (CYGD), the Neurospora crassa genome (MNCDB), the databases for the comprehensive set of genomes (PEDANT genomes), the database of annotated human EST clusters (HIB), the database of complete cDNAs from the DHGP (German Human Genome Project), as well as the project specific databases for the GABI (Genome Analysis in Plants) and HNB (Helmholtz–Netzwerk Bioinformatik) networks. The Arabidospsis thaliana database (MATDB), the database of mitochondrial proteins (MITOP) and our contribution to the PIR International Protein Sequence Database have been described elsewhere [Schoof et al. (2002) Nucleic Acids Res., 30, 91–93; Scharfe et al. (2000) Nucleic Acids Res., 28, 155–158; Barker et al. (2001) Nucleic Acids Res., 29, 29–32]. All databases described, the protein analysis tools provided and the detailed descriptions of our projects can be accessed through the MIPS World Wide Web server (http://mips.gsf.de). PMID:11752246
MIPS: a database for genomes and protein sequences.

PubMed

Mewes, H W; Frishman, D; Güldener, U; Mannhaupt, G; Mayer, K; Mokrejs, M; Morgenstern, B; Münsterkötter, M; Rudd, S; Weil, B

2002-01-01

The Munich Information Center for Protein Sequences (MIPS-GSF, Neuherberg, Germany) continues to provide genome-related information in a systematic way. MIPS supports both national and European sequencing and functional analysis projects, develops and maintains automatically generated and manually annotated genome-specific databases, develops systematic classification schemes for the functional annotation of protein sequences, and provides tools for the comprehensive analysis of protein sequences. This report updates the information on the yeast genome (CYGD), the Neurospora crassa genome (MNCDB), the databases for the comprehensive set of genomes (PEDANT genomes), the database of annotated human EST clusters (HIB), the database of complete cDNAs from the DHGP (German Human Genome Project), as well as the project specific databases for the GABI (Genome Analysis in Plants) and HNB (Helmholtz-Netzwerk Bioinformatik) networks. The Arabidospsis thaliana database (MATDB), the database of mitochondrial proteins (MITOP) and our contribution to the PIR International Protein Sequence Database have been described elsewhere [Schoof et al. (2002) Nucleic Acids Res., 30, 91-93; Scharfe et al. (2000) Nucleic Acids Res., 28, 155-158; Barker et al. (2001) Nucleic Acids Res., 29, 29-32]. All databases described, the protein analysis tools provided and the detailed descriptions of our projects can be accessed through the MIPS World Wide Web server (http://mips.gsf.de).
WikiGenomes: an open web application for community consumption and curation of gene annotation data in Wikidata

DOE Office of Scientific and Technical Information (OSTI.GOV)

Putman, Tim E.; Lelong, Sebastien; Burgstaller-Muehlbacher, Sebastian

With the advancement of genome-sequencing technologies, new genomes are being sequenced daily. Although these sequences are deposited in publicly available data warehouses, their functional and genomic annotations (beyond genes which are predicted automatically) mostly reside in the text of primary publications. Professional curators are hard at work extracting those annotations from the literature for the most studied organisms and depositing them in structured databases. However, the resources don’t exist to fund the comprehensive curation of the thousands of newly sequenced organisms in this manner. Here, we describe WikiGenomes (wikigenomes.org), a web application that facilitates the consumption and curation of genomicmore » data by the entire scientific community. WikiGenomes is based on Wikidata, an openly editable knowledge graph with the goal of aggregating published knowledge into a free and open database. WikiGenomes empowers the individual genomic researcher to contribute their expertise to the curation effort and integrates the knowledge into Wikidata, enabling it to be accessed by anyone without restriction.« less
WikiGenomes: an open web application for community consumption and curation of gene annotation data in Wikidata

DOE PAGES

Putman, Tim E.; Lelong, Sebastien; Burgstaller-Muehlbacher, Sebastian; ...

2017-03-06

With the advancement of genome-sequencing technologies, new genomes are being sequenced daily. Although these sequences are deposited in publicly available data warehouses, their functional and genomic annotations (beyond genes which are predicted automatically) mostly reside in the text of primary publications. Professional curators are hard at work extracting those annotations from the literature for the most studied organisms and depositing them in structured databases. However, the resources don’t exist to fund the comprehensive curation of the thousands of newly sequenced organisms in this manner. Here, we describe WikiGenomes (wikigenomes.org), a web application that facilitates the consumption and curation of genomicmore » data by the entire scientific community. WikiGenomes is based on Wikidata, an openly editable knowledge graph with the goal of aggregating published knowledge into a free and open database. WikiGenomes empowers the individual genomic researcher to contribute their expertise to the curation effort and integrates the knowledge into Wikidata, enabling it to be accessed by anyone without restriction.« less
Workflow and web application for annotating NCBI BioProject transcriptome data.

PubMed

Vera Alvarez, Roberto; Medeiros Vidal, Newton; Garzón-Martínez, Gina A; Barrero, Luz S; Landsman, David; Mariño-Ramírez, Leonardo

2017-01-01

The volume of transcriptome data is growing exponentially due to rapid improvement of experimental technologies. In response, large central resources such as those of the National Center for Biotechnology Information (NCBI) are continually adapting their computational infrastructure to accommodate this large influx of data. New and specialized databases, such as Transcriptome Shotgun Assembly Sequence Database (TSA) and Sequence Read Archive (SRA), have been created to aid the development and expansion of centralized repositories. Although the central resource databases are under continual development, they do not include automatic pipelines to increase annotation of newly deposited data. Therefore, third-party applications are required to achieve that aim. Here, we present an automatic workflow and web application for the annotation of transcriptome data. The workflow creates secondary data such as sequencing reads and BLAST alignments, which are available through the web application. They are based on freely available bioinformatics tools and scripts developed in-house. The interactive web application provides a search engine and several browser utilities. Graphical views of transcript alignments are available through SeqViewer, an embedded tool developed by NCBI for viewing biological sequence data. The web application is tightly integrated with other NCBI web applications and tools to extend the functionality of data processing and interconnectivity. We present a case study for the species Physalis peruviana with data generated from BioProject ID 67621. URL: http://www.ncbi.nlm.nih.gov/projects/physalis/. Published by Oxford University Press 2017. This work is written by US Government employees and is in the public domain in the US.
Improvement of the banana "Musa acuminata" reference sequence using NGS data and semi-automated bioinformatics methods.

PubMed

Martin, Guillaume; Baurens, Franc-Christophe; Droc, Gaëtan; Rouard, Mathieu; Cenci, Alberto; Kilian, Andrzej; Hastie, Alex; Doležel, Jaroslav; Aury, Jean-Marc; Alberti, Adriana; Carreel, Françoise; D'Hont, Angélique

2016-03-16

Recent advances in genomics indicate functional significance of a majority of genome sequences and their long range interactions. As a detailed examination of genome organization and function requires very high quality genome sequence, the objective of this study was to improve reference genome assembly of banana (Musa acuminata). We have developed a modular bioinformatics pipeline to improve genome sequence assemblies, which can handle various types of data. The pipeline comprises several semi-automated tools. However, unlike classical automated tools that are based on global parameters, the semi-automated tools proposed an expert mode for a user who can decide on suggested improvements through local compromises. The pipeline was used to improve the draft genome sequence of Musa acuminata. Genotyping by sequencing (GBS) of a segregating population and paired-end sequencing were used to detect and correct scaffold misassemblies. Long insert size paired-end reads identified scaffold junctions and fusions missed by automated assembly methods. GBS markers were used to anchor scaffolds to pseudo-molecules with a new bioinformatics approach that avoids the tedious step of marker ordering during genetic map construction. Furthermore, a genome map was constructed and used to assemble scaffolds into super scaffolds. Finally, a consensus gene annotation was projected on the new assembly from two pre-existing annotations. This approach reduced the total Musa scaffold number from 7513 to 1532 (i.e. by 80%), with an N50 that increased from 1.3 Mb (65 scaffolds) to 3.0 Mb (26 scaffolds). 89.5% of the assembly was anchored to the 11 Musa chromosomes compared to the previous 70%. Unknown sites (N) were reduced from 17.3 to 10.0%. The release of the Musa acuminata reference genome version 2 provides a platform for detailed analysis of banana genome variation, function and evolution. Bioinformatics tools developed in this work can be used to improve genome sequence assemblies in other species.
Automatic segmentation of triaxial accelerometry signals for falls risk estimation.

PubMed

Redmond, Stephen J; Scalzi, Maria Elena; Narayanan, Michael R; Lord, Stephen R; Cerutti, Sergio; Lovell, Nigel H

2010-01-01

Falls-related injuries in the elderly population represent one of the most significant contributors to rising health care expense in developed countries. In recent years, falls detection technologies have become more common. However, very few have adopted a preferable falls prevention strategy through unsupervised monitoring in the free-living environment. The basis of the monitoring described herein was a self-administered directed-routine (DR) comprising three separate tests measured by way of a waist-mounted triaxial accelerometer. Using features extracted from the manually segmented signals, a reasonable estimate of falls risk can be achieved. We describe here a series of algorithms for automatically segmenting these recordings, enabling the use of the DR assessment in the unsupervised and home environments. The accelerometry signals, from 68 subjects performing the DR, were manually annotated by an observer. Using the proposed signal segmentation routines, an good agreement was observed between the manually annotated markers and the automatically estimated values. However, a decrease in the correlation with falls risk to 0.73 was observed using the automatic segmentation, compared to 0.81 when using markers manually placed by an observer.

An orthology-based analysis of pathogenic protozoa impacting global health: an improved comparative genomics approach with prokaryotes and model eukaryote orthologs.

PubMed

Cuadrat, Rafael R C; da Serra Cruz, Sérgio Manuel; Tschoeke, Diogo Antônio; Silva, Edno; Tosta, Frederico; Jucá, Henrique; Jardim, Rodrigo; Campos, Maria Luiza M; Mattoso, Marta; Dávila, Alberto M R

2014-08-01

A key focus in 21(st) century integrative biology and drug discovery for neglected tropical and other diseases has been the use of BLAST-based computational methods for identification of orthologous groups in pathogenic organisms to discern orthologs, with a view to evaluate similarities and differences among species, and thus allow the transfer of annotation from known/curated proteins to new/non-annotated ones. We used here a profile-based sensitive methodology to identify distant homologs, coupled to the NCBI's COG (Unicellular orthologs) and KOG (Eukaryote orthologs), permitting us to perform comparative genomics analyses on five protozoan genomes. OrthoSearch was used in five protozoan proteomes showing that 3901 and 7473 orthologs can be identified by comparison with COG and KOG proteomes, respectively. The core protozoa proteome inferred was 418 Protozoa-COG orthologous groups and 704 Protozoa-KOG orthologous groups: (i) 31.58% (132/418) belongs to the category J (translation, ribosomal structure, and biogenesis), and 9.81% (41/418) to the category O (post-translational modification, protein turnover, chaperones) using COG; (ii) 21.45% (151/704) belongs to the categories J, and 13.92% (98/704) to the O using KOG. The phylogenomic analysis showed four well-supported clades for Eukarya, discriminating Multicellular [(i) human, fly, plant and worm] and Unicellular [(ii) yeast, (iii) fungi, and (iv) protozoa] species. These encouraging results attest to the usefulness of the profile-based methodology for comparative genomics to accelerate semi-automatic re-annotation, especially of the protozoan proteomes. This approach may also lend itself for applications in global health, for example, in the case of novel drug target discovery against pathogenic organisms previously considered difficult to research with traditional drug discovery tools.
An Orthology-Based Analysis of Pathogenic Protozoa Impacting Global Health: An Improved Comparative Genomics Approach with Prokaryotes and Model Eukaryote Orthologs

PubMed Central

Cuadrat, Rafael R. C.; da Serra Cruz, Sérgio Manuel; Tschoeke, Diogo Antônio; Silva, Edno; Tosta, Frederico; Jucá, Henrique; Jardim, Rodrigo; Campos, Maria Luiza M.; Mattoso, Marta

2014-01-01

Abstract A key focus in 21st century integrative biology and drug discovery for neglected tropical and other diseases has been the use of BLAST-based computational methods for identification of orthologous groups in pathogenic organisms to discern orthologs, with a view to evaluate similarities and differences among species, and thus allow the transfer of annotation from known/curated proteins to new/non-annotated ones. We used here a profile-based sensitive methodology to identify distant homologs, coupled to the NCBI's COG (Unicellular orthologs) and KOG (Eukaryote orthologs), permitting us to perform comparative genomics analyses on five protozoan genomes. OrthoSearch was used in five protozoan proteomes showing that 3901 and 7473 orthologs can be identified by comparison with COG and KOG proteomes, respectively. The core protozoa proteome inferred was 418 Protozoa-COG orthologous groups and 704 Protozoa-KOG orthologous groups: (i) 31.58% (132/418) belongs to the category J (translation, ribosomal structure, and biogenesis), and 9.81% (41/418) to the category O (post-translational modification, protein turnover, chaperones) using COG; (ii) 21.45% (151/704) belongs to the categories J, and 13.92% (98/704) to the O using KOG. The phylogenomic analysis showed four well-supported clades for Eukarya, discriminating Multicellular [(i) human, fly, plant and worm] and Unicellular [(ii) yeast, (iii) fungi, and (iv) protozoa] species. These encouraging results attest to the usefulness of the profile-based methodology for comparative genomics to accelerate semi-automatic re-annotation, especially of the protozoan proteomes. This approach may also lend itself for applications in global health, for example, in the case of novel drug target discovery against pathogenic organisms previously considered difficult to research with traditional drug discovery tools. PMID:24960463
Argo: an integrative, interactive, text mining-based workbench supporting curation

PubMed Central

Rak, Rafal; Rowley, Andrew; Black, William; Ananiadou, Sophia

2012-01-01

Curation of biomedical literature is often supported by the automatic analysis of textual content that generally involves a sequence of individual processing components. Text mining (TM) has been used to enhance the process of manual biocuration, but has been focused on specific databases and tasks rather than an environment integrating TM tools into the curation pipeline, catering for a variety of tasks, types of information and applications. Processing components usually come from different sources and often lack interoperability. The well established Unstructured Information Management Architecture is a framework that addresses interoperability by defining common data structures and interfaces. However, most of the efforts are targeted towards software developers and are not suitable for curators, or are otherwise inconvenient to use on a higher level of abstraction. To overcome these issues we introduce Argo, an interoperable, integrative, interactive and collaborative system for text analysis with a convenient graphic user interface to ease the development of processing workflows and boost productivity in labour-intensive manual curation. Robust, scalable text analytics follow a modular approach, adopting component modules for distinct levels of text analysis. The user interface is available entirely through a web browser that saves the user from going through often complicated and platform-dependent installation procedures. Argo comes with a predefined set of processing components commonly used in text analysis, while giving the users the ability to deposit their own components. The system accommodates various areas and levels of user expertise, from TM and computational linguistics to ontology-based curation. One of the key functionalities of Argo is its ability to seamlessly incorporate user-interactive components, such as manual annotation editors, into otherwise completely automatic pipelines. As a use case, we demonstrate the functionality of an in-built manual annotation editor that is well suited for in-text corpus annotation tasks. Database URL: http://www.nactem.ac.uk/Argo PMID:22434844
An evaluation of automatic coronary artery calcium scoring methods with cardiac CT using the orCaScore framework.

PubMed

Wolterink, Jelmer M; Leiner, Tim; de Vos, Bob D; Coatrieux, Jean-Louis; Kelm, B Michael; Kondo, Satoshi; Salgado, Rodrigo A; Shahzad, Rahil; Shu, Huazhong; Snoeren, Miranda; Takx, Richard A P; van Vliet, Lucas J; van Walsum, Theo; Willems, Tineke P; Yang, Guanyu; Zheng, Yefeng; Viergever, Max A; Išgum, Ivana

2016-05-01

The amount of coronary artery calcification (CAC) is a strong and independent predictor of cardiovascular disease (CVD) events. In clinical practice, CAC is manually identified and automatically quantified in cardiac CT using commercially available software. This is a tedious and time-consuming process in large-scale studies. Therefore, a number of automatic methods that require no interaction and semiautomatic methods that require very limited interaction for the identification of CAC in cardiac CT have been proposed. Thus far, a comparison of their performance has been lacking. The objective of this study was to perform an independent evaluation of (semi)automatic methods for CAC scoring in cardiac CT using a publicly available standardized framework. Cardiac CT exams of 72 patients distributed over four CVD risk categories were provided for (semi)automatic CAC scoring. Each exam consisted of a noncontrast-enhanced calcium scoring CT (CSCT) and a corresponding coronary CT angiography (CCTA) scan. The exams were acquired in four different hospitals using state-of-the-art equipment from four major CT scanner vendors. The data were divided into 32 training exams and 40 test exams. A reference standard for CAC in CSCT was defined by consensus of two experts following a clinical protocol. The framework organizers evaluated the performance of (semi)automatic methods on test CSCT scans, per lesion, artery, and patient. Five (semi)automatic methods were evaluated. Four methods used both CSCT and CCTA to identify CAC, and one method used only CSCT. The evaluated methods correctly detected between 52% and 94% of CAC lesions with positive predictive values between 65% and 96%. Lesions in distal coronary arteries were most commonly missed and aortic calcifications close to the coronary ostia were the most common false positive errors. The majority (between 88% and 98%) of correctly identified CAC lesions were assigned to the correct artery. Linearly weighted Cohen's kappa for patient CVD risk categorization by the evaluated methods ranged from 0.80 to 1.00. A publicly available standardized framework for the evaluation of (semi)automatic methods for CAC identification in cardiac CT is described. An evaluation of five (semi)automatic methods within this framework shows that automatic per patient CVD risk categorization is feasible. CAC lesions at ambiguous locations such as the coronary ostia remain challenging, but their detection had limited impact on CVD risk determination.
Annotated chemical patent corpus: a gold standard for text mining.

PubMed

Akhondi, Saber A; Klenner, Alexander G; Tyrchan, Christian; Manchala, Anil K; Boppana, Kiran; Lowe, Daniel; Zimmermann, Marc; Jagarlapudi, Sarma A R P; Sayle, Roger; Kors, Jan A; Muresan, Sorel

2014-01-01

Exploring the chemical and biological space covered by patent applications is crucial in early-stage medicinal chemistry activities. Patent analysis can provide understanding of compound prior art, novelty checking, validation of biological assays, and identification of new starting points for chemical exploration. Extracting chemical and biological entities from patents through manual extraction by expert curators can take substantial amount of time and resources. Text mining methods can help to ease this process. To validate the performance of such methods, a manually annotated patent corpus is essential. In this study we have produced a large gold standard chemical patent corpus. We developed annotation guidelines and selected 200 full patents from the World Intellectual Property Organization, United States Patent and Trademark Office, and European Patent Office. The patents were pre-annotated automatically and made available to four independent annotator groups each consisting of two to ten annotators. The annotators marked chemicals in different subclasses, diseases, targets, and modes of action. Spelling mistakes and spurious line break due to optical character recognition errors were also annotated. A subset of 47 patents was annotated by at least three annotator groups, from which harmonized annotations and inter-annotator agreement scores were derived. One group annotated the full set. The patent corpus includes 400,125 annotations for the full set and 36,537 annotations for the harmonized set. All patents and annotated entities are publicly available at www.biosemantics.org.
Annotated Chemical Patent Corpus: A Gold Standard for Text Mining

PubMed Central

Akhondi, Saber A.; Klenner, Alexander G.; Tyrchan, Christian; Manchala, Anil K.; Boppana, Kiran; Lowe, Daniel; Zimmermann, Marc; Jagarlapudi, Sarma A. R. P.; Sayle, Roger; Kors, Jan A.; Muresan, Sorel

2014-01-01

Exploring the chemical and biological space covered by patent applications is crucial in early-stage medicinal chemistry activities. Patent analysis can provide understanding of compound prior art, novelty checking, validation of biological assays, and identification of new starting points for chemical exploration. Extracting chemical and biological entities from patents through manual extraction by expert curators can take substantial amount of time and resources. Text mining methods can help to ease this process. To validate the performance of such methods, a manually annotated patent corpus is essential. In this study we have produced a large gold standard chemical patent corpus. We developed annotation guidelines and selected 200 full patents from the World Intellectual Property Organization, United States Patent and Trademark Office, and European Patent Office. The patents were pre-annotated automatically and made available to four independent annotator groups each consisting of two to ten annotators. The annotators marked chemicals in different subclasses, diseases, targets, and modes of action. Spelling mistakes and spurious line break due to optical character recognition errors were also annotated. A subset of 47 patents was annotated by at least three annotator groups, from which harmonized annotations and inter-annotator agreement scores were derived. One group annotated the full set. The patent corpus includes 400,125 annotations for the full set and 36,537 annotations for the harmonized set. All patents and annotated entities are publicly available at www.biosemantics.org. PMID:25268232
Annotation Graphs: A Graph-Based Visualization for Meta-Analysis of Data Based on User-Authored Annotations.

PubMed

Zhao, Jian; Glueck, Michael; Breslav, Simon; Chevalier, Fanny; Khan, Azam

2017-01-01

User-authored annotations of data can support analysts in the activity of hypothesis generation and sensemaking, where it is not only critical to document key observations, but also to communicate insights between analysts. We present annotation graphs, a dynamic graph visualization that enables meta-analysis of data based on user-authored annotations. The annotation graph topology encodes annotation semantics, which describe the content of and relations between data selections, comments, and tags. We present a mixed-initiative approach to graph layout that integrates an analyst's manual manipulations with an automatic method based on similarity inferred from the annotation semantics. Various visual graph layout styles reveal different perspectives on the annotation semantics. Annotation graphs are implemented within C8, a system that supports authoring annotations during exploratory analysis of a dataset. We apply principles of Exploratory Sequential Data Analysis (ESDA) in designing C8, and further link these to an existing task typology in the visualization literature. We develop and evaluate the system through an iterative user-centered design process with three experts, situated in the domain of analyzing HCI experiment data. The results suggest that annotation graphs are effective as a method of visually extending user-authored annotations to data meta-analysis for discovery and organization of ideas.
Semi-automatic, octave-spanning optical frequency counter.

PubMed

Liu, Tze-An; Shu, Ren-Huei; Peng, Jin-Long

2008-07-07

This work presents and demonstrates a semi-automatic optical frequency counter with octave-spanning counting capability using two fiber laser combs operated at different repetition rates. Monochromators are utilized to provide an approximate frequency of the laser under measurement to determine the mode number difference between the two laser combs. The exact mode number of the beating comb line is obtained from the mode number difference and the measured beat frequencies. The entire measurement process, except the frequency stabilization of the laser combs and the optimization of the beat signal-to-noise ratio, is controlled by a computer running a semi-automatic optical frequency counter.
A dorsolateral prefrontal cortex semi-automatic segmenter

NASA Astrophysics Data System (ADS)

Al-Hakim, Ramsey; Fallon, James; Nain, Delphine; Melonakos, John; Tannenbaum, Allen

2006-03-01

Structural, functional, and clinical studies in schizophrenia have, for several decades, consistently implicated dysfunction of the prefrontal cortex in the etiology of the disease. Functional and structural imaging studies, combined with clinical, psychometric, and genetic analyses in schizophrenia have confirmed the key roles played by the prefrontal cortex and closely linked "prefrontal system" structures such as the striatum, amygdala, mediodorsal thalamus, substantia nigra-ventral tegmental area, and anterior cingulate cortices. The nodal structure of the prefrontal system circuit is the dorsal lateral prefrontal cortex (DLPFC), or Brodmann area 46, which also appears to be the most commonly studied and cited brain area with respect to schizophrenia. 1, 2, 3, 4 In 1986, Weinberger et. al. tied cerebral blood flow in the DLPFC to schizophrenia.1 In 2001, Perlstein et. al. demonstrated that DLPFC activation is essential for working memory tasks commonly deficient in schizophrenia. 2 More recently, groups have linked morphological changes due to gene deletion and increased DLPFC glutamate concentration to schizophrenia. 3, 4 Despite the experimental and clinical focus on the DLPFC in structural and functional imaging, the variability of the location of this area, differences in opinion on exactly what constitutes DLPFC, and inherent difficulties in segmenting this highly convoluted cortical region have contributed to a lack of widely used standards for manual or semi-automated segmentation programs. Given these implications, we developed a semi-automatic tool to segment the DLPFC from brain MRI scans in a reproducible way to conduct further morphological and statistical studies. The segmenter is based on expert neuroanatomist rules (Fallon-Kindermann rules), inspired by cytoarchitectonic data and reconstructions presented by Rajkowska and Goldman-Rakic. 5 It is semi-automated to provide essential user interactivity. We present our results and provide details on our DLPFC open-source tool.
Accurate and consistent automatic seismocardiogram annotation without concurrent ECG.

PubMed

Laurin, A; Khosrow-Khavar, F; Blaber, A P; Tavakolian, Kouhyar

2016-09-01

Seismocardiography (SCG) is the measurement of vibrations in the sternum caused by the beating of the heart. Precise cardiac mechanical timings that are easily obtained from SCG are critically dependent on accurate identification of fiducial points. So far, SCG annotation has relied on concurrent ECG measurements. An algorithm capable of annotating SCG without the use any other concurrent measurement was designed. We subjected 18 participants to graded lower body negative pressure. We collected ECG and SCG, obtained R peaks from the former, and annotated the latter by hand, using these identified peaks. We also annotated the SCG automatically. We compared the isovolumic moment timings obtained by hand to those obtained using our algorithm. Mean ± confidence interval of the percentage of accurately annotated cardiac cycles were [Formula: see text], [Formula: see text], [Formula: see text], [Formula: see text], and [Formula: see text] for levels of negative pressure 0, -20, -30, -40, and -50 mmHg. LF/HF ratios, the relative power of low-frequency variations to high-frequency variations in heart beat intervals, obtained from isovolumic moments were also compared to those obtained from R peaks. The mean differences ± confidence interval were [Formula: see text], [Formula: see text], [Formula: see text], [Formula: see text], and [Formula: see text] for increasing levels of negative pressure. The accuracy and consistency of the algorithm enables the use of SCG as a stand-alone heart monitoring tool in healthy individuals at rest, and could serve as a basis for an eventual application in pathological cases.
HHsvm: fast and accurate classification of profile–profile matches identified by HHsearch

PubMed Central

Dlakić, Mensur

2009-01-01

Motivation: Recently developed profile–profile methods rival structural comparisons in their ability to detect homology between distantly related proteins. Despite this tremendous progress, many genuine relationships between protein families cannot be recognized as comparisons of their profiles result in scores that are statistically insignificant. Results: Using known evolutionary relationships among protein superfamilies in SCOP database, support vector machines were trained on four sets of discriminatory features derived from the output of HHsearch. Upon validation, it was shown that the automatic classification of all profile–profile matches was superior to fixed threshold-based annotation in terms of sensitivity and specificity. The effectiveness of this approach was demonstrated by annotating several domains of unknown function from the Pfam database. Availability: Programs and scripts implementing the methods described in this manuscript are freely available from http://hhsvm.dlakiclab.org/. Contact: mdlakic@montana.edu Supplementary information: Supplementary data are available at Bioinformatics online. PMID:19773335
Real-time image annotation by manifold-based biased Fisher discriminant analysis

NASA Astrophysics Data System (ADS)

Ji, Rongrong; Yao, Hongxun; Wang, Jicheng; Sun, Xiaoshuai; Liu, Xianming

2008-01-01

Automatic Linguistic Annotation is a promising solution to bridge the semantic gap in content-based image retrieval. However, two crucial issues are not well addressed in state-of-art annotation algorithms: 1. The Small Sample Size (3S) problem in keyword classifier/model learning; 2. Most of annotation algorithms can not extend to real-time online usage due to their low computational efficiencies. This paper presents a novel Manifold-based Biased Fisher Discriminant Analysis (MBFDA) algorithm to address these two issues by transductive semantic learning and keyword filtering. To address the 3S problem, Co-Training based Manifold learning is adopted for keyword model construction. To achieve real-time annotation, a Bias Fisher Discriminant Analysis (BFDA) based semantic feature reduction algorithm is presented for keyword confidence discrimination and semantic feature reduction. Different from all existing annotation methods, MBFDA views image annotation from a novel Eigen semantic feature (which corresponds to keywords) selection aspect. As demonstrated in experiments, our manifold-based biased Fisher discriminant analysis annotation algorithm outperforms classical and state-of-art annotation methods (1.K-NN Expansion; 2.One-to-All SVM; 3.PWC-SVM) in both computational time and annotation accuracy with a large margin.
compMS2Miner: An Automatable Metabolite Identification, Visualization, and Data-Sharing R Package for High-Resolution LC-MS Data Sets.

PubMed

Edmands, William M B; Petrick, Lauren; Barupal, Dinesh K; Scalbert, Augustin; Wilson, Mark J; Wickliffe, Jeffrey K; Rappaport, Stephen M

2017-04-04

A long-standing challenge of untargeted metabolomic profiling by ultrahigh-performance liquid chromatography-high-resolution mass spectrometry (UHPLC-HRMS) is efficient transition from unknown mass spectral features to confident metabolite annotations. The compMS 2 Miner (Comprehensive MS 2 Miner) package was developed in the R language to facilitate rapid, comprehensive feature annotation using a peak-picker-output and MS 2 data files as inputs. The number of MS 2 spectra that can be collected during a metabolomic profiling experiment far outweigh the amount of time required for pain-staking manual interpretation; therefore, a degree of software workflow autonomy is required for broad-scale metabolite annotation. CompMS 2 Miner integrates many useful tools in a single workflow for metabolite annotation and also provides a means to overview the MS 2 data with a Web application GUI compMS 2 Explorer (Comprehensive MS 2 Explorer) that also facilitates data-sharing and transparency. The automatable compMS 2 Miner workflow consists of the following steps: (i) matching unknown MS 1 features to precursor MS 2 scans, (ii) filtration of spectral noise (dynamic noise filter), (iii) generation of composite mass spectra by multiple similar spectrum signal summation and redundant/contaminant spectra removal, (iv) interpretation of possible fragment ion substructure using an internal database, (v) annotation of unknowns with chemical and spectral databases with prediction of mammalian biotransformation metabolites, wrapper functions for in silico fragmentation software, nearest neighbor chemical similarity scoring, random forest based retention time prediction, text-mining based false positive removal/true positive ranking, chemical taxonomic prediction and differential evolution based global annotation score optimization, and (vi) network graph visualizations, data curation, and sharing are made possible via the compMS 2 Explorer application. Metabolite identities and comments can also be recorded using an interactive table within compMS 2 Explorer. The utility of the package is illustrated with a data set of blood serum samples from 7 diet induced obese (DIO) and 7 nonobese (NO) C57BL/6J mice, which were also treated with an antibiotic (streptomycin) to knockdown the gut microbiota. The results of fully autonomous and objective usage of compMS 2 Miner are presented here. All automatically annotated spectra output by the workflow are provided in the Supporting Information and can alternatively be explored as publically available compMS 2 Explorer applications for both positive and negative modes ( https://wmbedmands.shinyapps.io/compMS2_mouseSera_POS and https://wmbedmands.shinyapps.io/compMS2_mouseSera_NEG ). The workflow provided rapid annotation of a diversity of endogenous and gut microbially derived metabolites affected by both diet and antibiotic treatment, which conformed to previously published reports. Composite spectra (n = 173) were autonomously matched to entries of the Massbank of North America (MoNA) spectral repository. These experimental and virtual (lipidBlast) spectra corresponded to 29 common endogenous compound classes (e.g., 51 lysophosphatidylcholines spectra) and were then used to calculate the ranking capability of 7 individual scoring metrics. It was found that an average of the 7 individual scoring metrics provided the most effective weighted average ranking ability of 3 for the MoNA matched spectra in spite of potential risk of false positive annotations emerging from automation. Minor structural differences such as relative carbon-carbon double bond positions were found in several cases to affect the correct rank of the MoNA annotated metabolite. The latest release and an example workflow is available in the package vignette ( https://github.com/WMBEdmands/compMS2Miner ) and a version of the published application is available on the shinyapps.io site ( https://wmbedmands.shinyapps.io/compMS2Example ).
POPCORN: a Supervisory Control Simulation for Workload and Performance Research

NASA Technical Reports Server (NTRS)

Hart, S. G.; Battiste, V.; Lester, P. T.

1984-01-01

A multi-task simulation of a semi-automatic supervisory control system was developed to provide an environment in which training, operator strategy development, failure detection and resolution, levels of automation, and operator workload can be investigated. The goal was to develop a well-defined, but realistically complex, task that would lend itself to model-based analysis. The name of the task (POPCORN) reflects the visual display that depicts different task elements milling around waiting to be released and pop out to be performed. The operator's task was to complete each of 100 task elements that ere represented by different symbols, by selecting a target task and entering the desired a command. The simulated automatic system then completed the selected function automatically. Highly significant differences in performance, strategy, and rated workload were found as a function of all experimental manipulations (except reward/penalty).
TRAP: automated classification, quantification and annotation of tandemly repeated sequences.

PubMed

Sobreira, Tiago José P; Durham, Alan M; Gruber, Arthur

2006-02-01

TRAP, the Tandem Repeats Analysis Program, is a Perl program that provides a unified set of analyses for the selection, classification, quantification and automated annotation of tandemly repeated sequences. TRAP uses the results of the Tandem Repeats Finder program to perform a global analysis of the satellite content of DNA sequences, permitting researchers to easily assess the tandem repeat content for both individual sequences and whole genomes. The results can be generated in convenient formats such as HTML and comma-separated values. TRAP can also be used to automatically generate annotation data in the format of feature table and GFF files.
Calling on a million minds for community annotation in WikiProteins

PubMed Central

Mons, Barend; Ashburner, Michael; Chichester, Christine; van Mulligen, Erik; Weeber, Marc; den Dunnen, Johan; van Ommen, Gert-Jan; Musen, Mark; Cockerill, Matthew; Hermjakob, Henning; Mons, Albert; Packer, Abel; Pacheco, Roberto; Lewis, Suzanna; Berkeley, Alfred; Melton, William; Barris, Nickolas; Wales, Jimmy; Meijssen, Gerard; Moeller, Erik; Roes, Peter Jan; Borner, Katy; Bairoch, Amos

2008-01-01

WikiProteins enables community annotation in a Wiki-based system. Extracts of major data sources have been fused into an editable environment that links out to the original sources. Data from community edits create automatic copies of the original data. Semantic technology captures concepts co-occurring in one sentence and thus potential factual statements. In addition, indirect associations between concepts have been calculated. We call on a 'million minds' to annotate a 'million concepts' and to collect facts from the literature with the reward of collaborative knowledge discovery. The system is available for beta testing at . PMID:18507872
Automated annotation of functional imaging experiments via multi-label classification

PubMed Central

Turner, Matthew D.; Chakrabarti, Chayan; Jones, Thomas B.; Xu, Jiawei F.; Fox, Peter T.; Luger, George F.; Laird, Angela R.; Turner, Jessica A.

2013-01-01

Identifying the experimental methods in human neuroimaging papers is important for grouping meaningfully similar experiments for meta-analyses. Currently, this can only be done by human readers. We present the performance of common machine learning (text mining) methods applied to the problem of automatically classifying or labeling this literature. Labeling terms are from the Cognitive Paradigm Ontology (CogPO), the text corpora are abstracts of published functional neuroimaging papers, and the methods use the performance of a human expert as training data. We aim to replicate the expert's annotation of multiple labels per abstract identifying the experimental stimuli, cognitive paradigms, response types, and other relevant dimensions of the experiments. We use several standard machine learning methods: naive Bayes (NB), k-nearest neighbor, and support vector machines (specifically SMO or sequential minimal optimization). Exact match performance ranged from only 15% in the worst cases to 78% in the best cases. NB methods combined with binary relevance transformations performed strongly and were robust to overfitting. This collection of results demonstrates what can be achieved with off-the-shelf software components and little to no pre-processing of raw text. PMID:24409112
Label free cell-tracking and division detection based on 2D time-lapse images for lineage analysis of early embryo development.

PubMed

Cicconet, Marcelo; Gutwein, Michelle; Gunsalus, Kristin C; Geiger, Davi

2014-08-01

In this paper we report a database and a series of techniques related to the problem of tracking cells, and detecting their divisions, in time-lapse movies of mammalian embryos. Our contributions are (1) a method for counting embryos in a well, and cropping each individual embryo across frames, to create individual movies for cell tracking; (2) a semi-automated method for cell tracking that works up to the 8-cell stage, along with a software implementation available to the public (this software was used to build the reported database); (3) an algorithm for automatic tracking up to the 4-cell stage, based on histograms of mirror symmetry coefficients captured using wavelets; (4) a cell-tracking database containing 100 annotated examples of mammalian embryos up to the 8-cell stage; and (5) statistical analysis of various timing distributions obtained from those examples. Copyright © 2014 Elsevier Ltd. All rights reserved.
Automatic Generation of Cycle-Approximate TLMs with Timed RTOS Model Support

NASA Astrophysics Data System (ADS)

Hwang, Yonghyun; Schirner, Gunar; Abdi, Samar

This paper presents a technique for automatically generating cycle-approximate transaction level models (TLMs) for multi-process applications mapped to embedded platforms. It incorporates three key features: (a) basic block level timing annotation, (b) RTOS model integration, and (c) RTOS overhead delay modeling. The inputs to TLM generation are application C processes and their mapping to processors in the platform. A processor data model, including pipelined datapath, memory hierarchy and branch delay model is used to estimate basic block execution delays. The delays are annotated to the C code, which is then integrated with a generated SystemC RTOS model. Our abstract RTOS provides dynamic scheduling and inter-process communication (IPC) with processor- and RTOS-specific pre-characterized timing. Our experiments using a MP3 decoder and a JPEG encoder show that timed TLMs, with integrated RTOS models, can be automatically generated in less than a minute. Our generated TLMs simulated three times faster than real-time and showed less than 10% timing error compared to board measurements.
A bootstrapping method for development of Treebank

NASA Astrophysics Data System (ADS)

Zarei, F.; Basirat, A.; Faili, H.; Mirain, M.

2017-01-01

Using statistical approaches beside the traditional methods of natural language processing could significantly improve both the quality and performance of several natural language processing (NLP) tasks. The effective usage of these approaches is subject to the availability of the informative, accurate and detailed corpora on which the learners are trained. This article introduces a bootstrapping method for developing annotated corpora based on a complex and rich linguistically motivated elementary structure called supertag. To this end, a hybrid method for supertagging is proposed that combines both of the generative and discriminative methods of supertagging. The method was applied on a subset of Wall Street Journal (WSJ) in order to annotate its sentences with a set of linguistically motivated elementary structures of the English XTAG grammar that is using a lexicalised tree-adjoining grammar formalism. The empirical results confirm that the bootstrapping method provides a satisfactory way for annotating the English sentences with the mentioned structures. The experiments show that the method could automatically annotate about 20% of WSJ with the accuracy of F-measure about 80% of which is particularly 12% higher than the F-measure of the XTAG Treebank automatically generated from the approach proposed by Basirat and Faili [(2013). Bridge the gap between statistical and hand-crafted grammars. Computer Speech and Language, 27, 1085-1104].

MicroScope—an integrated microbial resource for the curation and comparative analysis of genomic and metabolic data

PubMed Central

Vallenet, David; Belda, Eugeni; Calteau, Alexandra; Cruveiller, Stéphane; Engelen, Stefan; Lajus, Aurélie; Le Fèvre, François; Longin, Cyrille; Mornico, Damien; Roche, David; Rouy, Zoé; Salvignol, Gregory; Scarpelli, Claude; Thil Smith, Adam Alexander; Weiman, Marion; Médigue, Claudine

2013-01-01

MicroScope is an integrated platform dedicated to both the methodical updating of microbial genome annotation and to comparative analysis. The resource provides data from completed and ongoing genome projects (automatic and expert annotations), together with data sources from post-genomic experiments (i.e. transcriptomics, mutant collections) allowing users to perfect and improve the understanding of gene functions. MicroScope (http://www.genoscope.cns.fr/agc/microscope) combines tools and graphical interfaces to analyse genomes and to perform the manual curation of gene annotations in a comparative context. Since its first publication in January 2006, the system (previously named MaGe for Magnifying Genomes) has been continuously extended both in terms of data content and analysis tools. The last update of MicroScope was published in 2009 in the Database journal. Today, the resource contains data for >1600 microbial genomes, of which ∼300 are manually curated and maintained by biologists (1200 personal accounts today). Expert annotations are continuously gathered in the MicroScope database (∼50 000 a year), contributing to the improvement of the quality of microbial genomes annotations. Improved data browsing and searching tools have been added, original tools useful in the context of expert annotation have been developed and integrated and the website has been significantly redesigned to be more user-friendly. Furthermore, in the context of the European project Microme (Framework Program 7 Collaborative Project), MicroScope is becoming a resource providing for the curation and analysis of both genomic and metabolic data. An increasing number of projects are related to the study of environmental bacterial (meta)genomes that are able to metabolize a large variety of chemical compounds that may be of high industrial interest. PMID:23193269
Community annotation experiment for ground truth generation for the i2b2 medication challenge

PubMed Central

Solti, Imre; Xia, Fei; Cadag, Eithon

2010-01-01

Objective Within the context of the Third i2b2 Workshop on Natural Language Processing Challenges for Clinical Records, the authors (also referred to as ‘the i2b2 medication challenge team’ or ‘the i2b2 team’ for short) organized a community annotation experiment. Design For this experiment, the authors released annotation guidelines and a small set of annotated discharge summaries. They asked the participants of the Third i2b2 Workshop to annotate 10 discharge summaries per person; each discharge summary was annotated by two annotators from two different teams, and a third annotator from a third team resolved disagreements. Measurements In order to evaluate the reliability of the annotations thus produced, the authors measured community inter-annotator agreement and compared it with the inter-annotator agreement of expert annotators when both the community and the expert annotators generated ground truth based on pooled system outputs. For this purpose, the pool consisted of the three most densely populated automatic annotations of each record. The authors also compared the community inter-annotator agreement with expert inter-annotator agreement when the experts annotated raw records without using the pool. Finally, they measured the quality of the community ground truth by comparing it with the expert ground truth. Results and conclusions The authors found that the community annotators achieved comparable inter-annotator agreement to expert annotators, regardless of whether the experts annotated from the pool. Furthermore, the ground truth generated by the community obtained F-measures above 0.90 against the ground truth of the experts, indicating the value of the community as a source of high-quality ground truth even on intricate and domain-specific annotation tasks. PMID:20819855
Improved systematic tRNA gene annotation allows new insights into the evolution of mitochondrial tRNA structures and into the mechanisms of mitochondrial genome rearrangements

PubMed Central

Jühling, Frank; Pütz, Joern; Bernt, Matthias; Donath, Alexander; Middendorf, Martin; Florentz, Catherine; Stadler, Peter F.

2012-01-01

Transfer RNAs (tRNAs) are present in all types of cells as well as in organelles. tRNAs of animal mitochondria show a low level of primary sequence conservation and exhibit ‘bizarre’ secondary structures, lacking complete domains of the common cloverleaf. Such sequences are hard to detect and hence frequently missed in computational analyses and mitochondrial genome annotation. Here, we introduce an automatic annotation procedure for mitochondrial tRNA genes in Metazoa based on sequence and structural information in manually curated covariance models. The method, applied to re-annotate 1876 available metazoan mitochondrial RefSeq genomes, allows to distinguish between remaining functional genes and degrading ‘pseudogenes’, even at early stages of divergence. The subsequent analysis of a comprehensive set of mitochondrial tRNA genes gives new insights into the evolution of structures of mitochondrial tRNA sequences as well as into the mechanisms of genome rearrangements. We find frequent losses of tRNA genes concentrated in basal Metazoa, frequent independent losses of individual parts of tRNA genes, particularly in Arthropoda, and wide-spread conserved overlaps of tRNAs in opposite reading direction. Direct evidence for several recent Tandem Duplication-Random Loss events is gained, demonstrating that this mechanism has an impact on the appearance of new mitochondrial gene orders. PMID:22139921
MIPS: analysis and annotation of proteins from whole genomes

PubMed Central

Mewes, H. W.; Amid, C.; Arnold, R.; Frishman, D.; Güldener, U.; Mannhaupt, G.; Münsterkötter, M.; Pagel, P.; Strack, N.; Stümpflen, V.; Warfsmann, J.; Ruepp, A.

2004-01-01

The Munich Information Center for Protein Sequences (MIPS-GSF), Neuherberg, Germany, provides protein sequence-related information based on whole-genome analysis. The main focus of the work is directed toward the systematic organization of sequence-related attributes as gathered by a variety of algorithms, primary information from experimental data together with information compiled from the scientific literature. MIPS maintains automatically generated and manually annotated genome-specific databases, develops systematic classification schemes for the functional annotation of protein sequences and provides tools for the comprehensive analysis of protein sequences. This report updates the information on the yeast genome (CYGD), the Neurospora crassa genome (MNCDB), the database of complete cDNAs (German Human Genome Project, NGFN), the database of mammalian protein–protein interactions (MPPI), the database of FASTA homologies (SIMAP), and the interface for the fast retrieval of protein-associated information (QUIPOS). The Arabidopsis thaliana database, the rice database, the plant EST databases (MATDB, MOsDB, SPUTNIK), as well as the databases for the comprehensive set of genomes (PEDANT genomes) are described elsewhere in the 2003 and 2004 NAR database issues, respectively. All databases described, and the detailed descriptions of our projects can be accessed through the MIPS web server (http://mips.gsf.de). PMID:14681354
MIPS: analysis and annotation of proteins from whole genomes.

PubMed

Mewes, H W; Amid, C; Arnold, R; Frishman, D; Güldener, U; Mannhaupt, G; Münsterkötter, M; Pagel, P; Strack, N; Stümpflen, V; Warfsmann, J; Ruepp, A

2004-01-01

The Munich Information Center for Protein Sequences (MIPS-GSF), Neuherberg, Germany, provides protein sequence-related information based on whole-genome analysis. The main focus of the work is directed toward the systematic organization of sequence-related attributes as gathered by a variety of algorithms, primary information from experimental data together with information compiled from the scientific literature. MIPS maintains automatically generated and manually annotated genome-specific databases, develops systematic classification schemes for the functional annotation of protein sequences and provides tools for the comprehensive analysis of protein sequences. This report updates the information on the yeast genome (CYGD), the Neurospora crassa genome (MNCDB), the database of complete cDNAs (German Human Genome Project, NGFN), the database of mammalian protein-protein interactions (MPPI), the database of FASTA homologies (SIMAP), and the interface for the fast retrieval of protein-associated information (QUIPOS). The Arabidopsis thaliana database, the rice database, the plant EST databases (MATDB, MOsDB, SPUTNIK), as well as the databases for the comprehensive set of genomes (PEDANT genomes) are described elsewhere in the 2003 and 2004 NAR database issues, respectively. All databases described, and the detailed descriptions of our projects can be accessed through the MIPS web server (http://mips.gsf.de).
Diagnostic accuracy of semi-automatic quantitative metrics as an alternative to expert reading of CT myocardial perfusion in the CORE320 study.

PubMed

Ostovaneh, Mohammad R; Vavere, Andrea L; Mehra, Vishal C; Kofoed, Klaus F; Matheson, Matthew B; Arbab-Zadeh, Armin; Fujisawa, Yasuko; Schuijf, Joanne D; Rochitte, Carlos E; Scholte, Arthur J; Kitagawa, Kakuya; Dewey, Marc; Cox, Christopher; DiCarli, Marcelo F; George, Richard T; Lima, Joao A C

To determine the diagnostic accuracy of semi-automatic quantitative metrics compared to expert reading for interpretation of computed tomography perfusion (CTP) imaging. The CORE320 multicenter diagnostic accuracy clinical study enrolled patients between 45 and 85 years of age who were clinically referred for invasive coronary angiography (ICA). Computed tomography angiography (CTA), CTP, single photon emission computed tomography (SPECT), and ICA images were interpreted manually in blinded core laboratories by two experienced readers. Additionally, eight quantitative CTP metrics as continuous values were computed semi-automatically from myocardial and blood attenuation and were combined using logistic regression to derive a final quantitative CTP metric score. For the reference standard, hemodynamically significant coronary artery disease (CAD) was defined as a quantitative ICA stenosis of 50% or greater and a corresponding perfusion defect by SPECT. Diagnostic accuracy was determined by area under the receiver operating characteristic curve (AUC). Of the total 377 included patients, 66% were male, median age was 62 (IQR: 56, 68) years, and 27% had prior myocardial infarction. In patient based analysis, the AUC (95% CI) for combined CTA-CTP expert reading and combined CTA-CTP semi-automatic quantitative metrics was 0.87(0.84-0.91) and 0.86 (0.83-0.9), respectively. In vessel based analyses the AUC's were 0.85 (0.82-0.88) and 0.84 (0.81-0.87), respectively. No significant difference in AUC was found between combined CTA-CTP expert reading and CTA-CTP semi-automatic quantitative metrics in patient based or vessel based analyses(p > 0.05 for all). Combined CTA-CTP semi-automatic quantitative metrics is as accurate as CTA-CTP expert reading to detect hemodynamically significant CAD. Copyright © 2018 Society of Cardiovascular Computed Tomography. Published by Elsevier Inc. All rights reserved.
Does semi-automatic bone-fragment segmentation improve the reproducibility of the Letournel acetabular fracture classification?

PubMed

Boudissa, M; Orfeuvre, B; Chabanas, M; Tonetti, J

2017-09-01

The Letournel classification of acetabular fracture shows poor reproducibility in inexperienced observers, despite the introduction of 3D imaging. We therefore developed a method of semi-automatic segmentation based on CT data. The present prospective study aimed to assess: (1) whether semi-automatic bone-fragment segmentation increased the rate of correct classification; (2) if so, in which fracture types; and (3) feasibility using the open-source itksnap 3.0 software package without incurring extra cost for users. Semi-automatic segmentation of acetabular fractures significantly increases the rate of correct classification by orthopedic surgery residents. Twelve orthopedic surgery residents classified 23 acetabular fractures. Six used conventional 3D reconstructions provided by the center's radiology department (conventional group) and 6 others used reconstructions obtained by semi-automatic segmentation using the open-source itksnap 3.0 software package (segmentation group). Bone fragments were identified by specific colors. Correct classification rates were compared between groups on Chi 2 test. Assessment was repeated 2 weeks later, to determine intra-observer reproducibility. Correct classification rates were significantly higher in the "segmentation" group: 114/138 (83%) versus 71/138 (52%); P<0.0001. The difference was greater for simple (36/36 (100%) versus 17/36 (47%); P<0.0001) than complex fractures (79/102 (77%) versus 54/102 (53%); P=0.0004). Mean segmentation time per fracture was 27±3min [range, 21-35min]. The segmentation group showed excellent intra-observer correlation coefficients, overall (ICC=0.88), and for simple (ICC=0.92) and complex fractures (ICC=0.84). Semi-automatic segmentation, identifying the various bone fragments, was effective in increasing the rate of correct acetabular fracture classification on the Letournel system by orthopedic surgery residents. It may be considered for routine use in education and training. III: prospective case-control study of a diagnostic procedure. Copyright © 2017 Elsevier Masson SAS. All rights reserved.
JGI Plant Genomics Gene Annotation Pipeline

DOE Office of Scientific and Technical Information (OSTI.GOV)

Shu, Shengqiang; Rokhsar, Dan; Goodstein, David

2014-07-14

Plant genomes vary in size and are highly complex with a high amount of repeats, genome duplication and tandem duplication. Gene encodes a wealth of information useful in studying organism and it is critical to have high quality and stable gene annotation. Thanks to advancement of sequencing technology, many plant species genomes have been sequenced and transcriptomes are also sequenced. To use these vastly large amounts of sequence data to make gene annotation or re-annotation in a timely fashion, an automatic pipeline is needed. JGI plant genomics gene annotation pipeline, called integrated gene call (IGC), is our effort toward thismore » aim with aid of a RNA-seq transcriptome assembly pipeline. It utilizes several gene predictors based on homolog peptides and transcript ORFs. See Methods for detail. Here we present genome annotation of JGI flagship green plants produced by this pipeline plus Arabidopsis and rice except for chlamy which is done by a third party. The genome annotations of these species and others are used in our gene family build pipeline and accessible via JGI Phytozome portal whose URL and front page snapshot are shown below.« less
Inductive creation of an annotation schema for manually indexing clinical conditions from emergency department reports

PubMed Central

Chapman, Wendy W.; Dowling, John N.

2006-01-01

Evaluating automated indexing applications requires comparing automatically indexed terms against manual reference standard annotations. However, there are no standard guidelines for determining which words from a textual document to include in manual annotations, and the vague task can result in substantial variation among manual indexers. We applied grounded theory to emergency department reports to create an annotation schema representing syntactic and semantic variables that could be annotated when indexing clinical conditions. We describe the annotation schema, which includes variables representing medical concepts (e.g., symptom, demographics), linguistic form (e.g., noun, adjective), and modifier types (e.g., anatomic location, severity). We measured the schema’s quality and found: (1) the schema was comprehensive enough to be applied to 20 unseen reports without changes to the schema; (2) agreement between author annotators applying the schema was high, with an F measure of 93%; and (3) an error analysis showed that the authors made complementary errors when applying the schema, demonstrating that the schema incorporates both linguistic and medical expertise. PMID:16230050
Triage by ranking to support the curation of protein interactions

PubMed Central

Pasche, Emilie; Gobeill, Julien; Rech de Laval, Valentine; Gleizes, Anne; Michel, Pierre-André; Bairoch, Amos

2017-01-01

Abstract Today, molecular biology databases are the cornerstone of knowledge sharing for life and health sciences. The curation and maintenance of these resources are labour intensive. Although text mining is gaining impetus among curators, its integration in curation workflow has not yet been widely adopted. The Swiss Institute of Bioinformatics Text Mining and CALIPHO groups joined forces to design a new curation support system named nextA5. In this report, we explore the integration of novel triage services to support the curation of two types of biological data: protein–protein interactions (PPIs) and post-translational modifications (PTMs). The recognition of PPIs and PTMs poses a special challenge, as it not only requires the identification of biological entities (proteins or residues), but also that of particular relationships (e.g. binding or position). These relationships cannot be described with onto-terminological descriptors such as the Gene Ontology for molecular functions, which makes the triage task more challenging. Prioritizing papers for these tasks thus requires the development of different approaches. In this report, we propose a new method to prioritize articles containing information specific to PPIs and PTMs. The new resources (RESTful APIs, semantically annotated MEDLINE library) enrich the neXtA5 platform. We tuned the article prioritization model on a set of 100 proteins previously annotated by the CALIPHO group. The effectiveness of the triage service was tested with a dataset of 200 annotated proteins. We defined two sets of descriptors to support automatic triage: the first set to enrich for papers with PPI data, and the second for PTMs. All occurrences of these descriptors were marked-up in MEDLINE and indexed, thus constituting a semantically annotated version of MEDLINE. These annotations were then used to estimate the relevance of a particular article with respect to the chosen annotation type. This relevance score was combined with a local vector-space search engine to generate a ranked list of PMIDs. We also evaluated a query refinement strategy, which adds specific keywords (such as ‘binds’ or ‘interacts’) to the original query. Compared to PubMed, the search effectiveness of the nextA5 triage service is improved by 190% for the prioritization of papers with PPIs information and by 260% for papers with PTMs information. Combining advanced retrieval and query refinement strategies with automatically enriched MEDLINE contents is effective to improve triage in complex curation tasks such as the curation of protein PPIs and PTMs. Database URL: http://candy.hesge.ch/nextA5 PMID:29220432
Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments

DTIC Science & Technology

2010-01-01

mostly American English1 tweets from October 27, 2010, automatically tokenized them using a Twitter tok - enizer (O’Connor et al., 2010b),2 and pre...for which our tagset, annotation guidelines, and tokenization rules were deficient or ambiguous. Based on these considerations we revised the tok ...tagger. 6Possessive endings only appear when a user or the tok - enizer has separated the possessive ending from a possessor; the tokenizer only does
Genic insights from integrated human proteomics in GeneCards.

PubMed

Fishilevich, Simon; Zimmerman, Shahar; Kohn, Asher; Iny Stein, Tsippi; Olender, Tsviya; Kolker, Eugene; Safran, Marilyn; Lancet, Doron

2016-01-01

GeneCards is a one-stop shop for searchable human gene annotations (http://www.genecards.org/). Data are automatically mined from ∼120 sources and presented in an integrated web card for every human gene. We report the application of recent advances in proteomics to enhance gene annotation and classification in GeneCards. First, we constructed the Human Integrated Protein Expression Database (HIPED), a unified database of protein abundance in human tissues, based on the publically available mass spectrometry (MS)-based proteomics sources ProteomicsDB, Multi-Omics Profiling Expression Database, Protein Abundance Across Organisms and The MaxQuant DataBase. The integrated database, residing within GeneCards, compares favourably with its individual sources, covering nearly 90% of human protein-coding genes. For gene annotation and comparisons, we first defined a protein expression vector for each gene, based on normalized abundances in 69 normal human tissues. This vector is portrayed in the GeneCards expression section as a bar graph, allowing visual inspection and comparison. These data are juxtaposed with transcriptome bar graphs. Using the protein expression vectors, we further defined a pairwise metric that helps assess expression-based pairwise proximity. This new metric for finding functional partners complements eight others, including sharing of pathways, gene ontology (GO) terms and domains, implemented in the GeneCards Suite. In parallel, we calculated proteome-based differential expression, highlighting a subset of tissues that overexpress a gene and subserving gene classification. This textual annotation allows users of VarElect, the suite's next-generation phenotyper, to more effectively discover causative disease variants. Finally, we define the protein-RNA expression ratio and correlation as yet another attribute of every gene in each tissue, adding further annotative information. The results constitute a significant enhancement of several GeneCards sections and help promote and organize the genome-wide structural and functional knowledge of the human proteome. Database URL:http://www.genecards.org/. © The Author(s) 2016. Published by Oxford University Press.
BEACON: automated tool for Bacterial GEnome Annotation ComparisON.

PubMed

Kalkatawi, Manal; Alam, Intikhab; Bajic, Vladimir B

2015-08-18

Genome annotation is one way of summarizing the existing knowledge about genomic characteristics of an organism. There has been an increased interest during the last several decades in computer-based structural and functional genome annotation. Many methods for this purpose have been developed for eukaryotes and prokaryotes. Our study focuses on comparison of functional annotations of prokaryotic genomes. To the best of our knowledge there is no fully automated system for detailed comparison of functional genome annotations generated by different annotation methods (AMs). The presence of many AMs and development of new ones introduce needs to: a/ compare different annotations for a single genome, and b/ generate annotation by combining individual ones. To address these issues we developed an Automated Tool for Bacterial GEnome Annotation ComparisON (BEACON) that benefits both AM developers and annotation analysers. BEACON provides detailed comparison of gene function annotations of prokaryotic genomes obtained by different AMs and generates extended annotations through combination of individual ones. For the illustration of BEACON's utility, we provide a comparison analysis of multiple different annotations generated for four genomes and show on these examples that the extended annotation can increase the number of genes annotated by putative functions up to 27%, while the number of genes without any function assignment is reduced. We developed BEACON, a fast tool for an automated and a systematic comparison of different annotations of single genomes. The extended annotation assigns putative functions to many genes with unknown functions. BEACON is available under GNU General Public License version 3.0 and is accessible at: http://www.cbrc.kaust.edu.sa/BEACON/ .
Improved annotation of the insect vector of citrus greening disease: biocuration by a diverse genomics community

PubMed Central

Hosmani, Prashant S.; Villalobos-Ayala, Krystal; Miller, Sherry; Shippy, Teresa; Flores, Mirella; Rosendale, Andrew; Cordola, Chris; Bell, Tracey; Mann, Hannah; DeAvila, Gabe; DeAvila, Daniel; Moore, Zachary; Buller, Kyle; Ciolkevich, Kathryn; Nandyal, Samantha; Mahoney, Robert; Van Voorhis, Joshua; Dunlevy, Megan; Farrow, David; Hunter, David; Morgan, Taylar; Shore, Kayla; Guzman, Victoria; Izsak, Allison; Dixon, Danielle E.; Cridge, Andrew; Cano, Liliana; Cao, Xiaolong; Jiang, Haobo; Leng, Nan; Johnson, Shannon; Cantarel, Brandi L.; Richards, Stephen; English, Adam; Shatters, Robert G.; Childers, Chris; Chen, Mei-Ju; Hunter, Wayne; Cilia, Michelle; Mueller, Lukas A.; Munoz-Torres, Monica; Nelson, David; Poelchau, Monica F.; Benoit, Joshua B.; Wiersma-Koch, Helen; D’Elia, Tom; Brown, Susan J.

2017-01-01

Abstract The Asian citrus psyllid (Diaphorina citri Kuwayama) is the insect vector of the bacterium Candidatus Liberibacter asiaticus (CLas), the pathogen associated with citrus Huanglongbing (HLB, citrus greening). HLB threatens citrus production worldwide. Suppression or reduction of the insect vector using chemical insecticides has been the primary method to inhibit the spread of citrus greening disease. Accurate structural and functional annotation of the Asian citrus psyllid genome, as well as a clear understanding of the interactions between the insect and CLas, are required for development of new molecular-based HLB control methods. A draft assembly of the D. citri genome has been generated and annotated with automated pipelines. However, knowledge transfer from well-curated reference genomes such as that of Drosophila melanogaster to newly sequenced ones is challenging due to the complexity and diversity of insect genomes. To identify and improve gene models as potential targets for pest control, we manually curated several gene families with a focus on genes that have key functional roles in D. citri biology and CLas interactions. This community effort produced 530 manually curated gene models across developmental, physiological, RNAi regulatory and immunity-related pathways. As previously shown in the pea aphid, RNAi machinery genes putatively involved in the microRNA pathway have been specifically duplicated. A comprehensive transcriptome enabled us to identify a number of gene families that are either missing or misassembled in the draft genome. In order to develop biocuration as a training experience, we included undergraduate and graduate students from multiple institutions, as well as experienced annotators from the insect genomics research community. The resulting gene set (OGS v1.0) combines both automatically predicted and manually curated gene models. Database URL: https://citrusgreening.org/ PMID:29220441
A new user-assisted segmentation and tracking technique for an object-based video editing system

NASA Astrophysics Data System (ADS)

Yu, Hong Y.; Hong, Sung-Hoon; Lee, Mike M.; Choi, Jae-Gark

2004-03-01

This paper presents a semi-automatic segmentation method which can be used to generate video object plane (VOP) for object based coding scheme and multimedia authoring environment. Semi-automatic segmentation can be considered as a user-assisted segmentation technique. A user can initially mark objects of interest around the object boundaries and then the user-guided and selected objects are continuously separated from the unselected areas through time evolution in the image sequences. The proposed segmentation method consists of two processing steps: partially manual intra-frame segmentation and fully automatic inter-frame segmentation. The intra-frame segmentation incorporates user-assistance to define the meaningful complete visual object of interest to be segmentation and decides precise object boundary. The inter-frame segmentation involves boundary and region tracking to obtain temporal coherence of moving object based on the object boundary information of previous frame. The proposed method shows stable efficient results that could be suitable for many digital video applications such as multimedia contents authoring, content based coding and indexing. Based on these results, we have developed objects based video editing system with several convenient editing functions.
TOPSAN: a dynamic web database for structural genomics.

PubMed

Ellrott, Kyle; Zmasek, Christian M; Weekes, Dana; Sri Krishna, S; Bakolitsa, Constantina; Godzik, Adam; Wooley, John

2011-01-01

The Open Protein Structure Annotation Network (TOPSAN) is a web-based collaboration platform for exploring and annotating structures determined by structural genomics efforts. Characterization of those structures presents a challenge since the majority of the proteins themselves have not yet been characterized. Responding to this challenge, the TOPSAN platform facilitates collaborative annotation and investigation via a user-friendly web-based interface pre-populated with automatically generated information. Semantic web technologies expand and enrich TOPSAN's content through links to larger sets of related databases, and thus, enable data integration from disparate sources and data mining via conventional query languages. TOPSAN can be found at http://www.topsan.org.
Image annotation based on positive-negative instances learning

NASA Astrophysics Data System (ADS)

Zhang, Kai; Hu, Jiwei; Liu, Quan; Lou, Ping

2017-07-01

Automatic image annotation is now a tough task in computer vision, the main sense of this tech is to deal with managing the massive image on the Internet and assisting intelligent retrieval. This paper designs a new image annotation model based on visual bag of words, using the low level features like color and texture information as well as mid-level feature as SIFT, and mixture the pic2pic, label2pic and label2label correlation to measure the correlation degree of labels and images. We aim to prune the specific features for each single label and formalize the annotation task as a learning process base on Positive-Negative Instances Learning. Experiments are performed using the Corel5K Dataset, and provide a quite promising result when comparing with other existing methods.
neXtA5: accelerating annotation of articles via automated approaches in neXtProt.

PubMed

Mottin, Luc; Gobeill, Julien; Pasche, Emilie; Michel, Pierre-André; Cusin, Isabelle; Gaudet, Pascale; Ruch, Patrick

2016-01-01

The rapid increase in the number of published articles poses a challenge for curated databases to remain up-to-date. To help the scientific community and database curators deal with this issue, we have developed an application, neXtA5, which prioritizes the literature for specific curation requirements. Our system, neXtA5, is a curation service composed of three main elements. The first component is a named-entity recognition module, which annotates MEDLINE over some predefined axes. This report focuses on three axes: Diseases, the Molecular Function and Biological Process sub-ontologies of the Gene Ontology (GO). The automatic annotations are then stored in a local database, BioMed, for each annotation axis. Additional entities such as species and chemical compounds are also identified. The second component is an existing search engine, which retrieves the most relevant MEDLINE records for any given query. The third component uses the content of BioMed to generate an axis-specific ranking, which takes into account the density of named-entities as stored in the Biomed database. The two ranked lists are ultimately merged using a linear combination, which has been specifically tuned to support the annotation of each axis. The fine-tuning of the coefficients is formally reported for each axis-driven search. Compared with PubMed, which is the system used by most curators, the improvement is the following: +231% for Diseases, +236% for Molecular Functions and +3153% for Biological Process when measuring the precision of the top-returned PMID (P0 or mean reciprocal rank). The current search methods significantly improve the search effectiveness of curators for three important curation axes. Further experiments are being performed to extend the curation types, in particular protein-protein interactions, which require specific relationship extraction capabilities. In parallel, user-friendly interfaces powered with a set of JSON web services are currently being implemented into the neXtProt annotation pipeline.Available on: http://babar.unige.ch:8082/neXtA5Database URL: http://babar.unige.ch:8082/neXtA5/fetcher.jsp. © The Author(s) 2016. Published by Oxford University Press.
neXtA5: accelerating annotation of articles via automated approaches in neXtProt

PubMed Central

Mottin, Luc; Gobeill, Julien; Pasche, Emilie; Michel, Pierre-André; Cusin, Isabelle; Gaudet, Pascale; Ruch, Patrick

2016-01-01

The rapid increase in the number of published articles poses a challenge for curated databases to remain up-to-date. To help the scientific community and database curators deal with this issue, we have developed an application, neXtA5, which prioritizes the literature for specific curation requirements. Our system, neXtA5, is a curation service composed of three main elements. The first component is a named-entity recognition module, which annotates MEDLINE over some predefined axes. This report focuses on three axes: Diseases, the Molecular Function and Biological Process sub-ontologies of the Gene Ontology (GO). The automatic annotations are then stored in a local database, BioMed, for each annotation axis. Additional entities such as species and chemical compounds are also identified. The second component is an existing search engine, which retrieves the most relevant MEDLINE records for any given query. The third component uses the content of BioMed to generate an axis-specific ranking, which takes into account the density of named-entities as stored in the Biomed database. The two ranked lists are ultimately merged using a linear combination, which has been specifically tuned to support the annotation of each axis. The fine-tuning of the coefficients is formally reported for each axis-driven search. Compared with PubMed, which is the system used by most curators, the improvement is the following: +231% for Diseases, +236% for Molecular Functions and +3153% for Biological Process when measuring the precision of the top-returned PMID (P0 or mean reciprocal rank). The current search methods significantly improve the search effectiveness of curators for three important curation axes. Further experiments are being performed to extend the curation types, in particular protein–protein interactions, which require specific relationship extraction capabilities. In parallel, user-friendly interfaces powered with a set of JSON web services are currently being implemented into the neXtProt annotation pipeline. Available on: http://babar.unige.ch:8082/neXtA5 Database URL: http://babar.unige.ch:8082/neXtA5/fetcher.jsp PMID:27374119
A Modular Hierarchical Approach to 3D Electron Microscopy Image Segmentation

PubMed Central

Liu, Ting; Jones, Cory; Seyedhosseini, Mojtaba; Tasdizen, Tolga

2014-01-01

The study of neural circuit reconstruction, i.e., connectomics, is a challenging problem in neuroscience. Automated and semi-automated electron microscopy (EM) image analysis can be tremendously helpful for connectomics research. In this paper, we propose a fully automatic approach for intra-section segmentation and inter-section reconstruction of neurons using EM images. A hierarchical merge tree structure is built to represent multiple region hypotheses and supervised classification techniques are used to evaluate their potentials, based on which we resolve the merge tree with consistency constraints to acquire final intra-section segmentation. Then, we use a supervised learning based linking procedure for the inter-section neuron reconstruction. Also, we develop a semi-automatic method that utilizes the intermediate outputs of our automatic algorithm and achieves intra-segmentation with minimal user intervention. The experimental results show that our automatic method can achieve close-to-human intra-segmentation accuracy and state-of-the-art inter-section reconstruction accuracy. We also show that our semi-automatic method can further improve the intra-segmentation accuracy. PMID:24491638

GARNET--gene set analysis with exploration of annotation relations.

PubMed

Rho, Kyoohyoung; Kim, Bumjin; Jang, Youngjun; Lee, Sanghyun; Bae, Taejeong; Seo, Jihae; Seo, Chaehwa; Lee, Jihyun; Kang, Hyunjung; Yu, Ungsik; Kim, Sunghoon; Lee, Sanghyuk; Kim, Wan Kyu

2011-02-15

Gene set analysis is a powerful method of deducing biological meaning for an a priori defined set of genes. Numerous tools have been developed to test statistical enrichment or depletion in specific pathways or gene ontology (GO) terms. Major difficulties towards biological interpretation are integrating diverse types of annotation categories and exploring the relationships between annotation terms of similar information. GARNET (Gene Annotation Relationship NEtwork Tools) is an integrative platform for gene set analysis with many novel features. It includes tools for retrieval of genes from annotation database, statistical analysis & visualization of annotation relationships, and managing gene sets. In an effort to allow access to a full spectrum of amassed biological knowledge, we have integrated a variety of annotation data that include the GO, domain, disease, drug, chromosomal location, and custom-defined annotations. Diverse types of molecular networks (pathways, transcription and microRNA regulations, protein-protein interaction) are also included. The pair-wise relationship between annotation gene sets was calculated using kappa statistics. GARNET consists of three modules--gene set manager, gene set analysis and gene set retrieval, which are tightly integrated to provide virtually automatic analysis for gene sets. A dedicated viewer for annotation network has been developed to facilitate exploration of the related annotations. GARNET (gene annotation relationship network tools) is an integrative platform for diverse types of gene set analysis, where complex relationships among gene annotations can be easily explored with an intuitive network visualization tool (http://garnet.isysbio.org/ or http://ercsb.ewha.ac.kr/garnet/).
Impact of translation on named-entity recognition in radiology texts

PubMed Central

Pedro, Vasco

2017-01-01

Abstract Radiology reports describe the results of radiography procedures and have the potential of being a useful source of information which can bring benefits to health care systems around the world. One way to automatically extract information from the reports is by using Text Mining tools. The problem is that these tools are mostly developed for English and reports are usually written in the native language of the radiologist, which is not necessarily English. This creates an obstacle to the sharing of Radiology information between different communities. This work explores the solution of translating the reports to English before applying the Text Mining tools, probing the question of what translation approach should be used. We created MRRAD (Multilingual Radiology Research Articles Dataset), a parallel corpus of Portuguese research articles related to Radiology and a number of alternative translations (human, automatic and semi-automatic) to English. This is a novel corpus which can be used to move forward the research on this topic. Using MRRAD we studied which kind of automatic or semi-automatic translation approach is more effective on the Named-entity recognition task of finding RadLex terms in the English version of the articles. Considering the terms extracted from human translations as our gold standard, we calculated how similar to this standard were the terms extracted using other translations. We found that a completely automatic translation approach using Google leads to F-scores (between 0.861 and 0.868, depending on the extraction approach) similar to the ones obtained through a more expensive semi-automatic translation approach using Unbabel (between 0.862 and 0.870). To better understand the results we also performed a qualitative analysis of the type of errors found in the automatic and semi-automatic translations. Database URL: https://github.com/lasigeBioTM/MRRAD PMID:29220455
Automatically identifying health outcome information in MEDLINE records.

PubMed

Demner-Fushman, Dina; Few, Barbara; Hauser, Susan E; Thoma, George

2006-01-01

Understanding the effect of a given intervention on the patient's health outcome is one of the key elements in providing optimal patient care. This study presents a methodology for automatic identification of outcomes-related information in medical text and evaluates its potential in satisfying clinical information needs related to health care outcomes. An annotation scheme based on an evidence-based medicine model for critical appraisal of evidence was developed and used to annotate 633 MEDLINE citations. Textual, structural, and meta-information features essential to outcome identification were learned from the created collection and used to develop an automatic system. Accuracy of automatic outcome identification was assessed in an intrinsic evaluation and in an extrinsic evaluation, in which ranking of MEDLINE search results obtained using PubMed Clinical Queries relied on identified outcome statements. The accuracy and positive predictive value of outcome identification were calculated. Effectiveness of the outcome-based ranking was measured using mean average precision and precision at rank 10. Automatic outcome identification achieved 88% to 93% accuracy. The positive predictive value of individual sentences identified as outcomes ranged from 30% to 37%. Outcome-based ranking improved retrieval accuracy, tripling mean average precision and achieving 389% improvement in precision at rank 10. Preliminary results in outcome-based document ranking show potential validity of the evidence-based medicine-model approach in timely delivery of information critical to clinical decision support at the point of service.
Deformably registering and annotating whole CLARITY brains to an atlas via masked LDDMM

NASA Astrophysics Data System (ADS)

Kutten, Kwame S.; Vogelstein, Joshua T.; Charon, Nicolas; Ye, Li; Deisseroth, Karl; Miller, Michael I.

2016-04-01

The CLARITY method renders brains optically transparent to enable high-resolution imaging in the structurally intact brain. Anatomically annotating CLARITY brains is necessary for discovering which regions contain signals of interest. Manually annotating whole-brain, terabyte CLARITY images is difficult, time-consuming, subjective, and error-prone. Automatically registering CLARITY images to a pre-annotated brain atlas offers a solution, but is difficult for several reasons. Removal of the brain from the skull and subsequent storage and processing cause variable non-rigid deformations, thus compounding inter-subject anatomical variability. Additionally, the signal in CLARITY images arises from various biochemical contrast agents which only sparsely label brain structures. This sparse labeling challenges the most commonly used registration algorithms that need to match image histogram statistics to the more densely labeled histological brain atlases. The standard method is a multiscale Mutual Information B-spline algorithm that dynamically generates an average template as an intermediate registration target. We determined that this method performs poorly when registering CLARITY brains to the Allen Institute's Mouse Reference Atlas (ARA), because the image histogram statistics are poorly matched. Therefore, we developed a method (Mask-LDDMM) for registering CLARITY images, that automatically finds the brain boundary and learns the optimal deformation between the brain and atlas masks. Using Mask-LDDMM without an average template provided better results than the standard approach when registering CLARITY brains to the ARA. The LDDMM pipelines developed here provide a fast automated way to anatomically annotate CLARITY images; our code is available as open source software at http://NeuroData.io.
Watch-and-Comment as an Approach to Collaboratively Annotate Points of Interest in Video and Interactive-TV Programs

NASA Astrophysics Data System (ADS)

Pimentel, Maria Da Graça C.; Cattelan, Renan G.; Melo, Erick L.; Freitas, Giliard B.; Teixeira, Cesar A.

In earlier work we proposed the Watch-and-Comment (WaC) paradigm as the seamless capture of multimodal comments made by one or more users while watching a video, resulting in the automatic generation of multimedia documents specifying annotated interactive videos. The aim is to allow services to be offered by applying document engineering techniques to the multimedia document generated automatically. The WaC paradigm was demonstrated with a WaCTool prototype application which supports multimodal annotation over video frames and segments, producing a corresponding interactive video. In this chapter, we extend the WaC paradigm to consider contexts in which several viewers may use their own mobile devices while watching and commenting on an interactive-TV program. We first review our previous work. Next, we discuss scenarios in which mobile users can collaborate via the WaC paradigm. We then present a new prototype application which allows users to employ their mobile devices to collaboratively annotate points of interest in video and interactive-TV programs. We also detail the current software infrastructure which supports our new prototype; the infrastructure extends the Ginga middleware for the Brazilian Digital TV with an implementation of the UPnP protocol - the aim is to provide the seamless integration of the users' mobile devices into the TV environment. As a result, the work reported in this chapter defines the WaC paradigm for the mobile-user as an approach to allow the collaborative annotation of the points of interest in video and interactive-TV programs.
Design and development of a prototypical software for semi-automatic generation of test methodologies and security checklists for IT vulnerability assessment in small- and medium-sized enterprises (SME)

NASA Astrophysics Data System (ADS)

Möller, Thomas; Bellin, Knut; Creutzburg, Reiner

2015-03-01

The aim of this paper is to show the recent progress in the design and prototypical development of a software suite Copra Breeder* for semi-automatic generation of test methodologies and security checklists for IT vulnerability assessment in small and medium-sized enterprises.
Build your own low-cost seismic/bathymetric recorder annotator

USGS Publications Warehouse

Robinson, W.

1994-01-01

An inexpensive programmable annotator, completely compatible with at least three models of widely used graphic recorders (Raytheon LSR-1811, Raytheon LSR-1807 M, and EDO 550) has been developed to automatically write event marks and print up to sixteen numbers on the paper record. Event mark and character printout intervals, character height and character position are all selectable with front panel switches. Operation is completely compatible with recorders running in either continuous or start-stop mode. ?? 1994.
Towards a Framework for Generating Tests to Satisfy Complex Code Coverage in Java Pathfinder

NASA Technical Reports Server (NTRS)

Staats, Matt

2009-01-01

We present work on a prototype tool based on the JavaPathfinder (JPF) model checker for automatically generating tests satisfying the MC/DC code coverage criterion. Using the Eclipse IDE, developers and testers can quickly instrument Java source code with JPF annotations covering all MC/DC coverage obligations, and JPF can then be used to automatically generate tests that satisfy these obligations. The prototype extension to JPF enables various tasks useful in automatic test generation to be performed, such as test suite reduction and execution of generated tests.
Comparative analysis of semantic localization accuracies between adult and pediatric DICOM CT images

NASA Astrophysics Data System (ADS)

Robertson, Duncan; Pathak, Sayan D.; Criminisi, Antonio; White, Steve; Haynor, David; Chen, Oliver; Siddiqui, Khan

2012-02-01

Existing literature describes a variety of techniques for semantic annotation of DICOM CT images, i.e. the automatic detection and localization of anatomical structures. Semantic annotation facilitates enhanced image navigation, linkage of DICOM image content and non-image clinical data, content-based image retrieval, and image registration. A key challenge for semantic annotation algorithms is inter-patient variability. However, while the algorithms described in published literature have been shown to cope adequately with the variability in test sets comprising adult CT scans, the problem presented by the even greater variability in pediatric anatomy has received very little attention. Most existing semantic annotation algorithms can only be extended to work on scans of both adult and pediatric patients by adapting parameters heuristically in light of patient size. In contrast, our approach, which uses random regression forests ('RRF'), learns an implicit model of scale variation automatically using training data. In consequence, anatomical structures can be localized accurately in both adult and pediatric CT studies without the need for parameter adaptation or additional information about patient scale. We show how the RRF algorithm is able to learn scale invariance from a combined training set containing a mixture of pediatric and adult scans. Resulting localization accuracy for both adult and pediatric data remains comparable with that obtained using RRFs trained and tested using only adult data.
Determinants of wood dust exposure in the Danish furniture industry.

PubMed

Mikkelsen, Anders B; Schlunssen, Vivi; Sigsgaard, Torben; Schaumburg, Inger

2002-11-01

This paper investigates the relation between wood dust exposure in the furniture industry and occupational hygiene variables. During the winter 1997-98 54 factories were visited and 2362 personal, passive inhalable dust samples were obtained; the geometric mean was 0.95 mg/m(3) and the geometric standard deviation was 2.08. In a first measuring round 1685 dust concentrations were obtained. For some of the workers repeated measurements were carried out 1 (351) and 2 weeks (326) after the first measurement. Hygiene variables like job, exhaust ventilation, cleaning procedures, etc., were documented. A multivariate analysis based on mixed effects models was used with hygiene variables being fixed effects and worker, machine, department and factory being random effects. A modified stepwise strategy of model making was adopted taking into account the hierarchically structured variables and making possible the exclusion of non-influential random as well as fixed effects. For woodworking, the following determinants of exposure increase the dust concentration: manual and automatic sanding and use of compressed air with fully automatic and semi-automatic machines and for cleaning of work pieces. Decreased dust exposure resulted from the use of compressed air with manual machines, working at fully automatic or semi-automatic machines, functioning exhaust ventilation, work on the night shift, daily cleaning of rooms, cleaning of work pieces with a brush, vacuum cleaning of machines, supplementary fresh air intake and safety representative elected within the last 2 yr. For handling and assembling, increased exposure results from work at automatic machines and presence of wood dust on the workpieces. Work on the evening shift, supplementary fresh air intake, work in a chair factory and special cleaning staff produced decreased exposure to wood dust. The implications of the results for the prevention of wood dust exposure are discussed.
DiffNet: automatic differential functional summarization of dE-MAP networks.

PubMed

Seah, Boon-Siew; Bhowmick, Sourav S; Dewey, C Forbes

2014-10-01

The study of genetic interaction networks that respond to changing conditions is an emerging research problem. Recently, Bandyopadhyay et al. (2010) proposed a technique to construct a differential network (dE-MAPnetwork) from two static gene interaction networks in order to map the interaction differences between them under environment or condition change (e.g., DNA-damaging agent). This differential network is then manually analyzed to conclude that DNA repair is differentially effected by the condition change. Unfortunately, manual construction of differential functional summary from a dE-MAP network that summarizes all pertinent functional responses is time-consuming, laborious and error-prone, impeding large-scale analysis on it. To this end, we propose DiffNet, a novel data-driven algorithm that leverages Gene Ontology (go) annotations to automatically summarize a dE-MAP network to obtain a high-level map of functional responses due to condition change. We tested DiffNet on the dynamic interaction networks following MMS treatment and demonstrated the superiority of our approach in generating differential functional summaries compared to state-of-the-art graph clustering methods. We studied the effects of parameters in DiffNet in controlling the quality of the summary. We also performed a case study that illustrates its utility. Copyright © 2014 Elsevier Inc. All rights reserved.
A semi-automatic traffic sign detection, classification, and positioning system

NASA Astrophysics Data System (ADS)

Creusen, I. M.; Hazelhoff, L.; de With, P. H. N.

2012-01-01

The availability of large-scale databases containing street-level panoramic images offers the possibility to perform semi-automatic surveying of real-world objects such as traffic signs. These inventories can be performed significantly more efficiently than using conventional methods. Governmental agencies are interested in these inventories for maintenance and safety reasons. This paper introduces a complete semi-automatic traffic sign inventory system. The system consists of several components. First, a detection algorithm locates the 2D position of the traffic signs in the panoramic images. Second, a classification algorithm is used to identify the traffic sign. Third, the 3D position of the traffic sign is calculated using the GPS position of the photographs. Finally, the results are listed in a table for quick inspection and are also visualized in a web browser.
ELECTROMAGNETIC AND ELECTROSTATIC GENERATORS: ANNOTATED BIBLIOGRAPHY.

DTIC Science & Technology

generator with split poles, ultrasonic-frequency generator, unipolar generator, single-phase micromotors , synchronous motor, asynchronous motor...asymmetrical rotor, magnetic circuit, dc micromotors , circuit for the automatic control of synchronized induction motors, induction torque micromotors , electric
T-lex2: genotyping, frequency estimation and re-annotation of transposable elements using single or pooled next-generation sequencing data.

PubMed

Fiston-Lavier, Anna-Sophie; Barrón, Maite G; Petrov, Dmitri A; González, Josefa

2015-02-27

Transposable elements (TEs) constitute the most active, diverse and ancient component in a broad range of genomes. Complete understanding of genome function and evolution cannot be achieved without a thorough understanding of TE impact and biology. However, in-depth analysis of TEs still represents a challenge due to the repetitive nature of these genomic entities. In this work, we present a broadly applicable and flexible tool: T-lex2. T-lex2 is the only available software that allows routine, automatic and accurate genotyping of individual TE insertions and estimation of their population frequencies both using individual strain and pooled next-generation sequencing data. Furthermore, T-lex2 also assesses the quality of the calls allowing the identification of miss-annotated TEs and providing the necessary information to re-annotate them. The flexible and customizable design of T-lex2 allows running it in any genome and for any type of TE insertion. Here, we tested the fidelity of T-lex2 using the fly and human genomes. Overall, T-lex2 represents a significant improvement in our ability to analyze the contribution of TEs to genome function and evolution as well as learning about the biology of TEs. T-lex2 is freely available online at http://sourceforge.net/projects/tlex. © The Author(s) 2014. Published by Oxford University Press on behalf of Nucleic Acids Research.
Annotating spatio-temporal datasets for meaningful analysis in the Web

NASA Astrophysics Data System (ADS)

Stasch, Christoph; Pebesma, Edzer; Scheider, Simon

2014-05-01

More and more environmental datasets that vary in space and time are available in the Web. This comes along with an advantage of using the data for other purposes than originally foreseen, but also with the danger that users may apply inappropriate analysis procedures due to lack of important assumptions made during the data collection process. In order to guide towards a meaningful (statistical) analysis of spatio-temporal datasets available in the Web, we have developed a Higher-Order-Logic formalism that captures some relevant assumptions in our previous work [1]. It allows to proof on meaningful spatial prediction and aggregation in a semi-automated fashion. In this poster presentation, we will present a concept for annotating spatio-temporal datasets available in the Web with concepts defined in our formalism. Therefore, we have defined a subset of the formalism as a Web Ontology Language (OWL) pattern. It allows capturing the distinction between the different spatio-temporal variable types, i.e. point patterns, fields, lattices and trajectories, that in turn determine whether a particular dataset can be interpolated or aggregated in a meaningful way using a certain procedure. The actual annotations that link spatio-temporal datasets with the concepts in the ontology pattern are provided as Linked Data. In order to allow data producers to add the annotations to their datasets, we have implemented a Web portal that uses a triple store at the backend to store the annotations and to make them available in the Linked Data cloud. Furthermore, we have implemented functions in the statistical environment R to retrieve the RDF annotations and, based on these annotations, to support a stronger typing of spatio-temporal datatypes guiding towards a meaningful analysis in R. [1] Stasch, C., Scheider, S., Pebesma, E., Kuhn, W. (2014): "Meaningful spatial prediction and aggregation", Environmental Modelling & Software, 51, 149-165.
Pythran: enabling static optimization of scientific Python programs

NASA Astrophysics Data System (ADS)

Guelton, Serge; Brunet, Pierrick; Amini, Mehdi; Merlini, Adrien; Corbillon, Xavier; Raynaud, Alan

2015-01-01

Pythran is an open source static compiler that turns modules written in a subset of Python language into native ones. Assuming that scientific modules do not rely much on the dynamic features of the language, it trades them for powerful, possibly inter-procedural, optimizations. These optimizations include detection of pure functions, temporary allocation removal, constant folding, Numpy ufunc fusion and parallelization, explicit thread-level parallelism through OpenMP annotations, false variable polymorphism pruning, and automatic vector instruction generation such as AVX or SSE. In addition to these compilation steps, Pythran provides a C++ runtime library that leverages the C++ STL to provide generic containers, and the Numeric Template Toolbox for Numpy support. It takes advantage of modern C++11 features such as variadic templates, type inference, move semantics and perfect forwarding, as well as classical idioms such as expression templates. Unlike the Cython approach, Pythran input code remains compatible with the Python interpreter. Output code is generally as efficient as the annotated Cython equivalent, if not more, but without the backward compatibility loss.
The Influence of Endmember Selection Method in Extracting Impervious Surface from Airborne Hyperspectral Imagery

NASA Astrophysics Data System (ADS)

Wang, J.; Feng, B.

2016-12-01

Impervious surface area (ISA) has long been studied as an important input into moisture flux models. In general, ISA impedes groundwater recharge, increases stormflow/flood frequency, and alters in-stream and riparian habitats. Urban area is recognized as one of the richest ISA environment. Urban ISA mapping assists flood prevention and urban planning. Hyperspectral imagery (HI), for its ability to detect subtle spectral signature, becomes an ideal candidate in urban ISA mapping. To map ISA from HI involves endmember (EM) selection. The high degree of spatial and spectral heterogeneity of urban environment puts great difficulty in this task: a compromise point is needed between the automatic degree and the good representativeness of the method. The study tested one manual and two semi-automatic EM selection strategies. The manual and the first semi-automatic methods have been widely used in EM selection. The second semi-automatic EM selection method is rather new and has been only proposed for moderate spatial resolution satellite. The manual method visually selected the EM candidates from eight landcover types in the original image. The first semi-automatic method chose the EM candidates using a threshold over the pixel purity index (PPI) map. The second semi-automatic method used the triangle shape of the HI scatter plot in the n-Dimension visualizer to identify the V-I-S (vegetation-impervious surface-soil) EM candidates: the pixels locate at the triangle points. The initial EM candidates from the three methods were further refined by three indexes (EM average RMSE, minimum average spectral angle, and count based EM selection) and generated three spectral libraries, which were used to classify the test image. Spectral angle mapper was applied. The accuracy reports for the classification results were generated. The overall accuracy are 85% for the manual method, 81% for the PPI method, and 87% for the V-I-S method. The V-I-S EM selection method performs best in this study. This fact proves the value of V-I-S EM selection method in not only moderate spatial resolution satellite image but also the more and more accessible high spatial resolution airborne image. This semi-automatic EM selection method can be adopted into a wide range of remote sensing images and provide ISA map for hydrology analysis.
BiobankConnect: software to rapidly connect data elements for pooled analysis across biobanks using ontological and lexical indexing.

PubMed

Pang, Chao; Hendriksen, Dennis; Dijkstra, Martijn; van der Velde, K Joeri; Kuiper, Joel; Hillege, Hans L; Swertz, Morris A

2015-01-01

Pooling data across biobanks is necessary to increase statistical power, reveal more subtle associations, and synergize the value of data sources. However, searching for desired data elements among the thousands of available elements and harmonizing differences in terminology, data collection, and structure, is arduous and time consuming. To speed up biobank data pooling we developed BiobankConnect, a system to semi-automatically match desired data elements to available elements by: (1) annotating the desired elements with ontology terms using BioPortal; (2) automatically expanding the query for these elements with synonyms and subclass information using OntoCAT; (3) automatically searching available elements for these expanded terms using Lucene lexical matching; and (4) shortlisting relevant matches sorted by matching score. We evaluated BiobankConnect using human curated matches from EU-BioSHaRE, searching for 32 desired data elements in 7461 available elements from six biobanks. We found 0.75 precision at rank 1 and 0.74 recall at rank 10 compared to a manually curated set of relevant matches. In addition, best matches chosen by BioSHaRE experts ranked first in 63.0% and in the top 10 in 98.4% of cases, indicating that our system has the potential to significantly reduce manual matching work. BiobankConnect provides an easy user interface to significantly speed up the biobank harmonization process. It may also prove useful for other forms of biomedical data integration. All the software can be downloaded as a MOLGENIS open source app from http://www.github.com/molgenis, with a demo available at http://www.biobankconnect.org. © The Author 2014. Published by Oxford University Press on behalf of the American Medical Informatics Association.
BiobankConnect: software to rapidly connect data elements for pooled analysis across biobanks using ontological and lexical indexing

PubMed Central

Pang, Chao; Hendriksen, Dennis; Dijkstra, Martijn; van der Velde, K Joeri; Kuiper, Joel; Hillege, Hans L; Swertz, Morris A

2015-01-01

Objective Pooling data across biobanks is necessary to increase statistical power, reveal more subtle associations, and synergize the value of data sources. However, searching for desired data elements among the thousands of available elements and harmonizing differences in terminology, data collection, and structure, is arduous and time consuming. Materials and methods To speed up biobank data pooling we developed BiobankConnect, a system to semi-automatically match desired data elements to available elements by: (1) annotating the desired elements with ontology terms using BioPortal; (2) automatically expanding the query for these elements with synonyms and subclass information using OntoCAT; (3) automatically searching available elements for these expanded terms using Lucene lexical matching; and (4) shortlisting relevant matches sorted by matching score. Results We evaluated BiobankConnect using human curated matches from EU-BioSHaRE, searching for 32 desired data elements in 7461 available elements from six biobanks. We found 0.75 precision at rank 1 and 0.74 recall at rank 10 compared to a manually curated set of relevant matches. In addition, best matches chosen by BioSHaRE experts ranked first in 63.0% and in the top 10 in 98.4% of cases, indicating that our system has the potential to significantly reduce manual matching work. Conclusions BiobankConnect provides an easy user interface to significantly speed up the biobank harmonization process. It may also prove useful for other forms of biomedical data integration. All the software can be downloaded as a MOLGENIS open source app from http://www.github.com/molgenis, with a demo available at http://www.biobankconnect.org. PMID:25361575
A Graph-Based Recovery and Decomposition of Swanson’s Hypothesis using Semantic Predications

PubMed Central

Cameron, Delroy; Bodenreider, Olivier; Yalamanchili, Hima; Danh, Tu; Vallabhaneni, Sreeram; Thirunarayan, Krishnaprasad; Sheth, Amit P.; Rindflesch, Thomas C.

2014-01-01

Objectives This paper presents a methodology for recovering and decomposing Swanson’s Raynaud Syndrome–Fish Oil Hypothesis semi-automatically. The methodology leverages the semantics of assertions extracted from biomedical literature (called semantic predications) along with structured background knowledge and graph-based algorithms to semi-automatically capture the informative associations originally discovered manually by Swanson. Demonstrating that Swanson’s manually intensive techniques can be undertaken semi-automatically, paves the way for fully automatic semantics-based hypothesis generation from scientific literature. Methods Semantic predications obtained from biomedical literature allow the construction of labeled directed graphs which contain various associations among concepts from the literature. By aggregating such associations into informative subgraphs, some of the relevant details originally articulated by Swanson has been uncovered. However, by leveraging background knowledge to bridge important knowledge gaps in the literature, a methodology for semi-automatically capturing the detailed associations originally explicated in natural language by Swanson has been developed. Results Our methodology not only recovered the 3 associations commonly recognized as Swanson’s Hypothesis, but also decomposed them into an additional 16 detailed associations, formulated as chains of semantic predications. Altogether, 14 out of the 19 associations that can be attributed to Swanson were retrieved using our approach. To the best of our knowledge, such an in-depth recovery and decomposition of Swanson’s Hypothesis has never been attempted. Conclusion In this work therefore, we presented a methodology for semi- automatically recovering and decomposing Swanson’s RS-DFO Hypothesis using semantic representations and graph algorithms. Our methodology provides new insights into potential prerequisites for semantics-driven Literature-Based Discovery (LBD). These suggest that three critical aspects of LBD include: 1) the need for more expressive representations beyond Swanson’s ABC model; 2) an ability to accurately extract semantic information from text; and 3) the semantic integration of scientific literature with structured background knowledge. PMID:23026233

Semi-Automatic Extraction Algorithm for Images of the Ciliary Muscle

PubMed Central

Kao, Chiu-Yen; Richdale, Kathryn; Sinnott, Loraine T.; Ernst, Lauren E.; Bailey, Melissa D.

2011-01-01

Purpose To development and evaluate a semi-automatic algorithm for segmentation and morphological assessment of the dimensions of the ciliary muscle in Visante™ Anterior Segment Optical Coherence Tomography images. Methods Geometric distortions in Visante images analyzed as binary files were assessed by imaging an optical flat and human donor tissue. The appropriate pixel/mm conversion factor to use for air (n = 1) was estimated by imaging calibration spheres. A semi-automatic algorithm was developed to extract the dimensions of the ciliary muscle from Visante images. Measurements were also made manually using Visante software calipers. Interclass correlation coefficients (ICC) and Bland-Altman analyses were used to compare the methods. A multilevel model was fitted to estimate the variance of algorithm measurements that was due to differences within- and between-examiners in scleral spur selection versus biological variability. Results The optical flat and the human donor tissue were imaged and appeared without geometric distortions in binary file format. Bland-Altman analyses revealed that caliper measurements tended to underestimate ciliary muscle thickness at 3 mm posterior to the scleral spur in subjects with the thickest ciliary muscles (t = 3.6, p < 0.001). The percent variance due to within- or between-examiner differences in scleral spur selection was found to be small (6%) when compared to the variance due to biological difference across subjects (80%). Using the mean of measurements from three images achieved an estimated ICC of 0.85. Conclusions The semi-automatic algorithm successfully segmented the ciliary muscle for further measurement. Using the algorithm to follow the scleral curvature to locate more posterior measurements is critical to avoid underestimating thickness measurements. This semi-automatic algorithm will allow for repeatable, efficient, and masked ciliary muscle measurements in large datasets. PMID:21169877
Ontology design patterns to disambiguate relations between genes and gene products in GENIA

PubMed Central

2011-01-01

Motivation Annotated reference corpora play an important role in biomedical information extraction. A semantic annotation of the natural language texts in these reference corpora using formal ontologies is challenging due to the inherent ambiguity of natural language. The provision of formal definitions and axioms for semantic annotations offers the means for ensuring consistency as well as enables the development of verifiable annotation guidelines. Consistent semantic annotations facilitate the automatic discovery of new information through deductive inferences. Results We provide a formal characterization of the relations used in the recent GENIA corpus annotations. For this purpose, we both select existing axiom systems based on the desired properties of the relations within the domain and develop new axioms for several relations. To apply this ontology of relations to the semantic annotation of text corpora, we implement two ontology design patterns. In addition, we provide a software application to convert annotated GENIA abstracts into OWL ontologies by combining both the ontology of relations and the design patterns. As a result, the GENIA abstracts become available as OWL ontologies and are amenable for automated verification, deductive inferences and other knowledge-based applications. Availability Documentation, implementation and examples are available from http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/. PMID:22166341
Neuronal Morphology goes Digital: A Research Hub for Cellular and System Neuroscience

PubMed Central

Parekh, Ruchi; Ascoli, Giorgio A.

2013-01-01

Summary The importance of neuronal morphology in brain function has been recognized for over a century. The broad applicability of “digital reconstructions” of neuron morphology across neuroscience sub-disciplines has stimulated the rapid development of numerous synergistic tools for data acquisition, anatomical analysis, three-dimensional rendering, electrophysiological simulation, growth models, and data sharing. Here we discuss the processes of histological labeling, microscopic imaging, and semi-automated tracing. Moreover, we provide an annotated compilation of currently available resources in this rich research “ecosystem” as a central reference for experimental and computational neuroscience. PMID:23522039
Active shape models incorporating isolated landmarks for medical image annotation

NASA Astrophysics Data System (ADS)

Norajitra, Tobias; Meinzer, Hans-Peter; Stieltjes, Bram; Maier-Hein, Klaus H.

2014-03-01

Apart from their robustness in anatomic surface segmentation, purely surface based 3D Active Shape Models lack the ability to automatically detect and annotate non-surface key points of interest. However, annotation of anatomic landmarks is desirable, as it yields additional anatomic and functional information. Moreover, landmark detection might help to further improve accuracy during ASM segmentation. We present an extension of surface-based 3D Active Shape Models incorporating isolated non-surface landmarks. Positions of isolated and surface landmarks are modeled conjoint within a point distribution model (PDM). Isolated landmark appearance is described by a set of haar-like features, supporting local landmark detection on the PDM estimates using a kNN-Classi er. Landmark detection was evaluated in a leave-one-out cross validation on a reference dataset comprising 45 CT volumes of the human liver after shape space projection. Depending on the anatomical landmark to be detected, our experiments have shown in about 1/4 up to more than 1/2 of all test cases a signi cant improvement in detection accuracy compared to the position estimates delivered by the PDM. Our results encourage further research with regard to the combination of shape priors and machine learning for landmark detection within the Active Shape Model Framework.
MIPS: curated databases and comprehensive secondary data resources in 2010.

PubMed

Mewes, H Werner; Ruepp, Andreas; Theis, Fabian; Rattei, Thomas; Walter, Mathias; Frishman, Dmitrij; Suhre, Karsten; Spannagl, Manuel; Mayer, Klaus F X; Stümpflen, Volker; Antonov, Alexey

2011-01-01

The Munich Information Center for Protein Sequences (MIPS at the Helmholtz Center for Environmental Health, Neuherberg, Germany) has many years of experience in providing annotated collections of biological data. Selected data sets of high relevance, such as model genomes, are subjected to careful manual curation, while the bulk of high-throughput data is annotated by automatic means. High-quality reference resources developed in the past and still actively maintained include Saccharomyces cerevisiae, Neurospora crassa and Arabidopsis thaliana genome databases as well as several protein interaction data sets (MPACT, MPPI and CORUM). More recent projects are PhenomiR, the database on microRNA-related phenotypes, and MIPS PlantsDB for integrative and comparative plant genome research. The interlinked resources SIMAP and PEDANT provide homology relationships as well as up-to-date and consistent annotation for 38,000,000 protein sequences. PPLIPS and CCancer are versatile tools for proteomics and functional genomics interfacing to a database of compilations from gene lists extracted from literature. A novel literature-mining tool, EXCERBT, gives access to structured information on classified relations between genes, proteins, phenotypes and diseases extracted from Medline abstracts by semantic analysis. All databases described here, as well as the detailed descriptions of our projects can be accessed through the MIPS WWW server (http://mips.helmholtz-muenchen.de).
MIPS: curated databases and comprehensive secondary data resources in 2010

PubMed Central

Mewes, H. Werner; Ruepp, Andreas; Theis, Fabian; Rattei, Thomas; Walter, Mathias; Frishman, Dmitrij; Suhre, Karsten; Spannagl, Manuel; Mayer, Klaus F.X.; Stümpflen, Volker; Antonov, Alexey

2011-01-01

The Munich Information Center for Protein Sequences (MIPS at the Helmholtz Center for Environmental Health, Neuherberg, Germany) has many years of experience in providing annotated collections of biological data. Selected data sets of high relevance, such as model genomes, are subjected to careful manual curation, while the bulk of high-throughput data is annotated by automatic means. High-quality reference resources developed in the past and still actively maintained include Saccharomyces cerevisiae, Neurospora crassa and Arabidopsis thaliana genome databases as well as several protein interaction data sets (MPACT, MPPI and CORUM). More recent projects are PhenomiR, the database on microRNA-related phenotypes, and MIPS PlantsDB for integrative and comparative plant genome research. The interlinked resources SIMAP and PEDANT provide homology relationships as well as up-to-date and consistent annotation for 38 000 000 protein sequences. PPLIPS and CCancer are versatile tools for proteomics and functional genomics interfacing to a database of compilations from gene lists extracted from literature. A novel literature-mining tool, EXCERBT, gives access to structured information on classified relations between genes, proteins, phenotypes and diseases extracted from Medline abstracts by semantic analysis. All databases described here, as well as the detailed descriptions of our projects can be accessed through the MIPS WWW server (http://mips.helmholtz-muenchen.de). PMID:21109531
MIPS: analysis and annotation of proteins from whole genomes in 2005

PubMed Central

Mewes, H. W.; Frishman, D.; Mayer, K. F. X.; Münsterkötter, M.; Noubibou, O.; Pagel, P.; Rattei, T.; Oesterheld, M.; Ruepp, A.; Stümpflen, V.

2006-01-01

The Munich Information Center for Protein Sequences (MIPS at the GSF), Neuherberg, Germany, provides resources related to genome information. Manually curated databases for several reference organisms are maintained. Several of these databases are described elsewhere in this and other recent NAR database issues. In a complementary effort, a comprehensive set of >400 genomes automatically annotated with the PEDANT system are maintained. The main goal of our current work on creating and maintaining genome databases is to extend gene centered information to information on interactions within a generic comprehensive framework. We have concentrated our efforts along three lines (i) the development of suitable comprehensive data structures and database technology, communication and query tools to include a wide range of different types of information enabling the representation of complex information such as functional modules or networks Genome Research Environment System, (ii) the development of databases covering computable information such as the basic evolutionary relations among all genes, namely SIMAP, the sequence similarity matrix and the CABiNet network analysis framework and (iii) the compilation and manual annotation of information related to interactions such as protein–protein interactions or other types of relations (e.g. MPCDB, MPPI, CYGD). All databases described and the detailed descriptions of our projects can be accessed through the MIPS WWW server (). PMID:16381839
MIPS: analysis and annotation of proteins from whole genomes in 2005.

PubMed

Mewes, H W; Frishman, D; Mayer, K F X; Münsterkötter, M; Noubibou, O; Pagel, P; Rattei, T; Oesterheld, M; Ruepp, A; Stümpflen, V

2006-01-01

The Munich Information Center for Protein Sequences (MIPS at the GSF), Neuherberg, Germany, provides resources related to genome information. Manually curated databases for several reference organisms are maintained. Several of these databases are described elsewhere in this and other recent NAR database issues. In a complementary effort, a comprehensive set of >400 genomes automatically annotated with the PEDANT system are maintained. The main goal of our current work on creating and maintaining genome databases is to extend gene centered information to information on interactions within a generic comprehensive framework. We have concentrated our efforts along three lines (i) the development of suitable comprehensive data structures and database technology, communication and query tools to include a wide range of different types of information enabling the representation of complex information such as functional modules or networks Genome Research Environment System, (ii) the development of databases covering computable information such as the basic evolutionary relations among all genes, namely SIMAP, the sequence similarity matrix and the CABiNet network analysis framework and (iii) the compilation and manual annotation of information related to interactions such as protein-protein interactions or other types of relations (e.g. MPCDB, MPPI, CYGD). All databases described and the detailed descriptions of our projects can be accessed through the MIPS WWW server (http://mips.gsf.de).
ALLocator: an interactive web platform for the analysis of metabolomic LC-ESI-MS datasets, enabling semi-automated, user-revised compound annotation and mass isotopomer ratio analysis.

PubMed

Kessler, Nikolas; Walter, Frederik; Persicke, Marcus; Albaum, Stefan P; Kalinowski, Jörn; Goesmann, Alexander; Niehaus, Karsten; Nattkemper, Tim W

2014-01-01

Adduct formation, fragmentation events and matrix effects impose special challenges to the identification and quantitation of metabolites in LC-ESI-MS datasets. An important step in compound identification is the deconvolution of mass signals. During this processing step, peaks representing adducts, fragments, and isotopologues of the same analyte are allocated to a distinct group, in order to separate peaks from coeluting compounds. From these peak groups, neutral masses and pseudo spectra are derived and used for metabolite identification via mass decomposition and database matching. Quantitation of metabolites is hampered by matrix effects and nonlinear responses in LC-ESI-MS measurements. A common approach to correct for these effects is the addition of a U-13C-labeled internal standard and the calculation of mass isotopomer ratios for each metabolite. Here we present a new web-platform for the analysis of LC-ESI-MS experiments. ALLocator covers the workflow from raw data processing to metabolite identification and mass isotopomer ratio analysis. The integrated processing pipeline for spectra deconvolution "ALLocatorSD" generates pseudo spectra and automatically identifies peaks emerging from the U-13C-labeled internal standard. Information from the latter improves mass decomposition and annotation of neutral losses. ALLocator provides an interactive and dynamic interface to explore and enhance the results in depth. Pseudo spectra of identified metabolites can be stored in user- and method-specific reference lists that can be applied on succeeding datasets. The potential of the software is exemplified in an experiment, in which abundance fold-changes of metabolites of the l-arginine biosynthesis in C. glutamicum type strain ATCC 13032 and l-arginine producing strain ATCC 21831 are compared. Furthermore, the capability for detection and annotation of uncommon large neutral losses is shown by the identification of (γ-)glutamyl dipeptides in the same strains. ALLocator is available online at: https://allocator.cebitec.uni-bielefeld.de. A login is required, but freely available.
An empirical study of ensemble-based semi-supervised learning approaches for imbalanced splice site datasets.

PubMed

Stanescu, Ana; Caragea, Doina

2015-01-01

Recent biochemical advances have led to inexpensive, time-efficient production of massive volumes of raw genomic data. Traditional machine learning approaches to genome annotation typically rely on large amounts of labeled data. The process of labeling data can be expensive, as it requires domain knowledge and expert involvement. Semi-supervised learning approaches that can make use of unlabeled data, in addition to small amounts of labeled data, can help reduce the costs associated with labeling. In this context, we focus on the problem of predicting splice sites in a genome using semi-supervised learning approaches. This is a challenging problem, due to the highly imbalanced distribution of the data, i.e., small number of splice sites as compared to the number of non-splice sites. To address this challenge, we propose to use ensembles of semi-supervised classifiers, specifically self-training and co-training classifiers. Our experiments on five highly imbalanced splice site datasets, with positive to negative ratios of 1-to-99, showed that the ensemble-based semi-supervised approaches represent a good choice, even when the amount of labeled data consists of less than 1% of all training data. In particular, we found that ensembles of co-training and self-training classifiers that dynamically balance the set of labeled instances during the semi-supervised iterations show improvements over the corresponding supervised ensemble baselines. In the presence of limited amounts of labeled data, ensemble-based semi-supervised approaches can successfully leverage the unlabeled data to enhance supervised ensembles learned from highly imbalanced data distributions. Given that such distributions are common for many biological sequence classification problems, our work can be seen as a stepping stone towards more sophisticated ensemble-based approaches to biological sequence annotation in a semi-supervised framework.
An empirical study of ensemble-based semi-supervised learning approaches for imbalanced splice site datasets

PubMed Central

2015-01-01

Background Recent biochemical advances have led to inexpensive, time-efficient production of massive volumes of raw genomic data. Traditional machine learning approaches to genome annotation typically rely on large amounts of labeled data. The process of labeling data can be expensive, as it requires domain knowledge and expert involvement. Semi-supervised learning approaches that can make use of unlabeled data, in addition to small amounts of labeled data, can help reduce the costs associated with labeling. In this context, we focus on the problem of predicting splice sites in a genome using semi-supervised learning approaches. This is a challenging problem, due to the highly imbalanced distribution of the data, i.e., small number of splice sites as compared to the number of non-splice sites. To address this challenge, we propose to use ensembles of semi-supervised classifiers, specifically self-training and co-training classifiers. Results Our experiments on five highly imbalanced splice site datasets, with positive to negative ratios of 1-to-99, showed that the ensemble-based semi-supervised approaches represent a good choice, even when the amount of labeled data consists of less than 1% of all training data. In particular, we found that ensembles of co-training and self-training classifiers that dynamically balance the set of labeled instances during the semi-supervised iterations show improvements over the corresponding supervised ensemble baselines. Conclusions In the presence of limited amounts of labeled data, ensemble-based semi-supervised approaches can successfully leverage the unlabeled data to enhance supervised ensembles learned from highly imbalanced data distributions. Given that such distributions are common for many biological sequence classification problems, our work can be seen as a stepping stone towards more sophisticated ensemble-based approaches to biological sequence annotation in a semi-supervised framework. PMID:26356316
Quantitative analysis of the patellofemoral motion pattern using semi-automatic processing of 4D CT data.

PubMed

Forsberg, Daniel; Lindblom, Maria; Quick, Petter; Gauffin, Håkan

2016-09-01

To present a semi-automatic method with minimal user interaction for quantitative analysis of the patellofemoral motion pattern. 4D CT data capturing the patellofemoral motion pattern of a continuous flexion and extension were collected for five patients prone to patellar luxation both pre- and post-surgically. For the proposed method, an observer would place landmarks in a single 3D volume, which then are automatically propagated to the other volumes in a time sequence. From the landmarks in each volume, the measures patellar displacement, patellar tilt and angle between femur and tibia were computed. Evaluation of the observer variability showed the proposed semi-automatic method to be favorable over a fully manual counterpart, with an observer variability of approximately 1.5[Formula: see text] for the angle between femur and tibia, 1.5 mm for the patellar displacement, and 4.0[Formula: see text]-5.0[Formula: see text] for the patellar tilt. The proposed method showed that surgery reduced the patellar displacement and tilt at maximum extension with approximately 10-15 mm and 15[Formula: see text]-20[Formula: see text] for three patients but with less evident differences for two of the patients. A semi-automatic method suitable for quantification of the patellofemoral motion pattern as captured by 4D CT data has been presented. Its observer variability is on par with that of other methods but with the distinct advantage to support continuous motions during the image acquisition.
Application of a Novel Semi-Automatic Technique for Determining the Bilateral Symmetry Plane of the Facial Skeleton of Normal Adult Males.

PubMed

Roumeliotis, Grayson; Willing, Ryan; Neuert, Mark; Ahluwalia, Romy; Jenkyn, Thomas; Yazdani, Arjang

2015-09-01

The accurate assessment of symmetry in the craniofacial skeleton is important for cosmetic and reconstructive craniofacial surgery. Although there have been several published attempts to develop an accurate system for determining the correct plane of symmetry, all are inaccurate and time consuming. Here, the authors applied a novel semi-automatic method for the calculation of craniofacial symmetry, based on principal component analysis and iterative corrective point computation, to a large sample of normal adult male facial computerized tomography scans obtained clinically (n = 32). The authors hypothesized that this method would generate planes of symmetry that would result in less error when one side of the face was compared to the other than a symmetry plane generated using a plane defined by cephalometric landmarks. When a three-dimensional model of one side of the face was reflected across the semi-automatic plane of symmetry there was less error than when reflected across the cephalometric plane. The semi-automatic plane was also more accurate when the locations of bilateral cephalometric landmarks (eg, frontozygomatic sutures) were compared across the face. The authors conclude that this method allows for accurate and fast measurements of craniofacial symmetry. This has important implications for studying the development of the facial skeleton, and clinical application for reconstruction.
Multi-label literature classification based on the Gene Ontology graph.

PubMed

Jin, Bo; Muller, Brian; Zhai, Chengxiang; Lu, Xinghua

2008-12-08

The Gene Ontology is a controlled vocabulary for representing knowledge related to genes and proteins in a computable form. The current effort of manually annotating proteins with the Gene Ontology is outpaced by the rate of accumulation of biomedical knowledge in literature, which urges the development of text mining approaches to facilitate the process by automatically extracting the Gene Ontology annotation from literature. The task is usually cast as a text classification problem, and contemporary methods are confronted with unbalanced training data and the difficulties associated with multi-label classification. In this research, we investigated the methods of enhancing automatic multi-label classification of biomedical literature by utilizing the structure of the Gene Ontology graph. We have studied three graph-based multi-label classification algorithms, including a novel stochastic algorithm and two top-down hierarchical classification methods for multi-label literature classification. We systematically evaluated and compared these graph-based classification algorithms to a conventional flat multi-label algorithm. The results indicate that, through utilizing the information from the structure of the Gene Ontology graph, the graph-based multi-label classification methods can significantly improve predictions of the Gene Ontology terms implied by the analyzed text. Furthermore, the graph-based multi-label classifiers are capable of suggesting Gene Ontology annotations (to curators) that are closely related to the true annotations even if they fail to predict the true ones directly. A software package implementing the studied algorithms is available for the research community. Through utilizing the information from the structure of the Gene Ontology graph, the graph-based multi-label classification methods have better potential than the conventional flat multi-label classification approach to facilitate protein annotation based on the literature.
An algorithm for optimal fusion of atlases with different labeling protocols

PubMed Central

Iglesias, Juan Eugenio; Sabuncu, Mert Rory; Aganj, Iman; Bhatt, Priyanka; Casillas, Christen; Salat, David; Boxer, Adam; Fischl, Bruce; Van Leemput, Koen

2014-01-01

In this paper we present a novel label fusion algorithm suited for scenarios in which different manual delineation protocols with potentially disparate structures have been used to annotate the training scans (hereafter referred to as “atlases”). Such scenarios arise when atlases have missing structures, when they have been labeled with different levels of detail, or when they have been taken from different heterogeneous databases. The proposed algorithm can be used to automatically label a novel scan with any of the protocols from the training data. Further, it enables us to generate new labels that are not present in any delineation protocol by defining intersections on the underling labels. We first use probabilistic models of label fusion to generalize three popular label fusion techniques to the multi-protocol setting: majority voting, semi-locally weighted voting and STAPLE. Then, we identify some shortcomings of the generalized methods, namely the inability to produce meaningful posterior probabilities for the different labels (majority voting, semi-locally weighted voting) and to exploit the similarities between the atlases (all three methods). Finally, we propose a novel generative label fusion model that can overcome these drawbacks. We use the proposed method to combine four brain MRI datasets labeled with different protocols (with a total of 102 unique labeled structures) to produce segmentations of 148 brain regions. Using cross-validation, we show that the proposed algorithm outperforms the generalizations of majority voting, semi-locally weighted voting and STAPLE (mean Dice score 83%, vs. 77%, 80% and 79%, respectively). We also evaluated the proposed algorithm in an aging study, successfully reproducing some well-known results in cortical and subcortical structures. PMID:25463466
Semantic Annotation of Computational Components

NASA Technical Reports Server (NTRS)

Vanderbilt, Peter; Mehrotra, Piyush

2004-01-01

This paper describes a methodology to specify machine-processable semantic descriptions of computational components to enable them to be shared and reused. A particular focus of this scheme is to enable automatic compositon of such components into simple work-flows.
Generating quality word sense disambiguation test sets based on MeSH indexing.

PubMed

Fan, Jung-Wei; Friedman, Carol

2009-11-14

Word sense disambiguation (WSD) determines the correct meaning of a word that has more than one meaning, and is a critical step in biomedical natural language processing, as interpretation of information in text can be correct only if the meanings of their component terms are correctly identified first. Quality evaluation sets are important to WSD because they can be used as representative samples for developing automatic programs and as referees for comparing different WSD programs. To help create quality test sets for WSD, we developed a MeSH-based automatic sense-tagging method that preferentially annotates terms being topical of the text. Preliminary results were promising and revealed important issues to be addressed in biomedical WSD research. We also suggest that, by cross-validating with 2 or 3 annotators, the method should be able to efficiently generate quality WSD test sets. Online supplement is available at: http://www.dbmi.columbia.edu/~juf7002/AMIA09.
AdjScales: Visualizing Differences between Adjectives for Language Learners

NASA Astrophysics Data System (ADS)

Sheinman, Vera; Tokunaga, Takenobu

In this study we introduce AdjScales, a method for scaling similar adjectives by their strength. It combines existing Web-based computational linguistic techniques in order to automatically differentiate between similar adjectives that describe the same property by strength. Though this kind of information is rarely present in most of the lexical resources and dictionaries, it may be useful for language learners that try to distinguish between similar words. Additionally, learners might gain from a simple visualization of these differences using unidimensional scales. The method is evaluated by comparison with annotation on a subset of adjectives from WordNet by four native English speakers. It is also compared against two non-native speakers of English. The collected annotation is an interesting resource in its own right. This work is a first step toward automatic differentiation of meaning between similar words for language learners. AdjScales can be useful for lexical resource enhancement.
DoctorEye: A clinically driven multifunctional platform, for accurate processing of tumors in medical images.

PubMed

Skounakis, Emmanouil; Farmaki, Christina; Sakkalis, Vangelis; Roniotis, Alexandros; Banitsas, Konstantinos; Graf, Norbert; Marias, Konstantinos

2010-01-01

This paper presents a novel, open access interactive platform for 3D medical image analysis, simulation and visualization, focusing in oncology images. The platform was developed through constant interaction and feedback from expert clinicians integrating a thorough analysis of their requirements while having an ultimate goal of assisting in accurately delineating tumors. It allows clinicians not only to work with a large number of 3D tomographic datasets but also to efficiently annotate multiple regions of interest in the same session. Manual and semi-automatic segmentation techniques combined with integrated correction tools assist in the quick and refined delineation of tumors while different users can add different components related to oncology such as tumor growth and simulation algorithms for improving therapy planning. The platform has been tested by different users and over large number of heterogeneous tomographic datasets to ensure stability, usability, extensibility and robustness with promising results. the platform, a manual and tutorial videos are available at: http://biomodeling.ics.forth.gr. it is free to use under the GNU General Public License.
KUTE-BASE: storing, downloading and exporting MIAME-compliant microarray experiments in minutes rather than hours.

PubMed

Draghici, Sorin; Tarca, Adi L; Yu, Longfei; Ethier, Stephen; Romero, Roberto

2008-03-01

The BioArray Software Environment (BASE) is a very popular MIAME-compliant, web-based microarray data repository. However in BASE, like in most other microarray data repositories, the experiment annotation and raw data uploading can be very timeconsuming, especially for large microarray experiments. We developed KUTE (Karmanos Universal daTabase for microarray Experiments), as a plug-in for BASE 2.0 that addresses these issues. KUTE provides an automatic experiment annotation feature and a completely redesigned data work-flow that dramatically reduce the human-computer interaction time. For instance, in BASE 2.0 a typical Affymetrix experiment involving 100 arrays required 4 h 30 min of user interaction time forexperiment annotation, and 45 min for data upload/download. In contrast, for the same experiment, KUTE required only 28 min of user interaction time for experiment annotation, and 3.3 min for data upload/download. http://vortex.cs.wayne.edu/kute/index.html.

System for definition of the central-chest vasculature

NASA Astrophysics Data System (ADS)

Taeprasartsit, Pinyo; Higgins, William E.

2009-02-01

Accurate definition of the central-chest vasculature from three-dimensional (3D) multi-detector CT (MDCT) images is important for pulmonary applications. For instance, the aorta and pulmonary artery help in automatic definition of the Mountain lymph-node stations for lung-cancer staging. This work presents a system for defining major vascular structures in the central chest. The system provides automatic methods for extracting the aorta and pulmonary artery and semi-automatic methods for extracting the other major central chest arteries/veins, such as the superior vena cava and azygos vein. Automatic aorta and pulmonary artery extraction are performed by model fitting and selection. The system also extracts certain vascular structure information to validate outputs. A semi-automatic method extracts vasculature by finding the medial axes between provided important sites. Results of the system are applied to lymph-node station definition and guidance of bronchoscopic biopsy.
BioCreative V CDR task corpus: a resource for chemical disease relation extraction.

PubMed

Li, Jiao; Sun, Yueping; Johnson, Robin J; Sciaky, Daniela; Wei, Chih-Hsuan; Leaman, Robert; Davis, Allan Peter; Mattingly, Carolyn J; Wiegers, Thomas C; Lu, Zhiyong

2016-01-01

Community-run, formal evaluations and manually annotated text corpora are critically important for advancing biomedical text-mining research. Recently in BioCreative V, a new challenge was organized for the tasks of disease named entity recognition (DNER) and chemical-induced disease (CID) relation extraction. Given the nature of both tasks, a test collection is required to contain both disease/chemical annotations and relation annotations in the same set of articles. Despite previous efforts in biomedical corpus construction, none was found to be sufficient for the task. Thus, we developed our own corpus called BC5CDR during the challenge by inviting a team of Medical Subject Headings (MeSH) indexers for disease/chemical entity annotation and Comparative Toxicogenomics Database (CTD) curators for CID relation annotation. To ensure high annotation quality and productivity, detailed annotation guidelines and automatic annotation tools were provided. The resulting BC5CDR corpus consists of 1500 PubMed articles with 4409 annotated chemicals, 5818 diseases and 3116 chemical-disease interactions. Each entity annotation includes both the mention text spans and normalized concept identifiers, using MeSH as the controlled vocabulary. To ensure accuracy, the entities were first captured independently by two annotators followed by a consensus annotation: The average inter-annotator agreement (IAA) scores were 87.49% and 96.05% for the disease and chemicals, respectively, in the test set according to the Jaccard similarity coefficient. Our corpus was successfully used for the BioCreative V challenge tasks and should serve as a valuable resource for the text-mining research community.Database URL: http://www.biocreative.org/tasks/biocreative-v/track-3-cdr/. Published by Oxford University Press 2016. This work is written by US Government employees and is in the public domain in the United States.
HEPMath 1.4: A mathematica package for semi-automatic computations in high energy physics

NASA Astrophysics Data System (ADS)

Wiebusch, Martin

2015-10-01

This article introduces the Mathematica package HEPMath which provides a number of utilities and algorithms for High Energy Physics computations in Mathematica. Its functionality is similar to packages like FormCalc or FeynCalc, but it takes a more complete and extensible approach to implementing common High Energy Physics notations in the Mathematica language, in particular those related to tensors and index contractions. It also provides a more flexible method for the generation of numerical code which is based on new features for C code generation in Mathematica. In particular it can automatically generate Python extension modules which make the compiled functions callable from Python, thus eliminating the need to write any code in a low-level language like C or Fortran. It also contains seamless interfaces to LHAPDF, FeynArts, and LoopTools.
Preclinical Biokinetic Modelling of Tc-99m Radiophamaceuticals Obtained from Semi-Automatic Image Processing.

PubMed

Cornejo-Aragón, Luz G; Santos-Cuevas, Clara L; Ocampo-García, Blanca E; Chairez-Oria, Isaac; Diaz-Nieto, Lorenza; García-Quiroz, Janice

2017-01-01

The aim of this study was to develop a semi automatic image processing algorithm (AIPA) based on the simultaneous information provided by X-ray and radioisotopic images to determine the biokinetic models of Tc-99m radiopharmaceuticals from quantification of image radiation activity in murine models. These radioisotopic images were obtained by a CCD (charge couple device) camera coupled to an ultrathin phosphorous screen in a preclinical multimodal imaging system (Xtreme, Bruker). The AIPA consisted of different image processing methods for background, scattering and attenuation correction on the activity quantification. A set of parametric identification algorithms was used to obtain the biokinetic models that characterize the interaction between different tissues and the radiopharmaceuticals considered in the study. The set of biokinetic models corresponded to the Tc-99m biodistribution observed in different ex vivo studies. This fact confirmed the contribution of the semi-automatic image processing technique developed in this study.
Chromatin accessibility prediction via a hybrid deep convolutional neural network.

PubMed

Liu, Qiao; Xia, Fei; Yin, Qijin; Jiang, Rui

2018-03-01

A majority of known genetic variants associated with human-inherited diseases lie in non-coding regions that lack adequate interpretation, making it indispensable to systematically discover functional sites at the whole genome level and precisely decipher their implications in a comprehensive manner. Although computational approaches have been complementing high-throughput biological experiments towards the annotation of the human genome, it still remains a big challenge to accurately annotate regulatory elements in the context of a specific cell type via automatic learning of the DNA sequence code from large-scale sequencing data. Indeed, the development of an accurate and interpretable model to learn the DNA sequence signature and further enable the identification of causative genetic variants has become essential in both genomic and genetic studies. We proposed Deopen, a hybrid framework mainly based on a deep convolutional neural network, to automatically learn the regulatory code of DNA sequences and predict chromatin accessibility. In a series of comparison with existing methods, we show the superior performance of our model in not only the classification of accessible regions against background sequences sampled at random, but also the regression of DNase-seq signals. Besides, we further visualize the convolutional kernels and show the match of identified sequence signatures and known motifs. We finally demonstrate the sensitivity of our model in finding causative noncoding variants in the analysis of a breast cancer dataset. We expect to see wide applications of Deopen with either public or in-house chromatin accessibility data in the annotation of the human genome and the identification of non-coding variants associated with diseases. Deopen is freely available at https://github.com/kimmo1019/Deopen. ruijiang@tsinghua.edu.cn. Supplementary data are available at Bioinformatics online. © The Author (2017). Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com
Self-organizing ontology of biochemically relevant small molecules

PubMed Central

2012-01-01

Background The advent of high-throughput experimentation in biochemistry has led to the generation of vast amounts of chemical data, necessitating the development of novel analysis, characterization, and cataloguing techniques and tools. Recently, a movement to publically release such data has advanced biochemical structure-activity relationship research, while providing new challenges, the biggest being the curation, annotation, and classification of this information to facilitate useful biochemical pattern analysis. Unfortunately, the human resources currently employed by the organizations supporting these efforts (e.g. ChEBI) are expanding linearly, while new useful scientific information is being released in a seemingly exponential fashion. Compounding this, currently existing chemical classification and annotation systems are not amenable to automated classification, formal and transparent chemical class definition axiomatization, facile class redefinition, or novel class integration, thus further limiting chemical ontology growth by necessitating human involvement in curation. Clearly, there is a need for the automation of this process, especially for novel chemical entities of biological interest. Results To address this, we present a formal framework based on Semantic Web technologies for the automatic design of chemical ontology which can be used for automated classification of novel entities. We demonstrate the automatic self-assembly of a structure-based chemical ontology based on 60 MeSH and 40 ChEBI chemical classes. This ontology is then used to classify 200 compounds with an accuracy of 92.7%. We extend these structure-based classes with molecular feature information and demonstrate the utility of our framework for classification of functionally relevant chemicals. Finally, we discuss an iterative approach that we envision for future biochemical ontology development. Conclusions We conclude that the proposed methodology can ease the burden of chemical data annotators and dramatically increase their productivity. We anticipate that the use of formal logic in our proposed framework will make chemical classification criteria more transparent to humans and machines alike and will thus facilitate predictive and integrative bioactivity model development. PMID:22221313
AIRS Maps from Space Processing Software

NASA Technical Reports Server (NTRS)

Thompson, Charles K.; Licata, Stephen J.

2012-01-01

This software package processes Atmospheric Infrared Sounder (AIRS) Level 2 swath standard product geophysical parameters, and generates global, colorized, annotated maps. It automatically generates daily and multi-day averaged colorized and annotated maps of various AIRS Level 2 swath geophysical parameters. It also generates AIRS input data sets for Eyes on Earth, Puffer-sphere, and Magic Planet. This program is tailored to AIRS Level 2 data products. It re-projects data into 1/4-degree grids that can be combined and averaged for any number of days. The software scales and colorizes global grids utilizing AIRS-specific color tables, and annotates images with title and color bar. This software can be tailored for use with other swath data products for the purposes of visualization.
A semi-automatic framework of measuring pulmonary arterial metrics at anatomic airway locations using CT imaging

NASA Astrophysics Data System (ADS)

Jin, Dakai; Guo, Junfeng; Dougherty, Timothy M.; Iyer, Krishna S.; Hoffman, Eric A.; Saha, Punam K.

2016-03-01

Pulmonary vascular dysfunction has been implicated in smoking-related susceptibility to emphysema. With the growing interest in characterizing arterial morphology for early evaluation of the vascular role in pulmonary diseases, there is an increasing need for the standardization of a framework for arterial morphological assessment at airway segmental levels. In this paper, we present an effective and robust semi-automatic framework to segment pulmonary arteries at different anatomic airway branches and measure their cross-sectional area (CSA). The method starts with user-specified endpoints of a target arterial segment through a custom-built graphical user interface. It then automatically detect the centerline joining the endpoints, determines the local structure orientation and computes the CSA along the centerline after filtering out the adjacent pulmonary structures, such as veins or airway walls. Several new techniques are presented, including collision-impact based cost function for centerline detection, radial sample-line based CSA computation, and outlier analysis of radial distance to subtract adjacent neighboring structures in the CSA measurement. The method was applied to repeat-scan pulmonary multirow detector CT (MDCT) images from ten healthy subjects (age: 21-48 Yrs, mean: 28.5 Yrs; 7 female) at functional residual capacity (FRC). The reproducibility of computed arterial CSA from four airway segmental regions in middle and lower lobes was analyzed. The overall repeat-scan intra-class correlation (ICC) of the computed CSA from all four airway regions in ten subjects was 96% with maximum ICC found at LB10 and RB4 regions.
Automatic reconstruction of a bacterial regulatory network using Natural Language Processing

PubMed Central

Rodríguez-Penagos, Carlos; Salgado, Heladia; Martínez-Flores, Irma; Collado-Vides, Julio

2007-01-01

Background Manual curation of biological databases, an expensive and labor-intensive process, is essential for high quality integrated data. In this paper we report the implementation of a state-of-the-art Natural Language Processing system that creates computer-readable networks of regulatory interactions directly from different collections of abstracts and full-text papers. Our major aim is to understand how automatic annotation using Text-Mining techniques can complement manual curation of biological databases. We implemented a rule-based system to generate networks from different sets of documents dealing with regulation in Escherichia coli K-12. Results Performance evaluation is based on the most comprehensive transcriptional regulation database for any organism, the manually-curated RegulonDB, 45% of which we were able to recreate automatically. From our automated analysis we were also able to find some new interactions from papers not already curated, or that were missed in the manual filtering and review of the literature. We also put forward a novel Regulatory Interaction Markup Language better suited than SBML for simultaneously representing data of interest for biologists and text miners. Conclusion Manual curation of the output of automatic processing of text is a good way to complement a more detailed review of the literature, either for validating the results of what has been already annotated, or for discovering facts and information that might have been overlooked at the triage or curation stages. PMID:17683642
Chest wall segmentation in automated 3D breast ultrasound scans.

PubMed

Tan, Tao; Platel, Bram; Mann, Ritse M; Huisman, Henkjan; Karssemeijer, Nico

2013-12-01

In this paper, we present an automatic method to segment the chest wall in automated 3D breast ultrasound images. Determining the location of the chest wall in automated 3D breast ultrasound images is necessary in computer-aided detection systems to remove automatically detected cancer candidates beyond the chest wall and it can be of great help for inter- and intra-modal image registration. We show that the visible part of the chest wall in an automated 3D breast ultrasound image can be accurately modeled by a cylinder. We fit the surface of our cylinder model to a set of automatically detected rib-surface points. The detection of the rib-surface points is done by a classifier using features representing local image intensity patterns and presence of rib shadows. Due to attenuation of the ultrasound signal, a clear shadow is visible behind the ribs. Evaluation of our segmentation method is done by computing the distance of manually annotated rib points to the surface of the automatically detected chest wall. We examined the performance on images obtained with the two most common 3D breast ultrasound devices in the market. In a dataset of 142 images, the average mean distance of the annotated points to the segmented chest wall was 5.59 ± 3.08 mm. Copyright © 2012 Elsevier B.V. All rights reserved.
Automatic 2.5-D Facial Landmarking and Emotion Annotation for Social Interaction Assistance.

PubMed

Zhao, Xi; Zou, Jianhua; Li, Huibin; Dellandrea, Emmanuel; Kakadiaris, Ioannis A; Chen, Liming

2016-09-01

People with low vision, Alzheimer's disease, and autism spectrum disorder experience difficulties in perceiving or interpreting facial expression of emotion in their social lives. Though automatic facial expression recognition (FER) methods on 2-D videos have been extensively investigated, their performance was constrained by challenges in head pose and lighting conditions. The shape information in 3-D facial data can reduce or even overcome these challenges. However, high expenses of 3-D cameras prevent their widespread use. Fortunately, 2.5-D facial data from emerging portable RGB-D cameras provide a good balance for this dilemma. In this paper, we propose an automatic emotion annotation solution on 2.5-D facial data collected from RGB-D cameras. The solution consists of a facial landmarking method and a FER method. Specifically, we propose building a deformable partial face model and fit the model to a 2.5-D face for localizing facial landmarks automatically. In FER, a novel action unit (AU) space-based FER method has been proposed. Facial features are extracted using landmarks and further represented as coordinates in the AU space, which are classified into facial expressions. Evaluated on three publicly accessible facial databases, namely EURECOM, FRGC, and Bosphorus databases, the proposed facial landmarking and expression recognition methods have achieved satisfactory results. Possible real-world applications using our algorithms have also been discussed.
Automatic 3D reconstruction of electrophysiology catheters from two-view monoplane C-arm image sequences.

PubMed

Baur, Christoph; Milletari, Fausto; Belagiannis, Vasileios; Navab, Nassir; Fallavollita, Pascal

2016-07-01

Catheter guidance is a vital task for the success of electrophysiology interventions. It is usually provided through fluoroscopic images that are taken intra-operatively. The cardiologists, who are typically equipped with C-arm systems, scan the patient from multiple views rotating the fluoroscope around one of its axes. The resulting sequences allow the cardiologists to build a mental model of the 3D position of the catheters and interest points from the multiple views. We describe and compare different 3D catheter reconstruction strategies and ultimately propose a novel and robust method for the automatic reconstruction of 3D catheters in non-synchronized fluoroscopic sequences. This approach does not purely rely on triangulation but incorporates prior knowledge about the catheters. In conjunction with an automatic detection method, we demonstrate the performance of our method compared to ground truth annotations. In our experiments that include 20 biplane datasets, we achieve an average reprojection error of 0.43 mm and an average reconstruction error of 0.67 mm compared to gold standard annotation. In clinical practice, catheters suffer from complex motion due to the combined effect of heartbeat and respiratory motion. As a result, any 3D reconstruction algorithm via triangulation is imprecise. We have proposed a new method that is fully automatic and highly accurate to reconstruct catheters in three dimensions.
Miniaturization of the Clonogenic Assay Using Confluence Measurement

PubMed Central

Mayr, Christian; Beyreis, Marlena; Dobias, Heidemarie; Gaisberger, Martin; Pichler, Martin; Ritter, Markus; Jakab, Martin; Neureiter, Daniel; Kiesslich, Tobias

2018-01-01

The clonogenic assay is a widely used method to study the ability of cells to ‘infinitely’ produce progeny and is, therefore, used as a tool in tumor biology to measure tumor-initiating capacity and stem cell status. However, the standard protocol of using 6-well plates has several disadvantages. By miniaturizing the assay to a 96-well microplate format, as well as by utilizing the confluence detection function of a multimode reader, we here describe a new and modified protocol that allows comprehensive experimental setups and a non-endpoint, label-free semi-automatic analysis. Comparison of bright field images with confluence images demonstrated robust and reproducible detection of clones by the confluence detection function. Moreover, time-resolved non-endpoint confluence measurement of the same well showed that semi-automatic analysis was suitable for determining the mean size and colony number. By treating cells with an inhibitor of clonogenic growth (PTC-209), we show that our modified protocol is suitable for comprehensive (broad concentration range, addition of technical replicates) concentration- and time-resolved analysis of the effect of substances or treatments on clonogenic growth. In summary, this protocol represents a time- and cost-effective alternative to the commonly used 6-well protocol (with endpoint staining) and also provides additional information about the kinetics of clonogenic growth. PMID:29510509
FunSimMat: a comprehensive functional similarity database

PubMed Central

Schlicker, Andreas; Albrecht, Mario

2008-01-01

Functional similarity based on Gene Ontology (GO) annotation is used in diverse applications like gene clustering, gene expression data analysis, protein interaction prediction and evaluation. However, there exists no comprehensive resource of functional similarity values although such a database would facilitate the use of functional similarity measures in different applications. Here, we describe FunSimMat (Functional Similarity Matrix, http://funsimmat.bioinf.mpi-inf.mpg.de/), a large new database that provides several different semantic similarity measures for GO terms. It offers various precomputed functional similarity values for proteins contained in UniProtKB and for protein families in Pfam and SMART. The web interface allows users to efficiently perform both semantic similarity searches with GO terms and functional similarity searches with proteins or protein families. All results can be downloaded in tab-delimited files for use with other tools. An additional XML–RPC interface gives automatic online access to FunSimMat for programs and remote services. PMID:17932054
Towards the automated analysis and database development of defibrillator data from cardiac arrest.

PubMed

Eftestøl, Trygve; Sherman, Lawrence D

2014-01-01

During resuscitation of cardiac arrest victims a variety of information in electronic format is recorded as part of the documentation of the patient care contact and in order to be provided for case review for quality improvement. Such review requires considerable effort and resources. There is also the problem of interobserver effects. We show that it is possible to efficiently analyze resuscitation episodes automatically using a minimal set of the available information. A minimal set of variables is defined which describe therapeutic events (compression sequences and defibrillations) and corresponding patient response events (annotated rhythm transitions). From this a state sequence representation of the resuscitation episode is constructed and an algorithm is developed for reasoning with this representation and extract review variables automatically. As a case study, the method is applied to the data abstraction process used in the King County EMS. The automatically generated variables are compared to the original ones with accuracies ≥ 90% for 18 variables and ≥ 85% for the remaining four variables. It is possible to use the information present in the CPR process data recorded by the AED along with rhythm and chest compression annotations to automate the episode review.
Annotation of phenotypic diversity: decoupling data curation and ontology curation using Phenex.

PubMed

Balhoff, James P; Dahdul, Wasila M; Dececchi, T Alexander; Lapp, Hilmar; Mabee, Paula M; Vision, Todd J

2014-01-01

Phenex (http://phenex.phenoscape.org/) is a desktop application for semantically annotating the phenotypic character matrix datasets common in evolutionary biology. Since its initial publication, we have added new features that address several major bottlenecks in the efficiency of the phenotype curation process: allowing curators during the data curation phase to provisionally request terms that are not yet available from a relevant ontology; supporting quality control against annotation guidelines to reduce later manual review and revision; and enabling the sharing of files for collaboration among curators. We decoupled data annotation from ontology development by creating an Ontology Request Broker (ORB) within Phenex. Curators can use the ORB to request a provisional term for use in data annotation; the provisional term can be automatically replaced with a permanent identifier once the term is added to an ontology. We added a set of annotation consistency checks to prevent common curation errors, reducing the need for later correction. We facilitated collaborative editing by improving the reliability of Phenex when used with online folder sharing services, via file change monitoring and continual autosave. With the addition of these new features, and in particular the Ontology Request Broker, Phenex users have been able to focus more effectively on data annotation. Phenoscape curators using Phenex have reported a smoother annotation workflow, with much reduced interruptions from ontology maintenance and file management issues.
Large-scale inference of gene function through phylogenetic annotation of Gene Ontology terms: case study of the apoptosis and autophagy cellular processes.

PubMed

Feuermann, Marc; Gaudet, Pascale; Mi, Huaiyu; Lewis, Suzanna E; Thomas, Paul D

2016-01-01

We previously reported a paradigm for large-scale phylogenomic analysis of gene families that takes advantage of the large corpus of experimentally supported Gene Ontology (GO) annotations. This 'GO Phylogenetic Annotation' approach integrates GO annotations from evolutionarily related genes across ∼100 different organisms in the context of a gene family tree, in which curators build an explicit model of the evolution of gene functions. GO Phylogenetic Annotation models the gain and loss of functions in a gene family tree, which is used to infer the functions of uncharacterized (or incompletely characterized) gene products, even for human proteins that are relatively well studied. Here, we report our results from applying this paradigm to two well-characterized cellular processes, apoptosis and autophagy. This revealed several important observations with respect to GO annotations and how they can be used for function inference. Notably, we applied only a small fraction of the experimentally supported GO annotations to infer function in other family members. The majority of other annotations describe indirect effects, phenotypes or results from high throughput experiments. In addition, we show here how feedback from phylogenetic annotation leads to significant improvements in the PANTHER trees, the GO annotations and GO itself. Thus GO phylogenetic annotation both increases the quantity and improves the accuracy of the GO annotations provided to the research community. We expect these phylogenetically based annotations to be of broad use in gene enrichment analysis as well as other applications of GO annotations.Database URL: http://amigo.geneontology.org/amigo. © The Author(s) 2016. Published by Oxford University Press.
User Interaction in Semi-Automatic Segmentation of Organs at Risk: a Case Study in Radiotherapy.

PubMed

Ramkumar, Anjana; Dolz, Jose; Kirisli, Hortense A; Adebahr, Sonja; Schimek-Jasch, Tanja; Nestle, Ursula; Massoptier, Laurent; Varga, Edit; Stappers, Pieter Jan; Niessen, Wiro J; Song, Yu

2016-04-01

Accurate segmentation of organs at risk is an important step in radiotherapy planning. Manual segmentation being a tedious procedure and prone to inter- and intra-observer variability, there is a growing interest in automated segmentation methods. However, automatic methods frequently fail to provide satisfactory result, and post-processing corrections are often needed. Semi-automatic segmentation methods are designed to overcome these problems by combining physicians' expertise and computers' potential. This study evaluates two semi-automatic segmentation methods with different types of user interactions, named the "strokes" and the "contour", to provide insights into the role and impact of human-computer interaction. Two physicians participated in the experiment. In total, 42 case studies were carried out on five different types of organs at risk. For each case study, both the human-computer interaction process and quality of the segmentation results were measured subjectively and objectively. Furthermore, different measures of the process and the results were correlated. A total of 36 quantifiable and ten non-quantifiable correlations were identified for each type of interaction. Among those pairs of measures, 20 of the contour method and 22 of the strokes method were strongly or moderately correlated, either directly or inversely. Based on those correlated measures, it is concluded that: (1) in the design of semi-automatic segmentation methods, user interactions need to be less cognitively challenging; (2) based on the observed workflows and preferences of physicians, there is a need for flexibility in the interface design; (3) the correlated measures provide insights that can be used in improving user interaction design.
Annotations of Mexican bullfighting videos for semantic index

NASA Astrophysics Data System (ADS)

Montoya Obeso, Abraham; Oropesa Morales, Lester Arturo; Fernando Vázquez, Luis; Cocolán Almeda, Sara Ivonne; Stoian, Andrei; García Vázquez, Mireya Saraí; Zamudio Fuentes, Luis Miguel; Montiel Perez, Jesús Yalja; de la O Torres, Saul; Ramírez Acosta, Alejandro Alvaro

2015-09-01

The video annotation is important for web indexing and browsing systems. Indeed, in order to evaluate the performance of video query and mining techniques, databases with concept annotations are required. Therefore, it is necessary generate a database with a semantic indexing that represents the digital content of the Mexican bullfighting atmosphere. This paper proposes a scheme to make complex annotations in a video in the frame of multimedia search engine project. Each video is partitioned using our segmentation algorithm that creates shots of different length and different number of frames. In order to make complex annotations about the video, we use ELAN software. The annotations are done in two steps: First, we take note about the whole content in each shot. Second, we describe the actions as parameters of the camera like direction, position and deepness. As a consequence, we obtain a more complete descriptor of every action. In both cases we use the concepts of the TRECVid 2014 dataset. We also propose new concepts. This methodology allows to generate a database with the necessary information to create descriptors and algorithms capable to detect actions to automatically index and classify new bullfighting multimedia content.
A web-based video annotation system for crowdsourcing surveillance videos

NASA Astrophysics Data System (ADS)

Gadgil, Neeraj J.; Tahboub, Khalid; Kirsh, David; Delp, Edward J.

2014-03-01

Video surveillance systems are of a great value to prevent threats and identify/investigate criminal activities. Manual analysis of a huge amount of video data from several cameras over a long period of time often becomes impracticable. The use of automatic detection methods can be challenging when the video contains many objects with complex motion and occlusions. Crowdsourcing has been proposed as an effective method for utilizing human intelligence to perform several tasks. Our system provides a platform for the annotation of surveillance video in an organized and controlled way. One can monitor a surveillance system using a set of tools such as training modules, roles and labels, task management. This system can be used in a real-time streaming mode to detect any potential threats or as an investigative tool to analyze past events. Annotators can annotate video contents assigned to them for suspicious activity or criminal acts. First responders are then able to view the collective annotations and receive email alerts about a newly reported incident. They can also keep track of the annotators' training performance, manage their activities and reward their success. By providing this system, the process of video analysis is made more efficient.

Considerations to improve functional annotations in biological databases.

PubMed

Benítez-Páez, Alfonso

2009-12-01

Despite the great effort to design efficient systems allowing the electronic indexation of information concerning genes, proteins, structures, and interactions published daily in scientific journals, some problems are still observed in specific tasks such as functional annotation. The annotation of function is a critical issue for bioinformatic routines, such as for instance, in functional genomics and the further prediction of unknown protein function, which are highly dependent of the quality of existing annotations. Some information management systems evolve to efficiently incorporate information from large-scale projects, but often, annotation of single records from the literature is difficult and slow. In this short report, functional characterizations of a representative sample of the entire set of uncharacterized proteins from Escherichia coli K12 was compiled from Swiss-Prot, PubMed, and EcoCyc and demonstrate a functional annotation deficit in biological databases. Some issues are postulated as causes of the lack of annotation, and different solutions are evaluated and proposed to avoid them. The hope is that as a consequence of these observations, there will be new impetus to improve the speed and quality of functional annotation and ultimately provide updated, reliable information to the scientific community.
A statistical framework to predict functional non-coding regions in the human genome through integrated analysis of annotation data.

PubMed

Lu, Qiongshi; Hu, Yiming; Sun, Jiehuan; Cheng, Yuwei; Cheung, Kei-Hoi; Zhao, Hongyu

2015-05-27

Identifying functional regions in the human genome is a major goal in human genetics. Great efforts have been made to functionally annotate the human genome either through computational predictions, such as genomic conservation, or high-throughput experiments, such as the ENCODE project. These efforts have resulted in a rich collection of functional annotation data of diverse types that need to be jointly analyzed for integrated interpretation and annotation. Here we present GenoCanyon, a whole-genome annotation method that performs unsupervised statistical learning using 22 computational and experimental annotations thereby inferring the functional potential of each position in the human genome. With GenoCanyon, we are able to predict many of the known functional regions. The ability of predicting functional regions as well as its generalizable statistical framework makes GenoCanyon a unique and powerful tool for whole-genome annotation. The GenoCanyon web server is available at http://genocanyon.med.yale.edu.
Initial Experience With Ultra High-Density Mapping of Human Right Atria.

PubMed

Bollmann, Andreas; Hilbert, Sebastian; John, Silke; Kosiuk, Jedrzej; Hindricks, Gerhard

2016-02-01

Recently, an automatic, high-resolution mapping system has been presented to accurately and quickly identify right atrial geometry and activation patterns in animals, but human data are lacking. This study aims to assess the clinical feasibility and accuracy of high-density electroanatomical mapping of various RA arrhythmias. Electroanatomical maps of the RA (35 partial and 24 complete) were created in 23 patients using a novel mini-basket catheter with 64 electrodes and automatic electrogram annotation. Median acquisition time was 6:43 minutes (0:39-23:05 minutes) with shorter times for partial (4.03 ± 4.13 minutes) than for complete maps (9.41 ± 4.92 minutes). During mapping 3,236 (710-16,306) data points were automatically annotated without manual correction. Maps obtained during sinus rhythm created geometry consistent with CT imaging and demonstrated activation originating at the middle to superior crista terminalis, while maps during CS pacing showed right atrial activation beginning at the infero-septal region. Activation patterns were consistent with cavotricuspid isthmus-dependent atrial flutter (n = 4), complex reentry tachycardia (n = 1), or ectopic atrial tachycardia (n = 2). His bundle and fractionated potentials in the slow pathway region were automatically detected in all patients. Ablation of the cavotricuspid isthmus (n = 9), the atrio-ventricular node (n = 2), atrial ectopy (n = 2), and the slow pathway (n = 3) was successfully and safely performed. RA mapping with this automatic high-density mapping system is fast, feasible, and safe. It is possible to reproducibly identify propagation of atrial activation during sinus rhythm, various tachycardias, and also complex reentrant arrhythmias. © 2015 Wiley Periodicals, Inc.
Semi-automatic segmentation of brain tumors using population and individual information.

PubMed

Wu, Yao; Yang, Wei; Jiang, Jun; Li, Shuanqian; Feng, Qianjin; Chen, Wufan

2013-08-01

Efficient segmentation of tumors in medical images is of great practical importance in early diagnosis and radiation plan. This paper proposes a novel semi-automatic segmentation method based on population and individual statistical information to segment brain tumors in magnetic resonance (MR) images. First, high-dimensional image features are extracted. Neighborhood components analysis is proposed to learn two optimal distance metrics, which contain population and patient-specific information, respectively. The probability of each pixel belonging to the foreground (tumor) and the background is estimated by the k-nearest neighborhood classifier under the learned optimal distance metrics. A cost function for segmentation is constructed through these probabilities and is optimized using graph cuts. Finally, some morphological operations are performed to improve the achieved segmentation results. Our dataset consists of 137 brain MR images, including 68 for training and 69 for testing. The proposed method overcomes segmentation difficulties caused by the uneven gray level distribution of the tumors and even can get satisfactory results if the tumors have fuzzy edges. Experimental results demonstrate that the proposed method is robust to brain tumor segmentation.
SNAD: Sequence Name Annotation-based Designer.

PubMed

Sidorov, Igor A; Reshetov, Denis A; Gorbalenya, Alexander E

2009-08-14

A growing diversity of biological data is tagged with unique identifiers (UIDs) associated with polynucleotides and proteins to ensure efficient computer-mediated data storage, maintenance, and processing. These identifiers, which are not informative for most people, are often substituted by biologically meaningful names in various presentations to facilitate utilization and dissemination of sequence-based knowledge. This substitution is commonly done manually that may be a tedious exercise prone to mistakes and omissions. Here we introduce SNAD (Sequence Name Annotation-based Designer) that mediates automatic conversion of sequence UIDs (associated with multiple alignment or phylogenetic tree, or supplied as plain text list) into biologically meaningful names and acronyms. This conversion is directed by precompiled or user-defined templates that exploit wealth of annotation available in cognate entries of external databases. Using examples, we demonstrate how this tool can be used to generate names for practical purposes, particularly in virology. A tool for controllable annotation-based conversion of sequence UIDs into biologically meaningful names and acronyms has been developed and placed into service, fostering links between quality of sequence annotation, and efficiency of communication and knowledge dissemination among researchers.
Automated Gene Ontology annotation for anonymous sequence data.

PubMed

Hennig, Steffen; Groth, Detlef; Lehrach, Hans

2003-07-01

Gene Ontology (GO) is the most widely accepted attempt to construct a unified and structured vocabulary for the description of genes and their products in any organism. Annotation by GO terms is performed in most of the current genome projects, which besides generality has the advantage of being very convenient for computer based classification methods. However, direct use of GO in small sequencing projects is not easy, especially for species not commonly represented in public databases. We present a software package (GOblet), which performs annotation based on GO terms for anonymous cDNA or protein sequences. It uses the species independent GO structure and vocabulary together with a series of protein databases collected from various sites, to perform a detailed GO annotation by sequence similarity searches. The sensitivity and the reference protein sets can be selected by the user. GOblet runs automatically and is available as a public service on our web server. The paper also addresses the reliability of automated GO annotations by using a reference set of more than 6000 human proteins. The GOblet server is accessible at http://goblet.molgen.mpg.de.
Semantic integration of data on transcriptional regulation

PubMed Central

Baitaluk, Michael; Ponomarenko, Julia

2010-01-01

Motivation: Experimental and predicted data concerning gene transcriptional regulation are distributed among many heterogeneous sources. However, there are no resources to integrate these data automatically or to provide a ‘one-stop shop’ experience for users seeking information essential for deciphering and modeling gene regulatory networks. Results: IntegromeDB, a semantic graph-based ‘deep-web’ data integration system that automatically captures, integrates and manages publicly available data concerning transcriptional regulation, as well as other relevant biological information, is proposed in this article. The problems associated with data integration are addressed by ontology-driven data mapping, multiple data annotation and heterogeneous data querying, also enabling integration of the user's data. IntegromeDB integrates over 100 experimental and computational data sources relating to genomics, transcriptomics, genetics, and functional and interaction data concerning gene transcriptional regulation in eukaryotes and prokaryotes. Availability: IntegromeDB is accessible through the integrated research environment BiologicalNetworks at http://www.BiologicalNetworks.org Contact: baitaluk@sdsc.edu Supplementary information: Supplementary data are available at Bioinformatics online. PMID:20427517
Semantic integration of data on transcriptional regulation.

PubMed

Baitaluk, Michael; Ponomarenko, Julia

2010-07-01

Experimental and predicted data concerning gene transcriptional regulation are distributed among many heterogeneous sources. However, there are no resources to integrate these data automatically or to provide a 'one-stop shop' experience for users seeking information essential for deciphering and modeling gene regulatory networks. IntegromeDB, a semantic graph-based 'deep-web' data integration system that automatically captures, integrates and manages publicly available data concerning transcriptional regulation, as well as other relevant biological information, is proposed in this article. The problems associated with data integration are addressed by ontology-driven data mapping, multiple data annotation and heterogeneous data querying, also enabling integration of the user's data. IntegromeDB integrates over 100 experimental and computational data sources relating to genomics, transcriptomics, genetics, and functional and interaction data concerning gene transcriptional regulation in eukaryotes and prokaryotes. IntegromeDB is accessible through the integrated research environment BiologicalNetworks at http://www.BiologicalNetworks.org baitaluk@sdsc.edu Supplementary data are available at Bioinformatics online.
Quantitative analysis of hyperpolarized 129Xe ventilation imaging in healthy volunteers and subjects with chronic obstructive pulmonary disease

PubMed Central

Virgincar, Rohan S.; Cleveland, Zackary I.; Kaushik, S. Sivaram; Freeman, Matthew S.; Nouls, John; Cofer, Gary P.; Martinez-Jimenez, Santiago; He, Mu; Kraft, Monica; Wolber, Jan; McAdams, H. Page; Driehuys, Bastiaan

2013-01-01

In this study, hyperpolarized (HP) 129Xe MR ventilation and 1H anatomical images were obtained from 3 subject groups: young healthy volunteers (HV), subjects with chronic obstructive pulmonary disease (COPD), and age-matched control subjects (AMC). Ventilation images were quantified by 2 methods: an expert reader-based ventilation defect score percentage (VDS%) and a semi-automatic segmentation-based ventilation defect percentage (VDP). Reader-based values were assigned by two experienced radiologists and resolved by consensus. In the semi-automatic analysis, 1H anatomical images and 129Xe ventilation images were both segmented following registration, to obtain the thoracic cavity volume (TCV) and ventilated volume (VV), respectively, which were then expressed as a ratio to obtain the VDP. Ventilation images were also characterized by generating signal intensity histograms from voxels within the TCV, and heterogeneity was analyzed using the coefficient of variation (CV). The reader-based VDS% correlated strongly with the semi-automatically generated VDP (r = 0.97, p < 0.0001), and with CV (r = 0.82, p < 0.0001). Both 129Xe ventilation defect scoring metrics readily separated the 3 groups from one another and correlated significantly with FEV1 (VDS%: r = -0.78, p = 0.0002; VDP: r = -0.79, p = 0.0003; CV: r = -0.66, p = 0.0059) and other pulmonary function tests. In the healthy subject groups (HV and AMC), the prevalence of ventilation defects also increased with age (VDS%: r = 0.61, p = 0.0002; VDP: r = 0.63, p = 0.0002). Moreover, ventilation histograms and their associated CVs distinguished between COPD subjects with similar ventilation defect scores but visibly different ventilation patterns. PMID:23065808
Graph Kernels for Molecular Similarity.

PubMed

Rupp, Matthias; Schneider, Gisbert

2010-04-12

Molecular similarity measures are important for many cheminformatics applications like ligand-based virtual screening and quantitative structure-property relationships. Graph kernels are formal similarity measures defined directly on graphs, such as the (annotated) molecular structure graph. Graph kernels are positive semi-definite functions, i.e., they correspond to inner products. This property makes them suitable for use with kernel-based machine learning algorithms such as support vector machines and Gaussian processes. We review the major types of kernels between graphs (based on random walks, subgraphs, and optimal assignments, respectively), and discuss their advantages, limitations, and successful applications in cheminformatics. Copyright © 2010 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
Automated hierarchical classification of protein domain subfamilies based on functionally-divergent residue signatures

PubMed Central

2012-01-01

Background The NCBI Conserved Domain Database (CDD) consists of a collection of multiple sequence alignments of protein domains that are at various stages of being manually curated into evolutionary hierarchies based on conserved and divergent sequence and structural features. These domain models are annotated to provide insights into the relationships between sequence, structure and function via web-based BLAST searches. Results Here we automate the generation of conserved domain (CD) hierarchies using a combination of heuristic and Markov chain Monte Carlo (MCMC) sampling procedures and starting from a (typically very large) multiple sequence alignment. This procedure relies on statistical criteria to define each hierarchy based on the conserved and divergent sequence patterns associated with protein functional-specialization. At the same time this facilitates the sequence and structural annotation of residues that are functionally important. These statistical criteria also provide a means to objectively assess the quality of CD hierarchies, a non-trivial task considering that the protein subgroups are often very distantly related—a situation in which standard phylogenetic methods can be unreliable. Our aim here is to automatically generate (typically sub-optimal) hierarchies that, based on statistical criteria and visual comparisons, are comparable to manually curated hierarchies; this serves as the first step toward the ultimate goal of obtaining optimal hierarchical classifications. A plot of runtimes for the most time-intensive (non-parallelizable) part of the algorithm indicates a nearly linear time complexity so that, even for the extremely large Rossmann fold protein class, results were obtained in about a day. Conclusions This approach automates the rapid creation of protein domain hierarchies and thus will eliminate one of the most time consuming aspects of conserved domain database curation. At the same time, it also facilitates protein domain annotation by identifying those pattern residues that most distinguish each protein domain subgroup from other related subgroups. PMID:22726767
Semi-automatic volume measurement for orbital fat and total extraocular muscles based on Cube FSE-flex sequence in patients with thyroid-associated ophthalmopathy.

PubMed

Tang, X; Liu, H; Chen, L; Wang, Q; Luo, B; Xiang, N; He, Y; Zhu, W; Zhang, J

2018-05-24

To investigate the accuracy of two semi-automatic segmentation measurements based on magnetic resonance imaging (MRI) three-dimensional (3D) Cube fast spin echo (FSE)-flex sequence in phantoms, and to evaluate the feasibility of determining the volumetric alterations of orbital fat (OF) and total extraocular muscles (TEM) in patients with thyroid-associated ophthalmopathy (TAO) by semi-automatic segmentation. Forty-four fatty (n=22) and lean (n=22) phantoms were scanned by using Cube FSE-flex sequence with a 3 T MRI system. Their volumes were measured by manual segmentation (MS) and two semi-automatic segmentation algorithms (regional growing [RG], multi-dimensional threshold [MDT]). Pearson correlation and Bland-Altman analysis were used to evaluate the measuring accuracy of MS, RG, and MDT in phantoms as compared with the true volume. Then, OF and TEM volumes of 15 TAO patients and 15 normal controls were measured using MDT. Paired-sample t-tests were used to compare the volumes and volume ratios of different orbital tissues between TAO patients and controls. Each segmentation (MS RG, MDT) has a significant correlation (p<0.01) with true volume. There was a minimal bias for MS, and a stronger agreement between MDT and the true volume than RG and the true volume both in fatty and lean phantoms. The reproducibility of Cube FSE-flex determined MDT was adequate. The volumetric ratios of OF/globe (p<0.01), TEM/globe (p<0.01), whole orbit/globe (p<0.01) and bone orbit/globe (p<0.01) were significantly greater in TAO patients than those in healthy controls. MRI Cube FSE-flex determined MDT is a relatively accurate semi-automatic segmentation that can be used to evaluate OF and TEM volumes in clinic. Copyright © 2018 The Royal College of Radiologists. Published by Elsevier Ltd. All rights reserved.
Compiled MPI: Cost-Effective Exascale Applications Development

DOE Office of Scientific and Technical Information (OSTI.GOV)

Bronevetsky, G; Quinlan, D; Lumsdaine, A

2012-04-10

The complexity of petascale and exascale machines makes it increasingly difficult to develop applications that can take advantage of them. Future systems are expected to feature billion-way parallelism, complex heterogeneous compute nodes and poor availability of memory (Peter Kogge, 2008). This new challenge for application development is motivating a significant amount of research and development on new programming models and runtime systems designed to simplify large-scale application development. Unfortunately, DoE has significant multi-decadal investment in a large family of mission-critical scientific applications. Scaling these applications to exascale machines will require a significant investment that will dwarf the costs of hardwaremore » procurement. A key reason for the difficulty in transitioning today's applications to exascale hardware is their reliance on explicit programming techniques, such as the Message Passing Interface (MPI) programming model to enable parallelism. MPI provides a portable and high performance message-passing system that enables scalable performance on a wide variety of platforms. However, it also forces developers to lock the details of parallelization together with application logic, making it very difficult to adapt the application to significant changes in the underlying system. Further, MPI's explicit interface makes it difficult to separate the application's synchronization and communication structure, reducing the amount of support that can be provided by compiler and run-time tools. This is in contrast to the recent research on more implicit parallel programming models such as Chapel, OpenMP and OpenCL, which promise to provide significantly more flexibility at the cost of reimplementing significant portions of the application. We are developing CoMPI, a novel compiler-driven approach to enable existing MPI applications to scale to exascale systems with minimal modifications that can be made incrementally over the application's lifetime. It includes: (1) New set of source code annotations, inserted either manually or automatically, that will clarify the application's use of MPI to the compiler infrastructure, enabling greater accuracy where needed; (2) A compiler transformation framework that leverages these annotations to transform the original MPI source code to improve its performance and scalability; (3) Novel MPI runtime implementation techniques that will provide a rich set of functionality extensions to be used by applications that have been transformed by our compiler; and (4) A novel compiler analysis that leverages simple user annotations to automatically extract the application's communication structure and synthesize most complex code annotations.« less
Inferring transposons activity chronology by TRANScendence - TEs database and de-novo mining tool.

PubMed

Startek, Michał Piotr; Nogły, Jakub; Gromadka, Agnieszka; Grzebelus, Dariusz; Gambin, Anna

2017-10-16

The constant progress in sequencing technology leads to ever increasing amounts of genomic data. In the light of current evidence transposable elements (TEs for short) are becoming useful tools for learning about the evolution of host genome. Therefore the software for genome-wide detection and analysis of TEs is of great interest. Here we describe the computational tool for mining, classifying and storing TEs from newly sequenced genomes. This is an online, web-based, user-friendly service, enabling users to upload their own genomic data, and perform de-novo searches for TEs. The detected TEs are automatically analyzed, compared to reference databases, annotated, clustered into families, and stored in TEs repository. Also, the genome-wide nesting structure of found elements are detected and analyzed by new method for inferring evolutionary history of TEs. We illustrate the functionality of our tool by performing a full-scale analyses of TE landscape in Medicago truncatula genome. TRANScendence is an effective tool for the de-novo annotation and classification of transposable elements in newly-acquired genomes. Its streamlined interface makes it well-suited for evolutionary studies.
A Benchmark of Vehicle Maintenance Training Between the U.S. Air Force and a Civilian Industry Leader

DTIC Science & Technology

1992-09-01

Aas :nosen to 113entlfy tasKs oerformed Dv reczcnizeo :omoe:ent automotive serv’ce Personnel :intry level o"ersonnei 4ere iot , ,ic udec i n tie sirve...Diagnose the cause of poor, intermittent, or no electric door and hatch/trunk lock operation. 10. Repair or replace switches, relays, actuators ...Semi-4utomative Temoerature Controls i. Cnecx ooeration of automatic ana semi-automatic neating, HP ventalation ana air-conaitioning ( HVAC ) control
Semi-automatic version of the potentiometric titration method for characterization of uranium compounds.

PubMed

Cristiano, Bárbara F G; Delgado, José Ubiratan; da Silva, José Wanderley S; de Barros, Pedro D; de Araújo, Radier M S; Dias, Fábio C; Lopes, Ricardo T

2012-09-01

The potentiometric titration method was used for characterization of uranium compounds to be applied in intercomparison programs. The method is applied with traceability assured using a potassium dichromate primary standard. A semi-automatic version was developed to reduce the analysis time and the operator variation. The standard uncertainty in determining the total concentration of uranium was around 0.01%, which is suitable for uranium characterization and compatible with those obtained by manual techniques. Copyright © 2012 Elsevier Ltd. All rights reserved.
Social Media Mining for Toxicovigilance: Automatic Monitoring of Prescription Medication Abuse from Twitter.

PubMed

Sarker, Abeed; O'Connor, Karen; Ginn, Rachel; Scotch, Matthew; Smith, Karen; Malone, Dan; Gonzalez, Graciela

2016-03-01

Prescription medication overdose is the fastest growing drug-related problem in the USA. The growing nature of this problem necessitates the implementation of improved monitoring strategies for investigating the prevalence and patterns of abuse of specific medications. Our primary aims were to assess the possibility of utilizing social media as a resource for automatic monitoring of prescription medication abuse and to devise an automatic classification technique that can identify potentially abuse-indicating user posts. We collected Twitter user posts (tweets) associated with three commonly abused medications (Adderall(®), oxycodone, and quetiapine). We manually annotated 6400 tweets mentioning these three medications and a control medication (metformin) that is not the subject of abuse due to its mechanism of action. We performed quantitative and qualitative analyses of the annotated data to determine whether posts on Twitter contain signals of prescription medication abuse. Finally, we designed an automatic supervised classification technique to distinguish posts containing signals of medication abuse from those that do not and assessed the utility of Twitter in investigating patterns of abuse over time. Our analyses show that clear signals of medication abuse can be drawn from Twitter posts and the percentage of tweets containing abuse signals are significantly higher for the three case medications (Adderall(®): 23 %, quetiapine: 5.0 %, oxycodone: 12 %) than the proportion for the control medication (metformin: 0.3 %). Our automatic classification approach achieves 82 % accuracy overall (medication abuse class recall: 0.51, precision: 0.41, F measure: 0.46). To illustrate the utility of automatic classification, we show how the classification data can be used to analyze abuse patterns over time. Our study indicates that social media can be a crucial resource for obtaining abuse-related information for medications, and that automatic approaches involving supervised classification and natural language processing hold promises for essential future monitoring and intervention tasks.
Tools and Equipment Modeling for Automobile Interactive Assembling Operating Simulation

DOE Office of Scientific and Technical Information (OSTI.GOV)

Wu Dianliang; Zhu Hongmin; Shanghai Key Laboratory of Advance Manufacturing Environment

Tools and equipment play an important role in the simulation of virtual assembly, especially in the assembly process simulation and plan. Because of variety in function and complexity in structure and manipulation, the simulation of tools and equipments remains to be a challenge for interactive assembly operation. Based on analysis of details and characteristics of interactive operations for automobile assembly, the functional requirement for tools and equipments of automobile assembly is given. Then, a unified modeling method for information expression and function realization of general tools and equipments is represented, and the handling methods of manual, semi-automatic, automatic tools andmore » equipments are discussed. Finally, the application in assembly simulation of rear suspension and front suspension of Roewe 750 automobile is given. The result shows that the modeling and handling methods are applicable in the interactive simulation of various tools and equipments, and can also be used for supporting assembly process planning in virtual environment.« less
Semi-automatic assessment of skin capillary density: proof of principle and validation.

PubMed

Gronenschild, E H B M; Muris, D M J; Schram, M T; Karaca, U; Stehouwer, C D A; Houben, A J H M

2013-11-01

Skin capillary density and recruitment have been proven to be relevant measures of microvascular function. Unfortunately, the assessment of skin capillary density from movie files is very time-consuming, since this is done manually. This impedes the use of this technique in large-scale studies. We aimed to develop a (semi-) automated assessment of skin capillary density. CapiAna (Capillary Analysis) is a newly developed semi-automatic image analysis application. The technique involves four steps: 1) movement correction, 2) selection of the frame range and positioning of the region of interest (ROI), 3) automatic detection of capillaries, and 4) manual correction of detected capillaries. To gain insight into the performance of the technique, skin capillary density was measured in twenty participants (ten women; mean age 56.2 [42-72] years). To investigate the agreement between CapiAna and the classic manual counting procedure, we used weighted Deming regression and Bland-Altman analyses. In addition, intra- and inter-observer coefficients of variation (CVs), and differences in analysis time were assessed. We found a good agreement between CapiAna and the classic manual method, with a Pearson's correlation coefficient (r) of 0.95 (P<0.001) and a Deming regression coefficient of 1.01 (95%CI: 0.91; 1.10). In addition, we found no significant differences between the two methods, with an intercept of the Deming regression of 1.75 (-6.04; 9.54), while the Bland-Altman analysis showed a mean difference (bias) of 2.0 (-13.5; 18.4) capillaries/mm(2). The intra- and inter-observer CVs of CapiAna were 2.5% and 5.6% respectively, while for the classic manual counting procedure these were 3.2% and 7.2%, respectively. Finally, the analysis time for CapiAna ranged between 25 and 35min versus 80 and 95min for the manual counting procedure. We have developed a semi-automatic image analysis application (CapiAna) for the assessment of skin capillary density, which agrees well with the classic manual counting procedure, is time-saving, and has a better reproducibility as compared to the classic manual counting procedure. As a result, the use of skin capillaroscopy is feasible in large-scale studies, which importantly extends the possibilities to perform microcirculation research in humans. © 2013.
Automatic segmentation of MR brain images of preterm infants using supervised classification.

PubMed

Moeskops, Pim; Benders, Manon J N L; Chiţ, Sabina M; Kersbergen, Karina J; Groenendaal, Floris; de Vries, Linda S; Viergever, Max A; Išgum, Ivana

2015-09-01

Preterm birth is often associated with impaired brain development. The state and expected progression of preterm brain development can be evaluated using quantitative assessment of MR images. Such measurements require accurate segmentation of different tissue types in those images. This paper presents an algorithm for the automatic segmentation of unmyelinated white matter (WM), cortical grey matter (GM), and cerebrospinal fluid in the extracerebral space (CSF). The algorithm uses supervised voxel classification in three subsequent stages. In the first stage, voxels that can easily be assigned to one of the three tissue types are labelled. In the second stage, dedicated analysis of the remaining voxels is performed. The first and the second stages both use two-class classification for each tissue type separately. Possible inconsistencies that could result from these tissue-specific segmentation stages are resolved in the third stage, which performs multi-class classification. A set of T1- and T2-weighted images was analysed, but the optimised system performs automatic segmentation using a T2-weighted image only. We have investigated the performance of the algorithm when using training data randomly selected from completely annotated images as well as when using training data from only partially annotated images. The method was evaluated on images of preterm infants acquired at 30 and 40weeks postmenstrual age (PMA). When the method was trained using random selection from the completely annotated images, the average Dice coefficients were 0.95 for WM, 0.81 for GM, and 0.89 for CSF on an independent set of images acquired at 30weeks PMA. When the method was trained using only the partially annotated images, the average Dice coefficients were 0.95 for WM, 0.78 for GM and 0.87 for CSF for the images acquired at 30weeks PMA, and 0.92 for WM, 0.80 for GM and 0.85 for CSF for the images acquired at 40weeks PMA. Even though the segmentations obtained using training data from the partially annotated images resulted in slightly lower Dice coefficients, the performance in all experiments was close to that of a second human expert (0.93 for WM, 0.79 for GM and 0.86 for CSF for the images acquired at 30weeks, and 0.94 for WM, 0.76 for GM and 0.87 for CSF for the images acquired at 40weeks). These results show that the presented method is robust to age and acquisition protocol and that it performs accurate segmentation of WM, GM, and CSF when the training data is extracted from complete annotations as well as when the training data is extracted from partial annotations only. This extends the applicability of the method by reducing the time and effort necessary to create training data in a population with different characteristics. Copyright © 2015 Elsevier Inc. All rights reserved.

Comparison of the effects of model-based iterative reconstruction and filtered back projection algorithms on software measurements in pulmonary subsolid nodules.

PubMed

Cohen, Julien G; Kim, Hyungjin; Park, Su Bin; van Ginneken, Bram; Ferretti, Gilbert R; Lee, Chang Hyun; Goo, Jin Mo; Park, Chang Min

2017-08-01

To evaluate the differences between filtered back projection (FBP) and model-based iterative reconstruction (MBIR) algorithms on semi-automatic measurements in subsolid nodules (SSNs). Unenhanced CT scans of 73 SSNs obtained using the same protocol and reconstructed with both FBP and MBIR algorithms were evaluated by two radiologists. Diameter, mean attenuation, mass and volume of whole nodules and their solid components were measured. Intra- and interobserver variability and differences between FBP and MBIR were then evaluated using Bland-Altman method and Wilcoxon tests. Longest diameter, volume and mass of nodules and those of their solid components were significantly higher using MBIR (p < 0.05) with mean differences of 1.1% (limits of agreement, -6.4 to 8.5%), 3.2% (-20.9 to 27.3%) and 2.9% (-16.9 to 22.7%) and 3.2% (-20.5 to 27%), 6.3% (-51.9 to 64.6%), 6.6% (-50.1 to 63.3%), respectively. The limits of agreement between FBP and MBIR were within the range of intra- and interobserver variability for both algorithms with respect to the diameter, volume and mass of nodules and their solid components. There were no significant differences in intra- or interobserver variability between FBP and MBIR (p > 0.05). Semi-automatic measurements of SSNs significantly differed between FBP and MBIR; however, the differences were within the range of measurement variability. • Intra- and interobserver reproducibility of measurements did not differ between FBP and MBIR. • Differences in SSNs' semi-automatic measurement induced by reconstruction algorithms were not clinically significant. • Semi-automatic measurement may be conducted regardless of reconstruction algorithm. • SSNs' semi-automated classification agreement (pure vs. part-solid) did not significantly differ between algorithms.
Towards the Real-Time Evaluation of Collaborative Activities: Integration of an Automatic Rater of Collaboration Quality in the Classroom from the Teacher's Perspective

ERIC Educational Resources Information Center

Chounta, Irene-Angelica; Avouris, Nikolaos

2016-01-01

This paper presents the integration of a real time evaluation method of collaboration quality in a monitoring application that supports teachers in class orchestration. The method is implemented as an automatic rater of collaboration quality and studied in a real time scenario of use. We argue that automatic and semi-automatic methods which…
MyWEST: my Web Extraction Software Tool for effective mining of annotations from web-based databanks.

PubMed

Masseroli, Marco; Stella, Andrea; Meani, Natalia; Alcalay, Myriam; Pinciroli, Francesco

2004-12-12

High-throughput technologies create the necessity to mine large amounts of gene annotations from diverse databanks, and to integrate the resulting data. Most databanks can be interrogated only via Web, for a single gene at a time, and query results are generally available only in the HTML format. Although some databanks provide batch retrieval of data via FTP, this requires expertise and resources for locally reimplementing the databank. We developed MyWEST, a tool aimed at researchers without extensive informatics skills or resources, which exploits user-defined templates to easily mine selected annotations from different Web-interfaced databanks, and aggregates and structures results in an automatically updated database. Using microarray results from a model system of retinoic acid-induced differentiation, MyWEST effectively gathered relevant annotations from various biomolecular databanks, highlighted significant biological characteristics and supported a global approach to the understanding of complex cellular mechanisms. MyWEST is freely available for non-profit use at http://www.medinfopoli.polimi.it/MyWEST/
Datasets of Odontocete Sounds Annotated for Developing Automatic Detection Methods

DTIC Science & Technology

2007-09-01

for each of the following species: 1. Mesoplodon densirostris (Blainville’s beaked whale) 2. Steno bredanensis (rough-toothed dolphin) 3...39 files, 70200s total Risso’s Dolphins (Grampus griseus) AUTEC (NUWC): 10 files; 17100s total Rough-toothed Dolphins ( Steno bredanensis
A System for the Semantic Multimodal Analysis of News Audio-Visual Content

NASA Astrophysics Data System (ADS)

Mezaris, Vasileios; Gidaros, Spyros; Papadopoulos, GeorgiosTh; Kasper, Walter; Steffen, Jörg; Ordelman, Roeland; Huijbregts, Marijn; de Jong, Franciska; Kompatsiaris, Ioannis; Strintzis, MichaelG

2010-12-01

News-related content is nowadays among the most popular types of content for users in everyday applications. Although the generation and distribution of news content has become commonplace, due to the availability of inexpensive media capturing devices and the development of media sharing services targeting both professional and user-generated news content, the automatic analysis and annotation that is required for supporting intelligent search and delivery of this content remains an open issue. In this paper, a complete architecture for knowledge-assisted multimodal analysis of news-related multimedia content is presented, along with its constituent components. The proposed analysis architecture employs state-of-the-art methods for the analysis of each individual modality (visual, audio, text) separately and proposes a novel fusion technique based on the particular characteristics of news-related content for the combination of the individual modality analysis results. Experimental results on news broadcast video illustrate the usefulness of the proposed techniques in the automatic generation of semantic annotations.
Automatic Compound Annotation from Mass Spectrometry Data Using MAGMa

PubMed Central

Ridder, Lars; van der Hooft, Justin J. J.; Verhoeven, Stefan

2014-01-01

The MAGMa software for automatic annotation of mass spectrometry based fragmentation data was applied to 16 MS/MS datasets of the CASMI 2013 contest. Eight solutions were submitted in category 1 (molecular formula assignments) and twelve in category 2 (molecular structure assignment). The MS/MS peaks of each challenge were matched with in silico generated substructures of candidate molecules from PubChem, resulting in penalty scores that were used for candidate ranking. In 6 of the 12 submitted solutions in category 2, the correct chemical structure obtained the best score, whereas 3 molecules were ranked outside the top 5. All top ranked molecular formulas submitted in category 1 were correct. In addition, we present MAGMa results generated retrospectively for the remaining challenges. Successful application of the MAGMa algorithm required inclusion of the relevant candidate molecules, application of the appropriate mass tolerance and a sufficient degree of in silico fragmentation of the candidate molecules. Furthermore, the effect of the exhaustiveness of the candidate lists and limitations of substructure based scoring are discussed. PMID:26819876
Automatic Compound Annotation from Mass Spectrometry Data Using MAGMa.

PubMed

Ridder, Lars; van der Hooft, Justin J J; Verhoeven, Stefan

2014-01-01

The MAGMa software for automatic annotation of mass spectrometry based fragmentation data was applied to 16 MS/MS datasets of the CASMI 2013 contest. Eight solutions were submitted in category 1 (molecular formula assignments) and twelve in category 2 (molecular structure assignment). The MS/MS peaks of each challenge were matched with in silico generated substructures of candidate molecules from PubChem, resulting in penalty scores that were used for candidate ranking. In 6 of the 12 submitted solutions in category 2, the correct chemical structure obtained the best score, whereas 3 molecules were ranked outside the top 5. All top ranked molecular formulas submitted in category 1 were correct. In addition, we present MAGMa results generated retrospectively for the remaining challenges. Successful application of the MAGMa algorithm required inclusion of the relevant candidate molecules, application of the appropriate mass tolerance and a sufficient degree of in silico fragmentation of the candidate molecules. Furthermore, the effect of the exhaustiveness of the candidate lists and limitations of substructure based scoring are discussed.
Development of numerical phantoms by MRI for RF electromagnetic dosimetry: a female model.

PubMed

Mazzurana, M; Sandrini, L; Vaccari, A; Malacarne, C; Cristoforetti, L; Pontalti, R

2004-01-01

Numerical human models for electromagnetic dosimetry are commonly obtained by segmentation of CT or MRI images and complex permittivity values are ascribed to each issue according to literature values. The aim of this study is to provide an alternative semi-automatic method by which non-segmented images, obtained by a MRI tomographer, can be automatically related to the complex permittivity values through two frequency dependent transfer functions. In this way permittivity and conductivity vary with continuity--even in the same tissue--reflecting the intrinsic realistic spatial dispersion of such parameters. A female human model impinged by a plane wave is tested using finite-difference time-domain algorithm and the results of the total body and layer-averaged specific absorption rate are reported.
Semi-automatic computerized approach to radiological quantification in rheumatoid arthritis

NASA Astrophysics Data System (ADS)

Steiner, Wolfgang; Schoeffmann, Sylvia; Prommegger, Andrea; Boegl, Karl; Klinger, Thomas; Peloschek, Philipp; Kainberger, Franz

2004-04-01

Rheumatoid Arthritis (RA) is a common systemic disease predominantly involving the joints. Precise diagnosis and follow-up therapy requires objective quantification. For this purpose, radiological analyses using standardized scoring systems are considered to be the most appropriate method. The aim of our study is to develop a semi-automatic image analysis software, especially applicable for scoring of joints in rheumatic disorders. The X-Ray RheumaCoach software delivers various scoring systems (Larsen-Score and Ratingen-Rau-Score) which can be applied by the scorer. In addition to the qualitative assessment of joints performed by the radiologist, a semi-automatic image analysis for joint detection and measurements of bone diameters and swollen tissue supports the image assessment process. More than 3000 radiographs from hands and feet of more than 200 RA patients were collected, analyzed, and statistically evaluated. Radiographs were quantified using conventional paper-based Larsen score and the X-Ray RheumaCoach software. The use of the software shortened the scoring time by about 25 percent and reduced the rate of erroneous scorings in all our studies. Compared to paper-based scoring methods, the X-Ray RheumaCoach software offers several advantages: (i) Structured data analysis and input that minimizes variance by standardization, (ii) faster and more precise calculation of sum scores and indices, (iii) permanent data storing and fast access to the software"s database, (iv) the possibility of cross-calculation to other scores, (v) semi-automatic assessment of images, and (vii) reliable documentation of results in the form of graphical printouts.
A new method of automatic landmark tagging for shape model construction via local curvature scale

NASA Astrophysics Data System (ADS)

Rueda, Sylvia; Udupa, Jayaram K.; Bai, Li

2008-03-01

Segmentation of organs in medical images is a difficult task requiring very often the use of model-based approaches. To build the model, we need an annotated training set of shape examples with correspondences indicated among shapes. Manual positioning of landmarks is a tedious, time-consuming, and error prone task, and almost impossible in the 3D space. To overcome some of these drawbacks, we devised an automatic method based on the notion of c-scale, a new local scale concept. For each boundary element b, the arc length of the largest homogeneous curvature region connected to b is estimated as well as the orientation of the tangent at b. With this shape description method, we can automatically locate mathematical landmarks selected at different levels of detail. The method avoids the use of landmarks for the generation of the mean shape. The selection of landmarks on the mean shape is done automatically using the c-scale method. Then, these landmarks are propagated to each shape in the training set, defining this way the correspondences among the shapes. Altogether 12 strategies are described along these lines. The methods are evaluated on 40 MRI foot data sets, the object of interest being the talus bone. The results show that, for the same number of landmarks, the proposed methods are more compact than manual and equally spaced annotations. The approach is applicable to spaces of any dimensionality, although we have focused in this paper on 2D shapes.
Automatic and semi-automatic approaches for arteriolar-to-venular computation in retinal photographs

NASA Astrophysics Data System (ADS)

Mendonça, Ana Maria; Remeseiro, Beatriz; Dashtbozorg, Behdad; Campilho, Aurélio

2017-03-01

The Arteriolar-to-Venular Ratio (AVR) is a popular dimensionless measure which allows the assessment of patients' condition for the early diagnosis of different diseases, including hypertension and diabetic retinopathy. This paper presents two new approaches for AVR computation in retinal photographs which include a sequence of automated processing steps: vessel segmentation, caliber measurement, optic disc segmentation, artery/vein classification, region of interest delineation, and AVR calculation. Both approaches have been tested on the INSPIRE-AVR dataset, and compared with a ground-truth provided by two medical specialists. The obtained results demonstrate the reliability of the fully automatic approach which provides AVR ratios very similar to at least one of the observers. Furthermore, the semi-automatic approach, which includes the manual modification of the artery/vein classification if needed, allows to significantly reduce the error to a level below the human error.
Interobserver variability in identification of breast tumors in MRI and its implications for prognostic biomarkers and radiogenomics

DOE Office of Scientific and Technical Information (OSTI.GOV)

Saha, Ashirbani, E-mail: as698@duke.edu; Grimm, La

Purpose: To assess the interobserver variability of readers when outlining breast tumors in MRI, study the reasons behind the variability, and quantify the effect of the variability on algorithmic imaging features extracted from breast MRI. Methods: Four readers annotated breast tumors from the MRI examinations of 50 patients from one institution using a bounding box to indicate a tumor. All of the annotated tumors were biopsy proven cancers. The similarity of bounding boxes was analyzed using Dice coefficients. An automatic tumor segmentation algorithm was used to segment tumors from the readers’ annotations. The segmented tumors were then compared between readersmore » using Dice coefficients as the similarity metric. Cases showing high interobserver variability (average Dice coefficient <0.8) after segmentation were analyzed by a panel of radiologists to identify the reasons causing the low level of agreement. Furthermore, an imaging feature, quantifying tumor and breast tissue enhancement dynamics, was extracted from each segmented tumor for a patient. Pearson’s correlation coefficients were computed between the features for each pair of readers to assess the effect of the annotation on the feature values. Finally, the authors quantified the extent of variation in feature values caused by each of the individual reasons for low agreement. Results: The average agreement between readers in terms of the overlap (Dice coefficient) of the bounding box was 0.60. Automatic segmentation of tumor improved the average Dice coefficient for 92% of the cases to the average value of 0.77. The mean agreement between readers expressed by the correlation coefficient for the imaging feature was 0.96. Conclusions: There is a moderate variability between readers when identifying the rectangular outline of breast tumors on MRI. This variability is alleviated by the automatic segmentation of the tumors. Furthermore, the moderate interobserver variability in terms of the bounding box does not translate into a considerable variability in terms of assessment of enhancement dynamics. The authors propose some additional ways to further reduce the interobserver variability.« less
Quadrature formula for evaluating left bounded Hadamard type hypersingular integrals

NASA Astrophysics Data System (ADS)

Bichi, Sirajo Lawan; Eshkuvatov, Z. K.; Nik Long, N. M. A.; Okhunov, Abdurahim

2014-12-01

Left semi-bounded Hadamard type Hypersingular integral (HSI) of the form H(h,x) = 1/π √{1+x/1-x }∫-1 **1√{1-t/1+t }h(t)/(t-x)2 dt,x∈(-1.1), Where h(t) is a smooth function is considered. The automatic quadrature scheme (AQS) is constructed by approximating the density function h(t) by the truncated Chebyshev polynomials of the fourth kind. Numerical results revealed that the proposed AQS is highly accurate when h(t) is choosing to be the polynomial and rational functions. The results are in line with the theoretical findings.
An ontology-based framework for bioinformatics workflows.

PubMed

Digiampietri, Luciano A; Perez-Alcazar, Jose de J; Medeiros, Claudia Bauzer

2007-01-01

The proliferation of bioinformatics activities brings new challenges - how to understand and organise these resources, how to exchange and reuse successful experimental procedures, and to provide interoperability among data and tools. This paper describes an effort toward these directions. It is based on combining research on ontology management, AI and scientific workflows to design, reuse and annotate bioinformatics experiments. The resulting framework supports automatic or interactive composition of tasks based on AI planning techniques and takes advantage of ontologies to support the specification and annotation of bioinformatics workflows. We validate our proposal with a prototype running on real data.
Wide coverage biomedical event extraction using multiple partially overlapping corpora

PubMed Central

2013-01-01

Background Biomedical events are key to understanding physiological processes and disease, and wide coverage extraction is required for comprehensive automatic analysis of statements describing biomedical systems in the literature. In turn, the training and evaluation of extraction methods requires manually annotated corpora. However, as manual annotation is time-consuming and expensive, any single event-annotated corpus can only cover a limited number of semantic types. Although combined use of several such corpora could potentially allow an extraction system to achieve broad semantic coverage, there has been little research into learning from multiple corpora with partially overlapping semantic annotation scopes. Results We propose a method for learning from multiple corpora with partial semantic annotation overlap, and implement this method to improve our existing event extraction system, EventMine. An evaluation using seven event annotated corpora, including 65 event types in total, shows that learning from overlapping corpora can produce a single, corpus-independent, wide coverage extraction system that outperforms systems trained on single corpora and exceeds previously reported results on two established event extraction tasks from the BioNLP Shared Task 2011. Conclusions The proposed method allows the training of a wide-coverage, state-of-the-art event extraction system from multiple corpora with partial semantic annotation overlap. The resulting single model makes broad-coverage extraction straightforward in practice by removing the need to either select a subset of compatible corpora or semantic types, or to merge results from several models trained on different individual corpora. Multi-corpus learning also allows annotation efforts to focus on covering additional semantic types, rather than aiming for exhaustive coverage in any single annotation effort, or extending the coverage of semantic types annotated in existing corpora. PMID:23731785
Palm: Easing the Burden of Analytical Performance Modeling

DOE Office of Scientific and Technical Information (OSTI.GOV)

Tallent, Nathan R.; Hoisie, Adolfy

2014-06-01

Analytical (predictive) application performance models are critical for diagnosing performance-limiting resources, optimizing systems, and designing machines. Creating models, however, is difficult because they must be both accurate and concise. To ease the burden of performance modeling, we developed Palm, a modeling tool that combines top-down (human-provided) semantic insight with bottom-up static and dynamic analysis. To express insight, Palm defines a source code modeling annotation language. By coordinating models and source code, Palm's models are `first-class' and reproducible. Unlike prior work, Palm formally links models, functions, and measurements. As a result, Palm (a) uses functions to either abstract or express complexitymore » (b) generates hierarchical models (representing an application's static and dynamic structure); and (c) automatically incorporates measurements to focus attention, represent constant behavior, and validate models. We discuss generating models for three different applications.« less
Geodesic Distance Algorithm for Extracting the Ascending Aorta from 3D CT Images

PubMed Central

Jang, Yeonggul; Jung, Ho Yub; Hong, Youngtaek; Cho, Iksung; Shim, Hackjoon; Chang, Hyuk-Jae

2016-01-01

This paper presents a method for the automatic 3D segmentation of the ascending aorta from coronary computed tomography angiography (CCTA). The segmentation is performed in three steps. First, the initial seed points are selected by minimizing a newly proposed energy function across the Hough circles. Second, the ascending aorta is segmented by geodesic distance transformation. Third, the seed points are effectively transferred through the next axial slice by a novel transfer function. Experiments are performed using a database composed of 10 patients' CCTA images. For the experiment, the ground truths are annotated manually on the axial image slices by a medical expert. A comparative evaluation with state-of-the-art commercial aorta segmentation algorithms shows that our approach is computationally more efficient and accurate under the DSC (Dice Similarity Coefficient) measurements. PMID:26904151
New developments in FeynCalc 9.0

NASA Astrophysics Data System (ADS)

Shtabovenko, Vladyslav; Mertig, Rolf; Orellana, Frederik

2016-10-01

In this note we report on the new version of FEYNCALC, a MATHEMATICA package for symbolic semi-automatic evaluation of Feynman diagrams and algebraic expressions in quantum field theory. The main features of version 9.0 are: improved tensor reduction and partial fractioning of loop integrals, new functions for using FEYNCALC together with tools for reduction of scalar loop integrals using integration-by-parts (IBP) identities, better interface to FEYNARTS and support for SU(N) generators with explicit fundamental indices.
Design for Manufacturing and Assembly in Apparel. Part 1. Handbook

DTIC Science & Technology

1994-02-01

reduced and the inverted pleat was eliminated to take advantage of the automatic seam stitcher . The shape and size of the side back section seam...coin pocket. The size and shape of the pocket would be designed to best utilize the equipment. An automatic dart stitcher may be utilized to stitch the...with stacker Semi-automatic serging units with stacker Automatic seaming units/profile stitchers Programmable seaming units for various operations
MAISTAS: a tool for automatic structural evaluation of alternative splicing products.

PubMed

Floris, Matteo; Raimondo, Domenico; Leoni, Guido; Orsini, Massimiliano; Marcatili, Paolo; Tramontano, Anna

2011-06-15

Analysis of the human genome revealed that the amount of transcribed sequence is an order of magnitude greater than the number of predicted and well-characterized genes. A sizeable fraction of these transcripts is related to alternatively spliced forms of known protein coding genes. Inspection of the alternatively spliced transcripts identified in the pilot phase of the ENCODE project has clearly shown that often their structure might substantially differ from that of other isoforms of the same gene, and therefore that they might perform unrelated functions, or that they might even not correspond to a functional protein. Identifying these cases is obviously relevant for the functional assignment of gene products and for the interpretation of the effect of variations in the corresponding proteins. Here we describe a publicly available tool that, given a gene or a protein, retrieves and analyses all its annotated isoforms, provides users with three-dimensional models of the isoform(s) of his/her interest whenever possible and automatically assesses whether homology derived structural models correspond to plausible structures. This information is clearly relevant. When the homology model of some isoforms of a gene does not seem structurally plausible, the implications are that either they assume a structure unrelated to that of the other isoforms of the same gene with presumably significant functional differences, or do not correspond to functional products. We provide indications that the second hypothesis is likely to be true for a substantial fraction of the cases. http://maistas.bioinformatica.crs4.it/.

Extracting genetic alteration information for personalized cancer therapy from ClinicalTrials.gov

PubMed Central

Xu, Jun; Lee, Hee-Jin; Zeng, Jia; Wu, Yonghui; Zhang, Yaoyun; Huang, Liang-Chin; Johnson, Amber; Holla, Vijaykumar; Bailey, Ann M; Cohen, Trevor; Meric-Bernstam, Funda; Bernstam, Elmer V

2016-01-01

Objective: Clinical trials investigating drugs that target specific genetic alterations in tumors are important for promoting personalized cancer therapy. The goal of this project is to create a knowledge base of cancer treatment trials with annotations about genetic alterations from ClinicalTrials.gov. Methods: We developed a semi-automatic framework that combines advanced text-processing techniques with manual review to curate genetic alteration information in cancer trials. The framework consists of a document classification system to identify cancer treatment trials from ClinicalTrials.gov and an information extraction system to extract gene and alteration pairs from the Title and Eligibility Criteria sections of clinical trials. By applying the framework to trials at ClinicalTrials.gov, we created a knowledge base of cancer treatment trials with genetic alteration annotations. We then evaluated each component of the framework against manually reviewed sets of clinical trials and generated descriptive statistics of the knowledge base. Results and Discussion: The automated cancer treatment trial identification system achieved a high precision of 0.9944. Together with the manual review process, it identified 20 193 cancer treatment trials from ClinicalTrials.gov. The automated gene-alteration extraction system achieved a precision of 0.8300 and a recall of 0.6803. After validation by manual review, we generated a knowledge base of 2024 cancer trials that are labeled with specific genetic alteration information. Analysis of the knowledge base revealed the trend of increased use of targeted therapy for cancer, as well as top frequent gene-alteration pairs of interest. We expect this knowledge base to be a valuable resource for physicians and patients who are seeking information about personalized cancer therapy. PMID:27013523
Extracting genetic alteration information for personalized cancer therapy from ClinicalTrials.gov.

PubMed

Xu, Jun; Lee, Hee-Jin; Zeng, Jia; Wu, Yonghui; Zhang, Yaoyun; Huang, Liang-Chin; Johnson, Amber; Holla, Vijaykumar; Bailey, Ann M; Cohen, Trevor; Meric-Bernstam, Funda; Bernstam, Elmer V; Xu, Hua

2016-07-01

Clinical trials investigating drugs that target specific genetic alterations in tumors are important for promoting personalized cancer therapy. The goal of this project is to create a knowledge base of cancer treatment trials with annotations about genetic alterations from ClinicalTrials.gov. We developed a semi-automatic framework that combines advanced text-processing techniques with manual review to curate genetic alteration information in cancer trials. The framework consists of a document classification system to identify cancer treatment trials from ClinicalTrials.gov and an information extraction system to extract gene and alteration pairs from the Title and Eligibility Criteria sections of clinical trials. By applying the framework to trials at ClinicalTrials.gov, we created a knowledge base of cancer treatment trials with genetic alteration annotations. We then evaluated each component of the framework against manually reviewed sets of clinical trials and generated descriptive statistics of the knowledge base. The automated cancer treatment trial identification system achieved a high precision of 0.9944. Together with the manual review process, it identified 20 193 cancer treatment trials from ClinicalTrials.gov. The automated gene-alteration extraction system achieved a precision of 0.8300 and a recall of 0.6803. After validation by manual review, we generated a knowledge base of 2024 cancer trials that are labeled with specific genetic alteration information. Analysis of the knowledge base revealed the trend of increased use of targeted therapy for cancer, as well as top frequent gene-alteration pairs of interest. We expect this knowledge base to be a valuable resource for physicians and patients who are seeking information about personalized cancer therapy. © The Author 2016. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For Permissions, please email: journals.permissions@oup.com.
RysannMD: A biomedical semantic annotator balancing speed and accuracy.

PubMed

Cuzzola, John; Jovanović, Jelena; Bagheri, Ebrahim

2017-07-01

Recently, both researchers and practitioners have explored the possibility of semantically annotating large and continuously evolving collections of biomedical texts such as research papers, medical reports, and physician notes in order to enable their efficient and effective management and use in clinical practice or research laboratories. Such annotations can be automatically generated by biomedical semantic annotators - tools that are specifically designed for detecting and disambiguating biomedical concepts mentioned in text. The biomedical community has already presented several solid automated semantic annotators. However, the existing tools are either strong in their disambiguation capacity, i.e., the ability to identify the correct biomedical concept for a given piece of text among several candidate concepts, or they excel in their processing time, i.e., work very efficiently, but none of the semantic annotation tools reported in the literature has both of these qualities. In this paper, we present RysannMD (Ryerson Semantic Annotator for Medical Domain), a biomedical semantic annotation tool that strikes a balance between processing time and performance while disambiguating biomedical terms. In other words, RysannMD provides reasonable disambiguation performance when choosing the right sense for a biomedical term in a given context, and does that in a reasonable time. To examine how RysannMD stands with respect to the state of the art biomedical semantic annotators, we have conducted a series of experiments using standard benchmarking corpora, including both gold and silver standards, and four modern biomedical semantic annotators, namely cTAKES, MetaMap, NOBLE Coder, and Neji. The annotators were compared with respect to the quality of the produced annotations measured against gold and silver standards using precision, recall, and F 1 measure and speed, i.e., processing time. In the experiments, RysannMD achieved the best median F 1 measure across the benchmarking corpora, independent of the standard used (silver/gold), biomedical subdomain, and document size. In terms of the annotation speed, RysannMD scored the second best median processing time across all the experiments. The obtained results indicate that RysannMD offers the best performance among the examined semantic annotators when both quality of annotation and speed are considered simultaneously. Copyright © 2017 Elsevier Inc. All rights reserved.
A procedural method for the efficient implementation of full-custom VLSI designs

NASA Technical Reports Server (NTRS)

Belk, P.; Hickey, N.

1987-01-01

An imbedded language system for the layout of very large scale integration (VLSI) circuits is examined. It is shown that through the judicious use of this system, a large variety of circuits can be designed with circuit density and performance comparable to traditional full-custom design methods, but with design costs more comparable to semi-custom design methods. The high performance of this methodology is attributable to the flexibility of procedural descriptions of VLSI layouts and to a number of automatic and semi-automatic tools within the system.
DeTEXT: A Database for Evaluating Text Extraction from Biomedical Literature Figures

PubMed Central

Yin, Xu-Cheng; Yang, Chun; Pei, Wei-Yi; Man, Haixia; Zhang, Jun; Learned-Miller, Erik; Yu, Hong

2015-01-01

Hundreds of millions of figures are available in biomedical literature, representing important biomedical experimental evidence. Since text is a rich source of information in figures, automatically extracting such text may assist in the task of mining figure information. A high-quality ground truth standard can greatly facilitate the development of an automated system. This article describes DeTEXT: A database for evaluating text extraction from biomedical literature figures. It is the first publicly available, human-annotated, high quality, and large-scale figure-text dataset with 288 full-text articles, 500 biomedical figures, and 9308 text regions. This article describes how figures were selected from open-access full-text biomedical articles and how annotation guidelines and annotation tools were developed. We also discuss the inter-annotator agreement and the reliability of the annotations. We summarize the statistics of the DeTEXT data and make available evaluation protocols for DeTEXT. Finally we lay out challenges we observed in the automated detection and recognition of figure text and discuss research directions in this area. DeTEXT is publicly available for downloading at http://prir.ustb.edu.cn/DeTEXT/. PMID:25951377
Annotating longitudinal clinical narratives for de-identification: The 2014 i2b2/UTHealth corpus.

PubMed

Stubbs, Amber; Uzuner, Özlem

2015-12-01

The 2014 i2b2/UTHealth natural language processing shared task featured a track focused on the de-identification of longitudinal medical records. For this track, we de-identified a set of 1304 longitudinal medical records describing 296 patients. This corpus was de-identified under a broad interpretation of the HIPAA guidelines using double-annotation followed by arbitration, rounds of sanity checking, and proof reading. The average token-based F1 measure for the annotators compared to the gold standard was 0.927. The resulting annotations were used both to de-identify the data and to set the gold standard for the de-identification track of the 2014 i2b2/UTHealth shared task. All annotated private health information were replaced with realistic surrogates automatically and then read over and corrected manually. The resulting corpus is the first of its kind made available for de-identification research. This corpus was first used for the 2014 i2b2/UTHealth shared task, during which the systems achieved a mean F-measure of 0.872 and a maximum F-measure of 0.964 using entity-based micro-averaged evaluations. Copyright © 2015 Elsevier Inc. All rights reserved.
Cross-organism learning method to discover new gene functionalities.

PubMed

Domeniconi, Giacomo; Masseroli, Marco; Moro, Gianluca; Pinoli, Pietro

2016-04-01

Knowledge of gene and protein functions is paramount for the understanding of physiological and pathological biological processes, as well as in the development of new drugs and therapies. Analyses for biomedical knowledge discovery greatly benefit from the availability of gene and protein functional feature descriptions expressed through controlled terminologies and ontologies, i.e., of gene and protein biomedical controlled annotations. In the last years, several databases of such annotations have become available; yet, these valuable annotations are incomplete, include errors and only some of them represent highly reliable human curated information. Computational techniques able to reliably predict new gene or protein annotations with an associated likelihood value are thus paramount. Here, we propose a novel cross-organisms learning approach to reliably predict new functionalities for the genes of an organism based on the known controlled annotations of the genes of another, evolutionarily related and better studied, organism. We leverage a new representation of the annotation discovery problem and a random perturbation of the available controlled annotations to allow the application of supervised algorithms to predict with good accuracy unknown gene annotations. Taking advantage of the numerous gene annotations available for a well-studied organism, our cross-organisms learning method creates and trains better prediction models, which can then be applied to predict new gene annotations of a target organism. We tested and compared our method with the equivalent single organism approach on different gene annotation datasets of five evolutionarily related organisms (Homo sapiens, Mus musculus, Bos taurus, Gallus gallus and Dictyostelium discoideum). Results show both the usefulness of the perturbation method of available annotations for better prediction model training and a great improvement of the cross-organism models with respect to the single-organism ones, without influence of the evolutionary distance between the considered organisms. The generated ranked lists of reliably predicted annotations, which describe novel gene functionalities and have an associated likelihood value, are very valuable both to complement available annotations, for better coverage in biomedical knowledge discovery analyses, and to quicken the annotation curation process, by focusing it on the prioritized novel annotations predicted. Copyright © 2015 Elsevier Ireland Ltd. All rights reserved.
Evaluating a variety of text-mined features for automatic protein function prediction with GOstruct.

PubMed

Funk, Christopher S; Kahanda, Indika; Ben-Hur, Asa; Verspoor, Karin M

2015-01-01

Most computational methods that predict protein function do not take advantage of the large amount of information contained in the biomedical literature. In this work we evaluate both ontology term co-mention and bag-of-words features mined from the biomedical literature and analyze their impact in the context of a structured output support vector machine model, GOstruct. We find that even simple literature based features are useful for predicting human protein function (F-max: Molecular Function =0.408, Biological Process =0.461, Cellular Component =0.608). One advantage of using literature features is their ability to offer easy verification of automated predictions. We find through manual inspection of misclassifications that some false positive predictions could be biologically valid predictions based upon support extracted from the literature. Additionally, we present a "medium-throughput" pipeline that was used to annotate a large subset of co-mentions; we suggest that this strategy could help to speed up the rate at which proteins are curated.
Evaluating Functional Annotations of Enzymes Using the Gene Ontology.

PubMed

Holliday, Gemma L; Davidson, Rebecca; Akiva, Eyal; Babbitt, Patricia C

2017-01-01

The Gene Ontology (GO) (Ashburner et al., Nat Genet 25(1):25-29, 2000) is a powerful tool in the informatics arsenal of methods for evaluating annotations in a protein dataset. From identifying the nearest well annotated homologue of a protein of interest to predicting where misannotation has occurred to knowing how confident you can be in the annotations assigned to those proteins is critical. In this chapter we explore what makes an enzyme unique and how we can use GO to infer aspects of protein function based on sequence similarity. These can range from identification of misannotation or other errors in a predicted function to accurate function prediction for an enzyme of entirely unknown function. Although GO annotation applies to any gene products, we focus here a describing our approach for hierarchical classification of enzymes in the Structure-Function Linkage Database (SFLD) (Akiva et al., Nucleic Acids Res 42(Database issue):D521-530, 2014) as a guide for informed utilisation of annotation transfer based on GO terms.
Cloud-Based Evaluation of Anatomical Structure Segmentation and Landmark Detection Algorithms: VISCERAL Anatomy Benchmarks.

PubMed

Jimenez-Del-Toro, Oscar; Muller, Henning; Krenn, Markus; Gruenberg, Katharina; Taha, Abdel Aziz; Winterstein, Marianne; Eggel, Ivan; Foncubierta-Rodriguez, Antonio; Goksel, Orcun; Jakab, Andras; Kontokotsios, Georgios; Langs, Georg; Menze, Bjoern H; Salas Fernandez, Tomas; Schaer, Roger; Walleyo, Anna; Weber, Marc-Andre; Dicente Cid, Yashin; Gass, Tobias; Heinrich, Mattias; Jia, Fucang; Kahl, Fredrik; Kechichian, Razmig; Mai, Dominic; Spanier, Assaf B; Vincent, Graham; Wang, Chunliang; Wyeth, Daniel; Hanbury, Allan

2016-11-01

Variations in the shape and appearance of anatomical structures in medical images are often relevant radiological signs of disease. Automatic tools can help automate parts of this manual process. A cloud-based evaluation framework is presented in this paper including results of benchmarking current state-of-the-art medical imaging algorithms for anatomical structure segmentation and landmark detection: the VISCERAL Anatomy benchmarks. The algorithms are implemented in virtual machines in the cloud where participants can only access the training data and can be run privately by the benchmark administrators to objectively compare their performance in an unseen common test set. Overall, 120 computed tomography and magnetic resonance patient volumes were manually annotated to create a standard Gold Corpus containing a total of 1295 structures and 1760 landmarks. Ten participants contributed with automatic algorithms for the organ segmentation task, and three for the landmark localization task. Different algorithms obtained the best scores in the four available imaging modalities and for subsets of anatomical structures. The annotation framework, resulting data set, evaluation setup, results and performance analysis from the three VISCERAL Anatomy benchmarks are presented in this article. Both the VISCERAL data set and Silver Corpus generated with the fusion of the participant algorithms on a larger set of non-manually-annotated medical images are available to the research community.
Automatic blood detection in capsule endoscopy video

NASA Astrophysics Data System (ADS)

Novozámský, Adam; Flusser, Jan; Tachecí, Ilja; Sulík, Lukáš; Bureš, Jan; Krejcar, Ondřej

2016-12-01

We propose two automatic methods for detecting bleeding in wireless capsule endoscopy videos of the small intestine. The first one uses solely the color information, whereas the second one incorporates the assumptions about the blood spot shape and size. The original idea is namely the definition of a new color space that provides good separability of blood pixels and intestinal wall. Both methods can be applied either individually or their results can be fused together for the final decision. We evaluate their individual performance and various fusion rules on real data, manually annotated by an endoscopist.
Ecological adaptation determines functional mammalian olfactory subgenomes

PubMed Central

Hayden, Sara; Bekaert, Michaël; Crider, Tess A.; Mariani, Stefano; Murphy, William J.; Teeling, Emma C.

2010-01-01

The ability to smell is governed by the largest gene family in mammalian genomes, the olfactory receptor (OR) genes. Although these genes are well annotated in the finished human and mouse genomes, we still do not understand which receptors bind specific odorants or how they fully function. Previous comparative studies have been taxonomically limited and mostly focused on the percentage of OR pseudogenes within species. No study has investigated the adaptive changes of functional OR gene families across phylogenetically and ecologically diverse mammals. To determine the extent to which OR gene repertoires have been influenced by habitat, sensory specialization, and other ecological traits, to better understand the functional importance of specific OR gene families and thus the odorants they bind, we compared the functional OR gene repertoires from 50 mammalian genomes. We amplified more than 2000 OR genes in aquatic, semi-aquatic, and flying mammals and coupled these data with 48,000 OR genes from mostly terrestrial mammals, extracted from genomic projects. Phylogenomic, Bayesian assignment, and principle component analyses partitioned species by ecotype (aquatic, semi-aquatic, terrestrial, flying) rather than phylogenetic relatedness, and identified OR families important for each habitat. Functional OR gene repertoires were reduced independently in the multiple origins of aquatic mammals and were significantly divergent in bats. We reject recent neutralist views of olfactory subgenome evolution and correlate specific OR gene families with physiological requirements, a preliminary step toward unraveling the relationship between specific odors and respective OR gene families. PMID:19952139
A method for automatically extracting infectious disease-related primers and probes from the literature

PubMed Central

2010-01-01

Background Primer and probe sequences are the main components of nucleic acid-based detection systems. Biologists use primers and probes for different tasks, some related to the diagnosis and prescription of infectious diseases. The biological literature is the main information source for empirically validated primer and probe sequences. Therefore, it is becoming increasingly important for researchers to navigate this important information. In this paper, we present a four-phase method for extracting and annotating primer/probe sequences from the literature. These phases are: (1) convert each document into a tree of paper sections, (2) detect the candidate sequences using a set of finite state machine-based recognizers, (3) refine problem sequences using a rule-based expert system, and (4) annotate the extracted sequences with their related organism/gene information. Results We tested our approach using a test set composed of 297 manuscripts. The extracted sequences and their organism/gene annotations were manually evaluated by a panel of molecular biologists. The results of the evaluation show that our approach is suitable for automatically extracting DNA sequences, achieving precision/recall rates of 97.98% and 95.77%, respectively. In addition, 76.66% of the detected sequences were correctly annotated with their organism name. The system also provided correct gene-related information for 46.18% of the sequences assigned a correct organism name. Conclusions We believe that the proposed method can facilitate routine tasks for biomedical researchers using molecular methods to diagnose and prescribe different infectious diseases. In addition, the proposed method can be expanded to detect and extract other biological sequences from the literature. The extracted information can also be used to readily update available primer/probe databases or to create new databases from scratch. PMID:20682041
Toward automated biochemotype annotation for large compound libraries.

PubMed

Chen, Xian; Liang, Yizeng; Xu, Jun

2006-08-01

Combinatorial chemistry allows scientists to probe large synthetically accessible chemical space. However, identifying the sub-space which is selectively associated with an interested biological target, is crucial to drug discovery and life sciences. This paper describes a process to automatically annotate biochemotypes of compounds in a library and thus to identify bioactivity related chemotypes (biochemotypes) from a large library of compounds. The process consists of two steps: (1) predicting all possible bioactivities for each compound in a library, and (2) deriving possible biochemotypes based on predictions. The Prediction of Activity Spectra for Substances program (PASS) was used in the first step. In second step, structural similarity and scaffold-hopping technologies are employed. These technologies are used to derive biochemotypes from bioactivity predictions and the corresponding annotated biochemotypes from MDL Drug Data Report (MDDR) database. About a one million (982,889) commercially available compound library (CACL) has been tested using this process. This paper demonstrates the feasibility of automatically annotating biochemotypes for large libraries of compounds. Nevertheless, some issues need to be considered in order to improve the process. First, the prediction accuracy of PASS program has no significant correlation with the number of compounds in a training set. Larger training sets do not necessarily increase the maximal error of prediction (MEP), nor do they increase the hit structural diversity. Smaller training sets do not necessarily decrease MEP, nor do they decrease the hit structural diversity. Second, the success of systematic bioactivity prediction relies on modeling, training data, and the definition of bioactivities (biochemotype ontology). Unfortunately, the biochemotype ontology was not well developed in the PASS program. Consequently, "ill-defined" bioactivities can reduce the quality of predictions. This paper suggests the ways in which the systematic bioactivities prediction program should be improved.
Standards-based curation of a decade-old digital repository dataset of molecular information.

PubMed

Harvey, Matthew J; Mason, Nicholas J; McLean, Andrew; Murray-Rust, Peter; Rzepa, Henry S; Stewart, James J P

2015-01-01

The desirable curation of 158,122 molecular geometries derived from the NCI set of reference molecules together with associated properties computed using the MOPAC semi-empirical quantum mechanical method and originally deposited in 2005 into the Cambridge DSpace repository as a data collection is reported. The procedures involved in the curation included annotation of the original data using new MOPAC methods, updating the syntax of the CML documents used to express the data to ensure schema conformance and adding new metadata describing the entries together with a XML schema transformation to map the metadata schema to that used by the DataCite organisation. We have adopted a granularity model in which a DataCite persistent identifier (DOI) is created for each individual molecule to enable data discovery and data metrics at this level using DataCite tools. We recommend that the future research data management (RDM) of the scientific and chemical data components associated with journal articles (the "supporting information") should be conducted in a manner that facilitates automatic periodic curation. Graphical abstractStandards and metadata-based curation of a decade-old digital repository dataset of molecular information.
10 CFR Appendix J to Subpart B of... - Uniform Test Method for Measuring the Energy Consumption of Automatic and Semi-Automatic Clothes...

Code of Federal Regulations, 2010 CFR

2010-01-01

...'s true energy consumption characteristics as to provide materially inaccurate comparative data... clothes washers should be totally representative of the design, construction, and control system that will...
10 CFR Appendix J to Subpart B of... - Uniform Test Method for Measuring the Energy Consumption of Automatic and Semi-Automatic Clothes...

Code of Federal Regulations, 2012 CFR

2012-01-01

...'s true energy consumption characteristics as to provide materially inaccurate comparative data... clothes washers should be totally representative of the design, construction, and control system that will...
10 CFR Appendix J to Subpart B of... - Uniform Test Method for Measuring the Energy Consumption of Automatic and Semi-Automatic Clothes...

Code of Federal Regulations, 2011 CFR

2011-01-01

...'s true energy consumption characteristics as to provide materially inaccurate comparative data... clothes washers should be totally representative of the design, construction, and control system that will...
Semi-Supervised Recurrent Neural Network for Adverse Drug Reaction mention extraction.

PubMed

Gupta, Shashank; Pawar, Sachin; Ramrakhiyani, Nitin; Palshikar, Girish Keshav; Varma, Vasudeva

2018-06-13

Social media is a useful platform to share health-related information due to its vast reach. This makes it a good candidate for public-health monitoring tasks, specifically for pharmacovigilance. We study the problem of extraction of Adverse-Drug-Reaction (ADR) mentions from social media, particularly from Twitter. Medical information extraction from social media is challenging, mainly due to short and highly informal nature of text, as compared to more technical and formal medical reports. Current methods in ADR mention extraction rely on supervised learning methods, which suffer from labeled data scarcity problem. The state-of-the-art method uses deep neural networks, specifically a class of Recurrent Neural Network (RNN) which is Long-Short-Term-Memory network (LSTM). Deep neural networks, due to their large number of free parameters rely heavily on large annotated corpora for learning the end task. But in the real-world, it is hard to get large labeled data, mainly due to the heavy cost associated with the manual annotation. To this end, we propose a novel semi-supervised learning based RNN model, which can leverage unlabeled data also present in abundance on social media. Through experiments we demonstrate the effectiveness of our method, achieving state-of-the-art performance in ADR mention extraction. In this study, we tackle the problem of labeled data scarcity for Adverse Drug Reaction mention extraction from social media and propose a novel semi-supervised learning based method which can leverage large unlabeled corpus available in abundance on the web. Through empirical study, we demonstrate that our proposed method outperforms fully supervised learning based baseline which relies on large manually annotated corpus for a good performance.
The TREC Interactive Track: An Annotated Bibliography.

ERIC Educational Resources Information Center

Over, Paul

2001-01-01

Discussion of the study of interactive information retrieval (IR) at the Text Retrieval Conferences (TREC) focuses on summaries of the Interactive Track at each conference. Describes evolution of the track, which has changed from comparing human-machine systems with fully automatic systems to comparing interactive systems that focus on the search…

PANNZER2: a rapid functional annotation web server.

PubMed

Törönen, Petri; Medlar, Alan; Holm, Liisa

2018-05-08

The unprecedented growth of high-throughput sequencing has led to an ever-widening annotation gap in protein databases. While computational prediction methods are available to make up the shortfall, a majority of public web servers are hindered by practical limitations and poor performance. Here, we introduce PANNZER2 (Protein ANNotation with Z-scoRE), a fast functional annotation web server that provides both Gene Ontology (GO) annotations and free text description predictions. PANNZER2 uses SANSparallel to perform high-performance homology searches, making bulk annotation based on sequence similarity practical. PANNZER2 can output GO annotations from multiple scoring functions, enabling users to see which predictions are robust across predictors. Finally, PANNZER2 predictions scored within the top 10 methods for molecular function and biological process in the CAFA2 NK-full benchmark. The PANNZER2 web server is updated on a monthly schedule and is accessible at http://ekhidna2.biocenter.helsinki.fi/sanspanz/. The source code is available under the GNU Public Licence v3.
Biases in the Experimental Annotations of Protein Function and Their Effect on Our Understanding of Protein Function Space

PubMed Central

Schnoes, Alexandra M.; Ream, David C.; Thorman, Alexander W.; Babbitt, Patricia C.; Friedberg, Iddo

2013-01-01

The ongoing functional annotation of proteins relies upon the work of curators to capture experimental findings from scientific literature and apply them to protein sequence and structure data. However, with the increasing use of high-throughput experimental assays, a small number of experimental studies dominate the functional protein annotations collected in databases. Here, we investigate just how prevalent is the “few articles - many proteins” phenomenon. We examine the experimentally validated annotation of proteins provided by several groups in the GO Consortium, and show that the distribution of proteins per published study is exponential, with 0.14% of articles providing the source of annotations for 25% of the proteins in the UniProt-GOA compilation. Since each of the dominant articles describes the use of an assay that can find only one function or a small group of functions, this leads to substantial biases in what we know about the function of many proteins. Mass-spectrometry, microscopy and RNAi experiments dominate high throughput experiments. Consequently, the functional information derived from these experiments is mostly of the subcellular location of proteins, and of the participation of proteins in embryonic developmental pathways. For some organisms, the information provided by different studies overlap by a large amount. We also show that the information provided by high throughput experiments is less specific than those provided by low throughput experiments. Given the experimental techniques available, certain biases in protein function annotation due to high-throughput experiments are unavoidable. Knowing that these biases exist and understanding their characteristics and extent is important for database curators, developers of function annotation programs, and anyone who uses protein function annotation data to plan experiments. PMID:23737737
A multiresolution prostate representation for automatic segmentation in magnetic resonance images.

PubMed

Alvarez, Charlens; Martínez, Fabio; Romero, Eduardo

2017-04-01

Accurate prostate delineation is necessary in radiotherapy processes for concentrating the dose onto the prostate and reducing side effects in neighboring organs. Currently, manual delineation is performed over magnetic resonance imaging (MRI) taking advantage of its high soft tissue contrast property. Nevertheless, as human intervention is a consuming task with high intra- and interobserver variability rates, (semi)-automatic organ delineation tools have emerged to cope with these challenges, reducing the time spent for these tasks. This work presents a multiresolution representation that defines a novel metric and allows to segment a new prostate by combining a set of most similar prostates in a dataset. The proposed method starts by selecting the set of most similar prostates with respect to a new one using the proposed multiresolution representation. This representation characterizes the prostate through a set of salient points, extracted from a region of interest (ROI) that encloses the organ and refined using structural information, allowing to capture main relevant features of the organ boundary. Afterward, the new prostate is automatically segmented by combining the nonrigidly registered expert delineations associated to the previous selected similar prostates using a weighted patch-based strategy. Finally, the prostate contour is smoothed based on morphological operations. The proposed approach was evaluated with respect to the expert manual segmentation under a leave-one-out scheme using two public datasets, obtaining averaged Dice coefficients of 82% ± 0.07 and 83% ± 0.06, and demonstrating a competitive performance with respect to atlas-based state-of-the-art methods. The proposed multiresolution representation provides a feature space that follows a local salient point criteria and a global rule of the spatial configuration among these points to find out the most similar prostates. This strategy suggests an easy adaptation in the clinical routine, as supporting tool for annotation. © 2017 American Association of Physicists in Medicine.
To do it or to let an automatic tool do it? The priority of control over effort.

PubMed

Osiurak, François; Wagner, Clara; Djerbi, Sara; Navarro, Jordan

2013-01-01

The aim of the present study is to provide experimental data relevant to the issue of what leads humans to use automatic tools. Two answers can be offered. The first is that humans strive to minimize physical and/or cognitive effort (principle of least effort). The second is that humans tend to keep their perceived control over the environment (principle of more control). These two factors certainly play a role, but the question raised here is to what do people give priority in situations wherein both manual and automatic actions take the same time - minimizing effort or keeping perceived control? To answer that question, we built four experiments in which participants were confronted with a recurring choice between performing a task manually (physical effort) or in a semi-automatic way (cognitive effort) versus using an automatic tool that completes the task for them (no effort). In this latter condition, participants were required to follow the progression of the automatic tool step by step. Our results showed that participants favored the manual or semi-automatic condition over the automatic condition. However, when they were offered the opportunity to perform recreational tasks in parallel, the shift toward manual condition disappeared. The findings give support to the idea that people give priority to keeping control over minimizing effort.
Validation of a semi-automatic protocol for the assessment of the tear meniscus central area based on open-source software

NASA Astrophysics Data System (ADS)

Pena-Verdeal, Hugo; Garcia-Resua, Carlos; Yebra-Pimentel, Eva; Giraldez, Maria J.

2017-08-01

Purpose: Different lower tear meniscus parameters can be clinical assessed on dry eye diagnosis. The aim of this study was to propose and analyse the variability of a semi-automatic method for measuring lower tear meniscus central area (TMCA) by using open source software. Material and methods: On a group of 105 subjects, one video of the lower tear meniscus after fluorescein instillation was generated by a digital camera attached to a slit-lamp. A short light beam (3x5 mm) with moderate illumination in the central portion of the meniscus (6 o'clock) was used. Images were extracted from each video by a masked observer. By using an open source software based on Java (NIH ImageJ), a further observer measured in a masked and randomized order the TMCA in the short light beam illuminated area by two methods: (1) manual method, where TMCA images was "manually" measured; (2) semi-automatic method, where TMCA images were transformed in an 8-bit-binary image, then holes inside this shape were filled and on the isolated shape, the area size was obtained. Finally, both measurements, manual and semi-automatic, were compared. Results: Paired t-test showed no statistical difference between both techniques results (p = 0.102). Pearson correlation between techniques show a significant positive near to perfect correlation (r = 0.99; p < 0.001). Conclusions: This study showed a useful tool to objectively measure the frontal central area of the meniscus in photography by free open source software.
Semi-automatic mapping of cultural heritage from airborne laser scanning using deep learning

NASA Astrophysics Data System (ADS)

Due Trier, Øivind; Salberg, Arnt-Børre; Holger Pilø, Lars; Tonning, Christer; Marius Johansen, Hans; Aarsten, Dagrun

2016-04-01

This paper proposes to use deep learning to improve semi-automatic mapping of cultural heritage from airborne laser scanning (ALS) data. Automatic detection methods, based on traditional pattern recognition, have been applied in a number of cultural heritage mapping projects in Norway for the past five years. Automatic detection of pits and heaps have been combined with visual interpretation of the ALS data for the mapping of deer hunting systems, iron production sites, grave mounds and charcoal kilns. However, the performance of the automatic detection methods varies substantially between ALS datasets. For the mapping of deer hunting systems on flat gravel and sand sediment deposits, the automatic detection results were almost perfect. However, some false detections appeared in the terrain outside of the sediment deposits. These could be explained by other pit-like landscape features, like parts of river courses, spaces between boulders, and modern terrain modifications. However, these were easy to spot during visual interpretation, and the number of missed individual pitfall traps was still low. For the mapping of grave mounds, the automatic method produced a large number of false detections, reducing the usefulness of the semi-automatic approach. The mound structure is a very common natural terrain feature, and the grave mounds are less distinct in shape than the pitfall traps. Still, applying automatic mound detection on an entire municipality did lead to a new discovery of an Iron Age grave field with more than 15 individual mounds. Automatic mound detection also proved to be useful for a detailed re-mapping of Norway's largest Iron Age grave yard, which contains almost 1000 individual graves. Combined pit and mound detection has been applied to the mapping of more than 1000 charcoal kilns that were used by an iron work 350-200 years ago. The majority of charcoal kilns were indirectly detected as either pits on the circumference, a central mound, or both. However, kilns with a flat interior and a shallow ditch along the circumference were often missed by the automatic detection method. The successfulness of automatic detection seems to depend on two factors: (1) the density of ALS ground hits on the cultural heritage structures being sought, and (2) to what extent these structures stand out from natural terrain structures. The first factor may, to some extent, be improved by using a higher number of ALS pulses per square meter. The second factor is difficult to change, and also highlights another challenge: how to make a general automatic method that is applicable in all types of terrain within a country. The mixed experience with traditional pattern recognition for semi-automatic mapping of cultural heritage led us to consider deep learning as an alternative approach. The main principle is that a general feature detector has been trained on a large image database. The feature detector is then tailored to a specific task by using a modest number of images of true and false examples of the features being sought. Results of using deep learning are compared with previous results using traditional pattern recognition.
Enhanced functionalities for annotating and indexing clinical text with the NCBO Annotator.

PubMed

Tchechmedjiev, Andon; Abdaoui, Amine; Emonet, Vincent; Melzi, Soumia; Jonnagaddala, Jitendra; Jonquet, Clement

2018-06-01

Second use of clinical data commonly involves annotating biomedical text with terminologies and ontologies. The National Center for Biomedical Ontology Annotator is a frequently used annotation service, originally designed for biomedical data, but not very suitable for clinical text annotation. In order to add new functionalities to the NCBO Annotator without hosting or modifying the original Web service, we have designed a proxy architecture that enables seamless extensions by pre-processing of the input text and parameters, and post processing of the annotations. We have then implemented enhanced functionalities for annotating and indexing free text such as: scoring, detection of context (negation, experiencer, temporality), new output formats and coarse-grained concept recognition (with UMLS Semantic Groups). In this paper, we present the NCBO Annotator+, a Web service which incorporates these new functionalities as well as a small set of evaluation results for concept recognition and clinical context detection on two standard evaluation tasks (Clef eHealth 2017, SemEval 2014). The Annotator+ has been successfully integrated into the SIFR BioPortal platform-an implementation of NCBO BioPortal for French biomedical terminologies and ontologies-to annotate English text. A Web user interface is available for testing and ontology selection (http://bioportal.lirmm.fr/ncbo_annotatorplus); however the Annotator+ is meant to be used through the Web service application programming interface (http://services.bioportal.lirmm.fr/ncbo_annotatorplus). The code is openly available, and we also provide a Docker packaging to enable easy local deployment to process sensitive (e.g. clinical) data in-house (https://github.com/sifrproject). andon.tchechmedjiev@lirmm.fr. Supplementary data are available at Bioinformatics online.
dbWFA: a web-based database for functional annotation of Triticum aestivum transcripts

PubMed Central

Vincent, Jonathan; Dai, Zhanwu; Ravel, Catherine; Choulet, Frédéric; Mouzeyar, Said; Bouzidi, M. Fouad; Agier, Marie; Martre, Pierre

2013-01-01

The functional annotation of genes based on sequence homology with genes from model species genomes is time-consuming because it is necessary to mine several unrelated databases. The aim of the present work was to develop a functional annotation database for common wheat Triticum aestivum (L.). The database, named dbWFA, is based on the reference NCBI UniGene set, an expressed gene catalogue built by expressed sequence tag clustering, and on full-length coding sequences retrieved from the TriFLDB database. Information from good-quality heterogeneous sources, including annotations for model plant species Arabidopsis thaliana (L.) Heynh. and Oryza sativa L., was gathered and linked to T. aestivum sequences through BLAST-based homology searches. Even though the complexity of the transcriptome cannot yet be fully appreciated, we developed a tool to easily and promptly obtain information from multiple functional annotation systems (Gene Ontology, MapMan bin codes, MIPS Functional Categories, PlantCyc pathway reactions and TAIR gene families). The use of dbWFA is illustrated here with several query examples. We were able to assign a putative function to 45% of the UniGenes and 81% of the full-length coding sequences from TriFLDB. Moreover, comparison of the annotation of the whole T. aestivum UniGene set along with curated annotations of the two model species assessed the accuracy of the annotation provided by dbWFA. To further illustrate the use of dbWFA, genes specifically expressed during the early cell division or late storage polymer accumulation phases of T. aestivum grain development were identified using a clustering analysis and then annotated using dbWFA. The annotation of these two sets of genes was consistent with previous analyses of T. aestivum grain transcriptomes and proteomes. Database URL: urgi.versailles.inra.fr/dbWFA/ PMID:23660284
HMMerThread: detecting remote, functional conserved domains in entire genomes by combining relaxed sequence-database searches with fold recognition.

PubMed

Bradshaw, Charles Richard; Surendranath, Vineeth; Henschel, Robert; Mueller, Matthias Stefan; Habermann, Bianca Hermine

2011-03-10

Conserved domains in proteins are one of the major sources of functional information for experimental design and genome-level annotation. Though search tools for conserved domain databases such as Hidden Markov Models (HMMs) are sensitive in detecting conserved domains in proteins when they share sufficient sequence similarity, they tend to miss more divergent family members, as they lack a reliable statistical framework for the detection of low sequence similarity. We have developed a greatly improved HMMerThread algorithm that can detect remotely conserved domains in highly divergent sequences. HMMerThread combines relaxed conserved domain searches with fold recognition to eliminate false positive, sequence-based identifications. With an accuracy of 90%, our software is able to automatically predict highly divergent members of conserved domain families with an associated 3-dimensional structure. We give additional confidence to our predictions by validation across species. We have run HMMerThread searches on eight proteomes including human and present a rich resource of remotely conserved domains, which adds significantly to the functional annotation of entire proteomes. We find ∼4500 cross-species validated, remotely conserved domain predictions in the human proteome alone. As an example, we find a DNA-binding domain in the C-terminal part of the A-kinase anchor protein 10 (AKAP10), a PKA adaptor that has been implicated in cardiac arrhythmias and premature cardiac death, which upon stress likely translocates from mitochondria to the nucleus/nucleolus. Based on our prediction, we propose that with this HLH-domain, AKAP10 is involved in the transcriptional control of stress response. Further remotely conserved domains we discuss are examples from areas such as sporulation, chromosome segregation and signalling during immune response. The HMMerThread algorithm is able to automatically detect the presence of remotely conserved domains in proteins based on weak sequence similarity. Our predictions open up new avenues for biological and medical studies. Genome-wide HMMerThread domains are available at http://vm1-hmmerthread.age.mpg.de.
HMMerThread: Detecting Remote, Functional Conserved Domains in Entire Genomes by Combining Relaxed Sequence-Database Searches with Fold Recognition

PubMed Central

Bradshaw, Charles Richard; Surendranath, Vineeth; Henschel, Robert; Mueller, Matthias Stefan; Habermann, Bianca Hermine

2011-01-01

Conserved domains in proteins are one of the major sources of functional information for experimental design and genome-level annotation. Though search tools for conserved domain databases such as Hidden Markov Models (HMMs) are sensitive in detecting conserved domains in proteins when they share sufficient sequence similarity, they tend to miss more divergent family members, as they lack a reliable statistical framework for the detection of low sequence similarity. We have developed a greatly improved HMMerThread algorithm that can detect remotely conserved domains in highly divergent sequences. HMMerThread combines relaxed conserved domain searches with fold recognition to eliminate false positive, sequence-based identifications. With an accuracy of 90%, our software is able to automatically predict highly divergent members of conserved domain families with an associated 3-dimensional structure. We give additional confidence to our predictions by validation across species. We have run HMMerThread searches on eight proteomes including human and present a rich resource of remotely conserved domains, which adds significantly to the functional annotation of entire proteomes. We find ∼4500 cross-species validated, remotely conserved domain predictions in the human proteome alone. As an example, we find a DNA-binding domain in the C-terminal part of the A-kinase anchor protein 10 (AKAP10), a PKA adaptor that has been implicated in cardiac arrhythmias and premature cardiac death, which upon stress likely translocates from mitochondria to the nucleus/nucleolus. Based on our prediction, we propose that with this HLH-domain, AKAP10 is involved in the transcriptional control of stress response. Further remotely conserved domains we discuss are examples from areas such as sporulation, chromosome segregation and signalling during immune response. The HMMerThread algorithm is able to automatically detect the presence of remotely conserved domains in proteins based on weak sequence similarity. Our predictions open up new avenues for biological and medical studies. Genome-wide HMMerThread domains are available at http://vm1-hmmerthread.age.mpg.de. PMID:21423752
Unified Sequence-Based Association Tests Allowing for Multiple Functional Annotations and Meta-analysis of Noncoding Variation in Metabochip Data.

PubMed

He, Zihuai; Xu, Bin; Lee, Seunggeun; Ionita-Laza, Iuliana

2017-09-07

Substantial progress has been made in the functional annotation of genetic variation in the human genome. Integrative analysis that incorporates such functional annotations into sequencing studies can aid the discovery of disease-associated genetic variants, especially those with unknown function and located outside protein-coding regions. Direct incorporation of one functional annotation as weight in existing dispersion and burden tests can suffer substantial loss of power when the functional annotation is not predictive of the risk status of a variant. Here, we have developed unified tests that can utilize multiple functional annotations simultaneously for integrative association analysis with efficient computational techniques. We show that the proposed tests significantly improve power when variant risk status can be predicted by functional annotations. Importantly, when functional annotations are not predictive of risk status, the proposed tests incur only minimal loss of power in relation to existing dispersion and burden tests, and under certain circumstances they can even have improved power by learning a weight that better approximates the underlying disease model in a data-adaptive manner. The tests can be constructed with summary statistics of existing dispersion and burden tests for sequencing data, therefore allowing meta-analysis of multiple studies without sharing individual-level data. We applied the proposed tests to a meta-analysis of noncoding rare variants in Metabochip data on 12,281 individuals from eight studies for lipid traits. By incorporating the Eigen functional score, we detected significant associations between noncoding rare variants in SLC22A3 and low-density lipoprotein and total cholesterol, associations that are missed by standard dispersion and burden tests. Copyright © 2017 American Society of Human Genetics. Published by Elsevier Inc. All rights reserved.
Optimizing the 3D-reconstruction technique for serial block-face scanning electron microscopy.

PubMed

Wernitznig, Stefan; Sele, Mariella; Urschler, Martin; Zankel, Armin; Pölt, Peter; Rind, F Claire; Leitinger, Gerd

2016-05-01

Elucidating the anatomy of neuronal circuits and localizing the synaptic connections between neurons, can give us important insights in how the neuronal circuits work. We are using serial block-face scanning electron microscopy (SBEM) to investigate the anatomy of a collision detection circuit including the Lobula Giant Movement Detector (LGMD) neuron in the locust, Locusta migratoria. For this, thousands of serial electron micrographs are produced that allow us to trace the neuronal branching pattern. The reconstruction of neurons was previously done manually by drawing cell outlines of each cell in each image separately. This approach was very time consuming and troublesome. To make the process more efficient a new interactive software was developed. It uses the contrast between the neuron under investigation and its surrounding for semi-automatic segmentation. For segmentation the user sets starting regions manually and the algorithm automatically selects a volume within the neuron until the edges corresponding to the neuronal outline are reached. Internally the algorithm optimizes a 3D active contour segmentation model formulated as a cost function taking the SEM image edges into account. This reduced the reconstruction time, while staying close to the manual reference segmentation result. Our algorithm is easy to use for a fast segmentation process, unlike previous methods it does not require image training nor an extended computing capacity. Our semi-automatic segmentation algorithm led to a dramatic reduction in processing time for the 3D-reconstruction of identified neurons. Copyright © 2016 Elsevier B.V. All rights reserved.
Semi-automated CCTV surveillance: the effects of system confidence, system accuracy and task complexity on operator vigilance, reliance and workload.

PubMed

Dadashi, N; Stedmon, A W; Pridmore, T P

2013-09-01

Recent advances in computer vision technology have lead to the development of various automatic surveillance systems, however their effectiveness is adversely affected by many factors and they are not completely reliable. This study investigated the potential of a semi-automated surveillance system to reduce CCTV operator workload in both detection and tracking activities. A further focus of interest was the degree of user reliance on the automated system. A simulated prototype was developed which mimicked an automated system that provided different levels of system confidence information. Dependent variable measures were taken for secondary task performance, reliance and subjective workload. When the automatic component of a semi-automatic CCTV surveillance system provided reliable system confidence information to operators, workload significantly decreased and spare mental capacity significantly increased. Providing feedback about system confidence and accuracy appears to be one important way of making the status of the automated component of the surveillance system more 'visible' to users and hence more effective to use. Copyright © 2012 Elsevier Ltd and The Ergonomics Society. All rights reserved.
New in protein structure and function annotation: hotspots, single nucleotide polymorphisms and the 'Deep Web'.

PubMed

Bromberg, Yana; Yachdav, Guy; Ofran, Yanay; Schneider, Reinhard; Rost, Burkhard

2009-05-01

The rapidly increasing quantity of protein sequence data continues to widen the gap between available sequences and annotations. Comparative modeling suggests some aspects of the 3D structures of approximately half of all known proteins; homology- and network-based inferences annotate some aspect of function for a similar fraction of the proteome. For most known protein sequences, however, there is detailed knowledge about neither their function nor their structure. Comprehensive efforts towards the expert curation of sequence annotations have failed to meet the demand of the rapidly increasing number of available sequences. Only the automated prediction of protein function in the absence of homology can close the gap between available sequences and annotations in the foreseeable future. This review focuses on two novel methods for automated annotation, and briefly presents an outlook on how modern web software may revolutionize the field of protein sequence annotation. First, predictions of protein binding sites and functional hotspots, and the evolution of these into the most successful type of prediction of protein function from sequence will be discussed. Second, a new tool, comprehensive in silico mutagenesis, which contributes important novel predictions of function and at the same time prepares for the onset of the next sequencing revolution, will be described. While these two new sub-fields of protein prediction represent the breakthroughs that have been achieved methodologically, it will then be argued that a different development might further change the way biomedical researchers benefit from annotations: modern web software can connect the worldwide web in any browser with the 'Deep Web' (ie, proprietary data resources). The availability of this direct connection, and the resulting access to a wealth of data, may impact drug discovery and development more than any existing method that contributes to protein annotation.
NoGOA: predicting noisy GO annotations using evidences and sparse representation.

PubMed

Yu, Guoxian; Lu, Chang; Wang, Jun

2017-07-21

Gene Ontology (GO) is a community effort to represent functional features of gene products. GO annotations (GOA) provide functional associations between GO terms and gene products. Due to resources limitation, only a small portion of annotations are manually checked by curators, and the others are electronically inferred. Although quality control techniques have been applied to ensure the quality of annotations, the community consistently report that there are still considerable noisy (or incorrect) annotations. Given the wide application of annotations, however, how to identify noisy annotations is an important but yet seldom studied open problem. We introduce a novel approach called NoGOA to predict noisy annotations. NoGOA applies sparse representation on the gene-term association matrix to reduce the impact of noisy annotations, and takes advantage of sparse representation coefficients to measure the semantic similarity between genes. Secondly, it preliminarily predicts noisy annotations of a gene based on aggregated votes from semantic neighborhood genes of that gene. Next, NoGOA estimates the ratio of noisy annotations for each evidence code based on direct annotations in GOA files archived on different periods, and then weights entries of the association matrix via estimated ratios and propagates weights to ancestors of direct annotations using GO hierarchy. Finally, it integrates evidence-weighted association matrix and aggregated votes to predict noisy annotations. Experiments on archived GOA files of six model species (H. sapiens, A. thaliana, S. cerevisiae, G. gallus, B. Taurus and M. musculus) demonstrate that NoGOA achieves significantly better results than other related methods and removing noisy annotations improves the performance of gene function prediction. The comparative study justifies the effectiveness of integrating evidence codes with sparse representation for predicting noisy GO annotations. Codes and datasets are available at http://mlda.swu.edu.cn/codes.php?name=NoGOA .
Automatic, semi-automatic and manual validation of urban drainage data.

PubMed

Branisavljević, N; Prodanović, D; Pavlović, D

2010-01-01

Advances in sensor technology and the possibility of automated long distance data transmission have made continuous measurements the preferable way of monitoring urban drainage processes. Usually, the collected data have to be processed by an expert in order to detect and mark the wrong data, remove them and replace them with interpolated data. In general, the first step in detecting the wrong, anomaly data is called the data quality assessment or data validation. Data validation consists of three parts: data preparation, validation scores generation and scores interpretation. This paper will present the overall framework for the data quality improvement system, suitable for automatic, semi-automatic or manual operation. The first two steps of the validation process are explained in more detail, using several validation methods on the same set of real-case data from the Belgrade sewer system. The final part of the validation process, which is the scores interpretation, needs to be further investigated on the developed system.
Semi-automatic brain tumor segmentation by constrained MRFs using structural trajectories.

PubMed

Zhao, Liang; Wu, Wei; Corso, Jason J

2013-01-01

Quantifying volume and growth of a brain tumor is a primary prognostic measure and hence has received much attention in the medical imaging community. Most methods have sought a fully automatic segmentation, but the variability in shape and appearance of brain tumor has limited their success and further adoption in the clinic. In reaction, we present a semi-automatic brain tumor segmentation framework for multi-channel magnetic resonance (MR) images. This framework does not require prior model construction and only requires manual labels on one automatically selected slice. All other slices are labeled by an iterative multi-label Markov random field optimization with hard constraints. Structural trajectories-the medical image analog to optical flow and 3D image over-segmentation are used to capture pixel correspondences between consecutive slices for pixel labeling. We show robustness and effectiveness through an evaluation on the 2012 MICCAI BRATS Challenge Dataset; our results indicate superior performance to baselines and demonstrate the utility of the constrained MRF formulation.
An independent software system for the analysis of dynamic MR images.

PubMed

Torheim, G; Lombardi, M; Rinck, P A

1997-01-01

A computer system for the manual, semi-automatic, and automatic analysis of dynamic MR images was to be developed on UNIX and personal computer platforms. The system was to offer an integrated and standardized way of performing both image processing and analysis that was independent of the MR unit used. The system consists of modules that are easily adaptable to special needs. Data from MR units or other diagnostic imaging equipment in techniques such as CT, ultrasonography, or nuclear medicine can be processed through the ACR-NEMA/DICOM standard file formats. A full set of functions is available, among them cine-loop visual analysis, and generation of time-intensity curves. Parameters such as cross-correlation coefficients, area under the curve, peak/maximum intensity, wash-in and wash-out slopes, time to peak, and relative signal intensity/contrast enhancement can be calculated. Other parameters can be extracted by fitting functions like the gamma-variate function. Region-of-interest data and parametric values can easily be exported. The system has been successfully tested in animal and patient examinations.
In silico analysis of expressed sequence tags from Trichostrongylus vitrinus (Nematoda): comparison of the automated ESTExplorer workflow platform with conventional database searches.

PubMed

Nagaraj, Shivashankar H; Gasser, Robin B; Nisbet, Alasdair J; Ranganathan, Shoba

2008-01-01

The analysis of expressed sequence tags (EST) offers a rapid and cost effective approach to elucidate the transcriptome of an organism, but requires several computational methods for assembly and annotation. Researchers frequently analyse each step manually, which is laborious and time consuming. We have recently developed ESTExplorer, a semi-automated computational workflow system, in order to achieve the rapid analysis of EST datasets. In this study, we evaluated EST data analysis for the parasitic nematode Trichostrongylus vitrinus (order Strongylida) using ESTExplorer, compared with database matching alone. We functionally annotated 1776 ESTs obtained via suppressive-subtractive hybridisation from T. vitrinus, an important parasitic trichostrongylid of small ruminants. Cluster and comparative genomic analyses of the transcripts using ESTExplorer indicated that 290 (41%) sequences had homologues in Caenorhabditis elegans, 329 (42%) in parasitic nematodes, 202 (28%) in organisms other than nematodes, and 218 (31%) had no significant match to any sequence in the current databases. Of the C. elegans homologues, 90 were associated with 'non-wildtype' double-stranded RNA interference (RNAi) phenotypes, including embryonic lethality, maternal sterility, sterile progeny, larval arrest and slow growth. We could functionally classify 267 (38%) sequences using the Gene Ontologies (GO) and establish pathway associations for 230 (33%) sequences using the Kyoto Encyclopedia of Genes and Genomes (KEGG). Further examination of this EST dataset revealed a number of signalling molecules, proteases, protease inhibitors, enzymes, ion channels and immune-related genes. In addition, we identified 40 putative secreted proteins that could represent potential candidates for developing novel anthelmintics or vaccines. We further compared the automated EST sequence annotations, using ESTExplorer, with database search results for individual T. vitrinus ESTs. ESTExplorer reliably and rapidly annotated 301 ESTs, with pathway and GO information, eliminating 60 low quality hits from database searches. We evaluated the efficacy of ESTExplorer in analysing EST data, and demonstrate that computational tools can be used to accelerate the process of gene discovery in EST sequencing projects. The present study has elucidated sets of relatively conserved and potentially novel genes for biological investigation, and the annotated EST set provides further insight into the molecular biology of T. vitrinus, towards the identification of novel drug targets.
Annotating Protein Functional Residues by Coupling High-Throughput Fitness Profile and Homologous-Structure Analysis

PubMed Central

Du, Yushen; Wu, Nicholas C.; Jiang, Lin; Zhang, Tianhao; Gong, Danyang; Shu, Sara; Wu, Ting-Ting

2016-01-01

ABSTRACT Identification and annotation of functional residues are fundamental questions in protein sequence analysis. Sequence and structure conservation provides valuable information to tackle these questions. It is, however, limited by the incomplete sampling of sequence space in natural evolution. Moreover, proteins often have multiple functions, with overlapping sequences that present challenges to accurate annotation of the exact functions of individual residues by conservation-based methods. Using the influenza A virus PB1 protein as an example, we developed a method to systematically identify and annotate functional residues. We used saturation mutagenesis and high-throughput sequencing to measure the replication capacity of single nucleotide mutations across the entire PB1 protein. After predicting protein stability upon mutations, we identified functional PB1 residues that are essential for viral replication. To further annotate the functional residues important to the canonical or noncanonical functions of viral RNA-dependent RNA polymerase (vRdRp), we performed a homologous-structure analysis with 16 different vRdRp structures. We achieved high sensitivity in annotating the known canonical polymerase functional residues. Moreover, we identified a cluster of noncanonical functional residues located in the loop region of the PB1 β-ribbon. We further demonstrated that these residues were important for PB1 protein nuclear import through the interaction with Ran-binding protein 5. In summary, we developed a systematic and sensitive method to identify and annotate functional residues that are not restrained by sequence conservation. Importantly, this method is generally applicable to other proteins about which homologous-structure information is available. PMID:27803181

Towards comprehensive structural motif mining for better fold annotation in the "twilight zone" of sequence dissimilarity

PubMed Central

Jia, Yi; Huan, Jun; Buhr, Vincent; Zhang, Jintao; Carayannopoulos, Leonidas N

2009-01-01

Background Automatic identification of structure fingerprints from a group of diverse protein structures is challenging, especially for proteins whose divergent amino acid sequences may fall into the "twilight-" or "midnight-" zones where pair-wise sequence identities to known sequences fall below 25% and sequence-based functional annotations often fail. Results Here we report a novel graph database mining method and demonstrate its application to protein structure pattern identification and structure classification. The biologic motivation of our study is to recognize common structure patterns in "immunoevasins", proteins mediating virus evasion of host immune defense. Our experimental study, using both viral and non-viral proteins, demonstrates the efficiency and efficacy of the proposed method. Conclusion We present a theoretic framework, offer a practical software implementation for incorporating prior domain knowledge, such as substitution matrices as studied here, and devise an efficient algorithm to identify approximate matched frequent subgraphs. By doing so, we significantly expanded the analytical power of sophisticated data mining algorithms in dealing with large volume of complicated and noisy protein structure data. And without loss of generality, choice of appropriate compatibility matrices allows our method to be easily employed in domains where subgraph labels have some uncertainty. PMID:19208148
Automated computer-based detection of encounter behaviours in groups of honeybees.

PubMed

Blut, Christina; Crespi, Alessandro; Mersch, Danielle; Keller, Laurent; Zhao, Linlin; Kollmann, Markus; Schellscheidt, Benjamin; Fülber, Carsten; Beye, Martin

2017-12-15

Honeybees form societies in which thousands of members integrate their behaviours to act as a single functional unit. We have little knowledge on how the collaborative features are regulated by workers' activities because we lack methods that enable collection of simultaneous and continuous behavioural information for each worker bee. In this study, we introduce the Bee Behavioral Annotation System (BBAS), which enables the automated detection of bees' behaviours in small observation hives. Continuous information on position and orientation were obtained by marking worker bees with 2D barcodes in a small observation hive. We computed behavioural and social features from the tracking information to train a behaviour classifier for encounter behaviours (interaction of workers via antennation) using a machine learning-based system. The classifier correctly detected 93% of the encounter behaviours in a group of bees, whereas 13% of the falsely classified behaviours were unrelated to encounter behaviours. The possibility of building accurate classifiers for automatically annotating behaviours may allow for the examination of individual behaviours of worker bees in the social environments of small observation hives. We envisage that BBAS will be a powerful tool for detecting the effects of experimental manipulation of social attributes and sub-lethal effects of pesticides on behaviour.
Digging into the low molecular weight peptidome with the OligoNet web server.

PubMed

Liu, Youzhong; Forcisi, Sara; Lucio, Marianna; Harir, Mourad; Bahut, Florian; Deleris-Bou, Magali; Krieger-Weber, Sibylle; Gougeon, Régis D; Alexandre, Hervé; Schmitt-Kopplin, Philippe

2017-09-15

Bioactive peptides play critical roles in regulating many biological processes. Recently, natural short peptides biomarkers are drawing significant attention and are considered as "hidden treasure" of drug candidates. High resolution and high mass accuracy provided by mass spectrometry (MS)-based untargeted metabolomics would enable the rapid detection and wide coverage of the low-molecular-weight peptidome. However, translating unknown masses (<1 500 Da) into putative peptides is often limited due to the lack of automatic data processing tools and to the limit of peptide databases. The web server OligoNet responds to this challenge by attempting to decompose each individual mass into a combination of amino acids out of metabolomics datasets. It provides an additional network-based data interpretation named "Peptide degradation network" (PDN), which unravels interesting relations between annotated peptides and generates potential functional patterns. The ab initio PDN built from yeast metabolic profiling data shows a great similarity with well-known metabolic networks, and could aid biological interpretation. OligoNet allows also an easy evaluation and interpretation of annotated peptides in systems biology, and is freely accessible at https://daniellyz200608105.shinyapps.io/OligoNet/ .
Year 2 Report: Protein Function Prediction Platform

DOE Office of Scientific and Technical Information (OSTI.GOV)

Zhou, C E

2012-04-27

Upon completion of our second year of development in a 3-year development cycle, we have completed a prototype protein structure-function annotation and function prediction system: Protein Function Prediction (PFP) platform (v.0.5). We have met our milestones for Years 1 and 2 and are positioned to continue development in completion of our original statement of work, or a reasonable modification thereof, in service to DTRA Programs involved in diagnostics and medical countermeasures research and development. The PFP platform is a multi-scale computational modeling system for protein structure-function annotation and function prediction. As of this writing, PFP is the only existing fullymore » automated, high-throughput, multi-scale modeling, whole-proteome annotation platform, and represents a significant advance in the field of genome annotation (Fig. 1). PFP modules perform protein functional annotations at the sequence, systems biology, protein structure, and atomistic levels of biological complexity (Fig. 2). Because these approaches provide orthogonal means of characterizing proteins and suggesting protein function, PFP processing maximizes the protein functional information that can currently be gained by computational means. Comprehensive annotation of pathogen genomes is essential for bio-defense applications in pathogen characterization, threat assessment, and medical countermeasure design and development in that it can short-cut the time and effort required to select and characterize protein biomarkers.« less
Measuring semantic similarities by combining gene ontology annotations and gene co-function networks

DOE PAGES

Peng, Jiajie; Uygun, Sahra; Kim, Taehyong; ...

2015-02-14

Background: Gene Ontology (GO) has been used widely to study functional relationships between genes. The current semantic similarity measures rely only on GO annotations and GO structure. This limits the power of GO-based similarity because of the limited proportion of genes that are annotated to GO in most organisms. Results: We introduce a novel approach called NETSIM (network-based similarity measure) that incorporates information from gene co-function networks in addition to using the GO structure and annotations. Using metabolic reaction maps of yeast, Arabidopsis, and human, we demonstrate that NETSIM can improve the accuracy of GO term similarities. We also demonstratemore » that NETSIM works well even for genomes with sparser gene annotation data. We applied NETSIM on large Arabidopsis gene families such as cytochrome P450 monooxygenases to group the members functionally and show that this grouping could facilitate functional characterization of genes in these families. Conclusions: Using NETSIM as an example, we demonstrated that the performance of a semantic similarity measure could be significantly improved after incorporating genome-specific information. NETSIM incorporates both GO annotations and gene co-function network data as a priori knowledge in the model. Therefore, functional similarities of GO terms that are not explicitly encoded in GO but are relevant in a taxon-specific manner become measurable when GO annotations are limited.« less
The Schistosoma mansoni phylome: using evolutionary genomics to gain insight into a parasite's biology.

PubMed

Silva, Larissa Lopes; Marcet-Houben, Marina; Nahum, Laila Alves; Zerlotini, Adhemar; Gabaldón, Toni; Oliveira, Guilherme

2012-11-13

Schistosoma mansoni is one of the causative agents of schistosomiasis, a neglected tropical disease that affects about 237 million people worldwide. Despite recent efforts, we still lack a general understanding of the relevant host-parasite interactions, and the possible treatments are limited by the emergence of resistant strains and the absence of a vaccine. The S. mansoni genome was completely sequenced and still under continuous annotation. Nevertheless, more than 45% of the encoded proteins remain without experimental characterization or even functional prediction. To improve our knowledge regarding the biology of this parasite, we conducted a proteome-wide evolutionary analysis to provide a broad view of the S. mansoni's proteome evolution and to improve its functional annotation. Using a phylogenomic approach, we reconstructed the S. mansoni phylome, which comprises the evolutionary histories of all parasite proteins and their homologs across 12 other organisms. The analysis of a total of 7,964 phylogenies allowed a deeper understanding of genomic complexity and evolutionary adaptations to a parasitic lifestyle. In particular, the identification of lineage-specific gene duplications pointed to the diversification of several protein families that are relevant for host-parasite interaction, including proteases, tetraspanins, fucosyltransferases, venom allergen-like proteins, and tegumental-allergen-like proteins. In addition to the evolutionary knowledge, the phylome data enabled us to automatically re-annotate 3,451 proteins through a phylogenetic-based approach rather than solely sequence similarity searches. To allow further exploitation of this valuable data, all information has been made available at PhylomeDB (http://www.phylomedb.org). In this study, we used an evolutionary approach to assess S. mansoni parasite biology, improve genome/proteome functional annotation, and provide insights into host-parasite interactions. Taking advantage of a proteome-wide perspective rather than focusing on individual proteins, we identified that this parasite has experienced specific gene duplication events, particularly affecting genes that are potentially related to the parasitic lifestyle. These innovations may be related to the mechanisms that protect S. mansoni against host immune responses being important adaptations for the parasite survival in a potentially hostile environment. Continuing this work, a comparative analysis involving genomic, transcriptomic, and proteomic data from other helminth parasites, other parasites, and vectors will supply more information regarding parasite's biology as well as host-parasite interactions.
Functional Annotations of Paralogs: A Blessing and a Curse

PubMed Central

Zallot, Rémi; Harrison, Katherine J.; Kolaczkowski, Bryan; de Crécy-Lagard, Valérie

2016-01-01

Gene duplication followed by mutation is a classic mechanism of neofunctionalization, producing gene families with functional diversity. In some cases, a single point mutation is sufficient to change the substrate specificity and/or the chemistry performed by an enzyme, making it difficult to accurately separate enzymes with identical functions from homologs with different functions. Because sequence similarity is often used as a basis for assigning functional annotations to genes, non-isofunctional gene families pose a great challenge for genome annotation pipelines. Here we describe how integrating evolutionary and functional information such as genome context, phylogeny, metabolic reconstruction and signature motifs may be required to correctly annotate multifunctional families. These integrative analyses can also lead to the discovery of novel gene functions, as hints from specific subgroups can guide the functional characterization of other members of the family. We demonstrate how careful manual curation processes using comparative genomics can disambiguate subgroups within large multifunctional families and discover their functions. We present the COG0720 protein family as a case study. We also discuss strategies to automate this process to improve the accuracy of genome functional annotation pipelines. PMID:27618105
Software Tool for Researching Annotations of Proteins (STRAP): Open-Source Protein Annotation Software with Data Visualization

PubMed Central

Bhatia, Vivek N.; Perlman, David H.; Costello, Catherine E.; McComb, Mark E.

2009-01-01

In order that biological meaning may be derived and testable hypotheses may be built from proteomics experiments, assignments of proteins identified by mass spectrometry or other techniques must be supplemented with additional notation, such as information on known protein functions, protein-protein interactions, or biological pathway associations. Collecting, organizing, and interpreting this data often requires the input of experts in the biological field of study, in addition to the time-consuming search for and compilation of information from online protein databases. Furthermore, visualizing this bulk of information can be challenging due to the limited availability of easy-to-use and freely available tools for this process. In response to these constraints, we have undertaken the design of software to automate annotation and visualization of proteomics data in order to accelerate the pace of research. Here we present the Software Tool for Researching Annotations of Proteins (STRAP) – a user-friendly, open-source C# application. STRAP automatically obtains gene ontology (GO) terms associated with proteins in a proteomics results ID list using the freely accessible UniProtKB and EBI GOA databases. Summarized in an easy-to-navigate tabular format, STRAP includes meta-information on the protein in addition to complimentary GO terminology. Additionally, this information can be edited by the user so that in-house expertise on particular proteins may be integrated into the larger dataset. STRAP provides a sortable tabular view for all terms, as well as graphical representations of GO-term association data in pie (biological process, cellular component and molecular function) and bar charts (cross comparison of sample sets) to aid in the interpretation of large datasets and differential analyses experiments. Furthermore, proteins of interest may be exported as a unique FASTA-formatted file to allow for customizable re-searching of mass spectrometry data, and gene names corresponding to the proteins in the lists may be encoded in the Gaggle microformat for further characterization, including pathway analysis. STRAP, a tutorial, and the C# source code are freely available from http://cpctools.sourceforge.net. PMID:19839595
Development of Semi-Automatic Lathe by using Intelligent Soft Computing Technique

NASA Astrophysics Data System (ADS)

Sakthi, S.; Niresh, J.; Vignesh, K.; Anand Raj, G.

2018-03-01

This paper discusses the enhancement of conventional lathe machine to semi-automated lathe machine by implementing a soft computing method. In the present scenario, lathe machine plays a vital role in the engineering division of manufacturing industry. While the manual lathe machines are economical, the accuracy and efficiency are not up to the mark. On the other hand, CNC machine provide the desired accuracy and efficiency, but requires a huge capital. In order to over come this situation, a semi-automated approach towards the conventional lathe machine is developed by employing stepper motors to the horizontal and vertical drive, that can be controlled by Arduino UNO -microcontroller. Based on the input parameters of the lathe operation the arduino coding is been generated and transferred to the UNO board. Thus upgrading from manual to semi-automatic lathe machines can significantly increase the accuracy and efficiency while, at the same time, keeping a check on investment cost and consequently provide a much needed escalation to the manufacturing industry.
iLOG: A Framework for Automatic Annotation of Learning Objects with Empirical Usage Metadata

ERIC Educational Resources Information Center

Miller, L. D.; Soh, Leen-Kiat; Samal, Ashok; Nugent, Gwen

2012-01-01

Learning objects (LOs) are digital or non-digital entities used for learning, education or training commonly stored in repositories searchable by their associated metadata. Unfortunately, based on the current standards, such metadata is often missing or incorrectly entered making search difficult or impossible. In this paper, we investigate…
Semantic photo synthesis

NASA Astrophysics Data System (ADS)

Johnson, Matthew; Brostow, G. J.; Shotton, J.; Kwatra, V.; Cipolla, R.

2007-02-01

Composite images are synthesized from existing photographs by artists who make concept art, e.g. storyboards for movies or architectural planning. Current techniques allow an artist to fabricate such an image by digitally splicing parts of stock photographs. While these images serve mainly to "quickly" convey how a scene should look, their production is laborious. We propose a technique that allows a person to design a new photograph with substantially less effort. This paper presents a method that generates a composite image when a user types in nouns, such as "boat" and "sand." The artist can optionally design an intended image by specifying other constraints. Our algorithm formulates the constraints as queries to search an automatically annotated image database. The desired photograph, not a collage, is then synthesized using graph-cut optimization, optionally allowing for further user interaction to edit or choose among alternative generated photos. Our results demonstrate our contributions of (1) a method of creating specific images with minimal human effort, and (2) a combined algorithm for automatically building an image library with semantic annotations from any photo collection.
Virus Particle Detection by Convolutional Neural Network in Transmission Electron Microscopy Images.

PubMed

Ito, Eisuke; Sato, Takaaki; Sano, Daisuke; Utagawa, Etsuko; Kato, Tsuyoshi

2018-06-01

A new computational method for the detection of virus particles in transmission electron microscopy (TEM) images is presented. Our approach is to use a convolutional neural network that transforms a TEM image to a probabilistic map that indicates where virus particles exist in the image. Our proposed approach automatically and simultaneously learns both discriminative features and classifier for virus particle detection by machine learning, in contrast to existing methods that are based on handcrafted features that yield many false positives and require several postprocessing steps. The detection performance of the proposed method was assessed against a dataset of TEM images containing feline calicivirus particles and compared with several existing detection methods, and the state-of-the-art performance of the developed method for detecting virus was demonstrated. Since our method is based on supervised learning that requires both the input images and their corresponding annotations, it is basically used for detection of already-known viruses. However, the method is highly flexible, and the convolutional networks can adapt themselves to any virus particles by learning automatically from an annotated dataset.
Automatic cerebrospinal fluid segmentation in non-contrast CT images using a 3D convolutional network

NASA Astrophysics Data System (ADS)

Patel, Ajay; van de Leemput, Sil C.; Prokop, Mathias; van Ginneken, Bram; Manniesing, Rashindra

2017-03-01

Segmentation of anatomical structures is fundamental in the development of computer aided diagnosis systems for cerebral pathologies. Manual annotations are laborious, time consuming and subject to human error and observer variability. Accurate quantification of cerebrospinal fluid (CSF) can be employed as a morphometric measure for diagnosis and patient outcome prediction. However, segmenting CSF in non-contrast CT images is complicated by low soft tissue contrast and image noise. In this paper we propose a state-of-the-art method using a multi-scale three-dimensional (3D) fully convolutional neural network (CNN) to automatically segment all CSF within the cranial cavity. The method is trained on a small dataset comprised of four manually annotated cerebral CT images. Quantitative evaluation of a separate test dataset of four images shows a mean Dice similarity coefficient of 0.87 +/- 0.01 and mean absolute volume difference of 4.77 +/- 2.70 %. The average prediction time was 68 seconds. Our method allows for fast and fully automated 3D segmentation of cerebral CSF in non-contrast CT, and shows promising results despite a limited amount of training data.
Automatic classification of radiological reports for clinical care.

PubMed

Gerevini, Alfonso Emilio; Lavelli, Alberto; Maffi, Alessandro; Maroldi, Roberto; Minard, Anne-Lyse; Serina, Ivan; Squassina, Guido

2018-06-07

Radiological reporting generates a large amount of free-text clinical narratives, a potentially valuable source of information for improving clinical care and supporting research. The use of automatic techniques to analyze such reports is necessary to make their content effectively available to radiologists in an aggregated form. In this paper we focus on the classification of chest computed tomography reports according to a classification schema proposed for this task by radiologists of the Italian hospital ASST Spedali Civili di Brescia. The proposed system is built exploiting a training data set containing reports annotated by radiologists. Each report is classified according to the schema developed by radiologists and textual evidences are marked in the report. The annotations are then used to train different machine learning based classifiers. We present in this paper a method based on a cascade of classifiers which make use of a set of syntactic and semantic features. The resulting system is a novel hierarchical classification system for the given task, that we have experimentally evaluated. Copyright © 2018 Elsevier B.V. All rights reserved.
Automatic detection of apical roots in oral radiographs

NASA Astrophysics Data System (ADS)

Wu, Yi; Xie, Fangfang; Yang, Jie; Cheng, Erkang; Megalooikonomou, Vasileios; Ling, Haibin

2012-03-01

The apical root regions play an important role in analysis and diagnosis of many oral diseases. Automatic detection of such regions is consequently the first step toward computer-aided diagnosis of these diseases. In this paper we propose an automatic method for periapical root region detection by using the state-of-theart machine learning approaches. Specifically, we have adapted the AdaBoost classifier for apical root detection. One challenge in the task is the lack of training cases especially for diseased ones. To handle this problem, we boost the training set by including more root regions that are close to the annotated ones and decompose the original images to randomly generate negative samples. Based on these training samples, the Adaboost algorithm in combination with Haar wavelets is utilized in this task to train an apical root detector. The learned detector usually generates a large amount of true and false positives. In order to reduce the number of false positives, a confidence score for each candidate detection result is calculated for further purification. We first merge the detected regions by combining tightly overlapped detected candidate regions and then we use the confidence scores from the Adaboost detector to eliminate the false positives. The proposed method is evaluated on a dataset containing 39 annotated digitized oral X-Ray images from 21 patients. The experimental results show that our approach can achieve promising detection accuracy.
Automatic Summarization of MEDLINE Citations for Evidence–Based Medical Treatment: A Topic-Oriented Evaluation

PubMed Central

Fiszman, Marcelo; Demner-Fushman, Dina; Kilicoglu, Halil; Rindflesch, Thomas C.

2009-01-01

As the number of electronic biomedical textual resources increases, it becomes harder for physicians to find useful answers at the point of care. Information retrieval applications provide access to databases; however, little research has been done on using automatic summarization to help navigate the documents returned by these systems. After presenting a semantic abstraction automatic summarization system for MEDLINE citations, we concentrate on evaluating its ability to identify useful drug interventions for fifty-three diseases. The evaluation methodology uses existing sources of evidence-based medicine as surrogates for a physician-annotated reference standard. Mean average precision (MAP) and a clinical usefulness score developed for this study were computed as performance metrics. The automatic summarization system significantly outperformed the baseline in both metrics. The MAP gain was 0.17 (p < 0.01) and the increase in the overall score of clinical usefulness was 0.39 (p < 0.05). PMID:19022398
Semi-automatic spray pyrolysis deposition of thin, transparent, titania films as blocking layers for dye-sensitized and perovskite solar cells.

PubMed

Krýsová, Hana; Krýsa, Josef; Kavan, Ladislav

2018-01-01

For proper function of the negative electrode of dye-sensitized and perovskite solar cells, the deposition of a nonporous blocking film is required on the surface of F-doped SnO 2 (FTO) glass substrates. Such a blocking film can minimise undesirable parasitic processes, for example, the back reaction of photoinjected electrons with the oxidized form of the redox mediator or with the hole-transporting medium can be avoided. In the present work, thin, transparent, blocking TiO 2 films are prepared by semi-automatic spray pyrolysis of precursors consisting of titanium diisopropoxide bis(acetylacetonate) as the main component. The variation in the layer thickness of the sprayed films is achieved by varying the number of spray cycles. The parameters investigated in this work were deposition temperature (150, 300 and 450 °C), number of spray cycles (20-200), precursor composition (with/without deliberately added acetylacetone), concentration (0.05 and 0.2 M) and subsequent post-calcination at 500 °C. The photo-electrochemical properties were evaluated in aqueous electrolyte solution under UV irradiation. The blocking properties were tested by cyclic voltammetry with a model redox probe with a simple one-electron-transfer reaction. Semi-automatic spraying resulted in the formation of transparent, homogeneous, TiO 2 films, and the technique allows for easy upscaling to large electrode areas. The deposition temperature of 450 °C was necessary for the fabrication of highly photoactive TiO 2 films. The blocking properties of the as-deposited TiO 2 films (at 450 °C) were impaired by post-calcination at 500 °C, but this problem could be addressed by increasing the number of spray cycles. The modification of the precursor by adding acetylacetone resulted in the fabrication of TiO 2 films exhibiting perfect blocking properties that were not influenced by post-calcination. These results will surely find use in the fabrication of large-scale dye-sensitized and perovskite solar cells.
Robust extraction of the aorta and pulmonary artery from 3D MDCT image data

NASA Astrophysics Data System (ADS)

Taeprasartsit, Pinyo; Higgins, William E.

2010-03-01

Accurate definition of the aorta and pulmonary artery from three-dimensional (3D) multi-detector CT (MDCT) images is important for pulmonary applications. This work presents robust methods for defining the aorta and pulmonary artery in the central chest. The methods work on both contrast enhanced and no-contrast 3D MDCT image data. The automatic methods use a common approach employing model fitting and selection and adaptive refinement. During the occasional event that more precise vascular extraction is desired or the method fails, we also have an alternate semi-automatic fail-safe method. The semi-automatic method extracts the vasculature by extending the medial axes into a user-guided direction. A ground-truth study over a series of 40 human 3D MDCT images demonstrates the efficacy, accuracy, robustness, and efficiency of the methods.
Semi-automatic motion compensation of contrast-enhanced ultrasound images from abdominal organs for perfusion analysis.

PubMed

Schäfer, Sebastian; Nylund, Kim; Sævik, Fredrik; Engjom, Trond; Mézl, Martin; Jiřík, Radovan; Dimcevski, Georg; Gilja, Odd Helge; Tönnies, Klaus

2015-08-01

This paper presents a system for correcting motion influences in time-dependent 2D contrast-enhanced ultrasound (CEUS) images to assess tissue perfusion characteristics. The system consists of a semi-automatic frame selection method to find images with out-of-plane motion as well as a method for automatic motion compensation. Translational and non-rigid motion compensation is applied by introducing a temporal continuity assumption. A study consisting of 40 clinical datasets was conducted to compare the perfusion with simulated perfusion using pharmacokinetic modeling. Overall, the proposed approach decreased the mean average difference between the measured perfusion and the pharmacokinetic model estimation. It was non-inferior for three out of four patient cohorts to a manual approach and reduced the analysis time by 41% compared to manual processing. Copyright © 2014 Elsevier Ltd. All rights reserved.
Draft genome of the gayal, Bos frontalis

PubMed Central

Wang, Ming-Shan; Zeng, Yan; Wang, Xiao; Nie, Wen-Hui; Wang, Jin-Huan; Su, Wei-Ting; Xiong, Zi-Jun; Wang, Sheng; Qu, Kai-Xing; Yan, Shou-Qing; Yang, Min-Min; Wang, Wen; Dong, Yang; Zhang, Ya-Ping

2017-01-01

Abstract Gayal (Bos frontalis), also known as mithan or mithun, is a large endangered semi-domesticated bovine that has a limited geographical distribution in the hill-forests of China, Northeast India, Bangladesh, Myanmar, and Bhutan. Many questions about the gayal such as its origin, population history, and genetic basis of local adaptation remain largely unresolved. De novo sequencing and assembly of the whole gayal genome provides an opportunity to address these issues. We report a high-depth sequencing, de novo assembly, and annotation of a female Chinese gayal genome. Based on the Illumina genomic sequencing platform, we have generated 350.38 Gb of raw data from 16 different insert-size libraries. A total of 276.86 Gb of clean data is retained after quality control. The assembled genome is about 2.85 Gb with scaffold and contig N50 sizes of 2.74 Mb and 14.41 kb, respectively. Repetitive elements account for 48.13% of the genome. Gene annotation has yielded 26 667 protein-coding genes, of which 97.18% have been functionally annotated. BUSCO assessment shows that our assembly captures 93% (3183 of 4104) of the core eukaryotic genes and 83.1% of vertebrate universal single-copy orthologs. We provide the first comprehensive de novo genome of the gayal. This genetic resource is integral for investigating the origin of the gayal and performing comparative genomic studies to improve understanding of the speciation and divergence of bovine species. The assembled genome could be used as reference in future population genetic studies of gayal. PMID:29048483

Protein Sequence Annotation Tool (PSAT): A centralized web-based meta-server for high-throughput sequence annotations

DOE Office of Scientific and Technical Information (OSTI.GOV)

Leung, Elo; Huang, Amy; Cadag, Eithon

In this study, we introduce the Protein Sequence Annotation Tool (PSAT), a web-based, sequence annotation meta-server for performing integrated, high-throughput, genome-wide sequence analyses. Our goals in building PSAT were to (1) create an extensible platform for integration of multiple sequence-based bioinformatics tools, (2) enable functional annotations and enzyme predictions over large input protein fasta data sets, and (3) provide a web interface for convenient execution of the tools. In this paper, we demonstrate the utility of PSAT by annotating the predicted peptide gene products of Herbaspirillum sp. strain RV1423, importing the results of PSAT into EC2KEGG, and using the resultingmore » functional comparisons to identify a putative catabolic pathway, thereby distinguishing RV1423 from a well annotated Herbaspirillum species. This analysis demonstrates that high-throughput enzyme predictions, provided by PSAT processing, can be used to identify metabolic potential in an otherwise poorly annotated genome. Lastly, PSAT is a meta server that combines the results from several sequence-based annotation and function prediction codes, and is available at http://psat.llnl.gov/psat/. PSAT stands apart from other sequencebased genome annotation systems in providing a high-throughput platform for rapid de novo enzyme predictions and sequence annotations over large input protein sequence data sets in FASTA. PSAT is most appropriately applied in annotation of large protein FASTA sets that may or may not be associated with a single genome.« less
Protein Sequence Annotation Tool (PSAT): A centralized web-based meta-server for high-throughput sequence annotations

DOE PAGES

Leung, Elo; Huang, Amy; Cadag, Eithon; ...

2016-01-20

In this study, we introduce the Protein Sequence Annotation Tool (PSAT), a web-based, sequence annotation meta-server for performing integrated, high-throughput, genome-wide sequence analyses. Our goals in building PSAT were to (1) create an extensible platform for integration of multiple sequence-based bioinformatics tools, (2) enable functional annotations and enzyme predictions over large input protein fasta data sets, and (3) provide a web interface for convenient execution of the tools. In this paper, we demonstrate the utility of PSAT by annotating the predicted peptide gene products of Herbaspirillum sp. strain RV1423, importing the results of PSAT into EC2KEGG, and using the resultingmore » functional comparisons to identify a putative catabolic pathway, thereby distinguishing RV1423 from a well annotated Herbaspirillum species. This analysis demonstrates that high-throughput enzyme predictions, provided by PSAT processing, can be used to identify metabolic potential in an otherwise poorly annotated genome. Lastly, PSAT is a meta server that combines the results from several sequence-based annotation and function prediction codes, and is available at http://psat.llnl.gov/psat/. PSAT stands apart from other sequencebased genome annotation systems in providing a high-throughput platform for rapid de novo enzyme predictions and sequence annotations over large input protein sequence data sets in FASTA. PSAT is most appropriately applied in annotation of large protein FASTA sets that may or may not be associated with a single genome.« less
Automatic tissue image segmentation based on image processing and deep learning

NASA Astrophysics Data System (ADS)

Kong, Zhenglun; Luo, Junyi; Xu, Shengpu; Li, Ting

2018-02-01

Image segmentation plays an important role in multimodality imaging, especially in fusion structural images offered by CT, MRI with functional images collected by optical technologies or other novel imaging technologies. Plus, image segmentation also provides detailed structure description for quantitative visualization of treating light distribution in the human body when incorporated with 3D light transport simulation method. Here we used image enhancement, operators, and morphometry methods to extract the accurate contours of different tissues such as skull, cerebrospinal fluid (CSF), grey matter (GM) and white matter (WM) on 5 fMRI head image datasets. Then we utilized convolutional neural network to realize automatic segmentation of images in a deep learning way. We also introduced parallel computing. Such approaches greatly reduced the processing time compared to manual and semi-automatic segmentation and is of great importance in improving speed and accuracy as more and more samples being learned. Our results can be used as a criteria when diagnosing diseases such as cerebral atrophy, which is caused by pathological changes in gray matter or white matter. We demonstrated the great potential of such image processing and deep leaning combined automatic tissue image segmentation in personalized medicine, especially in monitoring, and treatments.
Framework for automatic information extraction from research papers on nanocrystal devices

PubMed Central

Yoshioka, Masaharu; Hara, Shinjiro; Newton, Marcus C

2015-01-01

Summary To support nanocrystal device development, we have been working on a computational framework to utilize information in research papers on nanocrystal devices. We developed an annotated corpus called “ NaDev” (Nanocrystal Device Development) for this purpose. We also proposed an automatic information extraction system called “NaDevEx” (Nanocrystal Device Automatic Information Extraction Framework). NaDevEx aims at extracting information from research papers on nanocrystal devices using the NaDev corpus and machine-learning techniques. However, the characteristics of NaDevEx were not examined in detail. In this paper, we conduct system evaluation experiments for NaDevEx using the NaDev corpus. We discuss three main issues: system performance, compared with human annotators; the effect of paper type (synthesis or characterization) on system performance; and the effects of domain knowledge features (e.g., a chemical named entity recognition system and list of names of physical quantities) on system performance. We found that overall system performance was 89% in precision and 69% in recall. If we consider identification of terms that intersect with correct terms for the same information category as the correct identification, i.e., loose agreement (in many cases, we can find that appropriate head nouns such as temperature or pressure loosely match between two terms), the overall performance is 95% in precision and 74% in recall. The system performance is almost comparable with results of human annotators for information categories with rich domain knowledge information (source material). However, for other information categories, given the relatively large number of terms that exist only in one paper, recall of individual information categories is not high (39–73%); however, precision is better (75–97%). The average performance for synthesis papers is better than that for characterization papers because of the lack of training examples for characterization papers. Based on these results, we discuss future research plans for improving the performance of the system. PMID:26665057
Design and pilot evaluation of a system to develop computer-based site-specific practice guidelines from decision models.

PubMed

Sanders, G D; Nease, R F; Owens, D K

2000-01-01

Local tailoring of clinical practice guidelines (CPGs) requires experts in medicine and evidence synthesis unavailable in many practice settings. The authors' computer-based system enables developers and users to create, disseminate, and tailor CPGs, using normative decision models (DMs). ALCHEMIST, a web-based system, analyzes a DM, creates a CPG in the form of an annotated algorithm, and displays for the guideline user the optimal strategy. ALCHEMIST'S interface enables remote users to tailor the guideline by changing underlying input variables and observing the new annotated algorithm that is developed automatically. In a pilot evaluation of the system, a DM was used to evaluate strategies for staging non-small-cell lung cancer. Subjects (n = 15) compared the automatically created CPG with published guidelines for this staging and critiqued both using a previously developed instrument to rate the CPGs' usability, accountability, and accuracy on a scale of 0 (worst) to 2 (best), with higher scores reflecting higher quality. The mean overall score for the ALCHEMIST CPG was 1.502, compared with the published-CPG score of 0.987 (p = 0.002). The ALCHEMIST CPG scores for usability, accountability, and accuracy were 1.683, 1.393, and 1.430, respectively; the published CPG scores were 1.192, 0.941, and 0.830 (each comparison p < 0.05). On a scale of 1 (worst) to 5 (best), users' mean ratings of ALCHEMIST'S ease of use, usefulness of content, and presentation format were 4.76, 3.98, and 4.64, respectively. The results demonstrate the feasibility of a web-based system that automatically analyzes a DM and creates a CPG as an annotated algorithm, enabling remote users to develop site-specific CPGs. In the pilot evaluation, the ALCHEMIST guidelines met established criteria for quality and compared favorably with national CPGs. The high usability and usefulness ratings suggest that such systems can be a good tool for guideline development.
Framework for automatic information extraction from research papers on nanocrystal devices.

PubMed

Dieb, Thaer M; Yoshioka, Masaharu; Hara, Shinjiro; Newton, Marcus C

2015-01-01

To support nanocrystal device development, we have been working on a computational framework to utilize information in research papers on nanocrystal devices. We developed an annotated corpus called " NaDev" (Nanocrystal Device Development) for this purpose. We also proposed an automatic information extraction system called "NaDevEx" (Nanocrystal Device Automatic Information Extraction Framework). NaDevEx aims at extracting information from research papers on nanocrystal devices using the NaDev corpus and machine-learning techniques. However, the characteristics of NaDevEx were not examined in detail. In this paper, we conduct system evaluation experiments for NaDevEx using the NaDev corpus. We discuss three main issues: system performance, compared with human annotators; the effect of paper type (synthesis or characterization) on system performance; and the effects of domain knowledge features (e.g., a chemical named entity recognition system and list of names of physical quantities) on system performance. We found that overall system performance was 89% in precision and 69% in recall. If we consider identification of terms that intersect with correct terms for the same information category as the correct identification, i.e., loose agreement (in many cases, we can find that appropriate head nouns such as temperature or pressure loosely match between two terms), the overall performance is 95% in precision and 74% in recall. The system performance is almost comparable with results of human annotators for information categories with rich domain knowledge information (source material). However, for other information categories, given the relatively large number of terms that exist only in one paper, recall of individual information categories is not high (39-73%); however, precision is better (75-97%). The average performance for synthesis papers is better than that for characterization papers because of the lack of training examples for characterization papers. Based on these results, we discuss future research plans for improving the performance of the system.
AgBase: supporting functional modeling in agricultural organisms

PubMed Central

McCarthy, Fiona M.; Gresham, Cathy R.; Buza, Teresia J.; Chouvarine, Philippe; Pillai, Lakshmi R.; Kumar, Ranjit; Ozkan, Seval; Wang, Hui; Manda, Prashanti; Arick, Tony; Bridges, Susan M.; Burgess, Shane C.

2011-01-01

AgBase (http://www.agbase.msstate.edu/) provides resources to facilitate modeling of functional genomics data and structural and functional annotation of agriculturally important animal, plant, microbe and parasite genomes. The website is redesigned to improve accessibility and ease of use, including improved search capabilities. Expanded capabilities include new dedicated pages for horse, cat, dog, cotton, rice and soybean. We currently provide 590 240 Gene Ontology (GO) annotations to 105 454 gene products in 64 different species, including GO annotations linked to transcripts represented on agricultural microarrays. For many of these arrays, this provides the only functional annotation available. GO annotations are available for download and we provide comprehensive, species-specific GO annotation files for 18 different organisms. The tools available at AgBase have been expanded and several existing tools improved based upon user feedback. One of seven new tools available at AgBase, GOModeler, supports hypothesis testing from functional genomics data. We host several associated databases and provide genome browsers for three agricultural pathogens. Moreover, we provide comprehensive training resources (including worked examples and tutorials) via links to Educational Resources at the AgBase website. PMID:21075795
Validating automatic semantic annotation of anatomy in DICOM CT images

NASA Astrophysics Data System (ADS)

Pathak, Sayan D.; Criminisi, Antonio; Shotton, Jamie; White, Steve; Robertson, Duncan; Sparks, Bobbi; Munasinghe, Indeera; Siddiqui, Khan

2011-03-01

In the current health-care environment, the time available for physicians to browse patients' scans is shrinking due to the rapid increase in the sheer number of images. This is further aggravated by mounting pressure to become more productive in the face of decreasing reimbursement. Hence, there is an urgent need to deliver technology which enables faster and effortless navigation through sub-volume image visualizations. Annotating image regions with semantic labels such as those derived from the RADLEX ontology can vastly enhance image navigation and sub-volume visualization. This paper uses random regression forests for efficient, automatic detection and localization of anatomical structures within DICOM 3D CT scans. A regression forest is a collection of decision trees which are trained to achieve direct mapping from voxels to organ location and size in a single pass. This paper focuses on comparing automated labeling with expert-annotated ground-truth results on a database of 50 highly variable CT scans. Initial investigations show that regression forest derived localization errors are smaller and more robust than those achieved by state-of-the-art global registration approaches. The simplicity of the algorithm's context-rich visual features yield typical runtimes of less than 10 seconds for a 5123 voxel DICOM CT series on a single-threaded, single-core machine running multiple trees; each tree taking less than a second. Furthermore, qualitative evaluation demonstrates that using the detected organs' locations as index into the image volume improves the efficiency of the navigational workflow in all the CT studies.
Measuring Patient Mobility in the ICU Using a Novel Noninvasive Sensor.

PubMed

Ma, Andy J; Rawat, Nishi; Reiter, Austin; Shrock, Christine; Zhan, Andong; Stone, Alex; Rabiee, Anahita; Griffin, Stephanie; Needham, Dale M; Saria, Suchi

2017-04-01

To develop and validate a noninvasive mobility sensor to automatically and continuously detect and measure patient mobility in the ICU. Prospective, observational study. Surgical ICU at an academic hospital. Three hundred sixty-two hours of sensor color and depth image data were recorded and curated into 109 segments, each containing 1,000 images, from eight patients. None. Three Microsoft Kinect sensors (Microsoft, Beijing, China) were deployed in one ICU room to collect continuous patient mobility data. We developed software that automatically analyzes the sensor data to measure mobility and assign the highest level within a time period. To characterize the highest mobility level, a validated 11-point mobility scale was collapsed into four categories: nothing in bed, in-bed activity, out-of-bed activity, and walking. Of the 109 sensor segments, the noninvasive mobility sensor was developed using 26 of these from three ICU patients and validated on 83 remaining segments from five different patients. Three physicians annotated each segment for the highest mobility level. The weighted Kappa (κ) statistic for agreement between automated noninvasive mobility sensor output versus manual physician annotation was 0.86 (95% CI, 0.72-1.00). Disagreement primarily occurred in the "nothing in bed" versus "in-bed activity" categories because "the sensor assessed movement continuously," which was significantly more sensitive to motion than physician annotations using a discrete manual scale. Noninvasive mobility sensor is a novel and feasible method for automating evaluation of ICU patient mobility.
Self-calibrating models for dynamic monitoring and diagnosis

NASA Technical Reports Server (NTRS)

Kuipers, Benjamin

1996-01-01

A method for automatically building qualitative and semi-quantitative models of dynamic systems, and using them for monitoring and fault diagnosis, is developed and demonstrated. The qualitative approach and semi-quantitative method are applied to monitoring observation streams, and to design of non-linear control systems.
APPRIS: annotation of principal and alternative splice isoforms

PubMed Central

Rodriguez, Jose Manuel; Maietta, Paolo; Ezkurdia, Iakes; Pietrelli, Alessandro; Wesselink, Jan-Jaap; Lopez, Gonzalo; Valencia, Alfonso; Tress, Michael L.

2013-01-01

Here, we present APPRIS (http://appris.bioinfo.cnio.es), a database that houses annotations of human splice isoforms. APPRIS has been designed to provide value to manual annotations of the human genome by adding reliable protein structural and functional data and information from cross-species conservation. The visual representation of the annotations provided by APPRIS for each gene allows annotators and researchers alike to easily identify functional changes brought about by splicing events. In addition to collecting, integrating and analyzing reliable predictions of the effect of splicing events, APPRIS also selects a single reference sequence for each gene, here termed the principal isoform, based on the annotations of structure, function and conservation for each transcript. APPRIS identifies a principal isoform for 85% of the protein-coding genes in the GENCODE 7 release for ENSEMBL. Analysis of the APPRIS data shows that at least 70% of the alternative (non-principal) variants would lose important functional or structural information relative to the principal isoform. PMID:23161672
Hierarchical semantic cognition for urban functional zones with VHR satellite images and POI data

NASA Astrophysics Data System (ADS)

Zhang, Xiuyuan; Du, Shihong; Wang, Qiao

2017-10-01

As the basic units of urban areas, functional zones are essential for city planning and management, but functional-zone maps are hardly available in most cities, as traditional urban investigations focus mainly on land-cover objects instead of functional zones. As a result, an automatic/semi-automatic method for mapping urban functional zones is highly required. Hierarchical semantic cognition (HSC) is presented in this study, and serves as a general cognition structure for recognizing urban functional zones. Unlike traditional classification methods, the HSC relies on geographic cognition and considers four semantic layers, i.e., visual features, object categories, spatial object patterns, and zone functions, as well as their hierarchical relations. Here, we used HSC to classify functional zones in Beijing with a very-high-resolution (VHR) satellite image and point-of-interest (POI) data. Experimental results indicate that this method can produce more accurate results than Support Vector Machine (SVM) and Latent Dirichlet Allocation (LDA) with a larger overall accuracy of 90.8%. Additionally, the contributions of diverse semantic layers are quantified: the object-category layer is the most important and makes 54% contribution to functional-zone classification; while, other semantic layers are less important but their contributions cannot be ignored. Consequently, the presented HSC is effective in classifying urban functional zones, and can further support urban planning and management.
Electrically evoked compound action potentials are different depending on the site of cochlear stimulation.

PubMed

van de Heyning, Paul; Arauz, Santiago L; Atlas, Marcus; Baumgartner, Wolf-Dieter; Caversaccio, Marco; Chester-Browne, Ronel; Estienne, Patricia; Gavilan, Javier; Godey, Benoit; Gstöttner, Wolfgang; Han, Demin; Hagen, Rudolph; Kompis, Martin; Kuzovkov, Vlad; Lassaletta, Luis; Lefevre, Franc; Li, Yongxin; Müller, Joachim; Parnes, Lorne; Kleine Punte, Andrea; Raine, Christopher; Rajan, Gunesh; Rivas, Adriana; Rivas, José Antonio; Royle, Nicola; Sprinzl, Georg; Stephan, Kurt; Walkowiak, Adam; Yanov, Yuri; Zimmermann, Kim; Zorowka, Patrick; Skarzynski, Henryk

2016-11-01

One of the many parameters that can affect cochlear implant (CI) users' performance is the site of presentation of electrical stimulation, from the CI, to the auditory nerve. Evoked compound action potential (ECAP) measurements are commonly used to verify nerve function by stimulating one electrode contact in the cochlea and recording the resulting action potentials on the other contacts of the electrode array. The present study aimed to determine if the ECAP amplitude differs between the apical, middle, and basal region of the cochlea, if double peak potentials were more likely in the apex than the basal region of the cochlea, and if there were differences in the ECAP threshold and recovery function across the cochlea. ECAP measurements were performed in the apical, middle, and basal region of the cochlea at fixed sites of stimulation with varying recording electrodes. One hundred and forty one adult subjects with severe to profound sensorineural hearing loss fitted with a Standard or FLEX SOFT electrode were included in this study. ECAP responses were captured using MAESTRO System Software (MED-EL). The ECAP amplitude, threshold, and slope were determined using amplitude growth sequences. The 50% recovery rate was assessed using independent single sequences that have two stimulation pulses (a masker and a probe pulse) separated by a variable inter-pulse interval. For all recordings, ECAP peaks were annotated semi-automatically. ECAP amplitudes were greater upon stimulation of the apical region compared to the basal region of the cochlea. ECAP slopes were steeper in the apical region compared to the basal region of the cochlea and ECAP thresholds were lower in the middle region compared to the basal region of the cochlea. The incidence of double peaks was greater upon stimulation of the apical region compared to the basal region of the cochlea. This data indicates that the site and intensity of cochlear stimulation affect ECAP properties.
Very Portable Remote Automatic Weather Stations

Treesearch

John R. Warren

1987-01-01

Remote Automatic Weather Stations (RAWS) were introduced to Forest Service and Bureau of Land Management field units in 1978 following development, test, and evaluation activities conducted jointly by the two agencies. The original configuration was designed for semi-permanent installation. Subsequently, a need for a more portable RAWS was expressed, and one was...
Research in Automatic Russian-English Scientific and Technical Lexicography. Final Report.

ERIC Educational Resources Information Center

Wayne State Univ., Detroit, MI.

Techniques of reversing English-Russian scientific and technical dictionaries into Russian-English versions through semi-automated compilation are described. Sections on manual and automatic processing discuss pre- and post-editing, the task program, updater (correction of errors and revision by specialist in a given field), the system employed…
Automated generation of patient-tailored electronic care pathways by translating computer-interpretable guidelines into hierarchical task networks.

PubMed

González-Ferrer, Arturo; ten Teije, Annette; Fdez-Olivares, Juan; Milian, Krystyna

2013-02-01

This paper describes a methodology which enables computer-aided support for the planning, visualization and execution of personalized patient treatments in a specific healthcare process, taking into account complex temporal constraints and the allocation of institutional resources. To this end, a translation from a time-annotated computer-interpretable guideline (CIG) model of a clinical protocol into a temporal hierarchical task network (HTN) planning domain is presented. The proposed method uses a knowledge-driven reasoning process to translate knowledge previously described in a CIG into a corresponding HTN Planning and Scheduling domain, taking advantage of HTNs known ability to (i) dynamically cope with temporal and resource constraints, and (ii) automatically generate customized plans. The proposed method, focusing on the representation of temporal knowledge and based on the identification of workflow and temporal patterns in a CIG, makes it possible to automatically generate time-annotated and resource-based care pathways tailored to the needs of any possible patient profile. The proposed translation is illustrated through a case study based on a 70 pages long clinical protocol to manage Hodgkin's disease, developed by the Spanish Society of Pediatric Oncology. We show that an HTN planning domain can be generated from the corresponding specification of the protocol in the Asbru language, providing a running example of this translation. Furthermore, the correctness of the translation is checked and also the management of ten different types of temporal patterns represented in the protocol. By interpreting the automatically generated domain with a state-of-art HTN planner, a time-annotated care pathway is automatically obtained, customized for the patient's and institutional needs. The generated care pathway can then be used by clinicians to plan and manage the patients long-term care. The described methodology makes it possible to automatically generate patient-tailored care pathways, leveraging an incremental knowledge-driven engineering process that starts from the expert knowledge of medical professionals. The presented approach makes the most of the strengths inherent in both CIG languages and HTN planning and scheduling techniques: for the former, knowledge acquisition and representation of the original clinical protocol, and for the latter, knowledge reasoning capabilities and an ability to deal with complex temporal and resource constraints. Moreover, the proposed approach provides immediate access to technologies such as business process management (BPM) tools, which are increasingly being used to support healthcare processes. Copyright © 2012 Elsevier B.V. All rights reserved.
Automated and Accurate Estimation of Gene Family Abundance from Shotgun Metagenomes

PubMed Central

Nayfach, Stephen; Bradley, Patrick H.; Wyman, Stacia K.; Laurent, Timothy J.; Williams, Alex; Eisen, Jonathan A.; Pollard, Katherine S.; Sharpton, Thomas J.

2015-01-01

Shotgun metagenomic DNA sequencing is a widely applicable tool for characterizing the functions that are encoded by microbial communities. Several bioinformatic tools can be used to functionally annotate metagenomes, allowing researchers to draw inferences about the functional potential of the community and to identify putative functional biomarkers. However, little is known about how decisions made during annotation affect the reliability of the results. Here, we use statistical simulations to rigorously assess how to optimize annotation accuracy and speed, given parameters of the input data like read length and library size. We identify best practices in metagenome annotation and use them to guide the development of the Shotgun Metagenome Annotation Pipeline (ShotMAP). ShotMAP is an analytically flexible, end-to-end annotation pipeline that can be implemented either on a local computer or a cloud compute cluster. We use ShotMAP to assess how different annotation databases impact the interpretation of how marine metagenome and metatranscriptome functional capacity changes across seasons. We also apply ShotMAP to data obtained from a clinical microbiome investigation of inflammatory bowel disease. This analysis finds that gut microbiota collected from Crohn’s disease patients are functionally distinct from gut microbiota collected from either ulcerative colitis patients or healthy controls, with differential abundance of metabolic pathways related to host-microbiome interactions that may serve as putative biomarkers of disease. PMID:26565399
A practical data processing workflow for multi-OMICS projects.

PubMed

Kohl, Michael; Megger, Dominik A; Trippler, Martin; Meckel, Hagen; Ahrens, Maike; Bracht, Thilo; Weber, Frank; Hoffmann, Andreas-Claudius; Baba, Hideo A; Sitek, Barbara; Schlaak, Jörg F; Meyer, Helmut E; Stephan, Christian; Eisenacher, Martin

2014-01-01

Multi-OMICS approaches aim on the integration of quantitative data obtained for different biological molecules in order to understand their interrelation and the functioning of larger systems. This paper deals with several data integration and data processing issues that frequently occur within this context. To this end, the data processing workflow within the PROFILE project is presented, a multi-OMICS project that aims on identification of novel biomarkers and the development of new therapeutic targets for seven important liver diseases. Furthermore, a software called CrossPlatformCommander is sketched, which facilitates several steps of the proposed workflow in a semi-automatic manner. Application of the software is presented for the detection of novel biomarkers, their ranking and annotation with existing knowledge using the example of corresponding Transcriptomics and Proteomics data sets obtained from patients suffering from hepatocellular carcinoma. Additionally, a linear regression analysis of Transcriptomics vs. Proteomics data is presented and its performance assessed. It was shown, that for capturing profound relations between Transcriptomics and Proteomics data, a simple linear regression analysis is not sufficient and implementation and evaluation of alternative statistical approaches are needed. Additionally, the integration of multivariate variable selection and classification approaches is intended for further development of the software. Although this paper focuses only on the combination of data obtained from quantitative Proteomics and Transcriptomics experiments, several approaches and data integration steps are also applicable for other OMICS technologies. Keeping specific restrictions in mind the suggested workflow (or at least parts of it) may be used as a template for similar projects that make use of different high throughput techniques. This article is part of a Special Issue entitled: Computational Proteomics in the Post-Identification Era. Guest Editors: Martin Eisenacher and Christian Stephan. Copyright © 2013 Elsevier B.V. All rights reserved.
SemanticSCo: A platform to support the semantic composition of services for gene expression analysis.

PubMed

Guardia, Gabriela D A; Ferreira Pires, Luís; da Silva, Eduardo G; de Farias, Cléver R G

2017-02-01

Gene expression studies often require the combined use of a number of analysis tools. However, manual integration of analysis tools can be cumbersome and error prone. To support a higher level of automation in the integration process, efforts have been made in the biomedical domain towards the development of semantic web services and supporting composition environments. Yet, most environments consider only the execution of simple service behaviours and requires users to focus on technical details of the composition process. We propose a novel approach to the semantic composition of gene expression analysis services that addresses the shortcomings of the existing solutions. Our approach includes an architecture designed to support the service composition process for gene expression analysis, and a flexible strategy for the (semi) automatic composition of semantic web services. Finally, we implement a supporting platform called SemanticSCo to realize the proposed composition approach and demonstrate its functionality by successfully reproducing a microarray study documented in the literature. The SemanticSCo platform provides support for the composition of RESTful web services semantically annotated using SAWSDL. Our platform also supports the definition of constraints/conditions regarding the order in which service operations should be invoked, thus enabling the definition of complex service behaviours. Our proposed solution for semantic web service composition takes into account the requirements of different stakeholders and addresses all phases of the service composition process. It also provides support for the definition of analysis workflows at a high-level of abstraction, thus enabling users to focus on biological research issues rather than on the technical details of the composition process. The SemanticSCo source code is available at https://github.com/usplssb/SemanticSCo. Copyright Â© 2017 Elsevier Inc. All rights reserved.
Annotating Protein Functional Residues by Coupling High-Throughput Fitness Profile and Homologous-Structure Analysis.

PubMed

Du, Yushen; Wu, Nicholas C; Jiang, Lin; Zhang, Tianhao; Gong, Danyang; Shu, Sara; Wu, Ting-Ting; Sun, Ren

2016-11-01

Identification and annotation of functional residues are fundamental questions in protein sequence analysis. Sequence and structure conservation provides valuable information to tackle these questions. It is, however, limited by the incomplete sampling of sequence space in natural evolution. Moreover, proteins often have multiple functions, with overlapping sequences that present challenges to accurate annotation of the exact functions of individual residues by conservation-based methods. Using the influenza A virus PB1 protein as an example, we developed a method to systematically identify and annotate functional residues. We used saturation mutagenesis and high-throughput sequencing to measure the replication capacity of single nucleotide mutations across the entire PB1 protein. After predicting protein stability upon mutations, we identified functional PB1 residues that are essential for viral replication. To further annotate the functional residues important to the canonical or noncanonical functions of viral RNA-dependent RNA polymerase (vRdRp), we performed a homologous-structure analysis with 16 different vRdRp structures. We achieved high sensitivity in annotating the known canonical polymerase functional residues. Moreover, we identified a cluster of noncanonical functional residues located in the loop region of the PB1 β-ribbon. We further demonstrated that these residues were important for PB1 protein nuclear import through the interaction with Ran-binding protein 5. In summary, we developed a systematic and sensitive method to identify and annotate functional residues that are not restrained by sequence conservation. Importantly, this method is generally applicable to other proteins about which homologous-structure information is available. To fully comprehend the diverse functions of a protein, it is essential to understand the functionality of individual residues. Current methods are highly dependent on evolutionary sequence conservation, which is usually limited by sampling size. Sequence conservation-based methods are further confounded by structural constraints and multifunctionality of proteins. Here we present a method that can systematically identify and annotate functional residues of a given protein. We used a high-throughput functional profiling platform to identify essential residues. Coupling it with homologous-structure comparison, we were able to annotate multiple functions of proteins. We demonstrated the method with the PB1 protein of influenza A virus and identified novel functional residues in addition to its canonical function as an RNA-dependent RNA polymerase. Not limited to virology, this method is generally applicable to other proteins that can be functionally selected and about which homologous-structure information is available. Copyright © 2016 Du et al.

A Semi-Automatic Alignment Method for Math Educational Standards Using the MP (Materialization Pattern) Model

ERIC Educational Resources Information Center

Choi, Namyoun

2010-01-01

Educational standards alignment, which matches similar or equivalent concepts of educational standards, is a necessary task for educational resource discovery and retrieval. Automated or semi-automated alignment systems for educational standards have been recently available. However, existing systems frequently result in inconsistency in…
Semi-automated identification of leopard frogs

USGS Publications Warehouse

Petrovska-Delacrétaz, Dijana; Edwards, Aaron; Chiasson, John; Chollet, Gérard; Pilliod, David S.

2014-01-01

Principal component analysis is used to implement a semi-automatic recognition system to identify recaptured northern leopard frogs (Lithobates pipiens). Results of both open set and closed set experiments are given. The presented algorithm is shown to provide accurate identification of 209 individual leopard frogs from a total set of 1386 images.
SABRE--A Novel Software Tool for Bibliographic Post-Processing.

ERIC Educational Resources Information Center

Burge, Cecil D.

1989-01-01

Describes the software architecture and application of SABRE (Semi-Automated Bibliographic Environment), which is one of the first products to provide a semi-automatic environment for relevancy ranking of citations obtained from searches of bibliographic databases. Features designed to meet the review, categorization, culling, and reporting needs…
Improving Microbial Genome Annotations in an Integrated Database Context

PubMed Central

Chen, I-Min A.; Markowitz, Victor M.; Chu, Ken; Anderson, Iain; Mavromatis, Konstantinos; Kyrpides, Nikos C.; Ivanova, Natalia N.

2013-01-01

Effective comparative analysis of microbial genomes requires a consistent and complete view of biological data. Consistency regards the biological coherence of annotations, while completeness regards the extent and coverage of functional characterization for genomes. We have developed tools that allow scientists to assess and improve the consistency and completeness of microbial genome annotations in the context of the Integrated Microbial Genomes (IMG) family of systems. All publicly available microbial genomes are characterized in IMG using different functional annotation and pathway resources, thus providing a comprehensive framework for identifying and resolving annotation discrepancies. A rule based system for predicting phenotypes in IMG provides a powerful mechanism for validating functional annotations, whereby the phenotypic traits of an organism are inferred based on the presence of certain metabolic reactions and pathways and compared to experimentally observed phenotypes. The IMG family of systems are available at http://img.jgi.doe.gov/. PMID:23424620
Protein function prediction using neighbor relativity in protein-protein interaction network.

PubMed

Moosavi, Sobhan; Rahgozar, Masoud; Rahimi, Amir

2013-04-01

There is a large gap between the number of discovered proteins and the number of functionally annotated ones. Due to the high cost of determining protein function by wet-lab research, function prediction has become a major task for computational biology and bioinformatics. Some researches utilize the proteins interaction information to predict function for un-annotated proteins. In this paper, we propose a novel approach called "Neighbor Relativity Coefficient" (NRC) based on interaction network topology which estimates the functional similarity between two proteins. NRC is calculated for each pair of proteins based on their graph-based features including distance, common neighbors and the number of paths between them. In order to ascribe function to an un-annotated protein, NRC estimates a weight for each neighbor to transfer its annotation to the unknown protein. Finally, the unknown protein will be annotated by the top score transferred functions. We also investigate the effect of using different coefficients for various types of functions. The proposed method has been evaluated on Saccharomyces cerevisiae and Homo sapiens interaction networks. The performance analysis demonstrates that NRC yields better results in comparison with previous protein function prediction approaches that utilize interaction network. Copyright © 2012 Elsevier Ltd. All rights reserved.
GO-FAANG meeting: A gathering on functional annotation of animal genomes

USDA-ARS?s Scientific Manuscript database

The FAANG (Functional Annotation of Animal Genomes) Consortium recently held a Gathering On FAANG (GO-FAANG) Workshop in Washington, DC on October 7-8, 2015. This consortium is a grass-roots organization formed to advance the annotation of newly assembled genomes of non-model organisms (www.faang.or...
Semi-automatic 3D lung nodule segmentation in CT using dynamic programming

NASA Astrophysics Data System (ADS)

Sargent, Dustin; Park, Sun Young

2017-02-01

We present a method for semi-automatic segmentation of lung nodules in chest CT that can be extended to general lesion segmentation in multiple modalities. Most semi-automatic algorithms for lesion segmentation or similar tasks use region-growing or edge-based contour finding methods such as level-set. However, lung nodules and other lesions are often connected to surrounding tissues, which makes these algorithms prone to growing the nodule boundary into the surrounding tissue. To solve this problem, we apply a 3D extension of the 2D edge linking method with dynamic programming to find a closed surface in a spherical representation of the nodule ROI. The algorithm requires a user to draw a maximal diameter across the nodule in the slice in which the nodule cross section is the largest. We report the lesion volume estimation accuracy of our algorithm on the FDA lung phantom dataset, and the RECIST diameter estimation accuracy on the lung nodule dataset from the SPIE 2016 lung nodule classification challenge. The phantom results in particular demonstrate that our algorithm has the potential to mitigate the disparity in measurements performed by different radiologists on the same lesions, which could improve the accuracy of disease progression tracking.
Automatic multi-label annotation of abdominal CT images using CBIR

NASA Astrophysics Data System (ADS)

Xue, Zhiyun; Antani, Sameer; Long, L. Rodney; Thoma, George R.

2017-03-01

We present a technique to annotate multiple organs shown in 2-D abdominal/pelvic CT images using CBIR. This annotation task is motivated by our research interests in visual question-answering (VQA). We aim to apply results from this effort in Open-iSM, a multimodal biomedical search engine developed by the National Library of Medicine (NLM). Understanding visual content of biomedical images is a necessary step for VQA. Though sufficient annotational information about an image may be available in related textual metadata, not all may be useful as descriptive tags, particularly for anatomy on the image. In this paper, we develop and evaluate a multi-label image annotation method using CBIR. We evaluate our method on two 2-D CT image datasets we generated from 3-D volumetric data obtained from a multi-organ segmentation challenge hosted in MICCAI 2015. Shape and spatial layout information is used to encode visual characteristics of the anatomy. We adapt a weighted voting scheme to assign multiple labels to the query image by combining the labels of the images identified as similar by the method. Key parameters that may affect the annotation performance, such as the number of images used in the label voting and the threshold for excluding labels that have low weights, are studied. The method proposes a coarse-to-fine retrieval strategy which integrates the classification with the nearest-neighbor search. Results from our evaluation (using the MICCAI CT image datasets as well as figures from Open-i) are presented.
Instance-Based Question Answering

DTIC Science & Technology

2006-12-01

answer clustering, composition, and scoring. Moreover, with the effort dedicated to improving monolingual system performance, system parameters are...text collections: document type, manual or automatic annotations (if any), and stylistic and notational differences in technical terms. Monolingual ...forum in which cross language retrieval systems and question answering systems are tested for various Eu- ropean languages. The CLEF QA monolingual task
Systematic analysis of snake neurotoxins' functional classification using a data warehousing approach.

PubMed

Siew, Joyce Phui Yee; Khan, Asif M; Tan, Paul T J; Koh, Judice L Y; Seah, Seng Hong; Koo, Chuay Yeng; Chai, Siaw Ching; Armugam, Arunmozhiarasi; Brusic, Vladimir; Jeyaseelan, Kandiah

2004-12-12

Sequence annotations, functional and structural data on snake venom neurotoxins (svNTXs) are scattered across multiple databases and literature sources. Sequence annotations and structural data are available in the public molecular databases, while functional data are almost exclusively available in the published articles. There is a need for a specialized svNTXs database that contains NTX entries, which are organized, well annotated and classified in a systematic manner. We have systematically analyzed svNTXs and classified them using structure-function groups based on their structural, functional and phylogenetic properties. Using conserved motifs in each phylogenetic group, we built an intelligent module for the prediction of structural and functional properties of unknown NTXs. We also developed an annotation tool to aid the functional prediction of newly identified NTXs as an additional resource for the venom research community. We created a searchable online database of NTX proteins sequences (http://research.i2r.a-star.edu.sg/Templar/DB/snake_neurotoxin). This database can also be found under Swiss-Prot Toxin Annotation Project website (http://www.expasy.org/sprot/).
Gene Ontology annotation of the rice blast fungus, Magnaporthe oryzae

PubMed Central

Meng, Shaowu; Brown, Douglas E; Ebbole, Daniel J; Torto-Alalibo, Trudy; Oh, Yeon Yee; Deng, Jixin; Mitchell, Thomas K; Dean, Ralph A

2009-01-01

Background Magnaporthe oryzae, the causal agent of blast disease of rice, is the most destructive disease of rice worldwide. The genome of this fungal pathogen has been sequenced and an automated annotation has recently been updated to Version 6 . However, a comprehensive manual curation remains to be performed. Gene Ontology (GO) annotation is a valuable means of assigning functional information using standardized vocabulary. We report an overview of the GO annotation for Version 5 of M. oryzae genome assembly. Methods A similarity-based (i.e., computational) GO annotation with manual review was conducted, which was then integrated with a literature-based GO annotation with computational assistance. For similarity-based GO annotation a stringent reciprocal best hits method was used to identify similarity between predicted proteins of M. oryzae and GO proteins from multiple organisms with published associations to GO terms. Significant alignment pairs were manually reviewed. Functional assignments were further cross-validated with manually reviewed data, conserved domains, or data determined by wet lab experiments. Additionally, biological appropriateness of the functional assignments was manually checked. Results In total, 6,286 proteins received GO term assignment via the homology-based annotation, including 2,870 hypothetical proteins. Literature-based experimental evidence, such as microarray, MPSS, T-DNA insertion mutation, or gene knockout mutation, resulted in 2,810 proteins being annotated with GO terms. Of these, 1,673 proteins were annotated with new terms developed for Plant-Associated Microbe Gene Ontology (PAMGO). In addition, 67 experiment-determined secreted proteins were annotated with PAMGO terms. Integration of the two data sets resulted in 7,412 proteins (57%) being annotated with 1,957 distinct and specific GO terms. Unannotated proteins were assigned to the 3 root terms. The Version 5 GO annotation is publically queryable via the GO site . Additionally, the genome of M. oryzae is constantly being refined and updated as new information is incorporated. For the latest GO annotation of Version 6 genome, please visit our website . The preliminary GO annotation of Version 6 genome is placed at a local MySql database that is publically queryable via a user-friendly interface Adhoc Query System. Conclusion Our analysis provides comprehensive and robust GO annotations of the M. oryzae genome assemblies that will be solid foundations for further functional interrogation of M. oryzae. PMID:19278556
Automatically monitoring driftwood in large rivers: preliminary results

NASA Astrophysics Data System (ADS)

Piegay, H.; Lemaire, P.; MacVicar, B.; Mouquet-Noppe, C.; Tougne, L.

2014-12-01

Driftwood in rivers impact sediment transport, riverine habitat and human infrastructures. Quantifying it, in particular large woods on fairly large rivers where it can move easily, would allow us to improve our knowledge on fluvial transport processes. There are several means of studying this phenomenon, amongst which RFID sensors tracking, photo and video monitoring. In this abstract, we are interested in the latter, being easier and cheaper to deploy. However, video monitoring of driftwood generates a huge amount of images and manually labeling it is tedious. It is essential to automate such a monitoring process, which is a difficult task in the field of computer vision, and more specifically automatic video analysis. Detecting foreground into dynamic background remains an open problem to date. We installed a video camera at the riverside of a gauging station on the Ain River, a 3500 km² Piedmont River in France. Several floods were manually annotated by a human operator. We developed software that automatically extracts and characterizes wood blocks within a video stream. This algorithm is based upon a statistical model and combines static, dynamic and spatial data. Segmented wood objects are further described with the help of a skeleton-based approach that helps us to automatically determine its shape, diameter and length. The first detailed comparisons between manual annotations and automatically extracted data show that we can fairly well detect large wood until a given size (approximately 120 cm in length or 15 cm in diameter) whereas smaller ones are difficult to detect and tend to be missed by either the human operator, either the algorithm. Detection is fairly accurate in high flow conditions where the water channel is usually brown because of suspended sediment transport. In low flow context, our algorithm still needs improvement to reduce the number of false positive so as to better distinguish shadow or turbulence structures from wood pieces.
The standard operating procedure of the DOE-JGI Microbial Genome Annotation Pipeline (MGAP v.4).

PubMed

Huntemann, Marcel; Ivanova, Natalia N; Mavromatis, Konstantinos; Tripp, H James; Paez-Espino, David; Palaniappan, Krishnaveni; Szeto, Ernest; Pillay, Manoj; Chen, I-Min A; Pati, Amrita; Nielsen, Torben; Markowitz, Victor M; Kyrpides, Nikos C

2015-01-01

The DOE-JGI Microbial Genome Annotation Pipeline performs structural and functional annotation of microbial genomes that are further included into the Integrated Microbial Genome comparative analysis system. MGAP is applied to assembled nucleotide sequence datasets that are provided via the IMG submission site. Dataset submission for annotation first requires project and associated metadata description in GOLD. The MGAP sequence data processing consists of feature prediction including identification of protein-coding genes, non-coding RNAs and regulatory RNA features, as well as CRISPR elements. Structural annotation is followed by assignment of protein product names and functions.
Ventriculogram segmentation using boosted decision trees

NASA Astrophysics Data System (ADS)

McDonald, John A.; Sheehan, Florence H.

2004-05-01

Left ventricular status, reflected in ejection fraction or end systolic volume, is a powerful prognostic indicator in heart disease. Quantitative analysis of these and other parameters from ventriculograms (cine xrays of the left ventricle) is infrequently performed due to the labor required for manual segmentation. None of the many methods developed for automated segmentation has achieved clinical acceptance. We present a method for semi-automatic segmentation of ventriculograms based on a very accurate two-stage boosted decision-tree pixel classifier. The classifier determines which pixels are inside the ventricle at key ED (end-diastole) and ES (end-systole) frames. The test misclassification rate is about 1%. The classifier is semi-automatic, requiring a user to select 3 points in each frame: the endpoints of the aortic valve and the apex. The first classifier stage is 2 boosted decision-trees, trained using features such as gray-level statistics (e.g. median brightness) and image geometry (e.g. coordinates relative to user supplied 3 points). Second stage classifiers are trained using the same features as the first, plus the output of the first stage. Border pixels are determined from the segmented images using dilation and erosion. A curve is then fit to the border pixels, minimizing a penalty function that trades off fidelity to the border pixels with smoothness. ED and ES volumes, and ejection fraction are estimated from border curves using standard area-length formulas. On independent test data, the differences between automatic and manual volumes (and ejection fractions) are similar in size to the differences between two human observers.
Identification of widespread adenosine nucleotide binding in Mycobacterium tuberculosis

DOE Office of Scientific and Technical Information (OSTI.GOV)

Ansong, Charles; Ortega, Corrie; Payne, Samuel H.

The annotation of protein function is almost completely performed by in silico approaches. However, computational prediction of protein function is frequently incomplete and error prone. In Mycobacterium tuberculosis (Mtb), ~25% of all genes have no predicted function and are annotated as hypothetical proteins. This lack of functional information severely limits our understanding of Mtb pathogenicity. Current tools for experimental functional annotation are limited and often do not scale to entire protein families. Here, we report a generally applicable chemical biology platform to functionally annotate bacterial proteins by combining activity-based protein profiling (ABPP) and quantitative LC-MS-based proteomics. As an example ofmore » this approach for high-throughput protein functional validation and discovery, we experimentally annotate the families of ATP-binding proteins in Mtb. Our data experimentally validate prior in silico predictions of >250 ATPases and adenosine nucleotide-binding proteins, and reveal 73 hypothetical proteins as novel ATP-binding proteins. We identify adenosine cofactor interactions with many hypothetical proteins containing a diversity of unrelated sequences, providing a new and expanded view of adenosine nucleotide binding in Mtb. Furthermore, many of these hypothetical proteins are both unique to Mycobacteria and essential for infection, suggesting specialized functions in mycobacterial physiology and pathogenicity. Thus, we provide a generally applicable approach for high throughput protein function discovery and validation, and highlight several ways in which application of activity-based proteomics data can improve the quality of functional annotations to facilitate novel biological insights.« less
The Proteome Folding Project: Proteome-scale prediction of structure and function

PubMed Central

Drew, Kevin; Winters, Patrick; Butterfoss, Glenn L.; Berstis, Viktors; Uplinger, Keith; Armstrong, Jonathan; Riffle, Michael; Schweighofer, Erik; Bovermann, Bill; Goodlett, David R.; Davis, Trisha N.; Shasha, Dennis; Malmström, Lars; Bonneau, Richard

2011-01-01

The incompleteness of proteome structure and function annotation is a critical problem for biologists and, in particular, severely limits interpretation of high-throughput and next-generation experiments. We have developed a proteome annotation pipeline based on structure prediction, where function and structure annotations are generated using an integration of sequence comparison, fold recognition, and grid-computing-enabled de novo structure prediction. We predict protein domain boundaries and three-dimensional (3D) structures for protein domains from 94 genomes (including human, Arabidopsis, rice, mouse, fly, yeast, Escherichia coli, and worm). De novo structure predictions were distributed on a grid of more than 1.5 million CPUs worldwide (World Community Grid). We generated significant numbers of new confident fold annotations (9% of domains that are otherwise unannotated in these genomes). We demonstrate that predicted structures can be combined with annotations from the Gene Ontology database to predict new and more specific molecular functions. PMID:21824995
snpGeneSets: An R Package for Genome-Wide Study Annotation

PubMed Central

Mei, Hao; Li, Lianna; Jiang, Fan; Simino, Jeannette; Griswold, Michael; Mosley, Thomas; Liu, Shijian

2016-01-01

Genome-wide studies (GWS) of SNP associations and differential gene expressions have generated abundant results; next-generation sequencing technology has further boosted the number of variants and genes identified. Effective interpretation requires massive annotation and downstream analysis of these genome-wide results, a computationally challenging task. We developed the snpGeneSets package to simplify annotation and analysis of GWS results. Our package integrates local copies of knowledge bases for SNPs, genes, and gene sets, and implements wrapper functions in the R language to enable transparent access to low-level databases for efficient annotation of large genomic data. The package contains functions that execute three types of annotations: (1) genomic mapping annotation for SNPs and genes and functional annotation for gene sets; (2) bidirectional mapping between SNPs and genes, and genes and gene sets; and (3) calculation of gene effect measures from SNP associations and performance of gene set enrichment analyses to identify functional pathways. We applied snpGeneSets to type 2 diabetes (T2D) results from the NHGRI genome-wide association study (GWAS) catalog, a Finnish GWAS, and a genome-wide expression study (GWES). These studies demonstrate the usefulness of snpGeneSets for annotating and performing enrichment analysis of GWS results. The package is open-source, free, and can be downloaded at: https://www.umc.edu/biostats_software/. PMID:27807048
A guide to best practices for Gene Ontology (GO) manual annotation

PubMed Central

Balakrishnan, Rama; Harris, Midori A.; Huntley, Rachael; Van Auken, Kimberly; Cherry, J. Michael

2013-01-01

The Gene Ontology Consortium (GOC) is a community-based bioinformatics project that classifies gene product function through the use of structured controlled vocabularies. A fundamental application of the Gene Ontology (GO) is in the creation of gene product annotations, evidence-based associations between GO definitions and experimental or sequence-based analysis. Currently, the GOC disseminates 126 million annotations covering >374 000 species including all the kingdoms of life. This number includes two classes of GO annotations: those created manually by experienced biocurators reviewing the literature or by examination of biological data (1.1 million annotations covering 2226 species) and those generated computationally via automated methods. As manual annotations are often used to propagate functional predictions between related proteins within and between genomes, it is critical to provide accurate consistent manual annotations. Toward this goal, we present here the conventions defined by the GOC for the creation of manual annotation. This guide represents the best practices for manual annotation as established by the GOC project over the past 12 years. We hope this guide will encourage research communities to annotate gene products of their interest to enhance the corpus of GO annotations available to all. Database URL: http://www.geneontology.org PMID:23842463
Functional annotation of regulatory pathways.

PubMed

Pandey, Jayesh; Koyutürk, Mehmet; Kim, Yohan; Szpankowski, Wojciech; Subramaniam, Shankar; Grama, Ananth

2007-07-01

Standardized annotations of biomolecules in interaction networks (e.g. Gene Ontology) provide comprehensive understanding of the function of individual molecules. Extending such annotations to pathways is a critical component of functional characterization of cellular signaling at the systems level. We propose a framework for projecting gene regulatory networks onto the space of functional attributes using multigraph models, with the objective of deriving statistically significant pathway annotations. We first demonstrate that annotations of pairwise interactions do not generalize to indirect relationships between processes. Motivated by this result, we formalize the problem of identifying statistically overrepresented pathways of functional attributes. We establish the hardness of this problem by demonstrating the non-monotonicity of common statistical significance measures. We propose a statistical model that emphasizes the modularity of a pathway, evaluating its significance based on the coupling of its building blocks. We complement the statistical model by an efficient algorithm and software, Narada, for computing significant pathways in large regulatory networks. Comprehensive results from our methods applied to the Escherichia coli transcription network demonstrate that our approach is effective in identifying known, as well as novel biological pathway annotations. Narada is implemented in Java and is available at http://www.cs.purdue.edu/homes/jpandey/narada/.
10 CFR Appendix J2 to Subpart B of... - Uniform Test Method for Measuring the Energy Consumption of Automatic and Semi-Automatic Clothes...

Code of Federal Regulations, 2014 CFR

2014-01-01

... hardness or less) using 27.0 grams + 4.0 grams per pound of cloth load of AHAM Standard detergent Formula 3... repellent finishes, such as fluoropolymer stain resistant finishes shall not be applied to the test cloth...

The Use of Opto-Electronics in Viscometry.

ERIC Educational Resources Information Center

Mazza, R. J.; Washbourn, D. H.

1982-01-01

Describes a semi-automatic viscometer which incorporates a microprocessor system and uses optoelectronics to detect flow of liquid through the capillary, flow time being displayed on a timer with accuracy of 0.01 second. The system could be made fully automatic with an additional microprocessor circuit and inclusion of a pump. (Author/JN)
Integration of wireless sensor networks into automatic irrigation scheduling of a center pivot

USDA-ARS?s Scientific Manuscript database

A six-span center pivot system was used as a platform for testing two wireless sensor networks (WSN) of infrared thermometers. The cropped field was a semi-circle, divided into six pie shaped sections of which three were irrigated manually and three were irrigated automatically based on the time tem...
Mutant phenotypes for thousands of bacterial genes of unknown function

DOE PAGES

Price, Morgan N.; Wetmore, Kelly M.; Waters, R. Jordan; ...

2018-05-16

One-third of all protein-coding genes from bacterial genomes cannot be annotated with a function. Here, to investigate the functions of these genes, we present genome-wide mutant fitness data from 32 diverse bacteria across dozens of growth conditions. We identified mutant phenotypes for 11,779 protein-coding genes that had not been annotated with a specific function. Many genes could be associated with a specific condition because the gene affected fitness only in that condition, or with another gene in the same bacterium because they had similar mutant phenotypes. Of the poorly annotated genes, 2,316 had associations that have high confidence because theymore » are conserved in other bacteria. By combining these conserved associations with comparative genomics, we identified putative DNA repair proteins; in addition, we propose specific functions for poorly annotated enzymes and transporters and for uncharacterized protein families. Lastly, our study demonstrates the scalability of microbial genetics and its utility for improving gene annotations.« less
Mutant phenotypes for thousands of bacterial genes of unknown function

DOE Office of Scientific and Technical Information (OSTI.GOV)

Price, Morgan N.; Wetmore, Kelly M.; Waters, R. Jordan

One-third of all protein-coding genes from bacterial genomes cannot be annotated with a function. Here, to investigate the functions of these genes, we present genome-wide mutant fitness data from 32 diverse bacteria across dozens of growth conditions. We identified mutant phenotypes for 11,779 protein-coding genes that had not been annotated with a specific function. Many genes could be associated with a specific condition because the gene affected fitness only in that condition, or with another gene in the same bacterium because they had similar mutant phenotypes. Of the poorly annotated genes, 2,316 had associations that have high confidence because theymore » are conserved in other bacteria. By combining these conserved associations with comparative genomics, we identified putative DNA repair proteins; in addition, we propose specific functions for poorly annotated enzymes and transporters and for uncharacterized protein families. Lastly, our study demonstrates the scalability of microbial genetics and its utility for improving gene annotations.« less
GarlicESTdb: an online database and mining tool for garlic EST sequences.

PubMed

Kim, Dae-Won; Jung, Tae-Sung; Nam, Seong-Hyeuk; Kwon, Hyuk-Ryul; Kim, Aeri; Chae, Sung-Hwa; Choi, Sang-Haeng; Kim, Dong-Wook; Kim, Ryong Nam; Park, Hong-Seog

2009-05-18

Allium sativum., commonly known as garlic, is a species in the onion genus (Allium), which is a large and diverse one containing over 1,250 species. Its close relatives include chives, onion, leek and shallot. Garlic has been used throughout recorded history for culinary, medicinal use and health benefits. Currently, the interest in garlic is highly increasing due to nutritional and pharmaceutical value including high blood pressure and cholesterol, atherosclerosis and cancer. For all that, there are no comprehensive databases available for Expressed Sequence Tags(EST) of garlic for gene discovery and future efforts of genome annotation. That is why we developed a new garlic database and applications to enable comprehensive analysis of garlic gene expression. GarlicESTdb is an integrated database and mining tool for large-scale garlic (Allium sativum) EST sequencing. A total of 21,595 ESTs collected from an in-house cDNA library were used to construct the database. The analysis pipeline is an automated system written in JAVA and consists of the following components: automatic preprocessing of EST reads, assembly of raw sequences, annotation of the assembled sequences, storage of the analyzed information into MySQL databases, and graphic display of all processed data. A web application was implemented with the latest J2EE (Java 2 Platform Enterprise Edition) software technology (JSP/EJB/JavaServlet) for browsing and querying the database, for creation of dynamic web pages on the client side, and for mapping annotated enzymes to KEGG pathways, the AJAX framework was also used partially. The online resources, such as putative annotation, single nucleotide polymorphisms (SNP) and tandem repeat data sets, can be searched by text, explored on the website, searched using BLAST, and downloaded. To archive more significant BLAST results, a curation system was introduced with which biologists can easily edit best-hit annotation information for others to view. The GarlicESTdb web application is freely available at http://garlicdb.kribb.re.kr. GarlicESTdb is the first incorporated online information database of EST sequences isolated from garlic that can be freely accessed and downloaded. It has many useful features for interactive mining of EST contigs and datasets from each library, including curation of annotated information, expression profiling, information retrieval, and summary of statistics of functional annotation. Consequently, the development of GarlicESTdb will provide a crucial contribution to biologists for data-mining and more efficient experimental studies.
Portable automatic text classification for adverse drug reaction detection via multi-corpus training.

PubMed

Sarker, Abeed; Gonzalez, Graciela

2015-02-01

Automatic detection of adverse drug reaction (ADR) mentions from text has recently received significant interest in pharmacovigilance research. Current research focuses on various sources of text-based information, including social media-where enormous amounts of user posted data is available, which have the potential for use in pharmacovigilance if collected and filtered accurately. The aims of this study are: (i) to explore natural language processing (NLP) approaches for generating useful features from text, and utilizing them in optimized machine learning algorithms for automatic classification of ADR assertive text segments; (ii) to present two data sets that we prepared for the task of ADR detection from user posted internet data; and (iii) to investigate if combining training data from distinct corpora can improve automatic classification accuracies. One of our three data sets contains annotated sentences from clinical reports, and the two other data sets, built in-house, consist of annotated posts from social media. Our text classification approach relies on generating a large set of features, representing semantic properties (e.g., sentiment, polarity, and topic), from short text nuggets. Importantly, using our expanded feature sets, we combine training data from different corpora in attempts to boost classification accuracies. Our feature-rich classification approach performs significantly better than previously published approaches with ADR class F-scores of 0.812 (previously reported best: 0.770), 0.538 and 0.678 for the three data sets. Combining training data from multiple compatible corpora further improves the ADR F-scores for the in-house data sets to 0.597 (improvement of 5.9 units) and 0.704 (improvement of 2.6 units) respectively. Our research results indicate that using advanced NLP techniques for generating information rich features from text can significantly improve classification accuracies over existing benchmarks. Our experiments illustrate the benefits of incorporating various semantic features such as topics, concepts, sentiments, and polarities. Finally, we show that integration of information from compatible corpora can significantly improve classification performance. This form of multi-corpus training may be particularly useful in cases where data sets are heavily imbalanced (e.g., social media data), and may reduce the time and costs associated with the annotation of data in the future. Copyright © 2014 The Authors. Published by Elsevier Inc. All rights reserved.
Portable Automatic Text Classification for Adverse Drug Reaction Detection via Multi-corpus Training

PubMed Central

Gonzalez, Graciela

2014-01-01

Objective Automatic detection of Adverse Drug Reaction (ADR) mentions from text has recently received significant interest in pharmacovigilance research. Current research focuses on various sources of text-based information, including social media — where enormous amounts of user posted data is available, which have the potential for use in pharmacovigilance if collected and filtered accurately. The aims of this study are: (i) to explore natural language processing approaches for generating useful features from text, and utilizing them in optimized machine learning algorithms for automatic classification of ADR assertive text segments; (ii) to present two data sets that we prepared for the task of ADR detection from user posted internet data; and (iii) to investigate if combining training data from distinct corpora can improve automatic classification accuracies. Methods One of our three data sets contains annotated sentences from clinical reports, and the two other data sets, built in-house, consist of annotated posts from social media. Our text classification approach relies on generating a large set of features, representing semantic properties (e.g., sentiment, polarity, and topic), from short text nuggets. Importantly, using our expanded feature sets, we combine training data from different corpora in attempts to boost classification accuracies. Results Our feature-rich classification approach performs significantly better than previously published approaches with ADR class F-scores of 0.812 (previously reported best: 0.770), 0.538 and 0.678 for the three data sets. Combining training data from multiple compatible corpora further improves the ADR F-scores for the in-house data sets to 0.597 (improvement of 5.9 units) and 0.704 (improvement of 2.6 units) respectively. Conclusions Our research results indicate that using advanced NLP techniques for generating information rich features from text can significantly improve classification accuracies over existing benchmarks. Our experiments illustrate the benefits of incorporating various semantic features such as topics, concepts, sentiments, and polarities. Finally, we show that integration of information from compatible corpora can significantly improve classification performance. This form of multi-corpus training may be particularly useful in cases where data sets are heavily imbalanced (e.g., social media data), and may reduce the time and costs associated with the annotation of data in the future. PMID:25451103
Accurate and reproducible functional maps in 127 human cell types via 2D genome segmentation

PubMed Central

Hardison, Ross C.

2017-01-01

Abstract The Roadmap Epigenomics Consortium has published whole-genome functional annotation maps in 127 human cell types by integrating data from studies of multiple epigenetic marks. These maps have been widely used for studying gene regulation in cell type-specific contexts and predicting the functional impact of DNA mutations on disease. Here, we present a new map of functional elements produced by applying a method called IDEAS on the same data. The method has several unique advantages and outperforms existing methods, including that used by the Roadmap Epigenomics Consortium. Using five categories of independent experimental datasets, we compared the IDEAS and Roadmap Epigenomics maps. While the overall concordance between the two maps is high, the maps differ substantially in the prediction details and in their consistency of annotation of a given genomic position across cell types. The annotation from IDEAS is uniformly more accurate than the Roadmap Epigenomics annotation and the improvement is substantial based on several criteria. We further introduce a pipeline that improves the reproducibility of functional annotation maps. Thus, we provide a high-quality map of candidate functional regions across 127 human cell types and compare the quality of different annotation methods in order to facilitate biomedical research in epigenomics. PMID:28973456
CommWalker: correctly evaluating modules in molecular networks in light of annotation bias.

PubMed

Luecken, M D; Page, M J T; Crosby, A J; Mason, S; Reinert, G; Deane, C M

2018-03-15

Detecting novel functional modules in molecular networks is an important step in biological research. In the absence of gold standard functional modules, functional annotations are often used to verify whether detected modules/communities have biological meaning. However, as we show, the uneven distribution of functional annotations means that such evaluation methods favor communities of well-studied proteins. We propose a novel framework for the evaluation of communities as functional modules. Our proposed framework, CommWalker, takes communities as inputs and evaluates them in their local network environment by performing short random walks. We test CommWalker's ability to overcome annotation bias using input communities from four community detection methods on two protein interaction networks. We find that modules accepted by CommWalker are similarly co-expressed as those accepted by current methods. Crucially, CommWalker performs well not only in well-annotated regions, but also in regions otherwise obscured by poor annotation. CommWalker community prioritization both faithfully captures well-validated communities and identifies functional modules that may correspond to more novel biology. The CommWalker algorithm is freely available at opig.stats.ox.ac.uk/resources or as a docker image on the Docker Hub at hub.docker.com/r/lueckenmd/commwalker/. deane@stats.ox.ac.uk. Supplementary data are available at Bioinformatics online.
Plant genome and transcriptome annotations: from misconceptions to simple solutions

PubMed Central

Bolger, Marie E; Arsova, Borjana; Usadel, Björn

2018-01-01

Abstract Next-generation sequencing has triggered an explosion of available genomic and transcriptomic resources in the plant sciences. Although genome and transcriptome sequencing has become orders of magnitudes cheaper and more efficient, often the functional annotation process is lagging behind. This might be hampered by the lack of a comprehensive enumeration of simple-to-use tools available to the plant researcher. In this comprehensive review, we present (i) typical ontologies to be used in the plant sciences, (ii) useful databases and resources used for functional annotation, (iii) what to expect from an annotated plant genome, (iv) an automated annotation pipeline and (v) a recipe and reference chart outlining typical steps used to annotate plant genomes/transcriptomes using publicly available resources. PMID:28062412
A conceptual study of automatic and semi-automatic quality assurance techniques for round image processing

NASA Technical Reports Server (NTRS)

1983-01-01

This report summarizes the results of a study conducted by Engineering and Economics Research (EER), Inc. under NASA Contract Number NAS5-27513. The study involved the development of preliminary concepts for automatic and semiautomatic quality assurance (QA) techniques for ground image processing. A distinction is made between quality assessment and the more comprehensive quality assurance which includes decision making and system feedback control in response to quality assessment.
GANESH: software for customized annotation of genome regions.

PubMed

Huntley, Derek; Hummerich, Holger; Smedley, Damian; Kittivoravitkul, Sasivimol; McCarthy, Mark; Little, Peter; Sergot, Marek

2003-09-01

GANESH is a software package designed to support the genetic analysis of regions of human and other genomes. It provides a set of components that may be assembled to construct a self-updating database of DNA sequence, mapping data, and annotations of possible genome features. Once one or more remote sources of data for the target region have been identified, all sequences for that region are downloaded, assimilated, and subjected to a (configurable) set of standard database-searching and genome-analysis packages. The results are stored in compressed form in a relational database, and are updated automatically on a regular schedule so that they are always immediately available in their most up-to-date versions. A Java front-end, executed as a stand alone application or web applet, provides a graphical interface for navigating the database and for viewing the annotations. There are facilities for importing and exporting data in the format of the Distributed Annotation System (DAS), enabling a GANESH database to be used as a component of a DAS configuration. The system has been used to construct databases for about a dozen regions of human chromosomes and for three regions of mouse chromosomes.
Automatic ultrasound image enhancement for 2D semi-automatic breast-lesion segmentation

NASA Astrophysics Data System (ADS)

Lu, Kongkuo; Hall, Christopher S.

2014-03-01

Breast cancer is the fastest growing cancer, accounting for 29%, of new cases in 2012, and second leading cause of cancer death among women in the United States and worldwide. Ultrasound (US) has been used as an indispensable tool for breast cancer detection/diagnosis and treatment. In computer-aided assistance, lesion segmentation is a preliminary but vital step, but the task is quite challenging in US images, due to imaging artifacts that complicate detection and measurement of the suspect lesions. The lesions usually present with poor boundary features and vary significantly in size, shape, and intensity distribution between cases. Automatic methods are highly application dependent while manual tracing methods are extremely time consuming and have a great deal of intra- and inter- observer variability. Semi-automatic approaches are designed to counterbalance the advantage and drawbacks of the automatic and manual methods. However, considerable user interaction might be necessary to ensure reasonable segmentation for a wide range of lesions. This work proposes an automatic enhancement approach to improve the boundary searching ability of the live wire method to reduce necessary user interaction while keeping the segmentation performance. Based on the results of segmentation of 50 2D breast lesions in US images, less user interaction is required to achieve desired accuracy, i.e. < 80%, when auto-enhancement is applied for live-wire segmentation.
A semi-automatic 2D-to-3D video conversion with adaptive key-frame selection

NASA Astrophysics Data System (ADS)

Ju, Kuanyu; Xiong, Hongkai

2014-11-01

To compensate the deficit of 3D content, 2D to 3D video conversion (2D-to-3D) has recently attracted more attention from both industrial and academic communities. The semi-automatic 2D-to-3D conversion which estimates corresponding depth of non-key-frames through key-frames is more desirable owing to its advantage of balancing labor cost and 3D effects. The location of key-frames plays a role on quality of depth propagation. This paper proposes a semi-automatic 2D-to-3D scheme with adaptive key-frame selection to keep temporal continuity more reliable and reduce the depth propagation errors caused by occlusion. The potential key-frames would be localized in terms of clustered color variation and motion intensity. The distance of key-frame interval is also taken into account to keep the accumulated propagation errors under control and guarantee minimal user interaction. Once their depth maps are aligned with user interaction, the non-key-frames depth maps would be automatically propagated by shifted bilateral filtering. Considering that depth of objects may change due to the objects motion or camera zoom in/out effect, a bi-directional depth propagation scheme is adopted where a non-key frame is interpolated from two adjacent key frames. The experimental results show that the proposed scheme has better performance than existing 2D-to-3D scheme with fixed key-frame interval.
Comparison of computer systems and ranking criteria for automatic melanoma detection in dermoscopic images.

PubMed

Møllersen, Kajsa; Zortea, Maciel; Schopf, Thomas R; Kirchesch, Herbert; Godtliebsen, Fred

2017-01-01

Melanoma is the deadliest form of skin cancer, and early detection is crucial for patient survival. Computer systems can assist in melanoma detection, but are not widespread in clinical practice. In 2016, an open challenge in classification of dermoscopic images of skin lesions was announced. A training set of 900 images with corresponding class labels and semi-automatic/manual segmentation masks was released for the challenge. An independent test set of 379 images, of which 75 were of melanomas, was used to rank the participants. This article demonstrates the impact of ranking criteria, segmentation method and classifier, and highlights the clinical perspective. We compare five different measures for diagnostic accuracy by analysing the resulting ranking of the computer systems in the challenge. Choice of performance measure had great impact on the ranking. Systems that were ranked among the top three for one measure, dropped to the bottom half when changing performance measure. Nevus Doctor, a computer system previously developed by the authors, was used to participate in the challenge, and investigate the impact of segmentation and classifier. The diagnostic accuracy when using an automatic versus the semi-automatic/manual segmentation is investigated. The unexpected small impact of segmentation method suggests that improvements of the automatic segmentation method w.r.t. resemblance to semi-automatic/manual segmentation will not improve diagnostic accuracy substantially. A small set of similar classification algorithms are used to investigate the impact of classifier on the diagnostic accuracy. The variability in diagnostic accuracy for different classifier algorithms was larger than the variability for segmentation methods, and suggests a focus for future investigations. From a clinical perspective, the misclassification of a melanoma as benign has far greater cost than the misclassification of a benign lesion. For computer systems to have clinical impact, their performance should be ranked by a high-sensitivity measure.
Comparison Of Semi-Automatic And Automatic Slick Detection Algorithms For Jiyeh Power Station Oil Spill, Lebanon

NASA Astrophysics Data System (ADS)

Osmanoglu, B.; Ozkan, C.; Sunar, F.

2013-10-01

After air strikes on July 14 and 15, 2006 the Jiyeh Power Station started leaking oil into the eastern Mediterranean Sea. The power station is located about 30 km south of Beirut and the slick covered about 170 km of coastline threatening the neighboring countries Turkey and Cyprus. Due to the ongoing conflict between Israel and Lebanon, cleaning efforts could not start immediately resulting in 12 000 to 15 000 tons of fuel oil leaking into the sea. In this paper we compare results from automatic and semi-automatic slick detection algorithms. The automatic detection method combines the probabilities calculated for each pixel from each image to obtain a joint probability, minimizing the adverse effects of atmosphere on oil spill detection. The method can readily utilize X-, C- and L-band data where available. Furthermore wind and wave speed observations can be used for a more accurate analysis. For this study, we utilize Envisat ASAR ScanSAR data. A probability map is generated based on the radar backscatter, effect of wind and dampening value. The semi-automatic algorithm is based on supervised classification. As a classifier, Artificial Neural Network Multilayer Perceptron (ANN MLP) classifier is used since it is more flexible and efficient than conventional maximum likelihood classifier for multisource and multi-temporal data. The learning algorithm for ANN MLP is chosen as the Levenberg-Marquardt (LM). Training and test data for supervised classification are composed from the textural information created from SAR images. This approach is semiautomatic because tuning the parameters of classifier and composing training data need a human interaction. We point out the similarities and differences between the two methods and their results as well as underlining their advantages and disadvantages. Due to the lack of ground truth data, we compare obtained results to each other, as well as other published oil slick area assessments.
A methodology for the semi-automatic digital image analysis of fragmental impactites

NASA Astrophysics Data System (ADS)

Chanou, A.; Osinski, G. R.; Grieve, R. A. F.

2014-04-01

A semi-automated digital image analysis method is developed for the comparative textural study of impact melt-bearing breccias. This method uses the freeware software ImageJ developed by the National Institute of Health (NIH). Digital image analysis is performed on scans of hand samples (10-15 cm across), based on macroscopic interpretations of the rock components. All image processing and segmentation are done semi-automatically, with the least possible manual intervention. The areal fraction of components is estimated and modal abundances can be deduced, where the physical optical properties (e.g., contrast, color) of the samples allow it. Other parameters that can be measured include, for example, clast size, clast-preferred orientations, average box-counting dimension or fragment shape complexity, and nearest neighbor distances (NnD). This semi-automated method allows the analysis of a larger number of samples in a relatively short time. Textures, granulometry, and shape descriptors are of considerable importance in rock characterization. The methodology is used to determine the variations of the physical characteristics of some examples of fragmental impactites.
Functional Annotation of the Arabidopsis Genome Using Controlled Vocabularies1

PubMed Central

Berardini, Tanya Z.; Mundodi, Suparna; Reiser, Leonore; Huala, Eva; Garcia-Hernandez, Margarita; Zhang, Peifen; Mueller, Lukas A.; Yoon, Jungwoon; Doyle, Aisling; Lander, Gabriel; Moseyko, Nick; Yoo, Danny; Xu, Iris; Zoeckler, Brandon; Montoya, Mary; Miller, Neil; Weems, Dan; Rhee, Seung Y.

2004-01-01

Controlled vocabularies are increasingly used by databases to describe genes and gene products because they facilitate identification of similar genes within an organism or among different organisms. One of The Arabidopsis Information Resource's goals is to associate all Arabidopsis genes with terms developed by the Gene Ontology Consortium that describe the molecular function, biological process, and subcellular location of a gene product. We have also developed terms describing Arabidopsis anatomy and developmental stages and use these to annotate published gene expression data. As of March 2004, we used computational and manual annotation methods to make 85,666 annotations representing 26,624 unique loci. We focus on associating genes to controlled vocabulary terms based on experimental data from the literature and use The Arabidopsis Information Resource-developed PubSearch software to facilitate this process. Each annotation is tagged with a combination of evidence codes, evidence descriptions, and references that provide a robust means to assess data quality. Annotation of all Arabidopsis genes will allow quantitative comparisons between sets of genes derived from sources such as microarray experiments. The Arabidopsis annotation data will also facilitate annotation of newly sequenced plant genomes by using sequence similarity to transfer annotations to homologous genes. In addition, complete and up-to-date annotations will make unknown genes easy to identify and target for experimentation. Here, we describe the process of Arabidopsis functional annotation using a variety of data sources and illustrate several ways in which this information can be accessed and used to infer knowledge about Arabidopsis and other plant species. PMID:15173566
A New Semi-Automatic Approach to Find Suitable Virtual Electrodes in Arrays Using an Interpolation Strategy.

PubMed

Salchow, Christina; Valtin, Markus; Seel, Thomas; Schauer, Thomas

2016-06-13

Functional Electrical Stimulation via electrode arrays enables the user to form virtual electrodes (VEs) of dynamic shape, size, and position. We developed a feedback-control-assisted manual search strategy which allows the therapist to conveniently and continuously modify VEs to find a good stimulation area. This works for applications in which the desired movement consists of at least two degrees of freedom. The virtual electrode can be moved to arbitrary locations within the array, and each involved element is stimulated with an individual intensity. Meanwhile, the applied global stimulation intensity is controlled automatically to meet a predefined angle for one degree of freedom. This enables the therapist to concentrate on the remaining degree(s) of freedom while changing the VE position. This feedback-control-assisted approach aims to integrate the user's opinion and the patient's sensation. Therefore, our method bridges the gap between manual search and fully automatic identification procedures for array electrodes. Measurements in four healthy volunteers were performed to demonstrate the usefulness of our concept, using a 24-element array to generate wrist and hand extension.
Incorporating Functional Annotations for Fine-Mapping Causal Variants in a Bayesian Framework Using Summary Statistics.

PubMed

Chen, Wenan; McDonnell, Shannon K; Thibodeau, Stephen N; Tillmans, Lori S; Schaid, Daniel J

2016-11-01

Functional annotations have been shown to improve both the discovery power and fine-mapping accuracy in genome-wide association studies. However, the optimal strategy to incorporate the large number of existing annotations is still not clear. In this study, we propose a Bayesian framework to incorporate functional annotations in a systematic manner. We compute the maximum a posteriori solution and use cross validation to find the optimal penalty parameters. By extending our previous fine-mapping method CAVIARBF into this framework, we require only summary statistics as input. We also derived an exact calculation of Bayes factors using summary statistics for quantitative traits, which is necessary when a large proportion of trait variance is explained by the variants of interest, such as in fine mapping expression quantitative trait loci (eQTL). We compared the proposed method with PAINTOR using different strategies to combine annotations. Simulation results show that the proposed method achieves the best accuracy in identifying causal variants among the different strategies and methods compared. We also find that for annotations with moderate effects from a large annotation pool, screening annotations individually and then combining the top annotations can produce overly optimistic results. We applied these methods on two real data sets: a meta-analysis result of lipid traits and a cis-eQTL study of normal prostate tissues. For the eQTL data, incorporating annotations significantly increased the number of potential causal variants with high probabilities. Copyright © 2016 by the Genetics Society of America.

Methods for automatic detection of artifacts in microelectrode recordings.

PubMed

Bakštein, Eduard; Sieger, Tomáš; Wild, Jiří; Novák, Daniel; Schneider, Jakub; Vostatek, Pavel; Urgošík, Dušan; Jech, Robert

2017-10-01

Extracellular microelectrode recording (MER) is a prominent technique for studies of extracellular single-unit neuronal activity. In order to achieve robust results in more complex analysis pipelines, it is necessary to have high quality input data with a low amount of artifacts. We show that noise (mainly electromagnetic interference and motion artifacts) may affect more than 25% of the recording length in a clinical MER database. We present several methods for automatic detection of noise in MER signals, based on (i) unsupervised detection of stationary segments, (ii) large peaks in the power spectral density, and (iii) a classifier based on multiple time- and frequency-domain features. We evaluate the proposed methods on a manually annotated database of 5735 ten-second MER signals from 58 Parkinson's disease patients. The existing methods for artifact detection in single-channel MER that have been rigorously tested, are based on unsupervised change-point detection. We show on an extensive real MER database that the presented techniques are better suited for the task of artifact identification and achieve much better results. The best-performing classifiers (bagging and decision tree) achieved artifact classification accuracy of up to 89% on an unseen test set and outperformed the unsupervised techniques by 5-10%. This was close to the level of agreement among raters using manual annotation (93.5%). We conclude that the proposed methods are suitable for automatic MER denoising and may help in the efficient elimination of undesirable signal artifacts. Copyright © 2017 Elsevier B.V. All rights reserved.
Toward knowledge-enhanced viewing using encyclopedias and model-based segmentation

NASA Astrophysics Data System (ADS)

Kneser, Reinhard; Lehmann, Helko; Geller, Dieter; Qian, Yue-Chen; Weese, Jürgen

2009-02-01

To make accurate decisions based on imaging data, radiologists must associate the viewed imaging data with the corresponding anatomical structures. Furthermore, given a disease hypothesis possible image findings which verify the hypothesis must be considered and where and how they are expressed in the viewed images. If rare anatomical variants, rare pathologies, unfamiliar protocols, or ambiguous findings are present, external knowledge sources such as medical encyclopedias are consulted. These sources are accessed using keywords typically describing anatomical structures, image findings, pathologies. In this paper we present our vision of how a patient's imaging data can be automatically enhanced with anatomical knowledge as well as knowledge about image findings. On one hand, we propose the automatic annotation of the images with labels from a standard anatomical ontology. These labels are used as keywords for a medical encyclopedia such as STATdx to access anatomical descriptions, information about pathologies and image findings. On the other hand we envision encyclopedias to contain links to region- and finding-specific image processing algorithms. Then a finding is evaluated on an image by applying the respective algorithm in the associated anatomical region. Towards realization of our vision, we present our method and results of automatic annotation of anatomical structures in 3D MRI brain images. Thereby we develop a complex surface mesh model incorporating major structures of the brain and a model-based segmentation method. We demonstrate the validity by analyzing the results of several training and segmentation experiments with clinical data focusing particularly on the visual pathway.
Temporal Cyber Attack Detection.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Ingram, Joey Burton; Draelos, Timothy J.; Galiardi, Meghan

Rigorous characterization of the performance and generalization ability of cyber defense systems is extremely difficult, making it hard to gauge uncertainty, and thus, confidence. This difficulty largely stems from a lack of labeled attack data that fully explores the potential adversarial space. Currently, performance of cyber defense systems is typically evaluated in a qualitative manner by manually inspecting the results of the system on live data and adjusting as needed. Additionally, machine learning has shown promise in deriving models that automatically learn indicators of compromise that are more robust than analyst-derived detectors. However, to generate these models, most algorithms requiremore » large amounts of labeled data (i.e., examples of attacks). Algorithms that do not require annotated data to derive models are similarly at a disadvantage, because labeled data is still necessary when evaluating performance. In this work, we explore the use of temporal generative models to learn cyber attack graph representations and automatically generate data for experimentation and evaluation. Training and evaluating cyber systems and machine learning models requires significant, annotated data, which is typically collected and labeled by hand for one-off experiments. Automatically generating such data helps derive/evaluate detection models and ensures reproducibility of results. Experimentally, we demonstrate the efficacy of generative sequence analysis techniques on learning the structure of attack graphs, based on a realistic example. These derived models can then be used to generate more data. Additionally, we provide a roadmap for future research efforts in this area.« less
Semi-automated quantitative Drosophila wings measurements.

PubMed

Loh, Sheng Yang Michael; Ogawa, Yoshitaka; Kawana, Sara; Tamura, Koichiro; Lee, Hwee Kuan

2017-06-28

Drosophila melanogaster is an important organism used in many fields of biological research such as genetics and developmental biology. Drosophila wings have been widely used to study the genetics of development, morphometrics and evolution. Therefore there is much interest in quantifying wing structures of Drosophila. Advancement in technology has increased the ease in which images of Drosophila can be acquired. However such studies have been limited by the slow and tedious process of acquiring phenotypic data. We have developed a system that automatically detects and measures key points and vein segments on a Drosophila wing. Key points are detected by performing image transformations and template matching on Drosophila wing images while vein segments are detected using an Active Contour algorithm. The accuracy of our key point detection was compared against key point annotations of users. We also performed key point detection using different training data sets of Drosophila wing images. We compared our software with an existing automated image analysis system for Drosophila wings and showed that our system performs better than the state of the art. Vein segments were manually measured and compared against the measurements obtained from our system. Our system was able to detect specific key points and vein segments from Drosophila wing images with high accuracy.
MICRA: an automatic pipeline for fast characterization of microbial genomes from high-throughput sequencing data.

PubMed

Caboche, Ségolène; Even, Gaël; Loywick, Alexandre; Audebert, Christophe; Hot, David

2017-12-19

The increase in available sequence data has advanced the field of microbiology; however, making sense of these data without bioinformatics skills is still problematic. We describe MICRA, an automatic pipeline, available as a web interface, for microbial identification and characterization through reads analysis. MICRA uses iterative mapping against reference genomes to identify genes and variations. Additional modules allow prediction of antibiotic susceptibility and resistance and comparing the results of several samples. MICRA is fast, producing few false-positive annotations and variant calls compared to current methods, making it a tool of great interest for fully exploiting sequencing data.
The standard operating procedure of the DOE-JGI Microbial Genome Annotation Pipeline (MGAP v.4)

DOE PAGES

Huntemann, Marcel; Ivanova, Natalia N.; Mavromatis, Konstantinos; ...

2015-10-26

The DOE-JGI Microbial Genome Annotation Pipeline performs structural and functional annotation of microbial genomes that are further included into the Integrated Microbial Genome comparative analysis system. MGAP is applied to assembled nucleotide sequence datasets that are provided via the IMG submission site. Dataset submission for annotation first requires project and associated metadata description in GOLD. The MGAP sequence data processing consists of feature prediction including identification of protein-coding genes, non-coding RNAs and regulatory RNA features, as well as CRISPR elements. In conclusion, structural annotation is followed by assignment of protein product names and functions.
The standard operating procedure of the DOE-JGI Microbial Genome Annotation Pipeline (MGAP v.4)

DOE Office of Scientific and Technical Information (OSTI.GOV)

Huntemann, Marcel; Ivanova, Natalia N.; Mavromatis, Konstantinos

The DOE-JGI Microbial Genome Annotation Pipeline performs structural and functional annotation of microbial genomes that are further included into the Integrated Microbial Genome comparative analysis system. MGAP is applied to assembled nucleotide sequence datasets that are provided via the IMG submission site. Dataset submission for annotation first requires project and associated metadata description in GOLD. The MGAP sequence data processing consists of feature prediction including identification of protein-coding genes, non-coding RNAs and regulatory RNA features, as well as CRISPR elements. In conclusion, structural annotation is followed by assignment of protein product names and functions.
The coffee genome hub: a resource for coffee genomes

PubMed Central

Dereeper, Alexis; Bocs, Stéphanie; Rouard, Mathieu; Guignon, Valentin; Ravel, Sébastien; Tranchant-Dubreuil, Christine; Poncet, Valérie; Garsmeur, Olivier; Lashermes, Philippe; Droc, Gaëtan

2015-01-01

The whole genome sequence of Coffea canephora, the perennial diploid species known as Robusta, has been recently released. In the context of the C. canephora genome sequencing project and to support post-genomics efforts, we developed the Coffee Genome Hub (http://coffee-genome.org/), an integrative genome information system that allows centralized access to genomics and genetics data and analysis tools to facilitate translational and applied research in coffee. We provide the complete genome sequence of C. canephora along with gene structure, gene product information, metabolism, gene families, transcriptomics, syntenic blocks, genetic markers and genetic maps. The hub relies on generic software (e.g. GMOD tools) for easy querying, visualizing and downloading research data. It includes a Genome Browser enhanced by a Community Annotation System, enabling the improvement of automatic gene annotation through an annotation editor. In addition, the hub aims at developing interoperability among other existing South Green tools managing coffee data (phylogenomics resources, SNPs) and/or supporting data analyses with the Galaxy workflow manager. PMID:25392413
Workspace definition for navigated control functional endoscopic sinus surgery

NASA Astrophysics Data System (ADS)

Gessat, Michael; Hofer, Mathias; Audette, Michael; Dietz, Andreas; Meixensberger, Jürgen; Stauß, Gero; Burgert, Oliver

2007-03-01

For the pre-operative definition of a surgical workspace for Navigated Control ® Functional Endoscopic Sinus Surgery (FESS), we developed a semi-automatic image processing system. Based on observations of surgeons using a manual system, we implemented a workflow-based engineering process that led us to the development of a system reducing time and workload spent during the workspace definition. The system uses a feature based on local curvature to align vertices of a polygonal outline along the bone structures defining the cavities of the inner nose. An anisotropic morphologic operator was developed solve problems arising from artifacts from noise and partial volume effects. We used time measurements and NASA's TLX questionnaire to evaluate our system.
A way toward analyzing high-content bioimage data by means of semantic annotation and visual data mining

NASA Astrophysics Data System (ADS)

Herold, Julia; Abouna, Sylvie; Zhou, Luxian; Pelengaris, Stella; Epstein, David B. A.; Khan, Michael; Nattkemper, Tim W.

2009-02-01

In the last years, bioimaging has turned from qualitative measurements towards a high-throughput and highcontent modality, providing multiple variables for each biological sample analyzed. We present a system which combines machine learning based semantic image annotation and visual data mining to analyze such new multivariate bioimage data. Machine learning is employed for automatic semantic annotation of regions of interest. The annotation is the prerequisite for a biological object-oriented exploration of the feature space derived from the image variables. With the aid of visual data mining, the obtained data can be explored simultaneously in the image as well as in the feature domain. Especially when little is known of the underlying data, for example in the case of exploring the effects of a drug treatment, visual data mining can greatly aid the process of data evaluation. We demonstrate how our system is used for image evaluation to obtain information relevant to diabetes study and screening of new anti-diabetes treatments. Cells of the Islet of Langerhans and whole pancreas in pancreas tissue samples are annotated and object specific molecular features are extracted from aligned multichannel fluorescence images. These are interactively evaluated for cell type classification in order to determine the cell number and mass. Only few parameters need to be specified which makes it usable also for non computer experts and allows for high-throughput analysis.
Neurocarta: aggregating and sharing disease-gene relations for the neurosciences.

PubMed

Portales-Casamar, Elodie; Ch'ng, Carolyn; Lui, Frances; St-Georges, Nicolas; Zoubarev, Anton; Lai, Artemis Y; Lee, Mark; Kwok, Cathy; Kwok, Willie; Tseng, Luchia; Pavlidis, Paul

2013-02-26

Understanding the genetic basis of diseases is key to the development of better diagnoses and treatments. Unfortunately, only a small fraction of the existing data linking genes to phenotypes is available through online public resources and, when available, it is scattered across multiple access tools. Neurocarta is a knowledgebase that consolidates information on genes and phenotypes across multiple resources and allows tracking and exploring of the associations. The system enables automatic and manual curation of evidence supporting each association, as well as user-enabled entry of their own annotations. Phenotypes are recorded using controlled vocabularies such as the Disease Ontology to facilitate computational inference and linking to external data sources. The gene-to-phenotype associations are filtered by stringent criteria to focus on the annotations most likely to be relevant. Neurocarta is constantly growing and currently holds more than 30,000 lines of evidence linking over 7,000 genes to 2,000 different phenotypes. Neurocarta is a one-stop shop for researchers looking for candidate genes for any disorder of interest. In Neurocarta, they can review the evidence linking genes to phenotypes and filter out the evidence they're not interested in. In addition, researchers can enter their own annotations from their experiments and analyze them in the context of existing public annotations. Neurocarta's in-depth annotation of neurodevelopmental disorders makes it a unique resource for neuroscientists working on brain development.
AnnotCompute: annotation-based exploration and meta-analysis of genomics experiments

PubMed Central

Zheng, Jie; Stoyanovich, Julia; Manduchi, Elisabetta; Liu, Junmin; Stoeckert, Christian J.

2011-01-01

The ever-increasing scale of biological data sets, particularly those arising in the context of high-throughput technologies, requires the development of rich data exploration tools. In this article, we present AnnotCompute, an information discovery platform for repositories of functional genomics experiments such as ArrayExpress. Our system leverages semantic annotations of functional genomics experiments with controlled vocabulary and ontology terms, such as those from the MGED Ontology, to compute conceptual dissimilarities between pairs of experiments. These dissimilarities are then used to support two types of exploratory analysis—clustering and query-by-example. We show that our proposed dissimilarity measures correspond to a user's intuition about conceptual dissimilarity, and can be used to support effective query-by-example. We also evaluate the quality of clustering based on these measures. While AnnotCompute can support a richer data exploration experience, its effectiveness is limited in some cases, due to the quality of available annotations. Nonetheless, tools such as AnnotCompute may provide an incentive for richer annotations of experiments. Code is available for download at http://www.cbil.upenn.edu/downloads/AnnotCompute. Database URL: http://www.cbil.upenn.edu/annotCompute/ PMID:22190598
Modeling semantic aspects for cross-media image indexing.

PubMed

Monay, Florent; Gatica-Perez, Daniel

2007-10-01

To go beyond the query-by-example paradigm in image retrieval, there is a need for semantic indexing of large image collections for intuitive text-based image search. Different models have been proposed to learn the dependencies between the visual content of an image set and the associated text captions, then allowing for the automatic creation of semantic indices for unannotated images. The task, however, remains unsolved. In this paper, we present three alternatives to learn a Probabilistic Latent Semantic Analysis model (PLSA) for annotated images, and evaluate their respective performance for automatic image indexing. Under the PLSA assumptions, an image is modeled as a mixture of latent aspects that generates both image features and text captions, and we investigate three ways to learn the mixture of aspects. We also propose a more discriminative image representation than the traditional Blob histogram, concatenating quantized local color information and quantized local texture descriptors. The first learning procedure of a PLSA model for annotated images is a standard EM algorithm, which implicitly assumes that the visual and the textual modalities can be treated equivalently. The other two models are based on an asymmetric PLSA learning, allowing to constrain the definition of the latent space on the visual or on the textual modality. We demonstrate that the textual modality is more appropriate to learn a semantically meaningful latent space, which translates into improved annotation performance. A comparison of our learning algorithms with respect to recent methods on a standard dataset is presented, and a detailed evaluation of the performance shows the validity of our framework.
ODMSummary: A Tool for Automatic Structured Comparison of Multiple Medical Forms Based on Semantic Annotation with the Unified Medical Language System.

PubMed

Storck, Michael; Krumm, Rainer; Dugas, Martin

2016-01-01

Medical documentation is applied in various settings including patient care and clinical research. Since procedures of medical documentation are heterogeneous and developed further, secondary use of medical data is complicated. Development of medical forms, merging of data from different sources and meta-analyses of different data sets are currently a predominantly manual process and therefore difficult and cumbersome. Available applications to automate these processes are limited. In particular, tools to compare multiple documentation forms are missing. The objective of this work is to design, implement and evaluate the new system ODMSummary for comparison of multiple forms with a high number of semantically annotated data elements and a high level of usability. System requirements are the capability to summarize and compare a set of forms, enable to estimate the documentation effort, track changes in different versions of forms and find comparable items in different forms. Forms are provided in Operational Data Model format with semantic annotations from the Unified Medical Language System. 12 medical experts were invited to participate in a 3-phase evaluation of the tool regarding usability. ODMSummary (available at https://odmtoolbox.uni-muenster.de/summary/summary.html) provides a structured overview of multiple forms and their documentation fields. This comparison enables medical experts to assess multiple forms or whole datasets for secondary use. System usability was optimized based on expert feedback. The evaluation demonstrates that feedback from domain experts is needed to identify usability issues. In conclusion, this work shows that automatic comparison of multiple forms is feasible and the results are usable for medical experts.
Measuring Patient Mobility in the ICU Using a Novel Noninvasive Sensor

PubMed Central

Ma, Andy J.; Rawat, Nishi; Reiter, Austin; Shrock, Christine; Zhan, Andong; Stone, Alex; Rabiee, Anahita; Griffin, Stephanie; Needham, Dale M.; Saria, Suchi

2017-01-01

Objectives To develop and validate a noninvasive mobility sensor to automatically and continuously detect and measure patient mobility in the ICU. Design Prospective, observational study. Setting Surgical ICU at an academic hospital. Patients Three hundred sixty-two hours of sensor color and depth image data were recorded and curated into 109 segments, each containing 1,000 images, from eight patients. Interventions None. Measurements and Main Results Three Microsoft Kinect sensors (Microsoft, Beijing, China) were deployed in one ICU room to collect continuous patient mobility data. We developed software that automatically analyzes the sensor data to measure mobility and assign the highest level within a time period. To characterize the highest mobility level, a validated 11-point mobility scale was collapsed into four categories: nothing in bed, in-bed activity, out-of-bed activity, and walking. Of the 109 sensor segments, the noninvasive mobility sensor was developed using 26 of these from three ICU patients and validated on 83 remaining segments from five different patients. Three physicians annotated each segment for the highest mobility level. The weighted Kappa (κ) statistic for agreement between automated noninvasive mobility sensor output versus manual physician annotation was 0.86 (95% CI, 0.72–1.00). Disagreement primarily occurred in the “nothing in bed” versus “in-bed activity” categories because “the sensor assessed movement continuously,” which was significantly more sensitive to motion than physician annotations using a discrete manual scale. Conclusions Noninvasive mobility sensor is a novel and feasible method for automating evaluation of ICU patient mobility. PMID:28291092
Phenotyping for patient safety: algorithm development for electronic health record based automated adverse event and medical error detection in neonatal intensive care.

PubMed

Li, Qi; Melton, Kristin; Lingren, Todd; Kirkendall, Eric S; Hall, Eric; Zhai, Haijun; Ni, Yizhao; Kaiser, Megan; Stoutenborough, Laura; Solti, Imre

2014-01-01

Although electronic health records (EHRs) have the potential to provide a foundation for quality and safety algorithms, few studies have measured their impact on automated adverse event (AE) and medical error (ME) detection within the neonatal intensive care unit (NICU) environment. This paper presents two phenotyping AE and ME detection algorithms (ie, IV infiltrations, narcotic medication oversedation and dosing errors) and describes manual annotation of airway management and medication/fluid AEs from NICU EHRs. From 753 NICU patient EHRs from 2011, we developed two automatic AE/ME detection algorithms, and manually annotated 11 classes of AEs in 3263 clinical notes. Performance of the automatic AE/ME detection algorithms was compared to trigger tool and voluntary incident reporting results. AEs in clinical notes were double annotated and consensus achieved under neonatologist supervision. Sensitivity, positive predictive value (PPV), and specificity are reported. Twelve severe IV infiltrates were detected. The algorithm identified one more infiltrate than the trigger tool and eight more than incident reporting. One narcotic oversedation was detected demonstrating 100% agreement with the trigger tool. Additionally, 17 narcotic medication MEs were detected, an increase of 16 cases over voluntary incident reporting. Automated AE/ME detection algorithms provide higher sensitivity and PPV than currently used trigger tools or voluntary incident-reporting systems, including identification of potential dosing and frequency errors that current methods are unequipped to detect. Published by the BMJ Publishing Group Limited. For permission to use (where not already granted under a licence) please go to http://group.bmj.com/group/rights-licensing/permissions.
Automatic pelvis segmentation from x-ray images of a mouse model

NASA Astrophysics Data System (ADS)

Al Okashi, Omar M.; Du, Hongbo; Al-Assam, Hisham

2017-05-01

The automatic detection and quantification of skeletal structures has a variety of different applications for biological research. Accurate segmentation of the pelvis from X-ray images of mice in a high-throughput project such as the Mouse Genomes Project not only saves time and cost but also helps achieving an unbiased quantitative analysis within the phenotyping pipeline. This paper proposes an automatic solution for pelvis segmentation based on structural and orientation properties of the pelvis in X-ray images. The solution consists of three stages including pre-processing image to extract pelvis area, initial pelvis mask preparation and final pelvis segmentation. Experimental results on a set of 100 X-ray images showed consistent performance of the algorithm. The automated solution overcomes the weaknesses of a manual annotation procedure where intra- and inter-observer variations cannot be avoided.
Supporting Semantic Annotation of Educational Content by Automatic Extraction of Hierarchical Domain Relationships

ERIC Educational Resources Information Center

Vrablecová, Petra; Šimko, Marián

2016-01-01

The domain model is an essential part of an adaptive learning system. For each educational course, it involves educational content and semantics, which is also viewed as a form of conceptual metadata about educational content. Due to the size of a domain model, manual domain model creation is a challenging and demanding task for teachers or…
Datasets of Odontocete Sounds Annotated for Developing Automatic Detection Methods

DTIC Science & Technology

2010-12-01

Passive acoustic detection of Minke whales (Balaenoptera acutorostrata) off the West Coast of Kauai, HI. Book of abstracts, Fourth International...Workshop on Detection , Classification and Localization of Marine Mammals using Passive Acoustics , Pavia, Italy, Sept. 10- 13, 2009, p. 57. Roch, M., Y...Mellinger, and D. Gillespie. 2010. Comparison of beaked whale detection algorithms. Applied Acoustics 71:1043-1049. 8 References
Information Retrieval and Text Mining Technologies for Chemistry.

PubMed

Krallinger, Martin; Rabal, Obdulia; Lourenço, Anália; Oyarzabal, Julen; Valencia, Alfonso

2017-06-28

Efficient access to chemical information contained in scientific literature, patents, technical reports, or the web is a pressing need shared by researchers and patent attorneys from different chemical disciplines. Retrieval of important chemical information in most cases starts with finding relevant documents for a particular chemical compound or family. Targeted retrieval of chemical documents is closely connected to the automatic recognition of chemical entities in the text, which commonly involves the extraction of the entire list of chemicals mentioned in a document, including any associated information. In this Review, we provide a comprehensive and in-depth description of fundamental concepts, technical implementations, and current technologies for meeting these information demands. A strong focus is placed on community challenges addressing systems performance, more particularly CHEMDNER and CHEMDNER patents tasks of BioCreative IV and V, respectively. Considering the growing interest in the construction of automatically annotated chemical knowledge bases that integrate chemical information and biological data, cheminformatics approaches for mapping the extracted chemical names into chemical structures and their subsequent annotation together with text mining applications for linking chemistry with biological information are also presented. Finally, future trends and current challenges are highlighted as a roadmap proposal for research in this emerging field.

Zone analysis in biology articles as a basis for information extraction.

PubMed

Mizuta, Yoko; Korhonen, Anna; Mullen, Tony; Collier, Nigel

2006-06-01

In the field of biomedicine, an overwhelming amount of experimental data has become available as a result of the high throughput of research in this domain. The amount of results reported has now grown beyond the limits of what can be managed by manual means. This makes it increasingly difficult for the researchers in this area to keep up with the latest developments. Information extraction (IE) in the biological domain aims to provide an effective automatic means to dynamically manage the information contained in archived journal articles and abstract collections and thus help researchers in their work. However, while considerable advances have been made in certain areas of IE, pinpointing and organizing factual information (such as experimental results) remains a challenge. In this paper we propose tackling this task by incorporating into IE information about rhetorical zones, i.e. classification of spans of text in terms of argumentation and intellectual attribution. As the first step towards this goal, we introduce a scheme for annotating biological texts for rhetorical zones and provide a qualitative and quantitative analysis of the data annotated according to this scheme. We also discuss our preliminary research on automatic zone analysis, and its incorporation into our IE framework.
Effectiveness of image features and similarity measures in cluster-based approaches for content-based image retrieval

NASA Astrophysics Data System (ADS)

Du, Hongbo; Al-Jubouri, Hanan; Sellahewa, Harin

2014-05-01

Content-based image retrieval is an automatic process of retrieving images according to image visual contents instead of textual annotations. It has many areas of application from automatic image annotation and archive, image classification and categorization to homeland security and law enforcement. The key issues affecting the performance of such retrieval systems include sensible image features that can effectively capture the right amount of visual contents and suitable similarity measures to find similar and relevant images ranked in a meaningful order. Many different approaches, methods and techniques have been developed as a result of very intensive research in the past two decades. Among many existing approaches, is a cluster-based approach where clustering methods are used to group local feature descriptors into homogeneous regions, and search is conducted by comparing the regions of the query image against those of the stored images. This paper serves as a review of works in this area. The paper will first summarize the existing work reported in the literature and then present the authors' own investigations in this field. The paper intends to highlight not only achievements made by recent research but also challenges and difficulties still remaining in this area.
SURE reliability analysis: Program and mathematics

NASA Technical Reports Server (NTRS)

Butler, Ricky W.; White, Allan L.

1988-01-01

The SURE program is a new reliability analysis tool for ultrareliable computer system architectures. The computational methods on which the program is based provide an efficient means for computing accurate upper and lower bounds for the death state probabilities of a large class of semi-Markov models. Once a semi-Markov model is described using a simple input language, the SURE program automatically computes the upper and lower bounds on the probability of system failure. A parameter of the model can be specified as a variable over a range of values directing the SURE program to perform a sensitivity analysis automatically. This feature, along with the speed of the program, makes it especially useful as a design tool.
Spectroscopic imaging ellipsometry for automated search of flakes of mono- and n-layers of 2D-materials

NASA Astrophysics Data System (ADS)

Funke, S.; Wurstbauer, U.; Miller, B.; Matković, A.; Green, A.; Diebold, A.; Röling, C.; Thiesen, P. H.

2017-11-01

Spectroscopic imaging ellipsometry (SIE) is used to localize and characterize flakes of conducting, semi-conducting and insulating 2D-materials. Although the research in the field of monolayers of 2D-materials increased the last years, it is still challenging to look for small flakes and distinguish between different layer numbers. Special substrates are used to enhance optical contrast for the conventional light microscopy (LM). In case when other functional support from the substrate is essential, an additional transfer step needs to be employed, bringing the drawbacks as contamination, cracking and wrinkling of the 2D materials. Furthermore it is time-consuming and not yet fully automatically to search for monolayers by contrast with the LM. Here we present a method, that is able to automatically localize regions with desired thicknesses, e.g. monolayers, of the different materials on arbitrary substrates.
A Hybrid Hierarchical Approach for Brain Tissue Segmentation by Combining Brain Atlas and Least Square Support Vector Machine

PubMed Central

Kasiri, Keyvan; Kazemi, Kamran; Dehghani, Mohammad Javad; Helfroush, Mohammad Sadegh

2013-01-01

In this paper, we present a new semi-automatic brain tissue segmentation method based on a hybrid hierarchical approach that combines a brain atlas as a priori information and a least-square support vector machine (LS-SVM). The method consists of three steps. In the first two steps, the skull is removed and the cerebrospinal fluid (CSF) is extracted. These two steps are performed using the toolbox FMRIB's automated segmentation tool integrated in the FSL software (FSL-FAST) developed in Oxford Centre for functional MRI of the brain (FMRIB). Then, in the third step, the LS-SVM is used to segment grey matter (GM) and white matter (WM). The training samples for LS-SVM are selected from the registered brain atlas. The voxel intensities and spatial positions are selected as the two feature groups for training and test. SVM as a powerful discriminator is able to handle nonlinear classification problems; however, it cannot provide posterior probability. Thus, we use a sigmoid function to map the SVM output into probabilities. The proposed method is used to segment CSF, GM and WM from the simulated magnetic resonance imaging (MRI) using Brainweb MRI simulator and real data provided by Internet Brain Segmentation Repository. The semi-automatically segmented brain tissues were evaluated by comparing to the corresponding ground truth. The Dice and Jaccard similarity coefficients, sensitivity and specificity were calculated for the quantitative validation of the results. The quantitative results show that the proposed method segments brain tissues accurately with respect to corresponding ground truth. PMID:24696800
Assessing the Three-North Shelter Forest Program in China by a novel framework for characterizing vegetation changes

NASA Astrophysics Data System (ADS)

Qiu, Bingwen; Chen, Gong; Tang, Zhenghong; Lu, Difei; Wang, Zhuangzhuang; Chen, Chongchen

2017-11-01

The Three-North Shelter Forest Program (TNSFP) in China has been intensely invested for approximately 40 years. However, the efficacy of the TNSFP has been debatable due to the spatiotemporal complexity of vegetation changes. A novel framework was proposed for characterizing vegetation changes in the TNSFP region through Combining Trend and Temporal Similarity trajectory (COTTS). This framework could automatically and continuously address the fundamental questions on where, what, how and when vegetation changes have occurred. Vegetation trend was measured by a non-parametric method. The temporal similarity trajectory was tracked by the Jeffries-Matusita (JM) distance of the inter-annual vegetation indices temporal profiles and modeled using the logistic function. The COTTS approach was applied to examine the afforestation efforts of the TNSFP using 500 m 8-day composites MODIS datasets from 2001 to 2015. Accuracy assessment from the 1109 reference sites reveals that the COTTS is capable of automatically determining vegetation dynamic patterns, with an overall accuracy of 90.08% and a kappa coefficient of 0.8688. The efficacy of the TNSFP was evaluated through comprehensive considerations of vegetation, soil and wetness. Around 45.78% areas obtained increasing vegetation trend, 2.96% areas achieved bare soil decline and 4.50% areas exhibited increasing surface wetness. There were 4.49% areas under vegetation degradation & desertification. Spatiotemporal heterogeneity of efficacy of the TNSFP was revealed: great vegetation gain through the abrupt dynamic pattern in the semi-humid and humid regions, bare soil decline & potential efficacy in the semi-arid region and remarkable efficacy in functional region of Eastern Ordos.
NCBI disease corpus: a resource for disease name recognition and concept normalization.

PubMed

Doğan, Rezarta Islamaj; Leaman, Robert; Lu, Zhiyong

2014-02-01

Information encoded in natural language in biomedical literature publications is only useful if efficient and reliable ways of accessing and analyzing that information are available. Natural language processing and text mining tools are therefore essential for extracting valuable information, however, the development of powerful, highly effective tools to automatically detect central biomedical concepts such as diseases is conditional on the availability of annotated corpora. This paper presents the disease name and concept annotations of the NCBI disease corpus, a collection of 793 PubMed abstracts fully annotated at the mention and concept level to serve as a research resource for the biomedical natural language processing community. Each PubMed abstract was manually annotated by two annotators with disease mentions and their corresponding concepts in Medical Subject Headings (MeSH®) or Online Mendelian Inheritance in Man (OMIM®). Manual curation was performed using PubTator, which allowed the use of pre-annotations as a pre-step to manual annotations. Fourteen annotators were randomly paired and differing annotations were discussed for reaching a consensus in two annotation phases. In this setting, a high inter-annotator agreement was observed. Finally, all results were checked against annotations of the rest of the corpus to assure corpus-wide consistency. The public release of the NCBI disease corpus contains 6892 disease mentions, which are mapped to 790 unique disease concepts. Of these, 88% link to a MeSH identifier, while the rest contain an OMIM identifier. We were able to link 91% of the mentions to a single disease concept, while the rest are described as a combination of concepts. In order to help researchers use the corpus to design and test disease identification methods, we have prepared the corpus as training, testing and development sets. To demonstrate its utility, we conducted a benchmarking experiment where we compared three different knowledge-based disease normalization methods with a best performance in F-measure of 63.7%. These results show that the NCBI disease corpus has the potential to significantly improve the state-of-the-art in disease name recognition and normalization research, by providing a high-quality gold standard thus enabling the development of machine-learning based approaches for such tasks. The NCBI disease corpus, guidelines and other associated resources are available at: http://www.ncbi.nlm.nih.gov/CBBresearch/Dogan/DISEASE/. Published by Elsevier Inc.
Mining of relations between proteins over biomedical scientific literature using a deep-linguistic approach.

PubMed

Rinaldi, Fabio; Schneider, Gerold; Kaljurand, Kaarel; Hess, Michael; Andronis, Christos; Konstandi, Ourania; Persidis, Andreas

2007-02-01

The amount of new discoveries (as published in the scientific literature) in the biomedical area is growing at an exponential rate. This growth makes it very difficult to filter the most relevant results, and thus the extraction of the core information becomes very expensive. Therefore, there is a growing interest in text processing approaches that can deliver selected information from scientific publications, which can limit the amount of human intervention normally needed to gather those results. This paper presents and evaluates an approach aimed at automating the process of extracting functional relations (e.g. interactions between genes and proteins) from scientific literature in the biomedical domain. The approach, using a novel dependency-based parser, is based on a complete syntactic analysis of the corpus. We have implemented a state-of-the-art text mining system for biomedical literature, based on a deep-linguistic, full-parsing approach. The results are validated on two different corpora: the manually annotated genomics information access (GENIA) corpus and the automatically annotated arabidopsis thaliana circadian rhythms (ATCR) corpus. We show how a deep-linguistic approach (contrary to common belief) can be used in a real world text mining application, offering high-precision relation extraction, while at the same time retaining a sufficient recall.
proGenomes: a resource for consistent functional and taxonomic annotations of prokaryotic genomes.

PubMed

Mende, Daniel R; Letunic, Ivica; Huerta-Cepas, Jaime; Li, Simone S; Forslund, Kristoffer; Sunagawa, Shinichi; Bork, Peer

2017-01-04

The availability of microbial genomes has opened many new avenues of research within microbiology. This has been driven primarily by comparative genomics approaches, which rely on accurate and consistent characterization of genomic sequences. It is nevertheless difficult to obtain consistent taxonomic and integrated functional annotations for defined prokaryotic clades. Thus, we developed proGenomes, a resource that provides user-friendly access to currently 25 038 high-quality genomes whose sequences and consistent annotations can be retrieved individually or by taxonomic clade. These genomes are assigned to 5306 consistent and accurate taxonomic species clusters based on previously established methodology. proGenomes also contains functional information for almost 80 million protein-coding genes, including a comprehensive set of general annotations and more focused annotations for carbohydrate-active enzymes and antibiotic resistance genes. Additionally, broad habitat information is provided for many genomes. All genomes and associated information can be downloaded by user-selected clade or multiple habitat-specific sets of representative genomes. We expect that the availability of high-quality genomes with comprehensive functional annotations will promote advances in clinical microbial genomics, functional evolution and other subfields of microbiology. proGenomes is available at http://progenomes.embl.de. © The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research.
TnpPred: A Web Service for the Robust Prediction of Prokaryotic Transposases

PubMed Central

Riadi, Gonzalo; Medina-Moenne, Cristobal; Holmes, David S.

2012-01-01

Transposases (Tnps) are enzymes that participate in the movement of insertion sequences (ISs) within and between genomes. Genes that encode Tnps are amongst the most abundant and widely distributed genes in nature. However, they are difficult to predict bioinformatically and given the increasing availability of prokaryotic genomes and metagenomes, it is incumbent to develop rapid, high quality automatic annotation of ISs. This need prompted us to develop a web service, termed TnpPred for Tnp discovery. It provides better sensitivity and specificity for Tnp predictions than given by currently available programs as determined by ROC analysis. TnpPred should be useful for improving genome annotation. The TnpPred web service is freely available for noncommercial use. PMID:23251097
A collaborative platform for consensus sessions in pathology over Internet.

PubMed

Zapletal, Eric; Le Bozec, Christel; Degoulet, Patrice; Jaulent, Marie-Christine

2003-01-01

The design of valid databases in pathology faces the problem of diagnostic disagreement between pathologists. Organizing consensus sessions between experts to reduce the variability is a difficult task. The TRIDEM platform addresses the issue to organize consensus sessions in pathology over the Internet. In this paper, we present the basis to achieve such collaborative platform. On the one hand, the platform integrates the functionalities of the IDEM consensus module that alleviates the consensus task by presenting to pathologists preliminary computed consensus through ergonomic interfaces (automatic step). On the other hand, a set of lightweight interaction tools such as vocal annotations are implemented to ease the communication between experts as they discuss a case (interactive step). The architecture of the TRIDEM platform is based on a Java-Server-Page web server that communicate with the ObjectStore PSE/PRO database used for the object storage. The HTML pages generated by the web server run Java applets to perform the different steps (automatic and interactive) of the consensus. The current limitations of the platform is to only handle a synchronous process. Moreover, improvements like re-writing the consensus workflow with a protocol such as BPML are already forecast.
Tracing the boundaries of Cenozoic volcanic edifices from Sardinia (Italy): a geomorphometric contribution

NASA Astrophysics Data System (ADS)

Melis, M. T.; Mundula, F.; DessÌ, F.; Cioni, R.; Funedda, A.

2014-09-01

Unequivocal delimitation of landforms is an important issue for different purposes, from science-driven morphometric analysis to legal issues related to land conservation. This study is aimed at giving a new contribution to the morphometric approach for the delineation of the boundaries of volcanic edifices, applied to 13 monogenetic volcanoes (scoria cones) related to the Pliocene-Pleistocene volcanic cycle in Sardinia (Italy). External boundary delimitation of the edifices is discussed based on an integrated methodology using automatic elaboration of digital elevation models together with geomorphological and geological observations. Different elaborations of surface slope and profile curvature have been proposed and discussed; among them, two algorithms based on simple mathematical functions combining slope and profile curvature well fit the requirements of this study. One of theses algorithms is a modification of a function introduced by Grosse et al. (2011), which better performs for recognizing and tracing the boundary between the volcanic scoria cone and its basement. Although the geological constraints still drive the final decision, the proposed method improves the existing tools for a semi-automatic tracing of the boundaries.
Tracing the boundaries of Cenozoic volcanic edifices from Sardinia (Italy): a geomorphometric contribution

NASA Astrophysics Data System (ADS)

Melis, M. T.; Mundula, F.; Dessì, F.; Cioni, R.; Funedda, A.

2014-05-01

Unequivocal delimitation of landforms is an important issue for different purposes, from science-driven morphometric analysis to legal issues related to land conservation. This study is aimed at giving a new contribution to the morphometric approach for the delineation of the boundaries of volcanic edifices, applied to 13 monogenetic volcanoes (scoria cones) related to the Pliocene-Pleistocene volcanic cycle in Sardinia (Italy). External boundary delimitation of the edifices is discussed based on an integrated methodology using automatic elaboration of digital elevation models together with geomorphological and geological observations. Different elaborations of surface slope and profile curvature have been proposed and discussed; among them, two algorithms based on simple mathematical functions combining slope and profile curvature well fit the requirements of this study. One of theses algorithms is a modification of a function already discussed by Grosse et al. (2011), which better perform for recognizing and tracing the boundary between the volcanic scoria cone and its basement. Although the geological constraints still drive the final decision, the proposed method improves the existing tools for a semi-automatic tracing of the boundaries.
Automated contour detection in X-ray left ventricular angiograms using multiview active appearance models and dynamic programming.

PubMed

Oost, Elco; Koning, Gerhard; Sonka, Milan; Oemrawsingh, Pranobe V; Reiber, Johan H C; Lelieveldt, Boudewijn P F

2006-09-01

This paper describes a new approach to the automated segmentation of X-ray left ventricular (LV) angiograms, based on active appearance models (AAMs) and dynamic programming. A coupling of shape and texture information between the end-diastolic (ED) and end-systolic (ES) frame was achieved by constructing a multiview AAM. Over-constraining of the model was compensated for by employing dynamic programming, integrating both intensity and motion features in the cost function. Two applications are compared: a semi-automatic method with manual model initialization, and a fully automatic algorithm. The first proved to be highly robust and accurate, demonstrating high clinical relevance. Based on experiments involving 70 patient data sets, the algorithm's success rate was 100% for ED and 99% for ES, with average unsigned border positioning errors of 0.68 mm for ED and 1.45 mm for ES. Calculated volumes were accurate and unbiased. The fully automatic algorithm, with intrinsically less user interaction was less robust, but showed a high potential, mostly due to a controlled gradient descent in updating the model parameters. The success rate of the fully automatic method was 91% for ED and 83% for ES, with average unsigned border positioning errors of 0.79 mm for ED and 1.55 mm for ES.
An efficient annotation and gene-expression derivation tool for Illumina Solexa datasets.

PubMed

Hosseini, Parsa; Tremblay, Arianne; Matthews, Benjamin F; Alkharouf, Nadim W

2010-07-02

The data produced by an Illumina flow cell with all eight lanes occupied, produces well over a terabyte worth of images with gigabytes of reads following sequence alignment. The ability to translate such reads into meaningful annotation is therefore of great concern and importance. Very easily, one can get flooded with such a great volume of textual, unannotated data irrespective of read quality or size. CASAVA, a optional analysis tool for Illumina sequencing experiments, enables the ability to understand INDEL detection, SNP information, and allele calling. To not only extract from such analysis, a measure of gene expression in the form of tag-counts, but furthermore to annotate such reads is therefore of significant value. We developed TASE (Tag counting and Analysis of Solexa Experiments), a rapid tag-counting and annotation software tool specifically designed for Illumina CASAVA sequencing datasets. Developed in Java and deployed using jTDS JDBC driver and a SQL Server backend, TASE provides an extremely fast means of calculating gene expression through tag-counts while annotating sequenced reads with the gene's presumed function, from any given CASAVA-build. Such a build is generated for both DNA and RNA sequencing. Analysis is broken into two distinct components: DNA sequence or read concatenation, followed by tag-counting and annotation. The end result produces output containing the homology-based functional annotation and respective gene expression measure signifying how many times sequenced reads were found within the genomic ranges of functional annotations. TASE is a powerful tool to facilitate the process of annotating a given Illumina Solexa sequencing dataset. Our results indicate that both homology-based annotation and tag-count analysis are achieved in very efficient times, providing researchers to delve deep in a given CASAVA-build and maximize information extraction from a sequencing dataset. TASE is specially designed to translate sequence data in a CASAVA-build into functional annotations while producing corresponding gene expression measurements. Achieving such analysis is executed in an ultrafast and highly efficient manner, whether the analysis be a single-read or paired-end sequencing experiment. TASE is a user-friendly and freely available application, allowing rapid analysis and annotation of any given Illumina Solexa sequencing dataset with ease.
Text Mining Improves Prediction of Protein Functional Sites

PubMed Central

Cohn, Judith D.; Ravikumar, Komandur E.

2012-01-01

We present an approach that integrates protein structure analysis and text mining for protein functional site prediction, called LEAP-FS (Literature Enhanced Automated Prediction of Functional Sites). The structure analysis was carried out using Dynamics Perturbation Analysis (DPA), which predicts functional sites at control points where interactions greatly perturb protein vibrations. The text mining extracts mentions of residues in the literature, and predicts that residues mentioned are functionally important. We assessed the significance of each of these methods by analyzing their performance in finding known functional sites (specifically, small-molecule binding sites and catalytic sites) in about 100,000 publicly available protein structures. The DPA predictions recapitulated many of the functional site annotations and preferentially recovered binding sites annotated as biologically relevant vs. those annotated as potentially spurious. The text-based predictions were also substantially supported by the functional site annotations: compared to other residues, residues mentioned in text were roughly six times more likely to be found in a functional site. The overlap of predictions with annotations improved when the text-based and structure-based methods agreed. Our analysis also yielded new high-quality predictions of many functional site residues that were not catalogued in the curated data sources we inspected. We conclude that both DPA and text mining independently provide valuable high-throughput protein functional site predictions, and that integrating the two methods using LEAP-FS further improves the quality of these predictions. PMID:22393388
The Function of Annotations in the Comprehension of Scientific Texts: Cognitive Load Effects and the Impact of Verbal Ability

ERIC Educational Resources Information Center

Wallen, Erik; Plass, Jan L.; Brunken, Roland

2005-01-01

Students participated in a study (n = 98) investigating the effectiveness of three types of annotations on three learning outcome measures. The annotations were designed to support the cognitive processes in the comprehension of scientific texts, with a function to aid either the process of selecting relevant information, organizing the…
20 CFR 220.133 - Skill requirements.

Code of Federal Regulations, 2011 CFR

2011-04-01

... needs little or no judgment to do simple duties that can be learned on the job in a short period of time... claimant can usually learn to do the job in 30 days, and little job training and judgment are needed. The... machines which are automatic or operated by others); or (4) Machine tending. (c) Semi-skilled work. Semi...
20 CFR 220.133 - Skill requirements.

Code of Federal Regulations, 2010 CFR

2010-04-01

... needs little or no judgment to do simple duties that can be learned on the job in a short period of time... claimant can usually learn to do the job in 30 days, and little job training and judgment are needed. The... machines which are automatic or operated by others); or (4) Machine tending. (c) Semi-skilled work. Semi...
20 CFR 220.133 - Skill requirements.

Code of Federal Regulations, 2014 CFR

2014-04-01

... needs little or no judgment to do simple duties that can be learned on the job in a short period of time... claimant can usually learn to do the job in 30 days, and little job training and judgment are needed. The... machines which are automatic or operated by others); or (4) Machine tending. (c) Semi-skilled work. Semi...

Some links on this page may take you to non-federal websites. Their policies may differ from this site.