Sample records for large text corpora

  1. Extracting Useful Semantic Information from Large Scale Corpora of Text

    ERIC Educational Resources Information Center

    Mendoza, Ray Padilla, Jr.

    2012-01-01

    Extracting and representing semantic information from large scale corpora is at the crux of computer-assisted knowledge generation. Semantic information depends on collocation extraction methods, mathematical models used to represent distributional information, and weighting functions which transform the space. This dissertation provides a…

  2. Mining Quality Phrases from Massive Text Corpora

    PubMed Central

    Liu, Jialu; Shang, Jingbo; Wang, Chi; Ren, Xiang; Han, Jiawei

    2015-01-01

    Text data are ubiquitous and play an essential role in big data applications. However, text data are mostly unstructured. Transforming unstructured text into structured units (e.g., semantically meaningful phrases) will substantially reduce semantic ambiguity and enhance the power and efficiency at manipulating such data using database technology. Thus mining quality phrases is a critical research problem in the field of databases. In this paper, we propose a new framework that extracts quality phrases from text corpora integrated with phrasal segmentation. The framework requires only limited training but the quality of phrases so generated is close to human judgment. Moreover, the method is scalable: both computation time and required space grow linearly as corpus size increases. Our experiments on large text corpora demonstrate the quality and efficiency of the new method. PMID:26705375

  3. Redundancy in electronic health record corpora: analysis, impact on text mining performance and mitigation strategies.

    PubMed

    Cohen, Raphael; Elhadad, Michael; Elhadad, Noémie

    2013-01-16

    The increasing availability of Electronic Health Record (EHR) data and specifically free-text patient notes presents opportunities for phenotype extraction. Text-mining methods in particular can help disease modeling by mapping named-entities mentions to terminologies and clustering semantically related terms. EHR corpora, however, exhibit specific statistical and linguistic characteristics when compared with corpora in the biomedical literature domain. We focus on copy-and-paste redundancy: clinicians typically copy and paste information from previous notes when documenting a current patient encounter. Thus, within a longitudinal patient record, one expects to observe heavy redundancy. In this paper, we ask three research questions: (i) How can redundancy be quantified in large-scale text corpora? (ii) Conventional wisdom is that larger corpora yield better results in text mining. But how does the observed EHR redundancy affect text mining? Does such redundancy introduce a bias that distorts learned models? Or does the redundancy introduce benefits by highlighting stable and important subsets of the corpus? (iii) How can one mitigate the impact of redundancy on text mining? We analyze a large-scale EHR corpus and quantify redundancy both in terms of word and semantic concept repetition. We observe redundancy levels of about 30% and non-standard distribution of both words and concepts. We measure the impact of redundancy on two standard text-mining applications: collocation identification and topic modeling. We compare the results of these methods on synthetic data with controlled levels of redundancy and observe significant performance variation. Finally, we compare two mitigation strategies to avoid redundancy-induced bias: (i) a baseline strategy, keeping only the last note for each patient in the corpus; (ii) removing redundant notes with an efficient fingerprinting-based algorithm. (a)For text mining, preprocessing the EHR corpus with fingerprinting yields significantly better results. Before applying text-mining techniques, one must pay careful attention to the structure of the analyzed corpora. While the importance of data cleaning has been known for low-level text characteristics (e.g., encoding and spelling), high-level and difficult-to-quantify corpus characteristics, such as naturally occurring redundancy, can also hurt text mining. Fingerprinting enables text-mining techniques to leverage available data in the EHR corpus, while avoiding the bias introduced by redundancy.

  4. Computational methods to extract meaning from text and advance theories of human cognition.

    PubMed

    McNamara, Danielle S

    2011-01-01

    Over the past two decades, researchers have made great advances in the area of computational methods for extracting meaning from text. This research has to a large extent been spurred by the development of latent semantic analysis (LSA), a method for extracting and representing the meaning of words using statistical computations applied to large corpora of text. Since the advent of LSA, researchers have developed and tested alternative statistical methods designed to detect and analyze meaning in text corpora. This research exemplifies how statistical models of semantics play an important role in our understanding of cognition and contribute to the field of cognitive science. Importantly, these models afford large-scale representations of human knowledge and allow researchers to explore various questions regarding knowledge, discourse processing, text comprehension, and language. This topic includes the latest progress by the leading researchers in the endeavor to go beyond LSA. Copyright © 2010 Cognitive Science Society, Inc.

  5. A token centric part-of-speech tagger for biomedical text.

    PubMed

    Barrett, Neil; Weber-Jahnke, Jens

    2014-05-01

    Difficulties with part-of-speech (POS) tagging of biomedical text is accessing and annotating appropriate training corpora. These difficulties may result in POS taggers trained on corpora that differ from the tagger's target biomedical text (cross-domain tagging). In such cases where training and target corpora differ tagging accuracy decreases. This paper presents a POS tagger for cross-domain tagging called TcT. TcT estimates a tag's likelihood for a given token by combining token collocation probabilities and the token's tag probabilities calculated using a Naive Bayes classifier. We compared TcT to three POS taggers used in the biomedical domain (mxpost, Brill and TnT). We trained each tagger on a non-biomedical corpus and evaluated it on biomedical corpora. TcT was more accurate in cross-domain tagging than mxpost, Brill and TnT (respective averages 83.9, 81.0, 79.5 and 78.8). Our analysis of tagger performance suggests that lexical differences between corpora have more effect on tagging accuracy than originally considered by previous research work. Biomedical POS tagging algorithms may be modified to improve their cross-domain tagging accuracy without requiring extra training or large training data sets. Future work should reexamine POS tagging methods for biomedical text. This differs from the work to date that has focused on retraining existing POS taggers. Copyright © 2014 Elsevier B.V. All rights reserved.

  6. Building gold standard corpora for medical natural language processing tasks.

    PubMed

    Deleger, Louise; Li, Qi; Lingren, Todd; Kaiser, Megan; Molnar, Katalin; Stoutenborough, Laura; Kouril, Michal; Marsolo, Keith; Solti, Imre

    2012-01-01

    We present the construction of three annotated corpora to serve as gold standards for medical natural language processing (NLP) tasks. Clinical notes from the medical record, clinical trial announcements, and FDA drug labels are annotated. We report high inter-annotator agreements (overall F-measures between 0.8467 and 0.9176) for the annotation of Personal Health Information (PHI) elements for a de-identification task and of medications, diseases/disorders, and signs/symptoms for information extraction (IE) task. The annotated corpora of clinical trials and FDA labels will be publicly released and to facilitate translational NLP tasks that require cross-corpora interoperability (e.g. clinical trial eligibility screening) their annotation schemas are aligned with a large scale, NIH-funded clinical text annotation project.

  7. A linear-RBF multikernel SVM to classify big text corpora.

    PubMed

    Romero, R; Iglesias, E L; Borrajo, L

    2015-01-01

    Support vector machine (SVM) is a powerful technique for classification. However, SVM is not suitable for classification of large datasets or text corpora, because the training complexity of SVMs is highly dependent on the input size. Recent developments in the literature on the SVM and other kernel methods emphasize the need to consider multiple kernels or parameterizations of kernels because they provide greater flexibility. This paper shows a multikernel SVM to manage highly dimensional data, providing an automatic parameterization with low computational cost and improving results against SVMs parameterized under a brute-force search. The model consists in spreading the dataset into cohesive term slices (clusters) to construct a defined structure (multikernel). The new approach is tested on different text corpora. Experimental results show that the new classifier has good accuracy compared with the classic SVM, while the training is significantly faster than several other SVM classifiers.

  8. Taming Big Data: An Information Extraction Strategy for Large Clinical Text Corpora.

    PubMed

    Gundlapalli, Adi V; Divita, Guy; Carter, Marjorie E; Redd, Andrew; Samore, Matthew H; Gupta, Kalpana; Trautner, Barbara

    2015-01-01

    Concepts of interest for clinical and research purposes are not uniformly distributed in clinical text available in electronic medical records. The purpose of our study was to identify filtering techniques to select 'high yield' documents for increased efficacy and throughput. Using two large corpora of clinical text, we demonstrate the identification of 'high yield' document sets in two unrelated domains: homelessness and indwelling urinary catheters. For homelessness, the high yield set includes homeless program and social work notes. For urinary catheters, concepts were more prevalent in notes from hospitalized patients; nursing notes accounted for a majority of the high yield set. This filtering will enable customization and refining of information extraction pipelines to facilitate extraction of relevant concepts for clinical decision support and other uses.

  9. The Case of Perrin and Thomson: An Example of the Use of a Mini-Corpus

    ERIC Educational Resources Information Center

    Banks, David

    2005-01-01

    Although recent trends have been towards large corpora, there is a valid place for the study of small corpora. This article is an example of one such study using a corpus of late 19th century texts, consisting of 1783 words in French by Perrin, and 2824 words in English by Thomson. Perrin uses more first person pronouns in a wider range of…

  10. HierarchicalTopics: visually exploring large text collections using topic hierarchies.

    PubMed

    Dou, Wenwen; Yu, Li; Wang, Xiaoyu; Ma, Zhiqiang; Ribarsky, William

    2013-12-01

    Analyzing large textual collections has become increasingly challenging given the size of the data available and the rate that more data is being generated. Topic-based text summarization methods coupled with interactive visualizations have presented promising approaches to address the challenge of analyzing large text corpora. As the text corpora and vocabulary grow larger, more topics need to be generated in order to capture the meaningful latent themes and nuances in the corpora. However, it is difficult for most of current topic-based visualizations to represent large number of topics without being cluttered or illegible. To facilitate the representation and navigation of a large number of topics, we propose a visual analytics system--HierarchicalTopic (HT). HT integrates a computational algorithm, Topic Rose Tree, with an interactive visual interface. The Topic Rose Tree constructs a topic hierarchy based on a list of topics. The interactive visual interface is designed to present the topic content as well as temporal evolution of topics in a hierarchical fashion. User interactions are provided for users to make changes to the topic hierarchy based on their mental model of the topic space. To qualitatively evaluate HT, we present a case study that showcases how HierarchicalTopics aid expert users in making sense of a large number of topics and discovering interesting patterns of topic groups. We have also conducted a user study to quantitatively evaluate the effect of hierarchical topic structure. The study results reveal that the HT leads to faster identification of large number of relevant topics. We have also solicited user feedback during the experiments and incorporated some suggestions into the current version of HierarchicalTopics.

  11. Combining MEDLINE and publisher data to create parallel corpora for the automatic translation of biomedical text

    PubMed Central

    2013-01-01

    Background Most of the institutional and research information in the biomedical domain is available in the form of English text. Even in countries where English is an official language, such as the United States, language can be a barrier for accessing biomedical information for non-native speakers. Recent progress in machine translation suggests that this technique could help make English texts accessible to speakers of other languages. However, the lack of adequate specialized corpora needed to train statistical models currently limits the quality of automatic translations in the biomedical domain. Results We show how a large-sized parallel corpus can automatically be obtained for the biomedical domain, using the MEDLINE database. The corpus generated in this work comprises article titles obtained from MEDLINE and abstract text automatically retrieved from journal websites, which substantially extends the corpora used in previous work. After assessing the quality of the corpus for two language pairs (English/French and English/Spanish) we use the Moses package to train a statistical machine translation model that outperforms previous models for automatic translation of biomedical text. Conclusions We have built translation data sets in the biomedical domain that can easily be extended to other languages available in MEDLINE. These sets can successfully be applied to train statistical machine translation models. While further progress should be made by incorporating out-of-domain corpora and domain-specific lexicons, we believe that this work improves the automatic translation of biomedical texts. PMID:23631733

  12. Visualizing the semantic content of large text databases using text maps

    NASA Technical Reports Server (NTRS)

    Combs, Nathan

    1993-01-01

    A methodology for generating text map representations of the semantic content of text databases is presented. Text maps provide a graphical metaphor for conceptualizing and visualizing the contents and data interrelationships of large text databases. Described are a set of experiments conducted against the TIPSTER corpora of Wall Street Journal articles. These experiments provide an introduction to current work in the representation and visualization of documents by way of their semantic content.

  13. Automatic extraction of property norm-like data from large text corpora.

    PubMed

    Kelly, Colin; Devereux, Barry; Korhonen, Anna

    2014-01-01

    Traditional methods for deriving property-based representations of concepts from text have focused on either extracting only a subset of possible relation types, such as hyponymy/hypernymy (e.g., car is-a vehicle) or meronymy/metonymy (e.g., car has wheels), or unspecified relations (e.g., car--petrol). We propose a system for the challenging task of automatic, large-scale acquisition of unconstrained, human-like property norms from large text corpora, and discuss the theoretical implications of such a system. We employ syntactic, semantic, and encyclopedic information to guide our extraction, yielding concept-relation-feature triples (e.g., car be fast, car require petrol, car cause pollution), which approximate property-based conceptual representations. Our novel method extracts candidate triples from parsed corpora (Wikipedia and the British National Corpus) using syntactically and grammatically motivated rules, then reweights triples with a linear combination of their frequency and four statistical metrics. We assess our system output in three ways: lexical comparison with norms derived from human-generated property norm data, direct evaluation by four human judges, and a semantic distance comparison with both WordNet similarity data and human-judged concept similarity ratings. Our system offers a viable and performant method of plausible triple extraction: Our lexical comparison shows comparable performance to the current state-of-the-art, while subsequent evaluations exhibit the human-like character of our generated properties.

  14. Metaphor Identification in Large Texts Corpora

    PubMed Central

    Neuman, Yair; Assaf, Dan; Cohen, Yohai; Last, Mark; Argamon, Shlomo; Howard, Newton; Frieder, Ophir

    2013-01-01

    Identifying metaphorical language-use (e.g., sweet child) is one of the challenges facing natural language processing. This paper describes three novel algorithms for automatic metaphor identification. The algorithms are variations of the same core algorithm. We evaluate the algorithms on two corpora of Reuters and the New York Times articles. The paper presents the most comprehensive study of metaphor identification in terms of scope of metaphorical phrases and annotated corpora size. Algorithms’ performance in identifying linguistic phrases as metaphorical or literal has been compared to human judgment. Overall, the algorithms outperform the state-of-the-art algorithm with 71% precision and 27% averaged improvement in prediction over the base-rate of metaphors in the corpus. PMID:23658625

  15. Encoding Standards for Linguistic Corpora.

    ERIC Educational Resources Information Center

    Ide, Nancy

    The demand for extensive reusability of large language text collections for natural languages processing research requires development of standardized encoding formats. Such formats must be capable of representing different kinds of information across the spectrum of text types and languages, capable of representing different levels of…

  16. Text mixing shapes the anatomy of rank-frequency distributions

    NASA Astrophysics Data System (ADS)

    Williams, Jake Ryland; Bagrow, James P.; Danforth, Christopher M.; Dodds, Peter Sheridan

    2015-05-01

    Natural languages are full of rules and exceptions. One of the most famous quantitative rules is Zipf's law, which states that the frequency of occurrence of a word is approximately inversely proportional to its rank. Though this "law" of ranks has been found to hold across disparate texts and forms of data, analyses of increasingly large corpora since the late 1990s have revealed the existence of two scaling regimes. These regimes have thus far been explained by a hypothesis suggesting a separability of languages into core and noncore lexica. Here we present and defend an alternative hypothesis that the two scaling regimes result from the act of aggregating texts. We observe that text mixing leads to an effective decay of word introduction, which we show provides accurate predictions of the location and severity of breaks in scaling. Upon examining large corpora from 10 languages in the Project Gutenberg eBooks collection, we find emphatic empirical support for the universality of our claim.

  17. Text mixing shapes the anatomy of rank-frequency distributions.

    PubMed

    Williams, Jake Ryland; Bagrow, James P; Danforth, Christopher M; Dodds, Peter Sheridan

    2015-05-01

    Natural languages are full of rules and exceptions. One of the most famous quantitative rules is Zipf's law, which states that the frequency of occurrence of a word is approximately inversely proportional to its rank. Though this "law" of ranks has been found to hold across disparate texts and forms of data, analyses of increasingly large corpora since the late 1990s have revealed the existence of two scaling regimes. These regimes have thus far been explained by a hypothesis suggesting a separability of languages into core and noncore lexica. Here we present and defend an alternative hypothesis that the two scaling regimes result from the act of aggregating texts. We observe that text mixing leads to an effective decay of word introduction, which we show provides accurate predictions of the location and severity of breaks in scaling. Upon examining large corpora from 10 languages in the Project Gutenberg eBooks collection, we find emphatic empirical support for the universality of our claim.

  18. Block-suffix shifting: fast, simultaneous medical concept set identification in large medical record corpora.

    PubMed

    Liu, Ying; Lita, Lucian Vlad; Niculescu, Radu Stefan; Mitra, Prasenjit; Giles, C Lee

    2008-11-06

    Owing to new advances in computer hardware, large text databases have become more prevalent than ever.Automatically mining information from these databases proves to be a challenge due to slow pattern/string matching techniques. In this paper we present a new, fast multi-string pattern matching method based on the well known Aho-Chorasick algorithm. Advantages of our algorithm include:the ability to exploit the natural structure of text, the ability to perform significant character shifting, avoiding backtracking jumps that are not useful, efficiency in terms of matching time and avoiding the typical "sub-string" false positive errors.Our algorithm is applicable to many fields with free text, such as the health care domain and the scientific document field. In this paper, we apply the BSS algorithm to health care data and mine hundreds of thousands of medical concepts from a large Electronic Medical Record (EMR) corpora simultaneously and efficiently. Experimental results show the superiority of our algorithm when compared with the top of the line multi-string matching algorithms.

  19. Using machine learning to disentangle homonyms in large text corpora.

    PubMed

    Roll, Uri; Correia, Ricardo A; Berger-Tal, Oded

    2018-06-01

    Systematic reviews are an increasingly popular decision-making tool that provides an unbiased summary of evidence to support conservation action. These reviews bridge the gap between researchers and managers by presenting a comprehensive overview of all studies relating to a particular topic and identify specifically where and under which conditions an effect is present. However, several technical challenges can severely hinder the feasibility and applicability of systematic reviews, for example, homonyms (terms that share spelling but differ in meaning). Homonyms add noise to search results and cannot be easily identified or removed. We developed a semiautomated approach that can aid in the classification of homonyms among narratives. We used a combination of automated content analysis and artificial neural networks to quickly and accurately sift through large corpora of academic texts and classify them to distinct topics. As an example, we explored the use of the word reintroduction in academic texts. Reintroduction is used within the conservation context to indicate the release of organisms to their former native habitat; however, a Web of Science search for this word returned thousands of publications in which the term has other meanings and contexts. Using our method, we automatically classified a sample of 3000 of these publications with over 99% accuracy, relative to a manual classification. Our approach can be used easily with other homonyms and can greatly facilitate systematic reviews or similar work in which homonyms hinder the harnessing of large text corpora. Beyond homonyms we see great promise in combining automated content analysis and machine-learning methods to handle and screen big data for relevant information in conservation science. © 2017 Society for Conservation Biology.

  20. Automatic Dictionary Expansion Using Non-parallel Corpora

    NASA Astrophysics Data System (ADS)

    Rapp, Reinhard; Zock, Michael

    Automatically generating bilingual dictionaries from parallel, manually translated texts is a well established technique that works well in practice. However, parallel texts are a scarce resource. Therefore, it is desirable also to be able to generate dictionaries from pairs of comparable monolingual corpora. For most languages, such corpora are much easier to acquire, and often in considerably larger quantities. In this paper we present the implementation of an algorithm which exploits such corpora with good success. Based on the assumption that the co-occurrence patterns between different languages are related, it expands a small base lexicon. For improved performance, it also realizes a novel interlingua approach. That is, if corpora of more than two languages are available, the translations from one language to another can be determined not only directly, but also indirectly via a pivot language.

  1. Three-Dimensional Dispaly Of Document Set

    DOEpatents

    Lantrip, David B.; Pennock, Kelly A.; Pottier, Marc C.; Schur, Anne; Thomas, James J.; Wise, James A.

    2003-06-24

    A method for spatializing text content for enhanced visual browsing and analysis. The invention is applied to large text document corpora such as digital libraries, regulations and procedures, archived reports, and the like. The text content from these sources may be transformed to a spatial representation that preserves informational characteristics from the documents. The three-dimensional representation may then be visually browsed and analyzed in ways that avoid language processing and that reduce the analysts' effort.

  2. Three-dimensional display of document set

    DOEpatents

    Lantrip, David B [Oxnard, CA; Pennock, Kelly A [Richland, WA; Pottier, Marc C [Richland, WA; Schur, Anne [Richland, WA; Thomas, James J [Richland, WA; Wise, James A [Richland, WA

    2006-09-26

    A method for spatializing text content for enhanced visual browsing and analysis. The invention is applied to large text document corpora such as digital libraries, regulations and procedures, archived reports, and the like. The text content from these sources may e transformed to a spatial representation that preserves informational characteristics from the documents. The three-dimensional representation may then be visually browsed and analyzed in ways that avoid language processing and that reduce the analysts' effort.

  3. Three-dimensional display of document set

    DOEpatents

    Lantrip, David B [Oxnard, CA; Pennock, Kelly A [Richland, WA; Pottier, Marc C [Richland, WA; Schur, Anne [Richland, WA; Thomas, James J [Richland, WA; Wise, James A [Richland, WA

    2001-10-02

    A method for spatializing text content for enhanced visual browsing and analysis. The invention is applied to large text document corpora such as digital libraries, regulations and procedures, archived reports, and the like. The text content from these sources may be transformed to a spatial representation that preserves informational characteristics from the documents. The three-dimensional representation may then be visually browsed and analyzed in ways that avoid language processing and that reduce the analysts' effort.

  4. Three-dimensional display of document set

    DOEpatents

    Lantrip, David B [Oxnard, CA; Pennock, Kelly A [Richland, WA; Pottier, Marc C [Richland, WA; Schur, Anne [Richland, WA; Thomas, James J [Richland, WA; Wise, James A [Richland, WA; York, Jeremy [Bothell, WA

    2009-06-30

    A method for spatializing text content for enhanced visual browsing and analysis. The invention is applied to large text document corpora such as digital libraries, regulations and procedures, archived reports, and the like. The text content from these sources may be transformed to a spatial representation that preserves informational characteristics from the documents. The three-dimensional representation may then be visually browsed and analyzed in ways that avoid language processing and that reduce the analysts' effort.

  5. Use of English Corpora as a Primary Resource to Teach English to the Bengali Learners

    ERIC Educational Resources Information Center

    Dash, Niladri Sekhar

    2011-01-01

    In this paper we argue in favour of teaching English as a second language to the Bengali learners with direct utilisation of English corpora. The proposed strategy is meant to be assisted with computer and is based on data, information, and examples retrieved from the present-day English corpora developed with various text samples composed by…

  6. The interpretation of dream meaning: Resolving ambiguity using Latent Semantic Analysis in a small corpus of text.

    PubMed

    Altszyler, Edgar; Ribeiro, Sidarta; Sigman, Mariano; Fernández Slezak, Diego

    2017-11-01

    Computer-based dreams content analysis relies on word frequencies within predefined categories in order to identify different elements in text. As a complementary approach, we explored the capabilities and limitations of word-embedding techniques to identify word usage patterns among dream reports. These tools allow us to quantify words associations in text and to identify the meaning of target words. Word-embeddings have been extensively studied in large datasets, but only a few studies analyze semantic representations in small corpora. To fill this gap, we compared Skip-gram and Latent Semantic Analysis (LSA) capabilities to extract semantic associations from dream reports. LSA showed better performance than Skip-gram in small size corpora in two tests. Furthermore, LSA captured relevant word associations in dream collection, even in cases with low-frequency words or small numbers of dreams. Word associations in dreams reports can thus be quantified by LSA, which opens new avenues for dream interpretation and decoding. Copyright © 2017 Elsevier Inc. All rights reserved.

  7. Proficiency Level--A Fuzzy Variable in Computer Learner Corpora

    ERIC Educational Resources Information Center

    Carlsen, Cecilie

    2012-01-01

    This article focuses on the proficiency level of texts in Computer Learner Corpora (CLCs). A claim is made that proficiency levels are often poorly defined in CLC design, and that the methods used for level assignment of corpus texts are not always adequate. Proficiency level can therefore, best be described as a fuzzy variable in CLCs,…

  8. Long-range correlations and burstiness in written texts: Universal and language-specific aspects

    NASA Astrophysics Data System (ADS)

    Constantoudis, Vassilios; Kalimeri, Maria; Diakonos, Fotis; Karamanos, Konstantinos; Papadimitriou, Constantinos; Chatzigeorgiou, Manolis; Papageorgiou, Harris

    2016-08-01

    Recently, methods from the statistical physics of complex systems have been applied successfully to identify universal features in the long-range correlations (LRCs) of written texts. However, in real texts, these universal features are being intermingled with language-specific influences. This paper aims at the characterization and further understanding of the interplay between universal and language-specific effects on the LRCs in texts. To this end, we apply the language-sensitive mapping of written texts to word-length series (wls) and analyse large parallel (of same content) corpora from 10 languages classified to four families (Romanic, Germanic, Greek and Uralic). The autocorrelation functions of the wls reveal tiny but persistent LRCs decaying at large scales following a power-law with a language-independent exponent ˜0.60-0.65. The impact of language is displayed in the amplitude of correlations where a relative standard deviation >40% among the analyzed languages is observed. The classification to language families seems to play a significant role since, the Finnish and Germanic languages exhibit more correlations than the Greek and Roman families. To reveal the origins of the LRCs, we focus on the long words and perform burst and correlation analysis in their positions along the corpora. We find that the universal features are linked more to the correlations of the inter-long word distances while the language-specific aspects are related more to their distributions.

  9. Portable automatic text classification for adverse drug reaction detection via multi-corpus training.

    PubMed

    Sarker, Abeed; Gonzalez, Graciela

    2015-02-01

    Automatic detection of adverse drug reaction (ADR) mentions from text has recently received significant interest in pharmacovigilance research. Current research focuses on various sources of text-based information, including social media-where enormous amounts of user posted data is available, which have the potential for use in pharmacovigilance if collected and filtered accurately. The aims of this study are: (i) to explore natural language processing (NLP) approaches for generating useful features from text, and utilizing them in optimized machine learning algorithms for automatic classification of ADR assertive text segments; (ii) to present two data sets that we prepared for the task of ADR detection from user posted internet data; and (iii) to investigate if combining training data from distinct corpora can improve automatic classification accuracies. One of our three data sets contains annotated sentences from clinical reports, and the two other data sets, built in-house, consist of annotated posts from social media. Our text classification approach relies on generating a large set of features, representing semantic properties (e.g., sentiment, polarity, and topic), from short text nuggets. Importantly, using our expanded feature sets, we combine training data from different corpora in attempts to boost classification accuracies. Our feature-rich classification approach performs significantly better than previously published approaches with ADR class F-scores of 0.812 (previously reported best: 0.770), 0.538 and 0.678 for the three data sets. Combining training data from multiple compatible corpora further improves the ADR F-scores for the in-house data sets to 0.597 (improvement of 5.9 units) and 0.704 (improvement of 2.6 units) respectively. Our research results indicate that using advanced NLP techniques for generating information rich features from text can significantly improve classification accuracies over existing benchmarks. Our experiments illustrate the benefits of incorporating various semantic features such as topics, concepts, sentiments, and polarities. Finally, we show that integration of information from compatible corpora can significantly improve classification performance. This form of multi-corpus training may be particularly useful in cases where data sets are heavily imbalanced (e.g., social media data), and may reduce the time and costs associated with the annotation of data in the future. Copyright © 2014 The Authors. Published by Elsevier Inc. All rights reserved.

  10. Portable Automatic Text Classification for Adverse Drug Reaction Detection via Multi-corpus Training

    PubMed Central

    Gonzalez, Graciela

    2014-01-01

    Objective Automatic detection of Adverse Drug Reaction (ADR) mentions from text has recently received significant interest in pharmacovigilance research. Current research focuses on various sources of text-based information, including social media — where enormous amounts of user posted data is available, which have the potential for use in pharmacovigilance if collected and filtered accurately. The aims of this study are: (i) to explore natural language processing approaches for generating useful features from text, and utilizing them in optimized machine learning algorithms for automatic classification of ADR assertive text segments; (ii) to present two data sets that we prepared for the task of ADR detection from user posted internet data; and (iii) to investigate if combining training data from distinct corpora can improve automatic classification accuracies. Methods One of our three data sets contains annotated sentences from clinical reports, and the two other data sets, built in-house, consist of annotated posts from social media. Our text classification approach relies on generating a large set of features, representing semantic properties (e.g., sentiment, polarity, and topic), from short text nuggets. Importantly, using our expanded feature sets, we combine training data from different corpora in attempts to boost classification accuracies. Results Our feature-rich classification approach performs significantly better than previously published approaches with ADR class F-scores of 0.812 (previously reported best: 0.770), 0.538 and 0.678 for the three data sets. Combining training data from multiple compatible corpora further improves the ADR F-scores for the in-house data sets to 0.597 (improvement of 5.9 units) and 0.704 (improvement of 2.6 units) respectively. Conclusions Our research results indicate that using advanced NLP techniques for generating information rich features from text can significantly improve classification accuracies over existing benchmarks. Our experiments illustrate the benefits of incorporating various semantic features such as topics, concepts, sentiments, and polarities. Finally, we show that integration of information from compatible corpora can significantly improve classification performance. This form of multi-corpus training may be particularly useful in cases where data sets are heavily imbalanced (e.g., social media data), and may reduce the time and costs associated with the annotation of data in the future. PMID:25451103

  11. Using the Longman Mini-concordancer on Tagged and Parsed Corpora, with Special Reference to Their Use as an Aid to Grammar Learning.

    ERIC Educational Resources Information Center

    Qiao, Hong Liang; Sussex, Roland

    1996-01-01

    Presents methods for using the Longman Mini-Concordancer on tagged and parsed corpora rather than plain text corpora. The article discusses several aspects with models to be applied in the classroom as an aid to grammar learning. This paper suggests exercises suitable for teaching English to both native and nonnative speakers. (13 references)…

  12. Human language reveals a universal positivity bias

    PubMed Central

    Dodds, Peter Sheridan; Clark, Eric M.; Desu, Suma; Frank, Morgan R.; Reagan, Andrew J.; Williams, Jake Ryland; Mitchell, Lewis; Harris, Kameron Decker; Kloumann, Isabel M.; Bagrow, James P.; Megerdoomian, Karine; McMahon, Matthew T.; Tivnan, Brian F.; Danforth, Christopher M.

    2015-01-01

    Using human evaluation of 100,000 words spread across 24 corpora in 10 languages diverse in origin and culture, we present evidence of a deep imprint of human sociality in language, observing that (i) the words of natural human language possess a universal positivity bias, (ii) the estimated emotional content of words is consistent between languages under translation, and (iii) this positivity bias is strongly independent of frequency of word use. Alongside these general regularities, we describe interlanguage variations in the emotional spectrum of languages that allow us to rank corpora. We also show how our word evaluations can be used to construct physical-like instruments for both real-time and offline measurement of the emotional content of large-scale texts. PMID:25675475

  13. How Can We Use Corpus Wordlists for Language Learning? Interfaces between Computer Corpora and Expert Intervention

    ERIC Educational Resources Information Center

    Chen, Yu-Hua; Bruncak, Radovan

    2015-01-01

    With the advances in technology, wordlists retrieved from computer corpora have become increasingly popular in recent years. The lexical items in those wordlists are usually selected, according to a set of robust frequency and dispersion criteria, from large corpora of authentic and naturally occurring language. Corpus wordlists are of great value…

  14. Benchmarking infrastructure for mutation text mining

    PubMed Central

    2014-01-01

    Background Experimental research on the automatic extraction of information about mutations from texts is greatly hindered by the lack of consensus evaluation infrastructure for the testing and benchmarking of mutation text mining systems. Results We propose a community-oriented annotation and benchmarking infrastructure to support development, testing, benchmarking, and comparison of mutation text mining systems. The design is based on semantic standards, where RDF is used to represent annotations, an OWL ontology provides an extensible schema for the data and SPARQL is used to compute various performance metrics, so that in many cases no programming is needed to analyze results from a text mining system. While large benchmark corpora for biological entity and relation extraction are focused mostly on genes, proteins, diseases, and species, our benchmarking infrastructure fills the gap for mutation information. The core infrastructure comprises (1) an ontology for modelling annotations, (2) SPARQL queries for computing performance metrics, and (3) a sizeable collection of manually curated documents, that can support mutation grounding and mutation impact extraction experiments. Conclusion We have developed the principal infrastructure for the benchmarking of mutation text mining tasks. The use of RDF and OWL as the representation for corpora ensures extensibility. The infrastructure is suitable for out-of-the-box use in several important scenarios and is ready, in its current state, for initial community adoption. PMID:24568600

  15. Benchmarking infrastructure for mutation text mining.

    PubMed

    Klein, Artjom; Riazanov, Alexandre; Hindle, Matthew M; Baker, Christopher Jo

    2014-02-25

    Experimental research on the automatic extraction of information about mutations from texts is greatly hindered by the lack of consensus evaluation infrastructure for the testing and benchmarking of mutation text mining systems. We propose a community-oriented annotation and benchmarking infrastructure to support development, testing, benchmarking, and comparison of mutation text mining systems. The design is based on semantic standards, where RDF is used to represent annotations, an OWL ontology provides an extensible schema for the data and SPARQL is used to compute various performance metrics, so that in many cases no programming is needed to analyze results from a text mining system. While large benchmark corpora for biological entity and relation extraction are focused mostly on genes, proteins, diseases, and species, our benchmarking infrastructure fills the gap for mutation information. The core infrastructure comprises (1) an ontology for modelling annotations, (2) SPARQL queries for computing performance metrics, and (3) a sizeable collection of manually curated documents, that can support mutation grounding and mutation impact extraction experiments. We have developed the principal infrastructure for the benchmarking of mutation text mining tasks. The use of RDF and OWL as the representation for corpora ensures extensibility. The infrastructure is suitable for out-of-the-box use in several important scenarios and is ready, in its current state, for initial community adoption.

  16. How Hierarchical Topics Evolve in Large Text Corpora.

    PubMed

    Cui, Weiwei; Liu, Shixia; Wu, Zhuofeng; Wei, Hao

    2014-12-01

    Using a sequence of topic trees to organize documents is a popular way to represent hierarchical and evolving topics in text corpora. However, following evolving topics in the context of topic trees remains difficult for users. To address this issue, we present an interactive visual text analysis approach to allow users to progressively explore and analyze the complex evolutionary patterns of hierarchical topics. The key idea behind our approach is to exploit a tree cut to approximate each tree and allow users to interactively modify the tree cuts based on their interests. In particular, we propose an incremental evolutionary tree cut algorithm with the goal of balancing 1) the fitness of each tree cut and the smoothness between adjacent tree cuts; 2) the historical and new information related to user interests. A time-based visualization is designed to illustrate the evolving topics over time. To preserve the mental map, we develop a stable layout algorithm. As a result, our approach can quickly guide users to progressively gain profound insights into evolving hierarchical topics. We evaluate the effectiveness of the proposed method on Amazon's Mechanical Turk and real-world news data. The results show that users are able to successfully analyze evolving topics in text data.

  17. Sentence alignment using feed forward neural network.

    PubMed

    Fattah, Mohamed Abdel; Ren, Fuji; Kuroiwa, Shingo

    2006-12-01

    Parallel corpora have become an essential resource for work in multi lingual natural language processing. However, sentence aligned parallel corpora are more efficient than non-aligned parallel corpora for cross language information retrieval and machine translation applications. In this paper, we present a new approach to align sentences in bilingual parallel corpora based on feed forward neural network classifier. A feature parameter vector is extracted from the text pair under consideration. This vector contains text features such as length, punctuate score, and cognate score values. A set of manually prepared training data has been assigned to train the feed forward neural network. Another set of data was used for testing. Using this new approach, we could achieve an error reduction of 60% over length based approach when applied on English-Arabic parallel documents. Moreover this new approach is valid for any language pair and it is quite flexible approach since the feature parameter vector may contain more/less or different features than that we used in our system such as lexical match feature.

  18. Lexical bundles in an advanced INTOCSU writing class and engineering texts: A functional analysis

    NASA Astrophysics Data System (ADS)

    Alquraishi, Mohammed Abdulrahman

    The purpose of this study is to investigate the functions of lexical bundles in two corpora: a corpus of engineering academic texts and a corpus of IEP advanced writing class texts. This study is concerned with the nature of formulaic language in Pathway IEPs and engineering texts, and whether those types of texts show similar or distinctive formulaic functions. Moreover, the study looked into lexical bundles found in an engineering 1.26 million-word corpus and an ESL 65000-word corpus using a concordancing program. The study then analyzed the functions of those lexical bundles and compared them statistically using chi-square tests. Additionally, the results of this investigation showed 236 unique frequent lexical bundles in the engineering corpus and 37 bundles in the pathway corpus. Also, the study identified several differences between the density and functions of lexical bundles in the two corpora. These differences were evident in the distribution of functions of lexical bundles and the minimal overlap of lexical bundles found in the two corpora. The results of this study call for more attention to formulaic language at ESP and EAP programs.

  19. Using Teacher-Developed Corpora in the CBI Classroom

    ERIC Educational Resources Information Center

    Salsbury, Tom; Crummer, Crista

    2008-01-01

    This article argues for the use of teacher-generated corpora in content-based courses. Using a content course for engineering and architecture students as an example, the article explains how a corpus consisting of texts from textbooks and journal articles helped students learn grammar, vocabulary, and writing. The article explains how the corpus…

  20. A system for de-identifying medical message board text.

    PubMed

    Benton, Adrian; Hill, Shawndra; Ungar, Lyle; Chung, Annie; Leonard, Charles; Freeman, Cristin; Holmes, John H

    2011-06-09

    There are millions of public posts to medical message boards by users seeking support and information on a wide range of medical conditions. It has been shown that these posts can be used to gain a greater understanding of patients' experiences and concerns. As investigators continue to explore large corpora of medical discussion board data for research purposes, protecting the privacy of the members of these online communities becomes an important challenge that needs to be met. Extant entity recognition methods used for more structured text are not sufficient because message posts present additional challenges: the posts contain many typographical errors, larger variety of possible names, terms and abbreviations specific to Internet posts or a particular message board, and mentions of the authors' personal lives. The main contribution of this paper is a system to de-identify the authors of message board posts automatically, taking into account the aforementioned challenges. We demonstrate our system on two different message board corpora, one on breast cancer and another on arthritis. We show that our approach significantly outperforms other publicly available named entity recognition and de-identification systems, which have been tuned for more structured text like operative reports, pathology reports, discharge summaries, or newswire.

  1. nala: text mining natural language mutation mentions

    PubMed Central

    Cejuela, Juan Miguel; Bojchevski, Aleksandar; Uhlig, Carsten; Bekmukhametov, Rustem; Kumar Karn, Sanjeev; Mahmuti, Shpend; Baghudana, Ashish; Dubey, Ankit; Satagopam, Venkata P.; Rost, Burkhard

    2017-01-01

    Abstract Motivation: The extraction of sequence variants from the literature remains an important task. Existing methods primarily target standard (ST) mutation mentions (e.g. ‘E6V’), leaving relevant mentions natural language (NL) largely untapped (e.g. ‘glutamic acid was substituted by valine at residue 6’). Results: We introduced three new corpora suggesting named-entity recognition (NER) to be more challenging than anticipated: 28–77% of all articles contained mentions only available in NL. Our new method nala captured NL and ST by combining conditional random fields with word embedding features learned unsupervised from the entire PubMed. In our hands, nala substantially outperformed the state-of-the-art. For instance, we compared all unique mentions in new discoveries correctly detected by any of three methods (SETH, tmVar, or nala). Neither SETH nor tmVar discovered anything missed by nala, while nala uniquely tagged 33% mentions. For NL mentions the corresponding value shot up to 100% nala-only. Availability and Implementation: Source code, API and corpora freely available at: http://tagtog.net/-corpora/IDP4+. Contact: nala@rostlab.org Supplementary information: Supplementary data are available at Bioinformatics online. PMID:28200120

  2. The ACODEA Framework: Developing Segmentation and Classification Schemes for Fully Automatic Analysis of Online Discussions

    ERIC Educational Resources Information Center

    Mu, Jin; Stegmann, Karsten; Mayfield, Elijah; Rose, Carolyn; Fischer, Frank

    2012-01-01

    Research related to online discussions frequently faces the problem of analyzing huge corpora. Natural Language Processing (NLP) technologies may allow automating this analysis. However, the state-of-the-art in machine learning and text mining approaches yields models that do not transfer well between corpora related to different topics. Also,…

  3. Task Effects on Linguistic Complexity and Accuracy: A Large-Scale Learner Corpus Analysis Employing Natural Language Processing Techniques

    ERIC Educational Resources Information Center

    Alexopoulou, Theodora; Michel, Marije; Murakami, Akira; Meurers, Detmar

    2017-01-01

    Large-scale learner corpora collected from online language learning platforms, such as the EF-Cambridge Open Language Database (EFCAMDAT), provide opportunities to analyze learner data at an unprecedented scale. However, interpreting the learner language in such corpora requires a precise understanding of tasks: How does the prompt and input of a…

  4. Megastudies, crowdsourcing, and large datasets in psycholinguistics: An overview of recent developments.

    PubMed

    Keuleers, Emmanuel; Balota, David A

    2015-01-01

    This paper introduces and summarizes the special issue on megastudies, crowdsourcing, and large datasets in psycholinguistics. We provide a brief historical overview and show how the papers in this issue have extended the field by compiling new databases and making important theoretical contributions. In addition, we discuss several studies that use text corpora to build distributional semantic models to tackle various interesting problems in psycholinguistics. Finally, as is the case across the papers, we highlight some methodological issues that are brought forth via the analyses of such datasets.

  5. Mining Consumer Health Vocabulary from Community-Generated Text

    PubMed Central

    Vydiswaran, V.G. Vinod; Mei, Qiaozhu; Hanauer, David A.; Zheng, Kai

    2014-01-01

    Community-generated text corpora can be a valuable resource to extract consumer health vocabulary (CHV) and link them to professional terminologies and alternative variants. In this research, we propose a pattern-based text-mining approach to identify pairs of CHV and professional terms from Wikipedia, a large text corpus created and maintained by the community. A novel measure, leveraging the ratio of frequency of occurrence, was used to differentiate consumer terms from professional terms. We empirically evaluated the applicability of this approach using a large data sample consisting of MedLine abstracts and all posts from an online health forum, MedHelp. The results show that the proposed approach is able to identify synonymous pairs and label the terms as either consumer or professional term with high accuracy. We conclude that the proposed approach provides great potential to produce a high quality CHV to improve the performance of computational applications in processing consumer-generated health text. PMID:25954426

  6. Incorporating linguistic knowledge for learning distributed word representations.

    PubMed

    Wang, Yan; Liu, Zhiyuan; Sun, Maosong

    2015-01-01

    Combined with neural language models, distributed word representations achieve significant advantages in computational linguistics and text mining. Most existing models estimate distributed word vectors from large-scale data in an unsupervised fashion, which, however, do not take rich linguistic knowledge into consideration. Linguistic knowledge can be represented as either link-based knowledge or preference-based knowledge, and we propose knowledge regularized word representation models (KRWR) to incorporate these prior knowledge for learning distributed word representations. Experiment results demonstrate that our estimated word representation achieves better performance in task of semantic relatedness ranking. This indicates that our methods can efficiently encode both prior knowledge from knowledge bases and statistical knowledge from large-scale text corpora into a unified word representation model, which will benefit many tasks in text mining.

  7. Incorporating Linguistic Knowledge for Learning Distributed Word Representations

    PubMed Central

    Wang, Yan; Liu, Zhiyuan; Sun, Maosong

    2015-01-01

    Combined with neural language models, distributed word representations achieve significant advantages in computational linguistics and text mining. Most existing models estimate distributed word vectors from large-scale data in an unsupervised fashion, which, however, do not take rich linguistic knowledge into consideration. Linguistic knowledge can be represented as either link-based knowledge or preference-based knowledge, and we propose knowledge regularized word representation models (KRWR) to incorporate these prior knowledge for learning distributed word representations. Experiment results demonstrate that our estimated word representation achieves better performance in task of semantic relatedness ranking. This indicates that our methods can efficiently encode both prior knowledge from knowledge bases and statistical knowledge from large-scale text corpora into a unified word representation model, which will benefit many tasks in text mining. PMID:25874581

  8. Primary diffuse large B-cell lymphoma of the corpora cavernosa presented as a perineal mass

    PubMed Central

    Carlos, González-Satué; Ivanna, Valverde Vilamala; Gustavo, Tapia Melendo; Joan, Areal Calama; Javier, Sanchez Macias; Luis, Ibarz Servio

    2012-01-01

    Primary male genital lymphomas may appear rarely in testis, and exceptionally in the penis and prostate, but there is not previous evidence of a lymphoma arising from the corpora cavernosa. We report the first case in the literature of a primary diffuse cell B lymphoma of the corpora cavernosa presented with low urinary tract symptoms, perineal pain and palpable mass. Diagnosis was based on trucut biopsy, histopathological studies and computed tomographic images. PMID:22919138

  9. Finding abbreviations in biomedical literature: three BioC-compatible modules and four BioC-formatted corpora.

    PubMed

    Islamaj Doğan, Rezarta; Comeau, Donald C; Yeganova, Lana; Wilbur, W John

    2014-01-01

    BioC is a recently created XML format to share text data and annotations, and an accompanying input/output library to promote interoperability of data and tools for natural language processing of biomedical text. This article reports the use of BioC to address a common challenge in processing biomedical text information-that of frequent entity name abbreviation. We selected three different abbreviation definition identification modules, and used the publicly available BioC code to convert these independent modules into BioC-compatible components that interact seamlessly with BioC-formatted data, and other BioC-compatible modules. In addition, we consider four manually annotated corpora of abbreviations in biomedical text: the Ab3P corpus of 1250 PubMed abstracts, the BIOADI corpus of 1201 PubMed abstracts, the old MEDSTRACT corpus of 199 PubMed(®) citations and the Schwartz and Hearst corpus of 1000 PubMed abstracts. Annotations in these corpora have been re-evaluated by four annotators and their consistency and quality levels have been improved. We converted them to BioC-format and described the representation of the annotations. These corpora are used to measure the three abbreviation-finding algorithms and the results are given. The BioC-compatible modules, when compared with their original form, have no difference in their efficiency, running time or any other comparable aspects. They can be conveniently used as a common pre-processing step for larger multi-layered text-mining endeavors. Database URL: Code and data are available for download at the BioC site: http://bioc.sourceforge.net. Published by Oxford University Press 2014. This work is written by US Government employees and is in the public domain in the US.

  10. Use of Co-occurrences for Temporal Expressions Annotation

    NASA Astrophysics Data System (ADS)

    Craveiro, Olga; Macedo, Joaquim; Madeira, Henrique

    The annotation or extraction of temporal information from text documents is becoming increasingly important in many natural language processing applications such as text summarization, information retrieval, question answering, etc.. This paper presents an original method for easy recognition of temporal expressions in text documents. The method creates semantically classified temporal patterns, using word co-occurrences obtained from training corpora and a pre-defined seed keywords set, derived from the used language temporal references. A participation on a Portuguese named entity evaluation contest showed promising effectiveness and efficiency results. This approach can be adapted to recognize other type of expressions or languages, within other contexts, by defining the suitable word sets and training corpora.

  11. Corpora of Vietnamese texts: lexical effects of intended audience and publication place.

    PubMed

    Pham, Giang; Kohnert, Kathryn; Carney, Edward

    2008-02-01

    This article has two primary aims. The first is to introduce a new Vietnamese text-based corpus. The Corpora of Vietnamese Texts (CVT; Tang, 2006a) consists of approximately 1 million words drawn from newspapers and children's literature, and is available online at www.vnspeechtherapy.com/vi/CVT. The second aim is to investigate potential differences in lexical frequency and distributional characteristics in the CVT on the basis of place of publication (Vietnam or Western countries) and intended audience: adult-directed texts (newspapers) or child-directed texts (children's literature). We found clear differences between adult- and child-directed texts, particularly in the distributional frequencies of pronouns or kinship terms, which were more frequent in children's literature. Within child- and adult-directed texts, lexical characteristics did not differ on the basis of place of publication. Implications of these findings for future research are discussed.

  12. Automatic Extraction of Destinations, Origins and Route Parts from Human Generated Route Directions

    NASA Astrophysics Data System (ADS)

    Zhang, Xiao; Mitra, Prasenjit; Klippel, Alexander; Maceachren, Alan

    Researchers from the cognitive and spatial sciences are studying text descriptions of movement patterns in order to examine how humans communicate and understand spatial information. In particular, route directions offer a rich source of information on how cognitive systems conceptualize movement patterns by segmenting them into meaningful parts. Route directions are composed using a plethora of cognitive spatial organization principles: changing levels of granularity, hierarchical organization, incorporation of cognitively and perceptually salient elements, and so forth. Identifying such information in text documents automatically is crucial for enabling machine-understanding of human spatial language. The benefits are: a) creating opportunities for large-scale studies of human linguistic behavior; b) extracting and georeferencing salient entities (landmarks) that are used by human route direction providers; c) developing methods to translate route directions to sketches and maps; and d) enabling queries on large corpora of crawled/analyzed movement data. In this paper, we introduce our approach and implementations that bring us closer to the goal of automatically processing linguistic route directions. We report on research directed at one part of the larger problem, that is, extracting the three most critical parts of route directions and movement patterns in general: origin, destination, and route parts. We use machine-learning based algorithms to extract these parts of routes, including, for example, destination names and types. We prove the effectiveness of our approach in several experiments using hand-tagged corpora.

  13. Combining Language Corpora with Experimental and Computational Approaches for Language Acquisition Research

    ERIC Educational Resources Information Center

    Monaghan, Padraic; Rowland, Caroline F.

    2017-01-01

    Historically, first language acquisition research was a painstaking process of observation, requiring the laborious hand coding of children's linguistic productions, followed by the generation of abstract theoretical proposals for how the developmental process unfolds. Recently, the ability to collect large-scale corpora of children's language…

  14. Gonadotropin binding sites in human ovarian follicles and corpora lutea during the menstrual cycle

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Shima, K.; Kitayama, S.; Nakano, R.

    Gonadotropin binding sites were localized by autoradiography after incubation of human ovarian sections with /sup 125/I-labeled gonadotropins. The binding sites for /sup 125/I-labeled human follicle-stimulating hormone (/sup 125/I-hFSH) were identified in the granulosa cells and in the newly formed corpora lutea. The /sup 125/I-labeled human luteinizing hormone (/sup 125/I-hLH) binding to the thecal cells increased during follicular maturation, and a dramatic increase was preferentially observed in the granulosa cells of the large preovulatory follicle. In the corpora lutea, the binding of /sup 125/I-hLH increased from the early luteal phase and decreased toward the late luteal phase. The changes in 3more » beta-hydroxysteroid dehydrogenase activity in the corpora lutea corresponded to the /sup 125/I-hLH binding. Thus, the changes in gonadotropin binding sites in the follicles and corpora lutea during the menstrual cycle may help in some important way to regulate human ovarian function.« less

  15. Transfer learning for biomedical named entity recognition with neural networks.

    PubMed

    Giorgi, John M; Bader, Gary D

    2018-06-01

    The explosive increase of biomedical literature has made information extraction an increasingly important tool for biomedical research. A fundamental task is the recognition of biomedical named entities in text (BNER) such as genes/proteins, diseases, and species. Recently, a domain-independent method based on deep learning and statistical word embeddings, called long short-term memory network-conditional random field (LSTM-CRF), has been shown to outperform state-of-the-art entity-specific BNER tools. However, this method is dependent on gold-standard corpora (GSCs) consisting of hand-labeled entities, which tend to be small but highly reliable. An alternative to GSCs are silver-standard corpora (SSCs), which are generated by harmonizing the annotations made by several automatic annotation systems. SSCs typically contain more noise than GSCs but have the advantage of containing many more training examples. Ideally, these corpora could be combined to achieve the benefits of both, which is an opportunity for transfer learning. In this work, we analyze to what extent transfer learning improves upon state-of-the-art results for BNER. We demonstrate that transferring a deep neural network (DNN) trained on a large, noisy SSC to a smaller, but more reliable GSC significantly improves upon state-of-the-art results for BNER. Compared to a state-of-the-art baseline evaluated on 23 GSCs covering four different entity classes, transfer learning results in an average reduction in error of approximately 11%. We found transfer learning to be especially beneficial for target data sets with a small number of labels (approximately 6000 or less). Source code for the LSTM-CRF is available athttps://github.com/Franck-Dernoncourt/NeuroNER/ and links to the corpora are available athttps://github.com/BaderLab/Transfer-Learning-BNER-Bioinformatics-2018/. john.giorgi@utoronto.ca. Supplementary data are available at Bioinformatics online.

  16. Inducing Multilingual Text Analysis Tools via Robust Projection across Aligned Corpora

    DTIC Science & Technology

    2001-01-01

    monolingual dictionary - derived list of canonical roots would resolve ambiguity re- garding which is the appropriate target. � Many of the errors are...system and set of algorithms for automati- cally inducing stand-alone monolingual part-of-speech taggers, base noun-phrase bracketers, named-entity...corpora has tended to focus on their use in translation model training for MT rather than on monolingual applications. One exception is bilin- gual parsing

  17. Synonym extraction and abbreviation expansion with ensembles of semantic spaces.

    PubMed

    Henriksson, Aron; Moen, Hans; Skeppstedt, Maria; Daudaravičius, Vidas; Duneld, Martin

    2014-02-05

    Terminologies that account for variation in language use by linking synonyms and abbreviations to their corresponding concept are important enablers of high-quality information extraction from medical texts. Due to the use of specialized sub-languages in the medical domain, manual construction of semantic resources that accurately reflect language use is both costly and challenging, often resulting in low coverage. Although models of distributional semantics applied to large corpora provide a potential means of supporting development of such resources, their ability to isolate synonymy from other semantic relations is limited. Their application in the clinical domain has also only recently begun to be explored. Combining distributional models and applying them to different types of corpora may lead to enhanced performance on the tasks of automatically extracting synonyms and abbreviation-expansion pairs. A combination of two distributional models - Random Indexing and Random Permutation - employed in conjunction with a single corpus outperforms using either of the models in isolation. Furthermore, combining semantic spaces induced from different types of corpora - a corpus of clinical text and a corpus of medical journal articles - further improves results, outperforming a combination of semantic spaces induced from a single source, as well as a single semantic space induced from the conjoint corpus. A combination strategy that simply sums the cosine similarity scores of candidate terms is generally the most profitable out of the ones explored. Finally, applying simple post-processing filtering rules yields substantial performance gains on the tasks of extracting abbreviation-expansion pairs, but not synonyms. The best results, measured as recall in a list of ten candidate terms, for the three tasks are: 0.39 for abbreviations to long forms, 0.33 for long forms to abbreviations, and 0.47 for synonyms. This study demonstrates that ensembles of semantic spaces can yield improved performance on the tasks of automatically extracting synonyms and abbreviation-expansion pairs. This notion, which merits further exploration, allows different distributional models - with different model parameters - and different types of corpora to be combined, potentially allowing enhanced performance to be obtained on a wide range of natural language processing tasks.

  18. Synonym extraction and abbreviation expansion with ensembles of semantic spaces

    PubMed Central

    2014-01-01

    Background Terminologies that account for variation in language use by linking synonyms and abbreviations to their corresponding concept are important enablers of high-quality information extraction from medical texts. Due to the use of specialized sub-languages in the medical domain, manual construction of semantic resources that accurately reflect language use is both costly and challenging, often resulting in low coverage. Although models of distributional semantics applied to large corpora provide a potential means of supporting development of such resources, their ability to isolate synonymy from other semantic relations is limited. Their application in the clinical domain has also only recently begun to be explored. Combining distributional models and applying them to different types of corpora may lead to enhanced performance on the tasks of automatically extracting synonyms and abbreviation-expansion pairs. Results A combination of two distributional models – Random Indexing and Random Permutation – employed in conjunction with a single corpus outperforms using either of the models in isolation. Furthermore, combining semantic spaces induced from different types of corpora – a corpus of clinical text and a corpus of medical journal articles – further improves results, outperforming a combination of semantic spaces induced from a single source, as well as a single semantic space induced from the conjoint corpus. A combination strategy that simply sums the cosine similarity scores of candidate terms is generally the most profitable out of the ones explored. Finally, applying simple post-processing filtering rules yields substantial performance gains on the tasks of extracting abbreviation-expansion pairs, but not synonyms. The best results, measured as recall in a list of ten candidate terms, for the three tasks are: 0.39 for abbreviations to long forms, 0.33 for long forms to abbreviations, and 0.47 for synonyms. Conclusions This study demonstrates that ensembles of semantic spaces can yield improved performance on the tasks of automatically extracting synonyms and abbreviation-expansion pairs. This notion, which merits further exploration, allows different distributional models – with different model parameters – and different types of corpora to be combined, potentially allowing enhanced performance to be obtained on a wide range of natural language processing tasks. PMID:24499679

  19. Morphometric Correlates of the Ovary and Ovulatory Corpora in the Bowhead Whale, Balaena mysticetus.

    PubMed

    Tarpley, Raymond J; Hillmann, Daniel J; George, John C; Zeh, Judith E; Suydam, Robert S

    2016-06-01

    Gross morphology and morphometry of the bowhead whale ovary, including ovulatory corpora, were investigated in 50 whales from the Chukchi and Beaufort seas off the coast of Alaska. Using the presence of ovarian corpora to define sexual maturity, 23 sexually immature whales (7.6-14.2 m total body length) and 27 sexually mature whales (14.2-17.7 m total body length) were identified. Ovary pair weights ranged from 0.38 to 2.45 kg and 2.92 to 12.02 kg for sexually immature and sexually mature whales, respectively. In sexually mature whales, corpora lutea (CLs) and/or large corpora albicantia (CAs) projected beyond ovary surfaces. CAs became increasingly less interruptive of the surface contour as they regressed, while remaining identifiable within transverse sections of the ovarian cortex. CLs formed large globular bodies, often with a central lumen, featuring golden parenchymas enfolded within radiating fibrous cords. CAs, sometimes vesicular, featured a dense fibrous core with outward fibrous projections through the former luteal tissue. CLs (never more than one per ovary pair) ranged from 6.7 to 15.0 cm in diameter in 13 whales. Fetuses were confirmed in nine of the 13 whales, with the associated CLs ranging from 8.3 to 15.0 cm in diameter. CLs from four whales where a fetus was not detected ranged from 6.7 to 10.6 cm in diameter. CA totals ranged from 0 to 22 for any single ovary, and from 1 to 41 for an ovary pair. CAs measured from 0.3 to 6.3 cm in diameter, and smaller corpora were more numerous, suggesting an accumulating record of ovulation. Neither the left nor the right ovary dominated in the production of corpora. Anat Rec, 299:769-797, 2016. © 2016 Wiley Periodicals, Inc. © 2016 Wiley Periodicals, Inc.

  20. Co-occurrence frequency evaluated with large language corpora boosts semantic priming effects.

    PubMed

    Brunellière, Angèle; Perre, Laetitia; Tran, ThiMai; Bonnotte, Isabelle

    2017-09-01

    In recent decades, many computational techniques have been developed to analyse the contextual usage of words in large language corpora. The present study examined whether the co-occurrence frequency obtained from large language corpora might boost purely semantic priming effects. Two experiments were conducted: one with conscious semantic priming, the other with subliminal semantic priming. Both experiments contrasted three semantic priming contexts: an unrelated priming context and two related priming contexts with word pairs that are semantically related and that co-occur either frequently or infrequently. In the conscious priming presentation (166-ms stimulus-onset asynchrony, SOA), a semantic priming effect was recorded in both related priming contexts, which was greater with higher co-occurrence frequency. In the subliminal priming presentation (66-ms SOA), no significant priming effect was shown, regardless of the related priming context. These results show that co-occurrence frequency boosts pure semantic priming effects and are discussed with reference to models of semantic network.

  1. On the unsupervised analysis of domain-specific Chinese texts

    PubMed Central

    Deng, Ke; Bol, Peter K.; Li, Kate J.; Liu, Jun S.

    2016-01-01

    With the growing availability of digitized text data both publicly and privately, there is a great need for effective computational tools to automatically extract information from texts. Because the Chinese language differs most significantly from alphabet-based languages in not specifying word boundaries, most existing Chinese text-mining methods require a prespecified vocabulary and/or a large relevant training corpus, which may not be available in some applications. We introduce an unsupervised method, top-down word discovery and segmentation (TopWORDS), for simultaneously discovering and segmenting words and phrases from large volumes of unstructured Chinese texts, and propose ways to order discovered words and conduct higher-level context analyses. TopWORDS is particularly useful for mining online and domain-specific texts where the underlying vocabulary is unknown or the texts of interest differ significantly from available training corpora. When outputs from TopWORDS are fed into context analysis tools such as topic modeling, word embedding, and association pattern finding, the results are as good as or better than that from using outputs of a supervised segmentation method. PMID:27185919

  2. The CHEMDNER corpus of chemicals and drugs and its annotation principles.

    PubMed

    Krallinger, Martin; Rabal, Obdulia; Leitner, Florian; Vazquez, Miguel; Salgado, David; Lu, Zhiyong; Leaman, Robert; Lu, Yanan; Ji, Donghong; Lowe, Daniel M; Sayle, Roger A; Batista-Navarro, Riza Theresa; Rak, Rafal; Huber, Torsten; Rocktäschel, Tim; Matos, Sérgio; Campos, David; Tang, Buzhou; Xu, Hua; Munkhdalai, Tsendsuren; Ryu, Keun Ho; Ramanan, S V; Nathan, Senthil; Žitnik, Slavko; Bajec, Marko; Weber, Lutz; Irmer, Matthias; Akhondi, Saber A; Kors, Jan A; Xu, Shuo; An, Xin; Sikdar, Utpal Kumar; Ekbal, Asif; Yoshioka, Masaharu; Dieb, Thaer M; Choi, Miji; Verspoor, Karin; Khabsa, Madian; Giles, C Lee; Liu, Hongfang; Ravikumar, Komandur Elayavilli; Lamurias, Andre; Couto, Francisco M; Dai, Hong-Jie; Tsai, Richard Tzong-Han; Ata, Caglar; Can, Tolga; Usié, Anabel; Alves, Rui; Segura-Bedmar, Isabel; Martínez, Paloma; Oyarzabal, Julen; Valencia, Alfonso

    2015-01-01

    The automatic extraction of chemical information from text requires the recognition of chemical entity mentions as one of its key steps. When developing supervised named entity recognition (NER) systems, the availability of a large, manually annotated text corpus is desirable. Furthermore, large corpora permit the robust evaluation and comparison of different approaches that detect chemicals in documents. We present the CHEMDNER corpus, a collection of 10,000 PubMed abstracts that contain a total of 84,355 chemical entity mentions labeled manually by expert chemistry literature curators, following annotation guidelines specifically defined for this task. The abstracts of the CHEMDNER corpus were selected to be representative for all major chemical disciplines. Each of the chemical entity mentions was manually labeled according to its structure-associated chemical entity mention (SACEM) class: abbreviation, family, formula, identifier, multiple, systematic and trivial. The difficulty and consistency of tagging chemicals in text was measured using an agreement study between annotators, obtaining a percentage agreement of 91. For a subset of the CHEMDNER corpus (the test set of 3,000 abstracts) we provide not only the Gold Standard manual annotations, but also mentions automatically detected by the 26 teams that participated in the BioCreative IV CHEMDNER chemical mention recognition task. In addition, we release the CHEMDNER silver standard corpus of automatically extracted mentions from 17,000 randomly selected PubMed abstracts. A version of the CHEMDNER corpus in the BioC format has been generated as well. We propose a standard for required minimum information about entity annotations for the construction of domain specific corpora on chemical and drug entities. The CHEMDNER corpus and annotation guidelines are available at: http://www.biocreative.org/resources/biocreative-iv/chemdner-corpus/.

  3. The CHEMDNER corpus of chemicals and drugs and its annotation principles

    PubMed Central

    2015-01-01

    The automatic extraction of chemical information from text requires the recognition of chemical entity mentions as one of its key steps. When developing supervised named entity recognition (NER) systems, the availability of a large, manually annotated text corpus is desirable. Furthermore, large corpora permit the robust evaluation and comparison of different approaches that detect chemicals in documents. We present the CHEMDNER corpus, a collection of 10,000 PubMed abstracts that contain a total of 84,355 chemical entity mentions labeled manually by expert chemistry literature curators, following annotation guidelines specifically defined for this task. The abstracts of the CHEMDNER corpus were selected to be representative for all major chemical disciplines. Each of the chemical entity mentions was manually labeled according to its structure-associated chemical entity mention (SACEM) class: abbreviation, family, formula, identifier, multiple, systematic and trivial. The difficulty and consistency of tagging chemicals in text was measured using an agreement study between annotators, obtaining a percentage agreement of 91. For a subset of the CHEMDNER corpus (the test set of 3,000 abstracts) we provide not only the Gold Standard manual annotations, but also mentions automatically detected by the 26 teams that participated in the BioCreative IV CHEMDNER chemical mention recognition task. In addition, we release the CHEMDNER silver standard corpus of automatically extracted mentions from 17,000 randomly selected PubMed abstracts. A version of the CHEMDNER corpus in the BioC format has been generated as well. We propose a standard for required minimum information about entity annotations for the construction of domain specific corpora on chemical and drug entities. The CHEMDNER corpus and annotation guidelines are available at: http://www.biocreative.org/resources/biocreative-iv/chemdner-corpus/ PMID:25810773

  4. The Nature and Scope of Student Search Strategies in Using a Web Derived Corpus for Writing

    ERIC Educational Resources Information Center

    Franken, Margaret

    2014-01-01

    The use of online language corpora in L2 teaching and learning is gaining momentum largely because corpora are an easily accessed source of language input that potentially provide rich and authentic lexico-grammatical data. This can be of particular use for students' writing as its incorporation can enhance the appearance of native-like fluency.…

  5. Concept annotation in the CRAFT corpus.

    PubMed

    Bada, Michael; Eckert, Miriam; Evans, Donald; Garcia, Kristin; Shipley, Krista; Sitnikov, Dmitry; Baumgartner, William A; Cohen, K Bretonnel; Verspoor, Karin; Blake, Judith A; Hunter, Lawrence E

    2012-07-09

    Manually annotated corpora are critical for the training and evaluation of automated methods to identify concepts in biomedical text. This paper presents the concept annotations of the Colorado Richly Annotated Full-Text (CRAFT) Corpus, a collection of 97 full-length, open-access biomedical journal articles that have been annotated both semantically and syntactically to serve as a research resource for the biomedical natural-language-processing (NLP) community. CRAFT identifies all mentions of nearly all concepts from nine prominent biomedical ontologies and terminologies: the Cell Type Ontology, the Chemical Entities of Biological Interest ontology, the NCBI Taxonomy, the Protein Ontology, the Sequence Ontology, the entries of the Entrez Gene database, and the three subontologies of the Gene Ontology. The first public release includes the annotations for 67 of the 97 articles, reserving two sets of 15 articles for future text-mining competitions (after which these too will be released). Concept annotations were created based on a single set of guidelines, which has enabled us to achieve consistently high interannotator agreement. As the initial 67-article release contains more than 560,000 tokens (and the full set more than 790,000 tokens), our corpus is among the largest gold-standard annotated biomedical corpora. Unlike most others, the journal articles that comprise the corpus are drawn from diverse biomedical disciplines and are marked up in their entirety. Additionally, with a concept-annotation count of nearly 100,000 in the 67-article subset (and more than 140,000 in the full collection), the scale of conceptual markup is also among the largest of comparable corpora. The concept annotations of the CRAFT Corpus have the potential to significantly advance biomedical text mining by providing a high-quality gold standard for NLP systems. The corpus, annotation guidelines, and other associated resources are freely available at http://bionlp-corpora.sourceforge.net/CRAFT/index.shtml.

  6. Concept annotation in the CRAFT corpus

    PubMed Central

    2012-01-01

    Background Manually annotated corpora are critical for the training and evaluation of automated methods to identify concepts in biomedical text. Results This paper presents the concept annotations of the Colorado Richly Annotated Full-Text (CRAFT) Corpus, a collection of 97 full-length, open-access biomedical journal articles that have been annotated both semantically and syntactically to serve as a research resource for the biomedical natural-language-processing (NLP) community. CRAFT identifies all mentions of nearly all concepts from nine prominent biomedical ontologies and terminologies: the Cell Type Ontology, the Chemical Entities of Biological Interest ontology, the NCBI Taxonomy, the Protein Ontology, the Sequence Ontology, the entries of the Entrez Gene database, and the three subontologies of the Gene Ontology. The first public release includes the annotations for 67 of the 97 articles, reserving two sets of 15 articles for future text-mining competitions (after which these too will be released). Concept annotations were created based on a single set of guidelines, which has enabled us to achieve consistently high interannotator agreement. Conclusions As the initial 67-article release contains more than 560,000 tokens (and the full set more than 790,000 tokens), our corpus is among the largest gold-standard annotated biomedical corpora. Unlike most others, the journal articles that comprise the corpus are drawn from diverse biomedical disciplines and are marked up in their entirety. Additionally, with a concept-annotation count of nearly 100,000 in the 67-article subset (and more than 140,000 in the full collection), the scale of conceptual markup is also among the largest of comparable corpora. The concept annotations of the CRAFT Corpus have the potential to significantly advance biomedical text mining by providing a high-quality gold standard for NLP systems. The corpus, annotation guidelines, and other associated resources are freely available at http://bionlp-corpora.sourceforge.net/CRAFT/index.shtml. PMID:22776079

  7. Contribution to terminology internationalization by word alignment in parallel corpora.

    PubMed

    Deléger, Louise; Merkel, Magnus; Zweigenbaum, Pierre

    2006-01-01

    Creating a complete translation of a large vocabulary is a time-consuming task, which requires skilled and knowledgeable medical translators. Our goal is to examine to which extent such a task can be alleviated by a specific natural language processing technique, word alignment in parallel corpora. We experiment with translation from English to French. Build a large corpus of parallel, English-French documents, and automatically align it at the document, sentence and word levels using state-of-the-art alignment methods and tools. Then project English terms from existing controlled vocabularies to the aligned word pairs, and examine the number and quality of the putative French translations obtained thereby. We considered three American vocabularies present in the UMLS with three different translation statuses: the MeSH, SNOMED CT, and the MedlinePlus Health Topics. We obtained several thousand new translations of our input terms, this number being closely linked to the number of terms in the input vocabularies. Our study shows that alignment methods can extract a number of new term translations from large bodies of text with a moderate human reviewing effort, and thus contribute to help a human translator obtain better translation coverage of an input vocabulary. Short-term perspectives include their application to a corpus 20 times larger than that used here, together with more focused methods for term extraction.

  8. Contribution to Terminology Internationalization by Word Alignment in Parallel Corpora

    PubMed Central

    Deléger, Louise; Merkel, Magnus; Zweigenbaum, Pierre

    2006-01-01

    Background and objectives Creating a complete translation of a large vocabulary is a time-consuming task, which requires skilled and knowledgeable medical translators. Our goal is to examine to which extent such a task can be alleviated by a specific natural language processing technique, word alignment in parallel corpora. We experiment with translation from English to French. Methods Build a large corpus of parallel, English-French documents, and automatically align it at the document, sentence and word levels using state-of-the-art alignment methods and tools. Then project English terms from existing controlled vocabularies to the aligned word pairs, and examine the number and quality of the putative French translations obtained thereby. We considered three American vocabularies present in the UMLS with three different translation statuses: the MeSH, SNOMED CT, and the MedlinePlus Health Topics. Results We obtained several thousand new translations of our input terms, this number being closely linked to the number of terms in the input vocabularies. Conclusion Our study shows that alignment methods can extract a number of new term translations from large bodies of text with a moderate human reviewing effort, and thus contribute to help a human translator obtain better translation coverage of an input vocabulary. Short-term perspectives include their application to a corpus 20 times larger than that used here, together with more focused methods for term extraction. PMID:17238328

  9. A Linguistic Inquiry and Word Count Analysis of the Adult Attachment Interview in Two Large Corpora.

    PubMed

    Waters, Theodore E A; Steele, Ryan D; Roisman, Glenn I; Haydon, Katherine C; Booth-LaForce, Cathryn

    2016-01-01

    An emerging literature suggests that variation in Adult Attachment Interview (AAI; George, Kaplan, & Main, 1985) states of mind about childhood experiences with primary caregivers is reflected in specific linguistic features captured by the Linguistic Inquiry Word Count automated text analysis program (LIWC; Pennebaker, Booth, & Francis, 2007). The current report addressed limitations of prior studies in this literature by using two large AAI corpora ( N s = 826 and 857) and a broader range of linguistic variables, as well as examining associations of LIWC-derived AAI dimensions with key developmental antecedents. First, regression analyses revealed that dismissing states of mind were associated with transcripts that were more truncated and deemphasized discussion of the attachment relationship whereas preoccupied states of mind were associated with longer, more conflicted, and angry narratives. Second, in aggregate, LIWC variables accounted for over a third of the variation in AAI dismissing and preoccupied states of mind, with regression weights cross-validating across samples. Third, LIWC-derived dismissing and preoccupied state of mind dimensions were associated with direct observations of maternal and paternal sensitivity as well as infant attachment security in childhood, replicating the pattern of results reported in Haydon, Roisman, Owen, Booth-LaForce, and Cox (2014) using coder-derived dismissing and preoccupation scores in the same sample.

  10. An annotated corpus with nanomedicine and pharmacokinetic parameters

    PubMed Central

    Lewinski, Nastassja A; Jimenez, Ivan; McInnes, Bridget T

    2017-01-01

    A vast amount of data on nanomedicines is being generated and published, and natural language processing (NLP) approaches can automate the extraction of unstructured text-based data. Annotated corpora are a key resource for NLP and information extraction methods which employ machine learning. Although corpora are available for pharmaceuticals, resources for nanomedicines and nanotechnology are still limited. To foster nanotechnology text mining (NanoNLP) efforts, we have constructed a corpus of annotated drug product inserts taken from the US Food and Drug Administration’s Drugs@FDA online database. In this work, we present the development of the Engineered Nanomedicine Database corpus to support the evaluation of nanomedicine entity extraction. The data were manually annotated for 21 entity mentions consisting of nanomedicine physicochemical characterization, exposure, and biologic response information of 41 Food and Drug Administration-approved nanomedicines. We evaluate the reliability of the manual annotations and demonstrate the use of the corpus by evaluating two state-of-the-art named entity extraction systems, OpenNLP and Stanford NER. The annotated corpus is available open source and, based on these results, guidelines and suggestions for future development of additional nanomedicine corpora are provided. PMID:29066897

  11. Menzerath-Altmann law for distinct word distribution analysis in a large text

    NASA Astrophysics Data System (ADS)

    Eroglu, Sertac

    2013-06-01

    The empirical law uncovered by Menzerath and formulated by Altmann, known as the Menzerath-Altmann law (henceforth the MA law), reveals the statistical distribution behavior of human language in various organizational levels. Building on previous studies relating organizational regularities in a language, we propose that the distribution of distinct (or different) words in a large text can effectively be described by the MA law. The validity of the proposition is demonstrated by examining two text corpora written in different languages not belonging to the same language family (English and Turkish). The results show not only that distinct word distribution behavior can accurately be predicted by the MA law, but that this result appears to be language-independent. This result is important not only for quantitative linguistic studies, but also may have significance for other naturally occurring organizations that display analogous organizational behavior. We also deliberately demonstrate that the MA law is a special case of the probability function of the generalized gamma distribution.

  12. NASA's online machine aided indexing system

    NASA Technical Reports Server (NTRS)

    Silvester, June P.; Genuardi, Michael T.; Klingbiel, Paul H.

    1993-01-01

    This report describes the NASA Lexical Dictionary, a machine aided indexing system used online at the National Aeronautics and Space Administration's Center for Aerospace Information (CASI). This system is comprised of a text processor that is based on the computational, non-syntactic analysis of input text, and an extensive 'knowledge base' that serves to recognize and translate text-extracted concepts. The structure and function of the various NLD system components are described in detail. Methods used for the development of the knowledge base are discussed. Particular attention is given to a statistically-based text analysis program that provides the knowledge base developer with a list of concept-specific phrases extracted from large textual corpora. Production and quality benefits resulting from the integration of machine aided indexing at CASI are discussed along with a number of secondary applications of NLD-derived systems including on-line spell checking and machine aided lexicography.

  13. Hong Kong Papers in Linguistics and Language Teaching, 16.

    ERIC Educational Resources Information Center

    Nakhoul, Liz, Ed.; And Others

    1993-01-01

    Articles and reports in this issue include the following: "Co-text or No Text: A Study of an Adapted Cloze Technique" (Dave Coniam); "Small-Corpora Concordancing in ESL Teaching and Learning" (Bruce K.C. Ma); "Interdisciplinary Dimensions of Debate" (S. Byron, L. Goldstein, D. Murphy, E. Roberts); "Can English…

  14. TELLTALE: Experiments in a Dynamic Hypertext Environment for Degraded and Multilingual Data.

    ERIC Educational Resources Information Center

    Pearce, Claudia; Nicholas, Charles

    1996-01-01

    Presents experimentation results for the TELLTALE system, a dynamic hypertext environment that provides full-text search from a hypertext-style user interface for text corpora that may be garbled by OCR (optical character recognition) or transmission errors, and that may contain languages other than English. (Author/LRW)

  15. Abstracts versus Full Texts and Patents: A Quantitative Analysis of Biomedical Entities

    NASA Astrophysics Data System (ADS)

    Müller, Bernd; Klinger, Roman; Gurulingappa, Harsha; Mevissen, Heinz-Theodor; Hofmann-Apitius, Martin; Fluck, Juliane; Friedrich, Christoph M.

    In information retrieval, named entity recognition gives the opportunity to apply semantic search in domain specific corpora. Recently, more full text patents and journal articles became freely available. As the information distribution amongst the different sections is unknown, an analysis of the diversity is of interest.

  16. Ontology design patterns to disambiguate relations between genes and gene products in GENIA

    PubMed Central

    2011-01-01

    Motivation Annotated reference corpora play an important role in biomedical information extraction. A semantic annotation of the natural language texts in these reference corpora using formal ontologies is challenging due to the inherent ambiguity of natural language. The provision of formal definitions and axioms for semantic annotations offers the means for ensuring consistency as well as enables the development of verifiable annotation guidelines. Consistent semantic annotations facilitate the automatic discovery of new information through deductive inferences. Results We provide a formal characterization of the relations used in the recent GENIA corpus annotations. For this purpose, we both select existing axiom systems based on the desired properties of the relations within the domain and develop new axioms for several relations. To apply this ontology of relations to the semantic annotation of text corpora, we implement two ontology design patterns. In addition, we provide a software application to convert annotated GENIA abstracts into OWL ontologies by combining both the ontology of relations and the design patterns. As a result, the GENIA abstracts become available as OWL ontologies and are amenable for automated verification, deductive inferences and other knowledge-based applications. Availability Documentation, implementation and examples are available from http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/. PMID:22166341

  17. Looking at Citations: Using Corpora in English for Academic Purposes.

    ERIC Educational Resources Information Center

    Thompson, Paul; Tribble, Chris

    2001-01-01

    Presents a classification scheme and the results of applying this scheme to the coding of academic texts in a corpus. The texts are doctoral theses from agricultural botany and agricultural economics departments. Results lead to a comparison of the citation practices of writers in different disciplines and the different rhetorical practices of…

  18. Unsupervised Medical Entity Recognition and Linking in Chinese Online Medical Text

    PubMed Central

    Gan, Liang; Cheng, Mian; Wu, Quanyuan

    2018-01-01

    Online medical text is full of references to medical entities (MEs), which are valuable in many applications, including medical knowledge-based (KB) construction, decision support systems, and the treatment of diseases. However, the diverse and ambiguous nature of the surface forms gives rise to a great difficulty for ME identification. Many existing solutions have focused on supervised approaches, which are often task-dependent. In other words, applying them to different kinds of corpora or identifying new entity categories requires major effort in data annotation and feature definition. In this paper, we propose unMERL, an unsupervised framework for recognizing and linking medical entities mentioned in Chinese online medical text. For ME recognition, unMERL first exploits a knowledge-driven approach to extract candidate entities from free text. Then, the categories of the candidate entities are determined using a distributed semantic-based approach. For ME linking, we propose a collaborative inference approach which takes full advantage of heterogenous entity knowledge and unstructured information in KB. Experimental results on real corpora demonstrate significant benefits compared to recent approaches with respect to both ME recognition and linking. PMID:29849994

  19. Sentiment analysis of political communication: combining a dictionary approach with crowdcoding.

    PubMed

    Haselmayer, Martin; Jenny, Marcelo

    2017-01-01

    Sentiment is important in studies of news values, public opinion, negative campaigning or political polarization and an explosive expansion of digital textual data and fast progress in automated text analysis provide vast opportunities for innovative social science research. Unfortunately, tools currently available for automated sentiment analysis are mostly restricted to English texts and require considerable contextual adaption to produce valid results. We present a procedure for collecting fine-grained sentiment scores through crowdcoding to build a negative sentiment dictionary in a language and for a domain of choice. The dictionary enables the analysis of large text corpora that resource-intensive hand-coding struggles to cope with. We calculate the tonality of sentences from dictionary words and we validate these estimates with results from manual coding. The results show that the crowdbased dictionary provides efficient and valid measurement of sentiment. Empirical examples illustrate its use by analyzing the tonality of party statements and media reports.

  20. Deep learning with word embeddings improves biomedical named entity recognition.

    PubMed

    Habibi, Maryam; Weber, Leon; Neves, Mariana; Wiegandt, David Luis; Leser, Ulf

    2017-07-15

    Text mining has become an important tool for biomedical research. The most fundamental text-mining task is the recognition of biomedical named entities (NER), such as genes, chemicals and diseases. Current NER methods rely on pre-defined features which try to capture the specific surface properties of entity types, properties of the typical local context, background knowledge, and linguistic information. State-of-the-art tools are entity-specific, as dictionaries and empirically optimal feature sets differ between entity types, which makes their development costly. Furthermore, features are often optimized for a specific gold standard corpus, which makes extrapolation of quality measures difficult. We show that a completely generic method based on deep learning and statistical word embeddings [called long short-term memory network-conditional random field (LSTM-CRF)] outperforms state-of-the-art entity-specific NER tools, and often by a large margin. To this end, we compared the performance of LSTM-CRF on 33 data sets covering five different entity classes with that of best-of-class NER tools and an entity-agnostic CRF implementation. On average, F1-score of LSTM-CRF is 5% above that of the baselines, mostly due to a sharp increase in recall. The source code for LSTM-CRF is available at https://github.com/glample/tagger and the links to the corpora are available at https://corposaurus.github.io/corpora/ . habibima@informatik.hu-berlin.de. © The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com

  1. Deep learning with word embeddings improves biomedical named entity recognition

    PubMed Central

    Habibi, Maryam; Weber, Leon; Neves, Mariana; Wiegandt, David Luis; Leser, Ulf

    2017-01-01

    Abstract Motivation: Text mining has become an important tool for biomedical research. The most fundamental text-mining task is the recognition of biomedical named entities (NER), such as genes, chemicals and diseases. Current NER methods rely on pre-defined features which try to capture the specific surface properties of entity types, properties of the typical local context, background knowledge, and linguistic information. State-of-the-art tools are entity-specific, as dictionaries and empirically optimal feature sets differ between entity types, which makes their development costly. Furthermore, features are often optimized for a specific gold standard corpus, which makes extrapolation of quality measures difficult. Results: We show that a completely generic method based on deep learning and statistical word embeddings [called long short-term memory network-conditional random field (LSTM-CRF)] outperforms state-of-the-art entity-specific NER tools, and often by a large margin. To this end, we compared the performance of LSTM-CRF on 33 data sets covering five different entity classes with that of best-of-class NER tools and an entity-agnostic CRF implementation. On average, F1-score of LSTM-CRF is 5% above that of the baselines, mostly due to a sharp increase in recall. Availability and implementation: The source code for LSTM-CRF is available at https://github.com/glample/tagger and the links to the corpora are available at https://corposaurus.github.io/corpora/. Contact: habibima@informatik.hu-berlin.de PMID:28881963

  2. Exploiting Language Models to Classify Events from Twitter

    PubMed Central

    Vo, Duc-Thuan; Hai, Vo Thuan; Ock, Cheol-Young

    2015-01-01

    Classifying events is challenging in Twitter because tweets texts have a large amount of temporal data with a lot of noise and various kinds of topics. In this paper, we propose a method to classify events from Twitter. We firstly find the distinguishing terms between tweets in events and measure their similarities with learning language models such as ConceptNet and a latent Dirichlet allocation method for selectional preferences (LDA-SP), which have been widely studied based on large text corpora within computational linguistic relations. The relationship of term words in tweets will be discovered by checking them under each model. We then proposed a method to compute the similarity between tweets based on tweets' features including common term words and relationships among their distinguishing term words. It will be explicit and convenient for applying to k-nearest neighbor techniques for classification. We carefully applied experiments on the Edinburgh Twitter Corpus to show that our method achieves competitive results for classifying events. PMID:26451139

  3. Recognizing chemicals in patents: a comparative analysis.

    PubMed

    Habibi, Maryam; Wiegandt, David Luis; Schmedding, Florian; Leser, Ulf

    2016-01-01

    Recently, methods for Chemical Named Entity Recognition (NER) have gained substantial interest, driven by the need for automatically analyzing todays ever growing collections of biomedical text. Chemical NER for patents is particularly essential due to the high economic importance of pharmaceutical findings. However, NER on patents has essentially been neglected by the research community for long, mostly because of the lack of enough annotated corpora. A recent international competition specifically targeted this task, but evaluated tools only on gold standard patent abstracts instead of full patents; furthermore, results from such competitions are often difficult to extrapolate to real-life settings due to the relatively high homogeneity of training and test data. Here, we evaluate the two state-of-the-art chemical NER tools, tmChem and ChemSpot, on four different annotated patent corpora, two of which consist of full texts. We study the overall performance of the tools, compare their results at the instance level, report on high-recall and high-precision ensembles, and perform cross-corpus and intra-corpus evaluations. Our findings indicate that full patents are considerably harder to analyze than patent abstracts and clearly confirm the common wisdom that using the same text genre (patent vs. scientific) and text type (abstract vs. full text) for training and testing is a pre-requisite for achieving high quality text mining results.

  4. Empirical data on corpus design and usage in biomedical natural language processing.

    PubMed

    Cohen, K Bretonnel; Fox, Lynne; Ogren, Philip V; Hunter, Lawrence

    2005-01-01

    This paper describes the design of six publicly available biomedical corpora. We then present usage data for the six corpora. We show that corpora that are carefully annotated with respect to structural and linguistic characteristics and that are distributed in standard formats are more widely used than corpora that are not. These findings have implications for the design of the next generation of biomedical corpora.

  5. (Text) Mining the LANDscape: Themes and Trends over 40 years of Landscape and Urban Planning

    Treesearch

    Paul H. Gobster

    2014-01-01

    In commemoration of the journal's 40th anniversary, the co-editor explores themes and trends covered by Landscape and Urban Planning and its parent journals through a qualitative comparison of co-occurrence term maps generated from the text corpora of its abstracts across the four decadal periods of publication.Cluster maps generated from the...

  6. U-Compare: share and compare text mining tools with UIMA.

    PubMed

    Kano, Yoshinobu; Baumgartner, William A; McCrohon, Luke; Ananiadou, Sophia; Cohen, K Bretonnel; Hunter, Lawrence; Tsujii, Jun'ichi

    2009-08-01

    Due to the increasing number of text mining resources (tools and corpora) available to biologists, interoperability issues between these resources are becoming significant obstacles to using them effectively. UIMA, the Unstructured Information Management Architecture, is an open framework designed to aid in the construction of more interoperable tools. U-Compare is built on top of the UIMA framework, and provides both a concrete framework for out-of-the-box text mining and a sophisticated evaluation platform allowing users to run specific tools on any target text, generating both detailed statistics and instance-based visualizations of outputs. U-Compare is a joint project, providing the world's largest, and still growing, collection of UIMA-compatible resources. These resources, originally developed by different groups for a variety of domains, include many famous tools and corpora. U-Compare can be launched straight from the web, without needing to be manually installed. All U-Compare components are provided ready-to-use and can be combined easily via a drag-and-drop interface without any programming. External UIMA components can also simply be mixed with U-Compare components, without distinguishing between locally and remotely deployed resources. http://u-compare.org/

  7. Wide coverage biomedical event extraction using multiple partially overlapping corpora

    PubMed Central

    2013-01-01

    Background Biomedical events are key to understanding physiological processes and disease, and wide coverage extraction is required for comprehensive automatic analysis of statements describing biomedical systems in the literature. In turn, the training and evaluation of extraction methods requires manually annotated corpora. However, as manual annotation is time-consuming and expensive, any single event-annotated corpus can only cover a limited number of semantic types. Although combined use of several such corpora could potentially allow an extraction system to achieve broad semantic coverage, there has been little research into learning from multiple corpora with partially overlapping semantic annotation scopes. Results We propose a method for learning from multiple corpora with partial semantic annotation overlap, and implement this method to improve our existing event extraction system, EventMine. An evaluation using seven event annotated corpora, including 65 event types in total, shows that learning from overlapping corpora can produce a single, corpus-independent, wide coverage extraction system that outperforms systems trained on single corpora and exceeds previously reported results on two established event extraction tasks from the BioNLP Shared Task 2011. Conclusions The proposed method allows the training of a wide-coverage, state-of-the-art event extraction system from multiple corpora with partial semantic annotation overlap. The resulting single model makes broad-coverage extraction straightforward in practice by removing the need to either select a subset of compatible corpora or semantic types, or to merge results from several models trained on different individual corpora. Multi-corpus learning also allows annotation efforts to focus on covering additional semantic types, rather than aiming for exhaustive coverage in any single annotation effort, or extending the coverage of semantic types annotated in existing corpora. PMID:23731785

  8. The language of gene ontology: a Zipf's law analysis.

    PubMed

    Kalankesh, Leila Ranandeh; Stevens, Robert; Brass, Andy

    2012-06-07

    Most major genome projects and sequence databases provide a GO annotation of their data, either automatically or through human annotators, creating a large corpus of data written in the language of GO. Texts written in natural language show a statistical power law behaviour, Zipf's law, the exponent of which can provide useful information on the nature of the language being used. We have therefore explored the hypothesis that collections of GO annotations will show similar statistical behaviours to natural language. Annotations from the Gene Ontology Annotation project were found to follow Zipf's law. Surprisingly, the measured power law exponents were consistently different between annotation captured using the three GO sub-ontologies in the corpora (function, process and component). On filtering the corpora using GO evidence codes we found that the value of the measured power law exponent responded in a predictable way as a function of the evidence codes used to support the annotation. Techniques from computational linguistics can provide new insights into the annotation process. GO annotations show similar statistical behaviours to those seen in natural language with measured exponents that provide a signal which correlates with the nature of the evidence codes used to support the annotations, suggesting that the measured exponent might provide a signal regarding the information content of the annotation.

  9. Linguistic measures of chemical diversity and the "keywords" of molecular collections.

    PubMed

    Woźniak, Michał; Wołos, Agnieszka; Modrzyk, Urszula; Górski, Rafał L; Winkowski, Jan; Bajczyk, Michał; Szymkuć, Sara; Grzybowski, Bartosz A; Eder, Maciej

    2018-05-15

    Computerized linguistic analyses have proven of immense value in comparing and searching through large text collections ("corpora"), including those deposited on the Internet - indeed, it would nowadays be hard to imagine browsing the Web without, for instance, search algorithms extracting most appropriate keywords from documents. This paper describes how such corpus-linguistic concepts can be extended to chemistry based on characteristic "chemical words" that span more than traditional functional groups and, instead, look at common structural fragments molecules share. Using these words, it is possible to quantify the diversity of chemical collections/databases in new ways and to define molecular "keywords" by which such collections are best characterized and annotated.

  10. [Single and combining effects of Calculus Bovis and zolpidem on inhibitive neurotransmitter of rat striatum corpora].

    PubMed

    Liu, Ping; He, Xinrong; Guo, Mei

    2010-04-01

    To investigate the correlation effects between single or combined administration of Calculus Bovis or zolpidem and changes of inhibitive neurotransmitter in rat striatum corpora. Sampling from rat striatum corpora was carried out through microdialysis. The content of two inhibitive neurotransmitters in rat corpus striatum- glycine (Gly) and gama aminobutyric acid (GABA), was determined by HPLC, which involved pre-column derivation with orthophthaladehyde, reversed-phase gradient elution and fluorescence detection. GABA content of rat striatum corpora in Calculus Bovis group was significantly increased compared with saline group (P < 0.01). GABA content of zolpidem group and Calculus Boris plus zolpidem group were increased largely compared with saline group as well (P < 0.05). GABA content of Calculus Bovis group was higher than combination group (P < 0.05). GABA content of zolpidem group was not significantly different from combination group. Gly content of Calculus Bovis or zolpidem group was markedly increased compared with saline group or combination group (P < 0.05). Contents of two inhibitive neurotransmitters in rat striatum corpora were all significantly increased in Calculus Bovis group, zolpidem group and combination group. The magnitude of increase was lower in combination group than in Calculus Bovis group and Zolpidem group, suggesting that Calculus Bovis promoted encephalon inhibition is more powerful than zolpidem. The increase in two inhibitive neurotransmitters did not show reinforcing effect in combination group, suggesting that Calculus Bovis and zolpidem may compete the same receptors. Therefore, combination of Calculus Bovis containing drugs and zolpidem has no clinical significance. Calculus Bovis shouldn't as an aperture-opening drugs be used for resuscitation therapy.

  11. Proposed Framework for the Evaluation of Standalone Corpora Processing Systems: An Application to Arabic Corpora

    PubMed Central

    Al-Thubaity, Abdulmohsen; Alqifari, Reem

    2014-01-01

    Despite the accessibility of numerous online corpora, students and researchers engaged in the fields of Natural Language Processing (NLP), corpus linguistics, and language learning and teaching may encounter situations in which they need to develop their own corpora. Several commercial and free standalone corpora processing systems are available to process such corpora. In this study, we first propose a framework for the evaluation of standalone corpora processing systems and then use it to evaluate seven freely available systems. The proposed framework considers the usability, functionality, and performance of the evaluated systems while taking into consideration their suitability for Arabic corpora. While the results show that most of the evaluated systems exhibited comparable usability scores, the scores for functionality and performance were substantially different with respect to support for the Arabic language and N-grams profile generation. The results of our evaluation will help potential users of the evaluated systems to choose the system that best meets their needs. More importantly, the results will help the developers of the evaluated systems to enhance their systems and developers of new corpora processing systems by providing them with a reference framework. PMID:25610910

  12. Proposed framework for the evaluation of standalone corpora processing systems: an application to Arabic corpora.

    PubMed

    Al-Thubaity, Abdulmohsen; Al-Khalifa, Hend; Alqifari, Reem; Almazrua, Manal

    2014-01-01

    Despite the accessibility of numerous online corpora, students and researchers engaged in the fields of Natural Language Processing (NLP), corpus linguistics, and language learning and teaching may encounter situations in which they need to develop their own corpora. Several commercial and free standalone corpora processing systems are available to process such corpora. In this study, we first propose a framework for the evaluation of standalone corpora processing systems and then use it to evaluate seven freely available systems. The proposed framework considers the usability, functionality, and performance of the evaluated systems while taking into consideration their suitability for Arabic corpora. While the results show that most of the evaluated systems exhibited comparable usability scores, the scores for functionality and performance were substantially different with respect to support for the Arabic language and N-grams profile generation. The results of our evaluation will help potential users of the evaluated systems to choose the system that best meets their needs. More importantly, the results will help the developers of the evaluated systems to enhance their systems and developers of new corpora processing systems by providing them with a reference framework.

  13. Evaluating gold standard corpora against gene/protein tagging solutions and lexical resources

    PubMed Central

    2013-01-01

    Motivation The identification of protein and gene names (PGNs) from the scientific literature requires semantic resources: Terminological and lexical resources deliver the term candidates into PGN tagging solutions and the gold standard corpora (GSC) train them to identify term parameters and contextual features. Ideally all three resources, i.e. corpora, lexica and taggers, cover the same domain knowledge, and thus support identification of the same types of PGNs and cover all of them. Unfortunately, none of the three serves as a predominant standard and for this reason it is worth exploring, how these three resources comply with each other. We systematically compare different PGN taggers against publicly available corpora and analyze the impact of the included lexical resource in their performance. In particular, we determine the performance gains through false positive filtering, which contributes to the disambiguation of identified PGNs. Results In general, machine learning approaches (ML-Tag) for PGN tagging show higher F1-measure performance against the BioCreative-II and Jnlpba GSCs (exact matching), whereas the lexicon based approaches (LexTag) in combination with disambiguation methods show better results on FsuPrge and PennBio. The ML-Tag solutions balance precision and recall, whereas the LexTag solutions have different precision and recall profiles at the same F1-measure across all corpora. Higher recall is achieved with larger lexical resources, which also introduce more noise (false positive results). The ML-Tag solutions certainly perform best, if the test corpus is from the same GSC as the training corpus. As expected, the false negative errors characterize the test corpora and – on the other hand – the profiles of the false positive mistakes characterize the tagging solutions. Lex-Tag solutions that are based on a large terminological resource in combination with false positive filtering produce better results, which, in addition, provide concept identifiers from a knowledge source in contrast to ML-Tag solutions. Conclusion The standard ML-Tag solutions achieve high performance, but not across all corpora, and thus should be trained using several different corpora to reduce possible biases. The LexTag solutions have different profiles for their precision and recall performance, but with similar F1-measure. This result is surprising and suggests that they cover a portion of the most common naming standards, but cope differently with the term variability across the corpora. The false positive filtering applied to LexTag solutions does improve the results by increasing their precision without compromising significantly their recall. The harmonisation of the annotation schemes in combination with standardized lexical resources in the tagging solutions will enable their comparability and will pave the way for a shared standard. PMID:24112383

  14. A new universality class in corpus of texts; A statistical physics study

    NASA Astrophysics Data System (ADS)

    Najafi, Elham; Darooneh, Amir H.

    2018-05-01

    Text can be regarded as a complex system. There are some methods in statistical physics which can be used to study this system. In this work, by means of statistical physics methods, we reveal new universal behaviors of texts associating with the fractality values of words in a text. The fractality measure indicates the importance of words in a text by considering distribution pattern of words throughout the text. We observed a power law relation between fractality of text and vocabulary size for texts and corpora. We also observed this behavior in studying biological data.

  15. Knowledge based word-concept model estimation and refinement for biomedical text mining.

    PubMed

    Jimeno Yepes, Antonio; Berlanga, Rafael

    2015-02-01

    Text mining of scientific literature has been essential for setting up large public biomedical databases, which are being widely used by the research community. In the biomedical domain, the existence of a large number of terminological resources and knowledge bases (KB) has enabled a myriad of machine learning methods for different text mining related tasks. Unfortunately, KBs have not been devised for text mining tasks but for human interpretation, thus performance of KB-based methods is usually lower when compared to supervised machine learning methods. The disadvantage of supervised methods though is they require labeled training data and therefore not useful for large scale biomedical text mining systems. KB-based methods do not have this limitation. In this paper, we describe a novel method to generate word-concept probabilities from a KB, which can serve as a basis for several text mining tasks. This method not only takes into account the underlying patterns within the descriptions contained in the KB but also those in texts available from large unlabeled corpora such as MEDLINE. The parameters of the model have been estimated without training data. Patterns from MEDLINE have been built using MetaMap for entity recognition and related using co-occurrences. The word-concept probabilities were evaluated on the task of word sense disambiguation (WSD). The results showed that our method obtained a higher degree of accuracy than other state-of-the-art approaches when evaluated on the MSH WSD data set. We also evaluated our method on the task of document ranking using MEDLINE citations. These results also showed an increase in performance over existing baseline retrieval approaches. Copyright © 2014 Elsevier Inc. All rights reserved.

  16. Inferring the semantic relationships of words within an ontology using random indexing: applications to pharmacogenomics.

    PubMed

    Percha, Bethany; Altman, Russ B

    2013-01-01

    The biomedical literature presents a uniquely challenging text mining problem. Sentences are long and complex, the subject matter is highly specialized with a distinct vocabulary, and producing annotated training data for this domain is time consuming and expensive. In this environment, unsupervised text mining methods that do not rely on annotated training data are valuable. Here we investigate the use of random indexing, an automated method for producing vector-space semantic representations of words from large, unlabeled corpora, to address the problem of term normalization in sentences describing drugs and genes. We show that random indexing produces similarity scores that capture some of the structure of PHARE, a manually curated ontology of pharmacogenomics concepts. We further show that random indexing can be used to identify likely word candidates for inclusion in the ontology, and can help localize these new labels among classes and roles within the ontology.

  17. Inferring the semantic relationships of words within an ontology using random indexing: applications to pharmacogenomics

    PubMed Central

    Percha, Bethany; Altman, Russ B.

    2013-01-01

    The biomedical literature presents a uniquely challenging text mining problem. Sentences are long and complex, the subject matter is highly specialized with a distinct vocabulary, and producing annotated training data for this domain is time consuming and expensive. In this environment, unsupervised text mining methods that do not rely on annotated training data are valuable. Here we investigate the use of random indexing, an automated method for producing vector-space semantic representations of words from large, unlabeled corpora, to address the problem of term normalization in sentences describing drugs and genes. We show that random indexing produces similarity scores that capture some of the structure of PHARE, a manually curated ontology of pharmacogenomics concepts. We further show that random indexing can be used to identify likely word candidates for inclusion in the ontology, and can help localize these new labels among classes and roles within the ontology. PMID:24551397

  18. 31 CFR 358.6 - What is the procedure for converting bearer corpora and detached bearer coupons to book-entry?

    Code of Federal Regulations, 2011 CFR

    2011-07-01

    ... bearer corpora and detached bearer coupons to book-entry? 358.6 Section 358.6 Money and Finance: Treasury... PUBLIC DEBT REGULATIONS GOVERNING BOOK-ENTRY CONVERSION OF BEARER CORPORA AND DETACHED BEARER COUPONS § 358.6 What is the procedure for converting bearer corpora and detached bearer coupons to book-entry...

  19. U-Compare: share and compare text mining tools with UIMA

    PubMed Central

    Kano, Yoshinobu; Baumgartner, William A.; McCrohon, Luke; Ananiadou, Sophia; Cohen, K. Bretonnel; Hunter, Lawrence; Tsujii, Jun'ichi

    2009-01-01

    Summary: Due to the increasing number of text mining resources (tools and corpora) available to biologists, interoperability issues between these resources are becoming significant obstacles to using them effectively. UIMA, the Unstructured Information Management Architecture, is an open framework designed to aid in the construction of more interoperable tools. U-Compare is built on top of the UIMA framework, and provides both a concrete framework for out-of-the-box text mining and a sophisticated evaluation platform allowing users to run specific tools on any target text, generating both detailed statistics and instance-based visualizations of outputs. U-Compare is a joint project, providing the world's largest, and still growing, collection of UIMA-compatible resources. These resources, originally developed by different groups for a variety of domains, include many famous tools and corpora. U-Compare can be launched straight from the web, without needing to be manually installed. All U-Compare components are provided ready-to-use and can be combined easily via a drag-and-drop interface without any programming. External UIMA components can also simply be mixed with U-Compare components, without distinguishing between locally and remotely deployed resources. Availability: http://u-compare.org/ Contact: kano@is.s.u-tokyo.ac.jp PMID:19414535

  20. A survey on annotation tools for the biomedical literature.

    PubMed

    Neves, Mariana; Leser, Ulf

    2014-03-01

    New approaches to biomedical text mining crucially depend on the existence of comprehensive annotated corpora. Such corpora, commonly called gold standards, are important for learning patterns or models during the training phase, for evaluating and comparing the performance of algorithms and also for better understanding the information sought for by means of examples. Gold standards depend on human understanding and manual annotation of natural language text. This process is very time-consuming and expensive because it requires high intellectual effort from domain experts. Accordingly, the lack of gold standards is considered as one of the main bottlenecks for developing novel text mining methods. This situation led the development of tools that support humans in annotating texts. Such tools should be intuitive to use, should support a range of different input formats, should include visualization of annotated texts and should generate an easy-to-parse output format. Today, a range of tools which implement some of these functionalities are available. In this survey, we present a comprehensive survey of tools for supporting annotation of biomedical texts. Altogether, we considered almost 30 tools, 13 of which were selected for an in-depth comparison. The comparison was performed using predefined criteria and was accompanied by hands-on experiences whenever possible. Our survey shows that current tools can support many of the tasks in biomedical text annotation in a satisfying manner, but also that no tool can be considered as a true comprehensive solution.

  1. Public Domain Generic Tools: An Overview.

    ERIC Educational Resources Information Center

    Erjavec, Tomaz

    This paper presents an introduction to language engineering software, especially for computerized language and text corpora. The focus of the paper is on small and relatively independent pieces of software designed for specific, often low-level language analysis tasks, and on tools in the public domain. Discussion begins with the application of…

  2. ECO: A Framework for Entity Co-Occurrence Exploration with Faceted Navigation

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Halliday, K. D.

    2010-08-20

    Even as highly structured databases and semantic knowledge bases become more prevalent, a substantial amount of human knowledge is reported as written prose. Typical textual reports, such as news articles, contain information about entities (people, organizations, and locations) and their relationships. Automatically extracting such relationships from large text corpora is a key component of corporate and government knowledge bases. The primary goal of the ECO project is to develop a scalable framework for extracting and presenting these relationships for exploration using an easily navigable faceted user interface. ECO uses entity co-occurrence relationships to identify related entities. The system aggregates andmore » indexes information on each entity pair, allowing the user to rapidly discover and mine relational information.« less

  3. Forced Alignment for Understudied Language Varieties: Testing Prosodylab-Aligner with Tongan Data

    ERIC Educational Resources Information Center

    Johnson, Lisa M.; Di Paolo, Marianna; Bell, Adrian

    2018-01-01

    Automated alignment of transcriptions to audio files expedites the process of preparing data for acoustic analysis. Unfortunately, the benefits of auto-alignment have generally been available only to researchers studying majority languages, for which large corpora exist and for which acoustic models have been created by large-scale research…

  4. Early Development of Demonstratives in Pre-Qin Chinese

    ERIC Educational Resources Information Center

    Deng, Lin

    2011-01-01

    This dissertation offers a new dynamic account of the evolution of the demonstrative system in pre-Qin Chinese based on a comprehensive linguistic analysis of the phonological, morphological, syntactic, semantic, and pragmatic aspects of demonstratives attested in two corpora of excavated texts, i.e. the oracle-bone inscriptions dated to the late…

  5. Kratylos: A Tool for Sharing Interlinearized and Lexical Data in Diverse Formats

    ERIC Educational Resources Information Center

    Kaufman, Daniel; Finkel, Raphael

    2018-01-01

    In this paper we present Kratylos, at www.kratylos.org/, a web application that creates searchable multimedia corpora from data collections in diverse formats, including collections of interlinearized glossed text (IGT) and dictionaries. There exists a crucial lacuna in the electronic ecology that supports language documentation and linguistic…

  6. Citation Matching in Sanskrit Corpora Using Local Alignment

    NASA Astrophysics Data System (ADS)

    Prasad, Abhinandan S.; Rao, Shrisha

    Citation matching is the problem of finding which citation occurs in a given textual corpus. Most existing citation matching work is done on scientific literature. The goal of this paper is to present methods for performing citation matching on Sanskrit texts. Exact matching and approximate matching are the two methods for performing citation matching. The exact matching method checks for exact occurrence of the citation with respect to the textual corpus. Approximate matching is a fuzzy string-matching method which computes a similarity score between an individual line of the textual corpus and the citation. The Smith-Waterman-Gotoh algorithm for local alignment, which is generally used in bioinformatics, is used here for calculating the similarity score. This similarity score is a measure of the closeness between the text and the citation. The exact- and approximate-matching methods are evaluated and compared. The methods presented can be easily applied to corpora in other Indic languages like Kannada, Tamil, etc. The approximate-matching method can in particular be used in the compilation of critical editions and plagiarism detection in a literary work.

  7. tESA: a distributional measure for calculating semantic relatedness.

    PubMed

    Rybinski, Maciej; Aldana-Montes, José Francisco

    2016-12-28

    Semantic relatedness is a measure that quantifies the strength of a semantic link between two concepts. Often, it can be efficiently approximated with methods that operate on words, which represent these concepts. Approximating semantic relatedness between texts and concepts represented by these texts is an important part of many text and knowledge processing tasks of crucial importance in the ever growing domain of biomedical informatics. The problem of most state-of-the-art methods for calculating semantic relatedness is their dependence on highly specialized, structured knowledge resources, which makes these methods poorly adaptable for many usage scenarios. On the other hand, the domain knowledge in the Life Sciences has become more and more accessible, but mostly in its unstructured form - as texts in large document collections, which makes its use more challenging for automated processing. In this paper we present tESA, an extension to a well known Explicit Semantic Relatedness (ESA) method. In our extension we use two separate sets of vectors, corresponding to different sections of the articles from the underlying corpus of documents, as opposed to the original method, which only uses a single vector space. We present an evaluation of Life Sciences domain-focused applicability of both tESA and domain-adapted Explicit Semantic Analysis. The methods are tested against a set of standard benchmarks established for the evaluation of biomedical semantic relatedness quality. Our experiments show that the propsed method achieves results comparable with or superior to the current state-of-the-art methods. Additionally, a comparative discussion of the results obtained with tESA and ESA is presented, together with a study of the adaptability of the methods to different corpora and their performance with different input parameters. Our findings suggest that combined use of the semantics from different sections (i.e. extending the original ESA methodology with the use of title vectors) of the documents of scientific corpora may be used to enhance the performance of a distributional semantic relatedness measures, which can be observed in the largest reference datasets. We also present the impact of the proposed extension on the size of distributional representations.

  8. Generation of silver standard concept annotations from biomedical texts with special relevance to phenotypes.

    PubMed

    Oellrich, Anika; Collier, Nigel; Smedley, Damian; Groza, Tudor

    2015-01-01

    Electronic health records and scientific articles possess differing linguistic characteristics that may impact the performance of natural language processing tools developed for one or the other. In this paper, we investigate the performance of four extant concept recognition tools: the clinical Text Analysis and Knowledge Extraction System (cTAKES), the National Center for Biomedical Ontology (NCBO) Annotator, the Biomedical Concept Annotation System (BeCAS) and MetaMap. Each of the four concept recognition systems is applied to four different corpora: the i2b2 corpus of clinical documents, a PubMed corpus of Medline abstracts, a clinical trails corpus and the ShARe/CLEF corpus. In addition, we assess the individual system performances with respect to one gold standard annotation set, available for the ShARe/CLEF corpus. Furthermore, we built a silver standard annotation set from the individual systems' output and assess the quality as well as the contribution of individual systems to the quality of the silver standard. Our results demonstrate that mainly the NCBO annotator and cTAKES contribute to the silver standard corpora (F1-measures in the range of 21% to 74%) and their quality (best F1-measure of 33%), independent from the type of text investigated. While BeCAS and MetaMap can contribute to the precision of silver standard annotations (precision of up to 42%), the F1-measure drops when combined with NCBO Annotator and cTAKES due to a low recall. In conclusion, the performances of individual systems need to be improved independently from the text types, and the leveraging strategies to best take advantage of individual systems' annotations need to be revised. The textual content of the PubMed corpus, accession numbers for the clinical trials corpus, and assigned annotations of the four concept recognition systems as well as the generated silver standard annotation sets are available from http://purl.org/phenotype/resources. The textual content of the ShARe/CLEF (https://sites.google.com/site/shareclefehealth/data) and i2b2 (https://i2b2.org/NLP/DataSets/) corpora needs to be requested with the individual corpus providers.

  9. A Learner Corpus-Based Study on Verb Errors of Turkish EFL Learners

    ERIC Educational Resources Information Center

    Can, Cem

    2017-01-01

    As learner corpora have presently become readily accessible, it is practicable to examine interlanguage errors and carry out error analysis (EA) on learner-generated texts. The data available in a learner corpus enable researchers to investigate authentic learner errors and their respective frequencies in terms of types and tokens as well as…

  10. Helios: Understanding Solar Evolution Through Text Analytics

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Randazzese, Lucien

    This proof-of-concept project focused on developing, testing, and validating a range of bibliometric, text analytic, and machine-learning based methods to explore the evolution of three photovoltaic (PV) technologies: Cadmium Telluride (CdTe), Dye-Sensitized solar cells (DSSC), and Multi-junction solar cells. The analytical approach to the work was inspired by previous work by the same team to measure and predict the scientific prominence of terms and entities within specific research domains. The goal was to create tools that could assist domain-knowledgeable analysts in investigating the history and path of technological developments in general, with a focus on analyzing step-function changes in performance,more » or “breakthroughs,” in particular. The text-analytics platform developed during this project was dubbed Helios. The project relied on computational methods for analyzing large corpora of technical documents. For this project we ingested technical documents from the following sources into Helios: Thomson Scientific Web of Science (papers), the U.S. Patent & Trademark Office (patents), the U.S. Department of Energy (technical documents), the U.S. National Science Foundation (project funding summaries), and a hand curated set of full-text documents from Thomson Scientific and other sources.« less

  11. Working with Corpora in the Translation Classroom

    ERIC Educational Resources Information Center

    Krüger, Ralph

    2012-01-01

    This article sets out to illustrate possible applications of electronic corpora in the translation classroom. Starting with a survey of corpus use within corpus-based translation studies, the didactic value of corpora in the translation classroom and their epistemic value in translation teaching and practice will be elaborated. A typology of…

  12. Using Monolingual and Bilingual Corpora in Lexicography

    ERIC Educational Resources Information Center

    Miangah, Tayebeh Mosavi

    2009-01-01

    Constructing and exploiting different types of corpora are among computer applications exposed to the researchers in different branches of science including lexicography. In lexicography, different types of corpora may be of great help in finding the most appropriate uses of words and expressions by referring to numerous examples and citations.…

  13. CUILESS2016: a clinical corpus applying compositional normalization of text mentions.

    PubMed

    Osborne, John D; Neu, Matthew B; Danila, Maria I; Solorio, Thamar; Bethard, Steven J

    2018-01-10

    Traditionally text mention normalization corpora have normalized concepts to single ontology identifiers ("pre-coordinated concepts"). Less frequently, normalization corpora have used concepts with multiple identifiers ("post-coordinated concepts") but the additional identifiers have been restricted to a defined set of relationships to the core concept. This approach limits the ability of the normalization process to express semantic meaning. We generated a freely available corpus using post-coordinated concepts without a defined set of relationships that we term "compositional concepts" to evaluate their use in clinical text. We annotated 5397 disorder mentions from the ShARe corpus to SNOMED CT that were previously normalized as "CUI-less" in the "SemEval-2015 Task 14" shared task because they lacked a pre-coordinated mapping. Unlike the previous normalization method, we do not restrict concept mappings to a particular set of the Unified Medical Language System (UMLS) semantic types and allow normalization to occur to multiple UMLS Concept Unique Identifiers (CUIs). We computed annotator agreement and assessed semantic coverage with this method. We generated the largest clinical text normalization corpus to date with mappings to multiple identifiers and made it freely available. All but 8 of the 5397 disorder mentions were normalized using this methodology. Annotator agreement ranged from 52.4% using the strictest metric (exact matching) to 78.2% using a hierarchical agreement that measures the overlap of shared ancestral nodes. Our results provide evidence that compositional concepts can increase semantic coverage in clinical text. To our knowledge we provide the first freely available corpus of compositional concept annotation in clinical text.

  14. Complex Event Extraction using DRUM

    DTIC Science & Technology

    2015-10-01

    towards tackling these challenges . Figure 9. Evaluation results for eleven teams. The diamond ◆ represents the results of our system. The two topmost...Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (EMNLP/ VLC -2000). The UniProt

  15. Multimodal Word Meaning Induction From Minimal Exposure to Natural Text.

    PubMed

    Lazaridou, Angeliki; Marelli, Marco; Baroni, Marco

    2017-04-01

    By the time they reach early adulthood, English speakers are familiar with the meaning of thousands of words. In the last decades, computational simulations known as distributional semantic models (DSMs) have demonstrated that it is possible to induce word meaning representations solely from word co-occurrence statistics extracted from a large amount of text. However, while these models learn in batch mode from large corpora, human word learning proceeds incrementally after minimal exposure to new words. In this study, we run a set of experiments investigating whether minimal distributional evidence from very short passages suffices to trigger successful word learning in subjects, testing their linguistic and visual intuitions about the concepts associated with new words. After confirming that subjects are indeed very efficient distributional learners even from small amounts of evidence, we test a DSM on the same multimodal task, finding that it behaves in a remarkable human-like way. We conclude that DSMs provide a convincing computational account of word learning even at the early stages in which a word is first encountered, and the way they build meaning representations can offer new insights into human language acquisition. Copyright © 2017 Cognitive Science Society, Inc.

  16. Learner Corpora: The Missing Link in EAP Pedagogy

    ERIC Educational Resources Information Center

    Gilquin, Gaetanelle; Granger, Sylviane; Paquot, Magali

    2007-01-01

    This article deals with the place of learner corpora, i.e. corpora containing authentic language data produced by learners of a foreign/second language, in English for academic purposes (EAP) pedagogy and sets out to demonstrate that they have a valuable contribution to make to the field. Following an initial brief introduction to corpus-based…

  17. Corpora and Language Assessment: The State of the Art

    ERIC Educational Resources Information Center

    Park, Kwanghyun

    2014-01-01

    This article outlines the current state of and recent developments in the use of corpora for language assessment and considers future directions with a special focus on computational methodology. Because corpora began to make inroads into language assessment in the 1990s, test developers have increasingly used them as a reference resource to…

  18. From Pedagogically Relevant Corpora to Authentic Language Learning Contents

    ERIC Educational Resources Information Center

    Braun, Sabine

    2005-01-01

    The potential of corpora for language learning and teaching has been widely acknowledged and their ready availability on the Web has facilitated access for a broad range of users, including language teachers and learners. However, the integration of corpora into general language learning and teaching practice has so far been disappointing. In this…

  19. The Importance of Corpora in Translation Studies: A Practical Case

    ERIC Educational Resources Information Center

    Bermúdez Bausela, Montserrat

    2016-01-01

    This paper deals with the use of corpora in Translation Studies, particularly with the so-called "'ad hoc' corpus" or "translator's corpus" as a working tool both in the classroom and for the professional translator. We believe that corpora are an inestimable source not only for terminology and phraseology extraction (cf. Maia,…

  20. Conventions for sign and speech transcription of child bimodal bilingual corpora in ELAN.

    PubMed

    Chen Pichler, Deborah; Hochgesang, Julie A; Lillo-Martin, Diane; de Quadros, Ronice Müller

    2010-01-01

    This article extends current methodologies for the linguistic analysis of sign language acquisition to cases of bimodal bilingual acquisition. Using ELAN, we are transcribing longitudinal spontaneous production data from hearing children of Deaf parents who are learning either American Sign Language (ASL) and American English (AE), or Brazilian Sign Language (Libras, also referred to as Língua de Sinais Brasileira/LSB in some texts) and Brazilian Portuguese (BP). Our goal is to construct corpora that can be mined for a wide range of investigations on various topics in acquisition. Thus, it is important that we maintain consistency in transcription for both signed and spoken languages. This article documents our transcription conventions, including the principles behind our approach. Using this document, other researchers can chose to follow similar conventions or develop new ones using our suggestions as a starting point.

  1. Conventions for sign and speech transcription of child bimodal bilingual corpora in ELAN

    PubMed Central

    Chen Pichler, Deborah; Hochgesang, Julie A.; Lillo-Martin, Diane; de Quadros, Ronice Müller

    2011-01-01

    This article extends current methodologies for the linguistic analysis of sign language acquisition to cases of bimodal bilingual acquisition. Using ELAN, we are transcribing longitudinal spontaneous production data from hearing children of Deaf parents who are learning either American Sign Language (ASL) and American English (AE), or Brazilian Sign Language (Libras, also referred to as Língua de Sinais Brasileira/LSB in some texts) and Brazilian Portuguese (BP). Our goal is to construct corpora that can be mined for a wide range of investigations on various topics in acquisition. Thus, it is important that we maintain consistency in transcription for both signed and spoken languages. This article documents our transcription conventions, including the principles behind our approach. Using this document, other researchers can chose to follow similar conventions or develop new ones using our suggestions as a starting point. PMID:21625371

  2. Large Extremity Peripheral Nerve Repair

    DTIC Science & Technology

    2016-12-01

    norbornene-2,3-dicarboxylic anhydride)/DMP-30 [2,4,6-tri (dimethylamino- methyl)phenol] (Tousimis Research Corp., Rock- ville, Md.); and then baked ...embedded in Epoxy resin (Tousimis Research Corpora- tion, Rockville, MD), and then baked overnight in a 60°C oven. From each proximal and distal

  3. Reduction corporoplasty.

    PubMed

    Hakky, Tariq S; Martinez, Daniel; Yang, Christopher; Carrion, Rafael E

    2015-01-01

    Here we present the first video demonstration of reduction corporoplasty in the management of phallic disfigurement in a 17 year old man with a history sickle cell disease and priapism. Surgical management of aneurysmal dilation of the corpora has yet to be defined in the literature. We preformed bilateral elliptical incisions over the lateral corpora as management of aneurysmal dilation of the corpora to correct phallic disfigurement. The patient tolerated the procedure well and has resolution of his corporal disfigurement. Reduction corporoplasty using bilateral lateral elliptical incisions in the management of aneurysmal dilation of the corpora is a safe an feasible operation in the management of phallic disfigurement.

  4. A Large-Scale Analysis of Variance in Written Language.

    PubMed

    Johns, Brendan T; Jamieson, Randall K

    2018-01-22

    The collection of very large text sources has revolutionized the study of natural language, leading to the development of several models of language learning and distributional semantics that extract sophisticated semantic representations of words based on the statistical redundancies contained within natural language (e.g., Griffiths, Steyvers, & Tenenbaum, ; Jones & Mewhort, ; Landauer & Dumais, ; Mikolov, Sutskever, Chen, Corrado, & Dean, ). The models treat knowledge as an interaction of processing mechanisms and the structure of language experience. But language experience is often treated agnostically. We report a distributional semantic analysis that shows written language in fiction books varies appreciably between books from the different genres, books from the same genre, and even books written by the same author. Given that current theories assume that word knowledge reflects an interaction between processing mechanisms and the language environment, the analysis shows the need for the field to engage in a more deliberate consideration and curation of the corpora used in computational studies of natural language processing. Copyright © 2018 Cognitive Science Society, Inc.

  5. A Sample Corpus Integration in Language Teacher Education through Coursebook Evaluation

    ERIC Educational Resources Information Center

    Asik, Asuman

    2017-01-01

    The use of corpora has an increased interest in language teaching in the past two decades. Many corpora have been utilized for several purposes in language classrooms directly or indirectly. In spite of the increasing awareness towards the use of corpora and the corpus tools, language teacher education programs still do not include corpus…

  6. Progestogen treatments for cycle management in a sheep model of assisted conception affect the growth patterns, the expression of luteinizing hormone receptors, and the progesterone secretion of induced corpora lutea.

    PubMed

    Letelier, Claudia; García-Fernández, Rosa Ana; Contreras-Solis, Ignacio; Sanchez, María Angeles; Garcia-Palencia, Pilar; Sanchez, Belen; Gonzalez-Bulnes, Antonio; Flores, Juana María

    2010-03-01

    To determine, in a sheep model, the effect of a short-term progestative treatment on growth dynamics and functionality of induced corpora lutea. Observational, model study. Public university. Sixty adult female sheep. Synchronization and induction of ovulation with progestogens and prostaglandin analogues; ovarian ultrasonography, blood sampling, and ovariectomy. Determination of pituitary function and morphologic characteristics, expression of luteinizing hormone (LH) receptors, and progesterone secretion of corpora lutea. The use of progestative pretreatments for assisted conception affect the growth patterns, the expression of LH receptors, and the progesterone secretion of induced corpora lutea. The current study indicates, in a sheep model, the existence of deleterious effects from progestogens on functionality of induced corpora lutea. Copyright 2010 American Society for Reproductive Medicine. Published by Elsevier Inc. All rights reserved.

  7. U.S. Ratification of the Chemical Weapons Convention

    DTIC Science & Technology

    2011-12-01

    safeguard trade secrets. Leading corpora- tions such as DuPont, Dow, and Monsanto also supported CWC ratification to improve the public image of the...represented large chemical companies such as Dow, DuPont, and Monsanto , was highly effective at contacting senators, putting out useful information, and

  8. A Corpus-Based EAP Course for NNS Doctoral Students: Moving from Available Specialized Corpora to Self-Compiled Corpora

    ERIC Educational Resources Information Center

    Lee, David; Swales, John

    2006-01-01

    This paper presents a discussion of an experimental, innovative course in corpus-informed EAP for doctoral students. Participants were given access to specialized corpora of academic writing and speaking, instructed in the tools of the trade (web- and PC-based concordancers) and gradually inducted into the skills needed to best exploit the data…

  9. The Use of General and Specialized Corpora as Reference Sources for Academic English Writing: A Case Study

    ERIC Educational Resources Information Center

    Chang, Ji-Yeon

    2014-01-01

    Corpora have been suggested as valuable sources for teaching English for academic purposes (EAP). Since previous studies have mainly focused on corpus use in classroom settings, more research is needed to reveal how students react to using corpora on their own and what should be provided to help them become autonomous corpus users, considering…

  10. Reduction Corporoplasty

    PubMed Central

    Hakky, Tariq S.; Martinez, Daniel; Yang, Christopher; Carrion, Rafael E.

    2015-01-01

    Objective Here we present the first video demonstration of reduction corporoplasty in the management of phallic disfigurement in a 17 year old man with a history sickle cell disease and priapism. Introduction Surgical management of aneurysmal dilation of the corpora has yet to be defined in the literature. Materials and Methods: We preformed bilateral elliptical incisions over the lateral corpora as management of aneurysmal dilation of the corpora to correct phallic disfigurement. Results The patient tolerated the procedure well and has resolution of his corporal disfigurement. Conclusions Reduction corporoplasty using bilateral lateral elliptical incisions in the management of aneurysmal dilation of the corpora is a safe an feasible operation in the management of phallic disfigurement. PMID:26005988

  11. Information Extraction from Unstructured Text for the Biodefense Knowledge Center

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Samatova, N F; Park, B; Krishnamurthy, R

    2005-04-29

    The Bio-Encyclopedia at the Biodefense Knowledge Center (BKC) is being constructed to allow an early detection of emerging biological threats to homeland security. It requires highly structured information extracted from variety of data sources. However, the quantity of new and vital information available from every day sources cannot be assimilated by hand, and therefore reliable high-throughput information extraction techniques are much anticipated. In support of the BKC, Lawrence Livermore National Laboratory and Oak Ridge National Laboratory, together with the University of Utah, are developing an information extraction system built around the bioterrorism domain. This paper reports two important pieces ofmore » our effort integrated in the system: key phrase extraction and semantic tagging. Whereas two key phrase extraction technologies developed during the course of project help identify relevant texts, our state-of-the-art semantic tagging system can pinpoint phrases related to emerging biological threats. Also we are enhancing and tailoring the Bio-Encyclopedia by augmenting semantic dictionaries and extracting details of important events, such as suspected disease outbreaks. Some of these technologies have already been applied to large corpora of free text sources vital to the BKC mission, including ProMED-mail, PubMed abstracts, and the DHS's Information Analysis and Infrastructure Protection (IAIP) news clippings. In order to address the challenges involved in incorporating such large amounts of unstructured text, the overall system is focused on precise extraction of the most relevant information for inclusion in the BKC.« less

  12. Data-Informed Language Learning

    ERIC Educational Resources Information Center

    Godwin-Jones, Robert

    2017-01-01

    Although data collection has been used in language learning settings for some time, it is only in recent decades that large corpora have become available, along with efficient tools for their use. Advances in natural language processing (NLP) have enabled rich tagging and annotation of corpus data, essential for their effective use in language…

  13. Lexical Link Analysis Application: Improving Web Service to Acquisition Visibility Portal

    DTIC Science & Technology

    2013-09-30

    during the Empire Challenge 2008 and 2009 (EC08/09) field experiments and for numerous other field experiments of new technologies during Trident Warrior...Empirical Methods in Natural Language Processing and Very Large Corpora (EMNLP/ VLC -2000) (pp. 63–70). Retrieved from http://nlp.stanford.edu/manning

  14. RysannMD: A biomedical semantic annotator balancing speed and accuracy.

    PubMed

    Cuzzola, John; Jovanović, Jelena; Bagheri, Ebrahim

    2017-07-01

    Recently, both researchers and practitioners have explored the possibility of semantically annotating large and continuously evolving collections of biomedical texts such as research papers, medical reports, and physician notes in order to enable their efficient and effective management and use in clinical practice or research laboratories. Such annotations can be automatically generated by biomedical semantic annotators - tools that are specifically designed for detecting and disambiguating biomedical concepts mentioned in text. The biomedical community has already presented several solid automated semantic annotators. However, the existing tools are either strong in their disambiguation capacity, i.e., the ability to identify the correct biomedical concept for a given piece of text among several candidate concepts, or they excel in their processing time, i.e., work very efficiently, but none of the semantic annotation tools reported in the literature has both of these qualities. In this paper, we present RysannMD (Ryerson Semantic Annotator for Medical Domain), a biomedical semantic annotation tool that strikes a balance between processing time and performance while disambiguating biomedical terms. In other words, RysannMD provides reasonable disambiguation performance when choosing the right sense for a biomedical term in a given context, and does that in a reasonable time. To examine how RysannMD stands with respect to the state of the art biomedical semantic annotators, we have conducted a series of experiments using standard benchmarking corpora, including both gold and silver standards, and four modern biomedical semantic annotators, namely cTAKES, MetaMap, NOBLE Coder, and Neji. The annotators were compared with respect to the quality of the produced annotations measured against gold and silver standards using precision, recall, and F 1 measure and speed, i.e., processing time. In the experiments, RysannMD achieved the best median F 1 measure across the benchmarking corpora, independent of the standard used (silver/gold), biomedical subdomain, and document size. In terms of the annotation speed, RysannMD scored the second best median processing time across all the experiments. The obtained results indicate that RysannMD offers the best performance among the examined semantic annotators when both quality of annotation and speed are considered simultaneously. Copyright © 2017 Elsevier Inc. All rights reserved.

  15. BioC: a minimalist approach to interoperability for biomedical text processing

    PubMed Central

    Comeau, Donald C.; Islamaj Doğan, Rezarta; Ciccarese, Paolo; Cohen, Kevin Bretonnel; Krallinger, Martin; Leitner, Florian; Lu, Zhiyong; Peng, Yifan; Rinaldi, Fabio; Torii, Manabu; Valencia, Alfonso; Verspoor, Karin; Wiegers, Thomas C.; Wu, Cathy H.; Wilbur, W. John

    2013-01-01

    A vast amount of scientific information is encoded in natural language text, and the quantity of such text has become so great that it is no longer economically feasible to have a human as the first step in the search process. Natural language processing and text mining tools have become essential to facilitate the search for and extraction of information from text. This has led to vigorous research efforts to create useful tools and to create humanly labeled text corpora, which can be used to improve such tools. To encourage combining these efforts into larger, more powerful and more capable systems, a common interchange format to represent, store and exchange the data in a simple manner between different language processing systems and text mining tools is highly desirable. Here we propose a simple extensible mark-up language format to share text documents and annotations. The proposed annotation approach allows a large number of different annotations to be represented including sentences, tokens, parts of speech, named entities such as genes or diseases and relationships between named entities. In addition, we provide simple code to hold this data, read it from and write it back to extensible mark-up language files and perform some sample processing. We also describe completed as well as ongoing work to apply the approach in several directions. Code and data are available at http://bioc.sourceforge.net/. Database URL: http://bioc.sourceforge.net/ PMID:24048470

  16. Evaluating Corpus Literacy Training for Pre-Service Language Teachers: Six Case Studies

    ERIC Educational Resources Information Center

    Heather, Julian; Helt, Marie

    2012-01-01

    Corpus literacy is the ability to use corpora--large, principled databases of spoken and written language--for language analysis and instruction. While linguists have emphasized the importance of corpus training in teacher preparation programs, few studies have investigated the process of initiating teachers into corpus literacy with the result…

  17. Large Extremity Peripheral Nerve Repair

    DTIC Science & Technology

    2016-12-01

    baked overnight in a 60°C oven. Using a diamond blade, 1-μm sections were cut 5 mm proximal and 5 mm distal to the graft. Histologic slides were...Scien- ces), embedded in Epoxy resin (Tousimis Research Corpora- tion, Rockville, MD), and then baked overnight in a 60°C oven. From each proximal and

  18. Validating a strategy for psychosocial phenotyping using a large corpus of clinical text.

    PubMed

    Gundlapalli, Adi V; Redd, Andrew; Carter, Marjorie; Divita, Guy; Shen, Shuying; Palmer, Miland; Samore, Matthew H

    2013-12-01

    To develop algorithms to improve efficiency of patient phenotyping using natural language processing (NLP) on text data. Of a large number of note titles available in our database, we sought to determine those with highest yield and precision for psychosocial concepts. From a database of over 1 billion documents from US Department of Veterans Affairs medical facilities, a random sample of 1500 documents from each of 218 enterprise note titles were chosen. Psychosocial concepts were extracted using a UIMA-AS-based NLP pipeline (v3NLP), using a lexicon of relevant concepts with negation and template format annotators. Human reviewers evaluated a subset of documents for false positives and sensitivity. High-yield documents were identified by hit rate and precision. Reasons for false positivity were characterized. A total of 58 707 psychosocial concepts were identified from 316 355 documents for an overall hit rate of 0.2 concepts per document (median 0.1, range 1.6-0). Of 6031 concepts reviewed from a high-yield set of note titles, the overall precision for all concept categories was 80%, with variability among note titles and concept categories. Reasons for false positivity included templating, negation, context, and alternate meaning of words. The sensitivity of the NLP system was noted to be 49% (95% CI 43% to 55%). Phenotyping using NLP need not involve the entire document corpus. Our methods offer a generalizable strategy for scaling NLP pipelines to large free text corpora with complex linguistic annotations in attempts to identify patients of a certain phenotype.

  19. Validating a strategy for psychosocial phenotyping using a large corpus of clinical text

    PubMed Central

    Gundlapalli, Adi V; Redd, Andrew; Carter, Marjorie; Divita, Guy; Shen, Shuying; Palmer, Miland; Samore, Matthew H

    2013-01-01

    Objective To develop algorithms to improve efficiency of patient phenotyping using natural language processing (NLP) on text data. Of a large number of note titles available in our database, we sought to determine those with highest yield and precision for psychosocial concepts. Materials and methods From a database of over 1 billion documents from US Department of Veterans Affairs medical facilities, a random sample of 1500 documents from each of 218 enterprise note titles were chosen. Psychosocial concepts were extracted using a UIMA-AS-based NLP pipeline (v3NLP), using a lexicon of relevant concepts with negation and template format annotators. Human reviewers evaluated a subset of documents for false positives and sensitivity. High-yield documents were identified by hit rate and precision. Reasons for false positivity were characterized. Results A total of 58 707 psychosocial concepts were identified from 316 355 documents for an overall hit rate of 0.2 concepts per document (median 0.1, range 1.6–0). Of 6031 concepts reviewed from a high-yield set of note titles, the overall precision for all concept categories was 80%, with variability among note titles and concept categories. Reasons for false positivity included templating, negation, context, and alternate meaning of words. The sensitivity of the NLP system was noted to be 49% (95% CI 43% to 55%). Conclusions Phenotyping using NLP need not involve the entire document corpus. Our methods offer a generalizable strategy for scaling NLP pipelines to large free text corpora with complex linguistic annotations in attempts to identify patients of a certain phenotype. PMID:24169276

  20. Vietnamese Document Representation and Classification

    NASA Astrophysics Data System (ADS)

    Nguyen, Giang-Son; Gao, Xiaoying; Andreae, Peter

    Vietnamese is very different from English and little research has been done on Vietnamese document classification, or indeed, on any kind of Vietnamese language processing, and only a few small corpora are available for research. We created a large Vietnamese text corpus with about 18000 documents, and manually classified them based on different criteria such as topics and styles, giving several classification tasks of different difficulty levels. This paper introduces a new syllable-based document representation at the morphological level of the language for efficient classification. We tested the representation on our corpus with different classification tasks using six classification algorithms and two feature selection techniques. Our experiments show that the new representation is effective for Vietnamese categorization, and suggest that best performance can be achieved using syllable-pair document representation, an SVM with a polynomial kernel as the learning algorithm, and using Information gain and an external dictionary for feature selection.

  1. Collaborative work on evaluation of ovarian toxicity. 13) Two- or four-week repeated dose studies and fertility study of PPAR alpha/gamma dual agonist in female rats.

    PubMed

    Sato, Norihiro; Uchida, Keisuke; Nakajima, Mikio; Watanabe, Atsushi; Kohira, Terutomo

    2009-01-01

    The main focus of this study was to determine the optimal dosing period in a repeated dose toxicity study based on toxic effects as assessed by ovarian morphological changes. To assess morphological and functional changes induced in the ovary by a peroxisome proliferator-activated receptor (PPAR) alpha/gamma dual agonist, the compound was administered to female rats at dose levels of 0, 4, 20, and 100 mg/kg/day in a repeated dose toxicity study for 2 or 4 weeks, and from 2 weeks prior to mating to Day 7 of pregnancy in a female fertility study. In the repeated dose toxicity study, an increase in atresia of large follicles, a decrease in corpora lutea, and an increase in stromal cells were observed in the treated groups. In addition, the granulosa cell exfoliations into antrum of large follicles and corpora lutea with retained oocyte are morphological characteristics induced by this compound, and they might be related with abnormal condition of ovulation. In the female fertility study, the pregnancy rate tended to decrease in the 100 mg/kg/day group. At necropsy, decreases in the number of corpora lutea, implantations and live embryos were noted in the 20 and 100 mg/kg/day group. No changes were observed in animals given 4 mg/kg/day. These findings indicated that histopathological changes in the ovary are important endpoints for evaluation of drugs inducing ovarian damage. In conclusion, a 2-week administration period is sufficient to detect ovarian toxicity of this test compound in the repeated dose toxicity study.

  2. Cross domains Arabic named entity recognition system

    NASA Astrophysics Data System (ADS)

    Al-Ahmari, S. Saad; Abdullatif Al-Johar, B.

    2016-07-01

    Named Entity Recognition (NER) plays an important role in many Natural Language Processing (NLP) applications such as; Information Extraction (IE), Question Answering (QA), Text Clustering, Text Summarization and Word Sense Disambiguation. This paper presents the development and implementation of domain independent system to recognize three types of Arabic named entities. The system works based on a set of domain independent grammar-rules along with Arabic part of speech tagger in addition to gazetteers and lists of trigger words. The experimental results shown, that the system performed as good as other systems with better results in some cases of cross-domains corpora.

  3. A set of high quality colour images with Spanish norms for seven relevant psycholinguistic variables: the Nombela naming test.

    PubMed

    Moreno-Martinez, Francisco Javier; Montoro, Pedro R; Laws, Keith R

    2011-05-01

    This paper presents a new corpus of 140 high quality colour images belonging to 14 subcategories and covering a range of naming difficulty. One hundred and six Spanish speakers named the items and provided data for several psycholinguistic variables: age of acquisition, familiarity, manipulability, name agreement, typicality and visual complexity. Furthermore, we also present lexical frequency data derived internet search hits. Apart from the large number of variables evaluated, these stimuli present an important advantage with respect to other comparable image corpora in so far as naming performance in healthy individuals is less prone to ceiling effect problems. Reliability and validity indexes showed that our items display similar psycholinguistic characteristics to those of other corpora. In sum, this set of ecologically valid stimuli provides a useful tool for scientists engaged in cognitive and neuroscience-based research.

  4. Event construal and temporal distance in natural language.

    PubMed

    Bhatia, Sudeep; Walasek, Lukasz

    2016-07-01

    Construal level theory proposes that events that are temporally proximate are represented more concretely than events that are temporally distant. We tested this prediction using two large natural language text corpora. In study 1 we examined posts on Twitter that referenced the future, and found that tweets mentioning temporally proximate dates used more concrete words than those mentioning distant dates. In study 2 we obtained all New York Times articles that referenced U.S. presidential elections between 1987 and 2007. We found that the concreteness of the words in these articles increased with the temporal proximity to their corresponding election. Additionally the reduction in concreteness after the election was much greater than the increase in concreteness leading up to the election, though both changes in concreteness were well described by an exponential function. We replicated this finding with New York Times articles referencing US public holidays. Overall, our results provide strong support for the predictions of construal level theory, and additionally illustrate how large natural language datasets can be used to inform psychological theory. Copyright © 2016 The Authors. Published by Elsevier B.V. All rights reserved.

  5. Automated Session-Quality Assessment for Human Tutoring Based on Expert Ratings of Tutoring Success

    ERIC Educational Resources Information Center

    Nye, Benjamin D.; Morrison, Donald M.; Samei, Borhan

    2015-01-01

    Archived transcripts from tens of millions of online human tutoring sessions potentially contain important knowledge about how online tutors help, or fail to help, students learn. However, without ways of automatically analyzing these large corpora, any knowledge in this data will remain buried. One way to approach this issue is to train an…

  6. Large Extremity Peripheral Nerve Repair

    DTIC Science & Technology

    2016-12-01

    Corp., Rock- ville, Md.); and then baked overnight in a 60°C oven. Using a diamond blade, 1-μm sections were cut 5 mm proximal and 5 mm distal to the...Epoxy resin (Tousimis Research Corpora- tion, Rockville, MD), and then baked overnight in a 60°C oven. From each proximal and distal end, 1 µm

  7. Cell line name recognition in support of the identification of synthetic lethality in cancer from text

    PubMed Central

    Kaewphan, Suwisa; Van Landeghem, Sofie; Ohta, Tomoko; Van de Peer, Yves; Ginter, Filip; Pyysalo, Sampo

    2016-01-01

    Motivation: The recognition and normalization of cell line names in text is an important task in biomedical text mining research, facilitating for instance the identification of synthetically lethal genes from the literature. While several tools have previously been developed to address cell line recognition, it is unclear whether available systems can perform sufficiently well in realistic and broad-coverage applications such as extracting synthetically lethal genes from the cancer literature. In this study, we revisit the cell line name recognition task, evaluating both available systems and newly introduced methods on various resources to obtain a reliable tagger not tied to any specific subdomain. In support of this task, we introduce two text collections manually annotated for cell line names: the broad-coverage corpus Gellus and CLL, a focused target domain corpus. Results: We find that the best performance is achieved using NERsuite, a machine learning system based on Conditional Random Fields, trained on the Gellus corpus and supported with a dictionary of cell line names. The system achieves an F-score of 88.46% on the test set of Gellus and 85.98% on the independently annotated CLL corpus. It was further applied at large scale to 24 302 102 unannotated articles, resulting in the identification of 5 181 342 cell line mentions, normalized to 11 755 unique cell line database identifiers. Availability and implementation: The manually annotated datasets, the cell line dictionary, derived corpora, NERsuite models and the results of the large-scale run on unannotated texts are available under open licenses at http://turkunlp.github.io/Cell-line-recognition/. Contact: sukaew@utu.fi PMID:26428294

  8. Probing the Statistical Properties of Unknown Texts: Application to the Voynich Manuscript

    PubMed Central

    Amancio, Diego R.; Altmann, Eduardo G.; Rybski, Diego; Oliveira, Osvaldo N.; Costa, Luciano da F.

    2013-01-01

    While the use of statistical physics methods to analyze large corpora has been useful to unveil many patterns in texts, no comprehensive investigation has been performed on the interdependence between syntactic and semantic factors. In this study we propose a framework for determining whether a text (e.g., written in an unknown alphabet) is compatible with a natural language and to which language it could belong. The approach is based on three types of statistical measurements, i.e. obtained from first-order statistics of word properties in a text, from the topology of complex networks representing texts, and from intermittency concepts where text is treated as a time series. Comparative experiments were performed with the New Testament in 15 different languages and with distinct books in English and Portuguese in order to quantify the dependency of the different measurements on the language and on the story being told in the book. The metrics found to be informative in distinguishing real texts from their shuffled versions include assortativity, degree and selectivity of words. As an illustration, we analyze an undeciphered medieval manuscript known as the Voynich Manuscript. We show that it is mostly compatible with natural languages and incompatible with random texts. We also obtain candidates for keywords of the Voynich Manuscript which could be helpful in the effort of deciphering it. Because we were able to identify statistical measurements that are more dependent on the syntax than on the semantics, the framework may also serve for text analysis in language-dependent applications. PMID:23844002

  9. Probing the statistical properties of unknown texts: application to the Voynich Manuscript.

    PubMed

    Amancio, Diego R; Altmann, Eduardo G; Rybski, Diego; Oliveira, Osvaldo N; Costa, Luciano da F

    2013-01-01

    While the use of statistical physics methods to analyze large corpora has been useful to unveil many patterns in texts, no comprehensive investigation has been performed on the interdependence between syntactic and semantic factors. In this study we propose a framework for determining whether a text (e.g., written in an unknown alphabet) is compatible with a natural language and to which language it could belong. The approach is based on three types of statistical measurements, i.e. obtained from first-order statistics of word properties in a text, from the topology of complex networks representing texts, and from intermittency concepts where text is treated as a time series. Comparative experiments were performed with the New Testament in 15 different languages and with distinct books in English and Portuguese in order to quantify the dependency of the different measurements on the language and on the story being told in the book. The metrics found to be informative in distinguishing real texts from their shuffled versions include assortativity, degree and selectivity of words. As an illustration, we analyze an undeciphered medieval manuscript known as the Voynich Manuscript. We show that it is mostly compatible with natural languages and incompatible with random texts. We also obtain candidates for keywords of the Voynich Manuscript which could be helpful in the effort of deciphering it. Because we were able to identify statistical measurements that are more dependent on the syntax than on the semantics, the framework may also serve for text analysis in language-dependent applications.

  10. [A customized method for information extraction from unstructured text data in the electronic medical records].

    PubMed

    Bao, X Y; Huang, W J; Zhang, K; Jin, M; Li, Y; Niu, C Z

    2018-04-18

    There is a huge amount of diagnostic or treatment information in electronic medical record (EMR), which is a concrete manifestation of clinicians actual diagnosis and treatment details. Plenty of episodes in EMRs, such as complaints, present illness, past history, differential diagnosis, diagnostic imaging, surgical records, reflecting details of diagnosis and treatment in clinical process, adopt Chinese description of natural language. How to extract effective information from these Chinese narrative text data, and organize it into a form of tabular for analysis of medical research, for the practical utilization of clinical data in the real world, is a difficult problem in Chinese medical data processing. Based on the EMRs narrative text data in a tertiary hospital in China, a customized information extracting rules learning, and rule based information extraction methods is proposed. The overall method consists of three steps, which includes: (1) Step 1, a random sample of 600 copies (including the history of present illness, past history, personal history, family history, etc.) of the electronic medical record data, was extracted as raw corpora. With our developed Chinese clinical narrative text annotation platform, the trained clinician and nurses marked the tokens and phrases in the corpora which would be extracted (with a history of diabetes as an example). (2) Step 2, based on the annotated corpora clinical text data, some extraction templates were summarized and induced firstly. Then these templates were rewritten using regular expressions of Perl programming language, as extraction rules. Using these extraction rules as basic knowledge base, we developed extraction packages in Perl, for extracting data from the EMRs text data. In the end, the extracted data items were organized in tabular data format, for later usage in clinical research or hospital surveillance purposes. (3) As the final step of the method, the evaluation and validation of the proposed methods were implemented in the National Clinical Service Data Integration Platform, and we checked the extraction results using artificial verification and automated verification combined, proved the effectiveness of the method. For all the patients with diabetes as diagnosed disease in the Department of Endocrine in the hospital, the medical history episode of these patients showed that, altogether 1 436 patients were dismissed in 2015, and a history of diabetes medical records extraction results showed that the recall rate was 87.6%, the accuracy rate was 99.5%, and F-Score was 0.93. For all the 10% patients (totally 1 223 patients) with diabetes by the dismissed dates of August 2017 in the same department, the extracted diabetes history extraction results showed that the recall rate was 89.2%, the accuracy rate was 99.2%, F-Score was 0.94. This study mainly adopts the combination of natural language processing and rule-based information extraction, and designs and implements an algorithm for extracting customized information from unstructured Chinese electronic medical record text data. It has better results than existing work.

  11. Phraseology and Frequency of Occurrence on the Web: Native Speakers' Perceptions of Google-Informed Second Language Writing

    ERIC Educational Resources Information Center

    Geluso, Joe

    2013-01-01

    Usage-based theories of language learning suggest that native speakers of a language are acutely aware of formulaic language due in large part to frequency effects. Corpora and data-driven learning can offer useful insights into frequent patterns of naturally occurring language to second/foreign language learners who, unlike native speakers, are…

  12. The reduction corporoplasty: the answer to the improbable urologic question "can you make my penis smaller?".

    PubMed

    Martinez, Daniel R; Manimala, Neil J; Rafiei, Arash; Hakky, Tariq S; Yang, Chris; Carrion, Rafael

    2015-03-01

    Aneurysmal dilatation of the corpora cavernosa can occur because of recurrent priapism in the setting of sickle cell disease. We present the first case of a successful implementation of the reduction corporoplasty technique for treatment of a phallus that was "too large for intercourse." We describe the presentation of a 17-year-old male with a history of sickle cell disease with a phallus "too large for intercourse." Patient reported normal erectile function and response with masturbation but also reported inability to penetrate his partner due to the enlarged and disfigured morphology. He had three priapismic episodes since the age of 10 that progressively led to an aneurysmal morphologic deformity of his phallus. Evaluation included a magnetic resonance imaging, which revealed true aneurysmal dilatation of bilateral corpora cavernosa in the middle and distal portions, and diffusely hyperplastic tunica. The main outcome measure is the successful management of phallic disfiguration. Reduction corporoplasty was performed, and the patient reported intact erectile function without aneurysmal recurrence. Patients with significant corporal aneurysmal defects secondary to recurrent priapism can be successfully managed with reduction corporoplasty. © 2014 International Society for Sexual Medicine.

  13. Avoid violence, rioting, and outrage; approach celebration, delight, and strength: Using large text corpora to compute valence, arousal, and the basic emotions.

    PubMed

    Westbury, Chris; Keith, Jeff; Briesemeister, Benny B; Hofmann, Markus J; Jacobs, Arthur M

    2015-01-01

    Ever since Aristotle discussed the issue in Book II of his Rhetoric, humans have attempted to identify a set of "basic emotion labels". In this paper we propose an algorithmic method for evaluating sets of basic emotion labels that relies upon computed co-occurrence distances between words in a 12.7-billion-word corpus of unselected text from USENET discussion groups. Our method uses the relationship between human arousal and valence ratings collected for a large list of words, and the co-occurrence similarity between each word and emotion labels. We assess how well the words in each of 12 emotion label sets-proposed by various researchers over the past 118 years-predict the arousal and valence ratings on a test and validation dataset, each consisting of over 5970 items. We also assess how well these emotion labels predict lexical decision residuals (LDRTs), after co-varying out the effects attributable to basic lexical predictors. We then demonstrate a generalization of our method to determine the most predictive "basic" emotion labels from among all of the putative models of basic emotion that we considered. As well as contributing empirical data towards the development of a more rigorous definition of basic emotions, our method makes it possible to derive principled computational estimates of emotionality-specifically, of arousal and valence-for all words in the language.

  14. Isolation and structure elucidation of neuropeptides of the AKH/RPCH family in long-horned grasshoppers (Ensifera).

    PubMed

    Gäde, G

    1992-11-01

    An identical neuropeptide was isolated by reversed-phase high-performance liquid chromatography from the corpora cardiaca of the king cricket, Libanasidus vittatus, and the two armoured ground crickets, Heterodes namaqua and Acanthoproctus cervinus. The crude gland extracts had adipokinetic activity in migratory locusts, hypertrehalosaemic activity in American cockroaches and a slight hypertrehalosaemic, but no adipokinetic, effect in armoured ground crickets. The primary structure of this neuropeptide was determined by pulsed-liquid phase sequencing employing Edman chemistry after enzymically deblocking the N-terminal 5-oxopyrrolidine-2-carboxylic acid residue. The C-terminus was also blocked, as indicated by the lack of digestion by carboxypeptidase A. The peptide was assigned the structure [symbol: see text]Glu-Leu-Asn-Phe-Ser-Thr-Gly-TrpNH2, previously designated Scg-AKH-II. The corpora cardiaca of the cricket Gryllodes sigillatus contained a neuropeptide which differed in retention time from the one isolated from the king and armoured ground crickets. The structure was assigned as [symbol: see text]Glu-Val-Asn-Phe-Ser-Thr-Gly-TrpNH2, previously designated Grb-AKH. This octapeptide caused hyperlipaemia in its donor species. The presence of the same peptide, Scg-AKH-II, in the two primitive infraorders of Ensifera, and the different peptide, Grb-AKH, in the most advanced infraorder of Ensifera, supports the evolutionary trends assigned formerly from morphological and physiological evidence.

  15. A multilingual gold-standard corpus for biomedical concept recognition: the Mantra GSC

    PubMed Central

    Clematide, Simon; Akhondi, Saber A; van Mulligen, Erik M; Rebholz-Schuhmann, Dietrich

    2015-01-01

    Objective To create a multilingual gold-standard corpus for biomedical concept recognition. Materials and methods We selected text units from different parallel corpora (Medline abstract titles, drug labels, biomedical patent claims) in English, French, German, Spanish, and Dutch. Three annotators per language independently annotated the biomedical concepts, based on a subset of the Unified Medical Language System and covering a wide range of semantic groups. To reduce the annotation workload, automatically generated preannotations were provided. Individual annotations were automatically harmonized and then adjudicated, and cross-language consistency checks were carried out to arrive at the final annotations. Results The number of final annotations was 5530. Inter-annotator agreement scores indicate good agreement (median F-score 0.79), and are similar to those between individual annotators and the gold standard. The automatically generated harmonized annotation set for each language performed equally well as the best annotator for that language. Discussion The use of automatic preannotations, harmonized annotations, and parallel corpora helped to keep the manual annotation efforts manageable. The inter-annotator agreement scores provide a reference standard for gauging the performance of automatic annotation techniques. Conclusion To our knowledge, this is the first gold-standard corpus for biomedical concept recognition in languages other than English. Other distinguishing features are the wide variety of semantic groups that are being covered, and the diversity of text genres that were annotated. PMID:25948699

  16. Hiding in plain sight: use of realistic surrogates to reduce exposure of protected health information in clinical text.

    PubMed

    Carrell, David; Malin, Bradley; Aberdeen, John; Bayer, Samuel; Clark, Cheryl; Wellner, Ben; Hirschman, Lynette

    2013-01-01

    Secondary use of clinical text is impeded by a lack of highly effective, low-cost de-identification methods. Both, manual and automated methods for removing protected health information, are known to leave behind residual identifiers. The authors propose a novel approach for addressing the residual identifier problem based on the theory of Hiding In Plain Sight (HIPS). HIPS relies on obfuscation to conceal residual identifiers. According to this theory, replacing the detected identifiers with realistic but synthetic surrogates should collectively render the few 'leaked' identifiers difficult to distinguish from the synthetic surrogates. The authors conducted a pilot study to test this theory on clinical narrative, de-identified by an automated system. Test corpora included 31 oncology and 50 family practice progress notes read by two trained chart abstractors and an informaticist. Experimental results suggest approximately 90% of residual identifiers can be effectively concealed by the HIPS approach in text containing average and high densities of personal identifying information. This pilot test suggests HIPS is feasible, but requires further evaluation. The results need to be replicated on larger corpora of diverse origin under a range of detection scenarios. Error analyses also suggest areas where surrogate generation techniques can be refined to improve efficacy. If these results generalize to existing high-performing de-identification systems with recall rates of 94-98%, HIPS could increase the effective de-identification rates of these systems to levels above 99% without further advancements in system recall. Additional and more rigorous assessment of the HIPS approach is warranted.

  17. The Impact of Misspelled Words on Automated Computer Scoring: A Case Study of Scientific Explanations

    NASA Astrophysics Data System (ADS)

    Ha, Minsu; Nehm, Ross H.

    2016-06-01

    Automated computerized scoring systems (ACSSs) are being increasingly used to analyze text in many educational settings. Nevertheless, the impact of misspelled words (MSW) on scoring accuracy remains to be investigated in many domains, particularly jargon-rich disciplines such as the life sciences. Empirical studies confirm that MSW are a pervasive feature of human-generated text and that despite improvements, spell-check and auto-replace programs continue to be characterized by significant errors. Our study explored four research questions relating to MSW and text-based computer assessments: (1) Do English language learners (ELLs) produce equivalent magnitudes and types of spelling errors as non-ELLs? (2) To what degree do MSW impact concept-specific computer scoring rules? (3) What impact do MSW have on computer scoring accuracy? and (4) Are MSW more likely to impact false-positive or false-negative feedback to students? We found that although ELLs produced twice as many MSW as non-ELLs, MSW were relatively uncommon in our corpora. The MSW in the corpora were found to be important features of the computer scoring models. Although MSW did not significantly or meaningfully impact computer scoring efficacy across nine different computer scoring models, MSW had a greater impact on the scoring algorithms for naïve ideas than key concepts. Linguistic and concept redundancy in student responses explains the weak connection between MSW and scoring accuracy. Lastly, we found that MSW tend to have a greater impact on false-positive feedback. We discuss the implications of these findings for the development of next-generation science assessments.

  18. Approaching the Linguistic Complexity

    NASA Astrophysics Data System (ADS)

    Drożdż, Stanisław; Kwapień, Jarosław; Orczyk, Adam

    We analyze the rank-frequency distributions of words in selected English and Polish texts. We compare scaling properties of these distributions in both languages. We also study a few small corpora of Polish literary texts and find that for a corpus consisting of texts written by different authors the basic scaling regime is broken more strongly than in the case of comparable corpus consisting of texts written by the same author. Similarly, for a corpus consisting of texts translated into Polish from other languages the scaling regime is broken more strongly than for a comparable corpus of native Polish texts. Moreover, based on the British National Corpus, we consider the rank-frequency distributions of the grammatically basic forms of words (lemmas) tagged with their proper part of speech. We find that these distributions do not scale if each part of speech is analyzed separately. The only part of speech that independently develops a trace of scaling is verbs.

  19. Event-based text mining for biology and functional genomics

    PubMed Central

    Thompson, Paul; Nawaz, Raheel; McNaught, John; Kell, Douglas B.

    2015-01-01

    The assessment of genome function requires a mapping between genome-derived entities and biochemical reactions, and the biomedical literature represents a rich source of information about reactions between biological components. However, the increasingly rapid growth in the volume of literature provides both a challenge and an opportunity for researchers to isolate information about reactions of interest in a timely and efficient manner. In response, recent text mining research in the biology domain has been largely focused on the identification and extraction of ‘events’, i.e. categorised, structured representations of relationships between biochemical entities, from the literature. Functional genomics analyses necessarily encompass events as so defined. Automatic event extraction systems facilitate the development of sophisticated semantic search applications, allowing researchers to formulate structured queries over extracted events, so as to specify the exact types of reactions to be retrieved. This article provides an overview of recent research into event extraction. We cover annotated corpora on which systems are trained, systems that achieve state-of-the-art performance and details of the community shared tasks that have been instrumental in increasing the quality, coverage and scalability of recent systems. Finally, several concrete applications of event extraction are covered, together with emerging directions of research. PMID:24907365

  20. Automatic measurement of voice onset time using discriminative structured prediction.

    PubMed

    Sonderegger, Morgan; Keshet, Joseph

    2012-12-01

    A discriminative large-margin algorithm for automatic measurement of voice onset time (VOT) is described, considered as a case of predicting structured output from speech. Manually labeled data are used to train a function that takes as input a speech segment of an arbitrary length containing a voiceless stop, and outputs its VOT. The function is explicitly trained to minimize the difference between predicted and manually measured VOT; it operates on a set of acoustic feature functions designed based on spectral and temporal cues used by human VOT annotators. The algorithm is applied to initial voiceless stops from four corpora, representing different types of speech. Using several evaluation methods, the algorithm's performance is near human intertranscriber reliability, and compares favorably with previous work. Furthermore, the algorithm's performance is minimally affected by training and testing on different corpora, and remains essentially constant as the amount of training data is reduced to 50-250 manually labeled examples, demonstrating the method's practical applicability to new datasets.

  1. Experiments in automatic word class and word sense identification for information retrieval

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Gauch, S.; Futrelle, R.P.

    Automatic identification of related words and automatic detection of word senses are two long-standing goals of researchers in natural language processing. Word class information and word sense identification may enhance the performance of information retrieval system4ms. Large online corpora and increased computational capabilities make new techniques based on corpus linguisitics feasible. Corpus-based analysis is especially needed for corpora from specialized fields for which no electronic dictionaries or thesauri exist. The methods described here use a combination of mutual information and word context to establish word similarities. Then, unsupervised classification is done using clustering in the word space, identifying word classesmore » without pretagging. We also describe an extension of the method to handle the difficult problems of disambiguation and of determining part-of-speech and semantic information for low-frequency words. The method is powerful enough to produce high-quality results on a small corpus of 200,000 words from abstracts in a field of molecular biology.« less

  2. Feature-level sentiment analysis by using comparative domain corpora

    NASA Astrophysics Data System (ADS)

    Quan, Changqin; Ren, Fuji

    2016-06-01

    Feature-level sentiment analysis (SA) is able to provide more fine-grained SA on certain opinion targets and has a wider range of applications on E-business. This study proposes an approach based on comparative domain corpora for feature-level SA. The proposed approach makes use of word associations for domain-specific feature extraction. First, we assign a similarity score for each candidate feature to denote its similarity extent to a domain. Then we identify domain features based on their similarity scores on different comparative domain corpora. After that, dependency grammar and a general sentiment lexicon are applied to extract and expand feature-oriented opinion words. Lastly, the semantic orientation of a domain-specific feature is determined based on the feature-oriented opinion lexicons. In evaluation, we compare the proposed method with several state-of-the-art methods (including unsupervised and semi-supervised) using a standard product review test collection. The experimental results demonstrate the effectiveness of using comparative domain corpora.

  3. [Erectile function and ablative surgery of penile tumors].

    PubMed

    Pisani, E; Austoni, E; Trinchieri, A; Ceresoli, A; Mantovani, F; Colombo, F; Mastromarino, G; Vecchio, D; Canclini, L; Fenice, O

    1994-02-01

    The Authors try to show the possibility to combine radical excision with minimal invasiveness in the surgery of penile cancer. The focal point of every therapeutic decision is correct clinical staging. Unfortunately there's some confusion in the two international staging systems (TNM and Jackson's classification). In fact it's not clear the anatomical difference between epithelioma of the glans infiltrating corpus spongiosum and subcoronary epithelioma of the shaft infiltrating the corpora cavernosa. It's obvious that the infiltration of the corpora cavernosa is a far more aggressive oncological manifestation than that of tumour infiltrating the corpus spongiosum. So we consider Jackson's classification more congenial. In terms of surgery this anatomical independence makes it easy to consider the corpora cavernosa as a distinct entity, so they remain perfectly functional when separated from the glandulo-spongio-urethral unit with its vasculo-nervous bundle. This makes conservation of the erectile function, when clinical staging show us that the tumour is not infiltrating the corpora cavernosa. The Authors show their results, which seem to be rather good.

  4. Hemodynamics of erection in man

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Shirai, M.; Ishii, N.

    1981-02-01

    Inquiry was made into the theory that closure of the efferent vein from the corpora cavernosa is essential for erection of the human penis. To determine whether the venous closure is indeed a prerequisite to human penile erection, two tests were carried out in men: (1) direct infusion in 133Xe into corpora cavernosa and (2) performance of carvernosography. In each case, penile erection was induced by providing the subject with sexual stimulation. The behavioral changes were studied through the 133Xe clearance curve and the contrast medium, respectively. When the penis remained flaccid, the 133Xe clearance curve followed a gentle pathmore » and the contrast medium could be noted within the penis for a relatively long period. However, on erection with sexual stimulation, the 133Xe clearance curve fell rapidly instead of following the gentle course expected in the case of venous closure. Also, the contrast medium quickly flowed out of the corpora cavernosa. The human penis therefore can well erect without closure of the efferent vein from the corpora cavernosa.« less

  5. Extracting semantic representations from word co-occurrence statistics: stop-lists, stemming, and SVD.

    PubMed

    Bullinaria, John A; Levy, Joseph P

    2012-09-01

    In a previous article, we presented a systematic computational study of the extraction of semantic representations from the word-word co-occurrence statistics of large text corpora. The conclusion was that semantic vectors of pointwise mutual information values from very small co-occurrence windows, together with a cosine distance measure, consistently resulted in the best representations across a range of psychologically relevant semantic tasks. This article extends that study by investigating the use of three further factors--namely, the application of stop-lists, word stemming, and dimensionality reduction using singular value decomposition (SVD)--that have been used to provide improved performance elsewhere. It also introduces an additional semantic task and explores the advantages of using a much larger corpus. This leads to the discovery and analysis of improved SVD-based methods for generating semantic representations (that provide new state-of-the-art performance on a standard TOEFL task) and the identification and discussion of problems and misleading results that can arise without a full systematic study.

  6. A resource-saving collective approach to biomedical semantic role labeling

    PubMed Central

    2014-01-01

    Background Biomedical semantic role labeling (BioSRL) is a natural language processing technique that identifies the semantic roles of the words or phrases in sentences describing biological processes and expresses them as predicate-argument structures (PAS’s). Currently, a major problem of BioSRL is that most systems label every node in a full parse tree independently; however, some nodes always exhibit dependency. In general SRL, collective approaches based on the Markov logic network (MLN) model have been successful in dealing with this problem. However, in BioSRL such an approach has not been attempted because it would require more training data to recognize the more specialized and diverse terms found in biomedical literature, increasing training time and computational complexity. Results We first constructed a collective BioSRL system based on MLN. This system, called collective BIOSMILE (CBIOSMILE), is trained on the BioProp corpus. To reduce the resources used in BioSRL training, we employ a tree-pruning filter to remove unlikely nodes from the parse tree and four argument candidate identifiers to retain candidate nodes in the tree. Nodes not recognized by any candidate identifier are discarded. The pruned annotated parse trees are used to train a resource-saving MLN-based system, which is referred to as resource-saving collective BIOSMILE (RCBIOSMILE). Our experimental results show that our proposed CBIOSMILE system outperforms BIOSMILE, which is the top BioSRL system. Furthermore, our proposed RCBIOSMILE maintains the same level of accuracy as CBIOSMILE using 92% less memory and 57% less training time. Conclusions This greatly improved efficiency makes RCBIOSMILE potentially suitable for training on much larger BioSRL corpora over more biomedical domains. Compared to real-world biomedical corpora, BioProp is relatively small, containing only 445 MEDLINE abstracts and 30 event triggers. It is not large enough for practical applications, such as pathway construction. We consider it of primary importance to pursue SRL training on large corpora in the future. PMID:24884358

  7. Chemical Topic Modeling: Exploring Molecular Data Sets Using a Common Text-Mining Approach.

    PubMed

    Schneider, Nadine; Fechner, Nikolas; Landrum, Gregory A; Stiefl, Nikolaus

    2017-08-28

    Big data is one of the key transformative factors which increasingly influences all aspects of modern life. Although this transformation brings vast opportunities it also generates novel challenges, not the least of which is organizing and searching this data deluge. The field of medicinal chemistry is not different: more and more data are being generated, for instance, by technologies such as DNA encoded libraries, peptide libraries, text mining of large literature corpora, and new in silico enumeration methods. Handling those huge sets of molecules effectively is quite challenging and requires compromises that often come at the expense of the interpretability of the results. In order to find an intuitive and meaningful approach to organizing large molecular data sets, we adopted a probabilistic framework called "topic modeling" from the text-mining field. Here we present the first chemistry-related implementation of this method, which allows large molecule sets to be assigned to "chemical topics" and investigating the relationships between those. In this first study, we thoroughly evaluate this novel method in different experiments and discuss both its disadvantages and advantages. We show very promising results in reproducing human-assigned concepts using the approach to identify and retrieve chemical series from sets of molecules. We have also created an intuitive visualization of the chemical topics output by the algorithm. This is a huge benefit compared to other unsupervised machine-learning methods, like clustering, which are commonly used to group sets of molecules. Finally, we applied the new method to the 1.6 million molecules of the ChEMBL22 data set to test its robustness and efficiency. In about 1 h we built a 100-topic model of this large data set in which we could identify interesting topics like "proteins", "DNA", or "steroids". Along with this publication we provide our data sets and an open-source implementation of the new method (CheTo) which will be part of an upcoming version of the open-source cheminformatics toolkit RDKit.

  8. Automatic Construction of English/Chinese Parallel Corpora.

    ERIC Educational Resources Information Center

    Yang, Christopher C.; Li, Kar Wing

    2003-01-01

    Discussion of multilingual corpora and cross-lingual information retrieval focuses on research that constructed English/Chinese parallel corpus automatically from the World Wide Web. Presents an alignment method which is based on dynamic programming to identify one-to-one Chinese and English title pairs and discusses results of experiments…

  9. Using Corpora in EFL Classrooms: The Case Study of IELTS Preparation

    ERIC Educational Resources Information Center

    Smirnova, Elizaveta A.

    2017-01-01

    This article describes the gathered experience in using corpora in an IELTS preparation course. The practice demonstrates an attempt to reduce negative washback effects occurring when preparation courses just concentrate on the test format neglecting the importance of development of learners' language skills and general study skills. Some…

  10. Linguistic Corpora and Lexicography.

    ERIC Educational Resources Information Center

    Meijs, Willem

    1996-01-01

    Overviews the development of corpus linguistics, reviews the use of corpora in modern lexicography, and presents central issues in ongoing work aimed at broadening the scope of lexicographical use of corpus data. Focuses on how the field has developed in relation to the production of new monolingual English dictionaries by major British…

  11. Corpora in Language Teaching and Learning

    ERIC Educational Resources Information Center

    Boulton, Alex

    2017-01-01

    This timeline looks at explicit uses of corpora in foreign or second language (L2) teaching and learning, i.e. what happens when end-users explore corpus data, whether directly via concordancers or integrated into CALL programs, or indirectly with prepared printed materials. The underlying rationale is that such contact provides the massive…

  12. Nora: A Vocabulary Discovery Tool for Concept Extraction.

    PubMed

    Divita, Guy; Carter, Marjorie E; Durgahee, B S Begum; Pettey, Warren E; Redd, Andrew; Samore, Matthew H; Gundlapalli, Adi V

    2015-01-01

    Coverage of terms in domain-specific terminologies and ontologies is often limited in controlled medical vocabularies. Creating and augmenting such terminologies is resource intensive. We developed Nora as an interactive tool to discover terminology from text corpora; the output can then be employed to refine and enhance natural language processing-based concept extraction tasks. Nora provides a visualization of chains of words foraged from word frequency indexes from a text corpus. Domain experts direct and curate chains that contain relevant terms, which are further curated to identify lexical variants. A test of Nora demonstrated an increase of a domain lexicon in homelessness and related psychosocial factors by 38%, yielding an additional 10% extracted concepts.

  13. Local, Regional and Large Scale Integrated Networks

    DTIC Science & Technology

    1975-08-01

    e.g., [Abramson, 1970, 1973, Kleinrock, 1973, Kleinrock, 1975, Roberts, 1973, Gitman , 19𔃿] , have shown that this "fixed assignment" of the...Abramson, 1973, ■- Kleinrock, 1973]), or intentionally avoid the issue of packet routing by proper assumptions [ Gitman , 1975]. The issue of...Communications Sys- tems," Memorandum RM-4781-PR, The Rand Corpora- tion, February 1966. Frank, H., I. Gitman , R. Van Slyke, "Pc-jket I.adio System

  14. FacetGist: Collective Extraction of Document Facets in Large Technical Corpora.

    PubMed

    Siddiqui, Tarique; Ren, Xiang; Parameswaran, Aditya; Han, Jiawei

    2016-10-01

    Given the large volume of technical documents available, it is crucial to automatically organize and categorize these documents to be able to understand and extract value from them. Towards this end, we introduce a new research problem called Facet Extraction. Given a collection of technical documents, the goal of Facet Extraction is to automatically label each document with a set of concepts for the key facets ( e.g. , application, technique, evaluation metrics, and dataset) that people may be interested in. Facet Extraction has numerous applications, including document summarization, literature search, patent search and business intelligence. The major challenge in performing Facet Extraction arises from multiple sources: concept extraction, concept to facet matching, and facet disambiguation. To tackle these challenges, we develop FacetGist, a framework for facet extraction. Facet Extraction involves constructing a graph-based heterogeneous network to capture information available across multiple local sentence-level features, as well as global context features. We then formulate a joint optimization problem, and propose an efficient algorithm for graph-based label propagation to estimate the facet of each concept mention. Experimental results on technical corpora from two domains demonstrate that Facet Extraction can lead to an improvement of over 25% in both precision and recall over competing schemes.

  15. FacetGist: Collective Extraction of Document Facets in Large Technical Corpora

    PubMed Central

    Siddiqui, Tarique; Ren, Xiang; Parameswaran, Aditya; Han, Jiawei

    2017-01-01

    Given the large volume of technical documents available, it is crucial to automatically organize and categorize these documents to be able to understand and extract value from them. Towards this end, we introduce a new research problem called Facet Extraction. Given a collection of technical documents, the goal of Facet Extraction is to automatically label each document with a set of concepts for the key facets (e.g., application, technique, evaluation metrics, and dataset) that people may be interested in. Facet Extraction has numerous applications, including document summarization, literature search, patent search and business intelligence. The major challenge in performing Facet Extraction arises from multiple sources: concept extraction, concept to facet matching, and facet disambiguation. To tackle these challenges, we develop FacetGist, a framework for facet extraction. Facet Extraction involves constructing a graph-based heterogeneous network to capture information available across multiple local sentence-level features, as well as global context features. We then formulate a joint optimization problem, and propose an efficient algorithm for graph-based label propagation to estimate the facet of each concept mention. Experimental results on technical corpora from two domains demonstrate that Facet Extraction can lead to an improvement of over 25% in both precision and recall over competing schemes. PMID:28210517

  16. Jointly learning word embeddings using a corpus and a knowledge base

    PubMed Central

    Bollegala, Danushka; Maehara, Takanori; Kawarabayashi, Ken-ichi

    2018-01-01

    Methods for representing the meaning of words in vector spaces purely using the information distributed in text corpora have proved to be very valuable in various text mining and natural language processing (NLP) tasks. However, these methods still disregard the valuable semantic relational structure between words in co-occurring contexts. These beneficial semantic relational structures are contained in manually-created knowledge bases (KBs) such as ontologies and semantic lexicons, where the meanings of words are represented by defining the various relationships that exist among those words. We combine the knowledge in both a corpus and a KB to learn better word embeddings. Specifically, we propose a joint word representation learning method that uses the knowledge in the KBs, and simultaneously predicts the co-occurrences of two words in a corpus context. In particular, we use the corpus to define our objective function subject to the relational constrains derived from the KB. We further utilise the corpus co-occurrence statistics to propose two novel approaches, Nearest Neighbour Expansion (NNE) and Hedged Nearest Neighbour Expansion (HNE), that dynamically expand the KB and therefore derive more constraints that guide the optimisation process. Our experimental results over a wide-range of benchmark tasks demonstrate that the proposed method statistically significantly improves the accuracy of the word embeddings learnt. It outperforms a corpus-only baseline and reports an improvement of a number of previously proposed methods that incorporate corpora and KBs in both semantic similarity prediction and word analogy detection tasks. PMID:29529052

  17. A multilingual gold-standard corpus for biomedical concept recognition: the Mantra GSC.

    PubMed

    Kors, Jan A; Clematide, Simon; Akhondi, Saber A; van Mulligen, Erik M; Rebholz-Schuhmann, Dietrich

    2015-09-01

    To create a multilingual gold-standard corpus for biomedical concept recognition. We selected text units from different parallel corpora (Medline abstract titles, drug labels, biomedical patent claims) in English, French, German, Spanish, and Dutch. Three annotators per language independently annotated the biomedical concepts, based on a subset of the Unified Medical Language System and covering a wide range of semantic groups. To reduce the annotation workload, automatically generated preannotations were provided. Individual annotations were automatically harmonized and then adjudicated, and cross-language consistency checks were carried out to arrive at the final annotations. The number of final annotations was 5530. Inter-annotator agreement scores indicate good agreement (median F-score 0.79), and are similar to those between individual annotators and the gold standard. The automatically generated harmonized annotation set for each language performed equally well as the best annotator for that language. The use of automatic preannotations, harmonized annotations, and parallel corpora helped to keep the manual annotation efforts manageable. The inter-annotator agreement scores provide a reference standard for gauging the performance of automatic annotation techniques. To our knowledge, this is the first gold-standard corpus for biomedical concept recognition in languages other than English. Other distinguishing features are the wide variety of semantic groups that are being covered, and the diversity of text genres that were annotated. © The Author 2015. Published by Oxford University Press on behalf of the American Medical Informatics Association.

  18. Annotation of Korean Learner Corpora for Particle Error Detection

    ERIC Educational Resources Information Center

    Lee, Sun-Hee; Jang, Seok Bae; Seo, Sang-Kyu

    2009-01-01

    In this study, we focus on particle errors and discuss an annotation scheme for Korean learner corpora that can be used to extract heuristic patterns of particle errors efficiently. We investigate different properties of particle errors so that they can be later used to identify learner errors automatically, and we provide resourceful annotation…

  19. Some Benefits of Corpora as a Language Learning Tool

    ERIC Educational Resources Information Center

    Marjanovic, Tatjana

    2012-01-01

    What this paper is meant to do is share illustrations and insights into how English learners and teachers alike can benefit from using corpora in their work. Arguments are made for their multifaceted possibilities as grammatical, lexical and discourse pools suitable for discovering ways of the language, be they regularities or idiosyncrasies. The…

  20. Learning in Parallel: Using Parallel Corpora to Enhance Written Language Acquisition at the Beginning Level

    ERIC Educational Resources Information Center

    Bluemel, Brody

    2014-01-01

    This article illustrates the pedagogical value of incorporating parallel corpora in foreign language education. It explores the development of a Chinese/English parallel corpus designed specifically for pedagogical application. The corpus tool was created to aid language learners in reading comprehension and writing development by making foreign…

  1. Corpora Processing and Computational Scaffolding for a Web-Based English Learning Environment: The CANDLE Project

    ERIC Educational Resources Information Center

    Liou, Hsien-Chin; Chang, Jason S; Chen, Hao-Jan; Lin, Chih-Cheng; Liaw, Meei-Ling; Gao, Zhao-Ming; Jang, Jyh-Shing Roger; Yeh, Yuli; Chuang, Thomas C.; You, Geeng-Neng

    2006-01-01

    This paper describes the development of an innovative web-based environment for English language learning with advanced data-driven and statistical approaches. The project uses various corpora, including a Chinese-English parallel corpus ("Sinorama") and various natural language processing (NLP) tools to construct effective English…

  2. Evaluating Bilingual and Monolingual Dictionaries for L2 Learners.

    ERIC Educational Resources Information Center

    Hunt, Alan

    1997-01-01

    A discussion of dictionaries and their use for second language (L2) learning suggests that lack of computerized modern language corpora can adversely affect bilingual dictionaries, commonly used by L2 learners, and shows how use of such corpora has benefitted two contemporary monolingual L2 learner dictionaries (1995 editions of the Longman…

  3. Learning for Semantic Parsing with Kernels under Various Forms of Supervision

    DTIC Science & Technology

    2007-08-01

    natural language sentences to their formal executable meaning representations. This is a challenging problem and is critical for developing computing...sentences are semantically tractable. This indi- cates that Geoquery is more challenging domain for semantic parsing than ATIS. In the past, there have been a...Combining parsers. In Proceedings of the Conference on Em- pirical Methods in Natural Language Processing and Very Large Corpora (EMNLP/ VLC -99), pp. 187–194

  4. Applications of Latent Variable Models in Modeling Influence and Decision Making

    DTIC Science & Technology

    2013-04-01

    by normalization φw,k ← φw,kP K φn,k . 3.3 Empirical study We studied the DIM with four text corpora: three collections of scientific articles and a...both provided funds for travel to NIPS 2011. School Foremost, I owe my advisor, David Blei, many thanks for his mentorship and support for the past four ...and former graduate students in my research lab and our broader research group , who have helped me in this program in various ways. They have served

  5. Exploring Learner Language through Corpora: Comparing and Interpreting Corpus Frequency Information

    ERIC Educational Resources Information Center

    Gablasova, Dana; Brezina, Vaclav; McEnery, Tony

    2017-01-01

    This article contributes to the debate about the appropriate use of corpus data in language learning research. It focuses on frequencies of linguistic features in language use and their comparison across corpora. The majority of corpus-based second language acquisition studies employ a comparative design in which either one or more second language…

  6. Is There a Core General Vocabulary? Introducing the "New General Service List"

    ERIC Educational Resources Information Center

    Brezina, Vaclav; Gablasova, Dana

    2015-01-01

    The current study presents a "New General Service List (new-GSL)", which is a result of robust comparison of four language corpora ("LOB," "BNC," "BE06," and "EnTenTen12") of the total size of over 12 billion running words. The four corpora were selected to represent a variety of corpus sizes and…

  7. Lexical Awareness and Development through Data Driven Learning: Attitudes and Beliefs of EFL Learners

    ERIC Educational Resources Information Center

    Asik, Asuman; Vural, Arzu Sarlanoglu; Akpinar, Kadriye Dilek

    2016-01-01

    Data-driven learning (DDL) has become an innovative approach developed from corpus linguistics. It plays a significant role in the progression of foreign language pedagogy, since it offers learners plentiful authentic corpora examples that make them analyze language rules with the help of online corpora and concordancers. The present study…

  8. Application of Learner Corpora to Second Language Learning and Teaching: An Overview

    ERIC Educational Resources Information Center

    Xu, Qi

    2016-01-01

    The paper gives an overview of learner corpora and their application to second language learning and teaching. It is proposed that there are four core components in learner corpus research, namely, corpus linguistics expertise, a good background in linguistic theory, knowledge of SLA theory, and a good understanding of foreign language teaching…

  9. Training L2 Writers to Reference Corpora as a Self-Correction Tool

    ERIC Educational Resources Information Center

    Quinn, Cynthia

    2015-01-01

    Corpora have the potential to support the L2 writing process at the discourse level in contrast to the isolated dictionary entries that many intermediate writers rely on. To take advantage of this resource, learners need to be trained, which involves practising corpus research and referencing skills as well as learning to make data-based…

  10. Stretched Verb Collocations with "Give": Their Use and Translation into Spanish Using the BNC and CREA Corpora

    ERIC Educational Resources Information Center

    Molina-Plaza, Silvia; de Gregorio-Godeo, Eduardo

    2010-01-01

    Within the context of on-going research, this paper explores the pedagogical implications of contrastive analyses of multiword units in English and Spanish based on electronic corpora as a CALL resource. The main tenets of collocations from a contrastive perspective--and the points of contact and departure between both languages--are discussed…

  11. The Effect of the Integration of Corpora in Reading Comprehension Classrooms on English as a Foreign Language Learners' Vocabulary Development

    ERIC Educational Resources Information Center

    Gordani, Yahya

    2013-01-01

    This study used a randomized pretest-posttest control group design to examine the effect of the integration of corpora in general English courses on the students' vocabulary development. To enhance the learners' lexical repertoire and thereby improve their reading comprehension, an online corpus-based approach was integrated into 42 hours of…

  12. MAXIECPC: Theoretical Background and Descriptive Research on General Statistics, Frequency Words and Keywords

    ERIC Educational Resources Information Center

    Calzada Pérez, María

    2013-01-01

    The present paper revolves around MaxiECPC, one of the various sub-corpora that make up ECPC (the European Comparable and Parallel Corpora), an electronic archive of speeches delivered at different parliaments (i.e. the European Parliament-EP; the Spanish Congreso de los Diputados-CD; and the British House of Commons-HC) from 1996 to 2009. In…

  13. Analyzing Idioms and Their Frequency in Three Advanced ILI Textbooks: A Corpus-Based Study

    ERIC Educational Resources Information Center

    Alavi, Sepideh; Rajabpoor, Aboozar

    2015-01-01

    The present study aimed at identifying and quantifying the idioms used in three ILI "Advanced" level textbooks based on three different English corpora; MICASE, BNC and the Brown Corpus, and comparing the frequencies of the idioms across the three corpora. The first step of the study involved searching the books to find multi-word…

  14. Social Media and Language Processing: How Facebook and Twitter Provide the Best Frequency Estimates for Studying Word Recognition

    ERIC Educational Resources Information Center

    Herdagdelen, Amaç; Marelli, Marco

    2017-01-01

    Corpus-based word frequencies are one of the most important predictors in language processing tasks. Frequencies based on conversational corpora (such as movie subtitles) are shown to better capture the variance in lexical decision tasks compared to traditional corpora. In this study, we show that frequencies computed from social media are…

  15. The Application of Corpora in Teaching Grammar: The Case of English Relative Clause

    ERIC Educational Resources Information Center

    Sahragard, Rahman; Kushki, Ali; Ansaripour, Ehsan

    2013-01-01

    The study was conducted to see if the provision of implementing corpora on English relative clauses would prove useful for Iranian EFL learners or not. Two writing classes were held for the participants of intermediate level. A record of 15 writing samples produced by each participant was kept in the form of a portfolio. Participants' portfolios…

  16. Learners' Writing Skills in French: Corpus Consultation and Learner Evaluation

    ERIC Educational Resources Information Center

    O'Sullivan, Ide; Chambers, Angela

    2006-01-01

    While the use of corpora and concordancing in the language-learning environment began as early as 1969 (McEnery & Wilson, 1997, p. 12), it was the work in the 1980s of Tim Johns (1986) and others which brought it to public attention. Important developments occurred in the 1990s, beginning with publications advocating the use of corpora and…

  17. Integrating Learner Corpora and Natural Language Processing: A Crucial Step towards Reconciling Technological Sophistication and Pedagogical Effectiveness

    ERIC Educational Resources Information Center

    Granger, Sylviane; Kraif, Olivier; Ponton, Claude; Antoniadis, Georges; Zampa, Virginie

    2007-01-01

    Learner corpora, electronic collections of spoken or written data from foreign language learners, offer unparalleled access to many hitherto uncovered aspects of learner language, particularly in their error-tagged format. This article aims to demonstrate the role that the learner corpus can play in CALL, particularly when used in conjunction with…

  18. Immunocytochemical distribution of locustamyoinhibiting peptide (Lom-MIP) in the nervous system of Locusta migratoria.

    PubMed

    Schoofs, L; Veelaert, D; Broeck, J V; De Loof, A

    1996-07-05

    Locustamyoinhibiting peptide (Lom-MIP) is one of the 4 identified myoinhibiting neuropeptides, isolated from brain-corpora cardiaca-corpora allata-suboesophageal ganglion complexes of the locust, Locusta migratoria. An antiserum was raised against Lom-MIP for use in immunohistochemistry. Locustamyoinhibiting peptide-like immunoreactivity (Lom-MIP-LI) was visualized in the nervous system and peripheral organs of Locusta migratoria by means of the peroxidase-antiperoxidase method. A total of 12 specific immunoreactive neurons was found in the brain. Processes of these neurons innervate the protocerebral bridge the central body complex and distinct neuropil areas in the proto- and tritocerebrum but not in the deuterocerebrum nor in the optic lobes. The glandular cells of the corpora cardiaca, known to produce adipokinetic hormones, are contacted by Lom-MIP-LI fibers. The corpora allata were innervated by the nervus corporis allati I containing immunoreactive fibers. Lom-MIP-LI cell bodies were also found in the subesophageal ganglion, the metathoracic ganglion and the abdominal ganglia I-IV. In peripheral muscles, Lom-MIP-LI fibers innervate the heart, the oviduct, and the hindgut. In the salivary glands, Lom-MIP-LI was detected in the intracellular ductule of the parietal cells. Possible functions of Lom-MIP are discussed.

  19. Ovarian response to pregnant kare serum gonadotrophin and prostaglandin F(2) proportional, variant in Africander and Mashona cows.

    PubMed

    Holness, D H; Hale, D H; McCabe, C T

    1980-11-01

    Oestrus was synchronised in ten Africander and eight Mashona mature dry cows by two injections of prostaglandin F(2) proportional, variant (PG) 11 days apart. Half the cows of each breed received an injection of 3000 i.u. pregnant mare serum gonadotrophin (PMSG) two days prior to the second PG injection. All cows were observed for the incidence of cestrus, and blood samples were taken at intervals for progesterone assay. Cows were slaughtered 11 days after the second PG injection and their reproductive tracts examined. Treatment with PMSG increased numbers both of corpora lutea and of follicles more than 10 mm in diameter. When numbers of corpora lutea and follicles were considered together, the response to treatment was significant in the Africanders (P<0,01) and markedly greater than that of Kashona cows. The concentration of progesterone in plasma on the day before slaughter was significantly correlated with the mass of corpora lutea (P<0,001), total mass of ovaries (P<0,001), but not with numbers of corpora lutea. It is suggested that generally Africander cows may secrete lower levels of follicle stimulating hormone and oestrogen than kashona cows during normal cyclic sexual activity.

  20. Improving performance of natural language processing part-of-speech tagging on clinical narratives through domain adaptation.

    PubMed

    Ferraro, Jeffrey P; Daumé, Hal; Duvall, Scott L; Chapman, Wendy W; Harkema, Henk; Haug, Peter J

    2013-01-01

    Natural language processing (NLP) tasks are commonly decomposed into subtasks, chained together to form processing pipelines. The residual error produced in these subtasks propagates, adversely affecting the end objectives. Limited availability of annotated clinical data remains a barrier to reaching state-of-the-art operating characteristics using statistically based NLP tools in the clinical domain. Here we explore the unique linguistic constructions of clinical texts and demonstrate the loss in operating characteristics when out-of-the-box part-of-speech (POS) tagging tools are applied to the clinical domain. We test a domain adaptation approach integrating a novel lexical-generation probability rule used in a transformation-based learner to boost POS performance on clinical narratives. Two target corpora from independent healthcare institutions were constructed from high frequency clinical narratives. Four leading POS taggers with their out-of-the-box models trained from general English and biomedical abstracts were evaluated against these clinical corpora. A high performing domain adaptation method, Easy Adapt, was compared to our newly proposed method ClinAdapt. The evaluated POS taggers drop in accuracy by 8.5-15% when tested on clinical narratives. The highest performing tagger reports an accuracy of 88.6%. Domain adaptation with Easy Adapt reports accuracies of 88.3-91.0% on clinical texts. ClinAdapt reports 93.2-93.9%. ClinAdapt successfully boosts POS tagging performance through domain adaptation requiring a modest amount of annotated clinical data. Improving the performance of critical NLP subtasks is expected to reduce pipeline error propagation leading to better overall results on complex processing tasks.

  1. Response of prepubertal ewes primed with monensin or progesterone to administration of FSH.

    PubMed

    Sumbung, F P; Williamson, P; Carson, R S

    1987-11-01

    Prepubertal ewe lambs were treated with FSH after progesterone priming for 12 days (Group P), monensin supplementation for 14 days (Group M) or a standard diet (Group C). Serial blood samples were taken for LH and progesterone assay, and ovariectomy was performed on half of each group 38-52 h after start of treatment to assess ovarian function, follicular steroid production in vitro and the concentration of gonadotrophin binding sites in follicles. The remaining ewe lambs were ovariectomized 8 days after FSH treatment to determine whether functional corpora lutea were present. FSH treatment was followed by a preovulatory LH surge which occurred significantly later (P less than 0.05) and was better synchronized in ewes in Groups P and M than in those in Group C. At 13-15 h after the LH surge significantly more large follicles were present on ovaries from Group P and M ewes than in Group C. Follicles greater than 5 mm diameter from ewes in Groups P and M produced significantly less oestrogen and testosterone and more dihydrotestosterone, and had significantly more hCG binding sites, than did similar-sized follicles from Group C animals. Ovariectomy on Day 8 after the completion of FSH treatment showed that ewes in Groups P and M had significantly greater numbers of functional corpora lutea. These results indicate that, in prepubertal ewes, progesterone priming and monensin supplementation may delay the preovulatory LH surge, allowing follicles developing after FSH treatment more time to mature before ovulation. This may result in better luteinization of ruptured follicles in these ewes, with the formation of functional corpora lutea.(ABSTRACT TRUNCATED AT 250 WORDS)

  2. DOE Office of Scientific and Technical Information (OSTI.GOV)

    Wilson, Andrew T.; Robinson, David Gerald

    Most topic modeling algorithms that address the evolution of documents over time use the same number of topics at all times. This obscures the common occurrence in the data where new subjects arise and old ones diminish or disappear entirely. We propose an algorithm to model the birth and death of topics within an LDA-like framework. The user selects an initial number of topics, after which new topics are created and retired without further supervision. Our approach also accommodates many of the acceleration and parallelization schemes developed in recent years for standard LDA. In recent years, topic modeling algorithms suchmore » as latent semantic analysis (LSA)[17], latent Dirichlet allocation (LDA)[10] and their descendants have offered a powerful way to explore and interrogate corpora far too large for any human to grasp without assistance. Using such algorithms we are able to search for similar documents, model and track the volume of topics over time, search for correlated topics or model them with a hierarchy. Most of these algorithms are intended for use with static corpora where the number of documents and the size of the vocabulary are known in advance. Moreover, almost all current topic modeling algorithms fix the number of topics as one of the input parameters and keep it fixed across the entire corpus. While this is appropriate for static corpora, it becomes a serious handicap when analyzing time-varying data sets where topics come and go as a matter of course. This is doubly true for online algorithms that may not have the option of revising earlier results in light of new data. To be sure, these algorithms will account for changing data one way or another, but without the ability to adapt to structural changes such as entirely new topics they may do so in counterintuitive ways.« less

  3. Inhibition of Cyclic GMP Export by Multidrug Resistance Protein 4: A New Strategy to Treat Erectile Dysfunction?

    PubMed

    Boydens, Charlotte; Pauwels, Bart; Vanden Daele, Laura; Van de Voorde, Johan

    2017-04-01

    Intracellular cyclic guanosine monophosphate (cGMP) concentrations are regulated by degradation enzymes (phosphodiesterases) and by active transport across the plasma membrane by multidrug resistance proteins (MRPs) 4 and 5. To evaluate the functional effect of MRP-4 inhibition and the role of MRP-4-mediated cGMP export in mouse corpora cavernosa. Isometric tension of mouse corpora cavernosa was measured after cumulative addition of MK-571, an inhibitor of MRP-4, or sildenafil, a phosphodiesterase type 5 inhibitor. In addition, the effect of MRP-4 inhibition on cGMP-independent and cGMP-dependent relaxations was studied. In vivo intracavernosal pressure and mean arterial pressure measurements were performed after intracavernosal injection of MK-571. The effect of MRP-4 inhibition on cGMP content was determined using an enzyme immunoassay kit. Measurement of the effect of MK-571 on cGMP content, relaxant responses of mouse corpora cavernosa to cGMP-independent and cGMP-dependent vasodilating substances, and determination of the ratio of intracavernosal pressure to mean arterial pressure after intracavernosal injection of MK-571. MK-571 and sildenafil relaxed the corpora cavernosa concentration dependently, with sildenafil being the more potent relaxing compound. Furthermore, MK-571 enhanced relaxing responses to cGMP-dependent substances, such as sodium nitroprusside, sildenafil, acetylcholine, and electrical field stimulation, with the latter even under in vitro diabetic conditions. In contrast, cGMP-independent relaxations were not altered by MRP-4 inhibition. Intracavernosal administration of MK-571 significantly increased intracavernosal pressure, with minimal effect on mean arterial pressure. The cGMP analysis showed that MRP-4 inhibition was accompanied by increased cGMP levels. MRP-4, at least when targeted locally in the penis or when combined with a phosphodiesterase type 5 inhibitor, might be a valuable alternative strategy for the treatment of (diabetic) erectile dysfunction. This study is the first to demonstrate an in vitro direct relaxant and an in vivo pro-erectile effect of the MRP-4 inhibitor, MK-571, on mouse corpora cavernosa. However, the functional effect of MRP-5-mediated export in mouse corpora cavernosa was not explored, which has been suggested to play the predominant role in cGMP export. Inhibition of MRP-4 increases basal and stimulated levels of cGMP, leading to corpora cavernosa relaxation and penile erection. Therefore, in addition to degradation of cGMP, export of cGMP by MRP-4 could contribute substantially to regulating cGMP levels in mouse corpora cavernosa. Boydens C, Pauwels B, Vanden Daele L, Van de Voorde J. Inhibition of Cyclic GMP Export by Multidrug Resistance Protein 4: A New Strategy to Treat Erectile Dysfunction? J Sex Med 2017;14:502-509. Copyright © 2017 International Society for Sexual Medicine. Published by Elsevier Inc. All rights reserved.

  4. Learning from Learners: A Non-Standard Direct Approach to the Teaching of Writing Skills in EFL in a University Context

    ERIC Educational Resources Information Center

    Fuster-Márquez, Miguel; Gregori-Signes, Carmen

    2018-01-01

    Corpora have been used in English as a foreign language materials for decades, and native corpora have been present in the classroom by means of direct approaches such as Data-Driven Learning (Johns, T., and P. King 1991. "'Should you be Persuaded'- Two Samples of Data-Driven Learning Materials." In "Classroom Concordancing,"…

  5. Domain Adaptation of Translation Models for Multilingual Applications

    DTIC Science & Technology

    2009-04-01

    expansion effect that corpus (or dictionary ) based trans- lation introduces - however, this effect is maintained even with monolingual query expansion [12...every day; bilingual web pages are harvested as parallel corpora as the quantity of non-English data on the web increases; online dictionaries of...approach is to customize translation models to a domain, by automatically selecting the resources ( dictionaries , parallel corpora) that are best for

  6. Discovering Psychological Principles by Mining Naturally Occurring Data Sets.

    PubMed

    Goldstone, Robert L; Lupyan, Gary

    2016-07-01

    The very expertise with which psychologists wield their tools for achieving laboratory control may have had the unwelcome effect of blinding psychologists to the possibilities of discovering principles of behavior without conducting experiments. When creatively interrogated, a diverse range of large, real-world data sets provides powerful diagnostic tools for revealing principles of human judgment, perception, categorization, decision-making, language use, inference, problem solving, and representation. Examples of these data sets include patterns of website links, dictionaries, logs of group interactions, collections of images and image tags, text corpora, history of financial transactions, trends in twitter tag usage and propagation, patents, consumer product sales, performance in high-stakes sporting events, dialect maps, and scientific citations. The goal of this issue is to present some exemplary case studies of mining naturally existing data sets to reveal important principles and phenomena in cognitive science, and to discuss some of the underlying issues involved with conducting traditional experiments, analyses of naturally occurring data, computational modeling, and the synthesis of all three methods. Copyright © 2016 Cognitive Science Society, Inc.

  7. Estimation of the prevalence of adverse drug reactions from social media.

    PubMed

    Nguyen, Thin; Larsen, Mark E; O'Dea, Bridianne; Phung, Dinh; Venkatesh, Svetha; Christensen, Helen

    2017-06-01

    This work aims to estimate the degree of adverse drug reactions (ADR) for psychiatric medications from social media, including Twitter, Reddit, and LiveJournal. Advances in lightning-fast cluster computing was employed to process large scale data, consisting of 6.4 terabytes of data containing 3.8 billion records from all the media. Rates of ADR were quantified using the SIDER database of drugs and side-effects, and an estimated ADR rate was based on the prevalence of discussion in the social media corpora. Agreement between these measures for a sample of ten popular psychiatric drugs was evaluated using the Pearson correlation coefficient, r, with values between 0.08 and 0.50. Word2vec, a novel neural learning framework, was utilized to improve the coverage of variants of ADR terms in the unstructured text by identifying syntactically or semantically similar terms. Improved correlation coefficients, between 0.29 and 0.59, demonstrates the capability of advanced techniques in machine learning to aid in the discovery of meaningful patterns from medical data, and social media data, at scale. Copyright © 2017 Elsevier B.V. All rights reserved.

  8. Development of the penis during the human fetal period (13 to 36 weeks after conception).

    PubMed

    Gallo, Carla B M; Costa, Waldemar S; Furriel, Angelica; Bastos, Ana L; Sampaio, Francisco J B

    2013-11-01

    We analyzed the development of the area of the penis and erectile structures (corpora cavernosa and corpus spongiosum) and the thickness of the tunica albuginea during the fetal period (13 to 36 weeks after conception) in humans to establish normative patterns of growth. We studied 56 male human fetuses at 13 to 36 weeks after conception. We used histochemical and morphometric techniques to analyze the parameters of total penile area, area of corpora cavernosa, area of corpus spongiosum, and thickness of tunica albuginea in the dorsal and ventral regions using ImageJ software (National Institutes of Health, Bethesda, Maryland). Between 13 and 36 weeks after conception the area of the penis varies from 0.95 to 24.25 mm2. The area of the corpora cavernosa varies from 0.28 to 9.12 mm2, and the area of the corpus spongiosum varies from 0.14 to 3.99 mm2. The thickness of the tunica albuginea varies from 0.029 to 0.296 mm in the dorsal region and from 0.014 to 0.113 mm in the ventral region of the corpora cavernosa. We found a strong correlation between the total penile area, corpora cavernosa and corpus spongiosum with fetal age (weeks following conception). The growth rate was more intense during the second trimester (13 to 24 weeks of gestation) compared to the third trimester (25 to 36 weeks). Tunica albuginea thickness also was strongly correlated with fetal age and this structure was thicker in the dorsal vs ventral region. Copyright © 2013 American Urological Association Education and Research, Inc. Published by Elsevier Inc. All rights reserved.

  9. Effects of ageing and streptozotocin-induced diabetes on connexin43 and P2 purinoceptor expression in the rat corpora cavernosa and urinary bladder.

    PubMed

    Suadicani, Sylvia O; Urban-Maldonado, Marcia; Tar, Moses T; Melman, Arnold; Spray, David C

    2009-06-01

    To investigate whether ageing and diabetes alter the expression of the gap junction protein connexin43 (Cx43) and of particular purinoceptor (P2R) subtypes in the corpus cavernosum and urinary bladder, and determine whether changes in expression of these proteins correlate with development of erectile and bladder dysfunction in diabetic and ageing rats. Erectile and bladder function of streptozotocin (STZ)-induced diabetic, insulin-treated and age-matched control Fischer-344 rats were evaluated 2, 4 and 8 months after diabetes induction by in vivo cystometry and cavernosometry. Corporal and bladder tissue were then isolated at each of these sample times and protein expression levels of Cx43 and of various P2R subtypes were determined by Western blotting. In the corpora of control rats ageing was accompanied by a significant decrease in Cx43 and P2X(1)R, and increase in P2X(7)R expression. There was decreased Cx43 and increased P2Y(4)R expression in the ageing control rat bladder. There was a significant negative correlation between erectile capacity and P2X(1)R expression levels, and a positive correlation between bladder spontaneous activity and P2Y(4)R expression levels. There was already development of erectile dysfunction and bladder overactivity at 2 months after inducing diabetes, the earliest sample measured in the study. The development of these urogenital complications was accompanied by significant decreases in Cx43, P2Y(2)R, P2X(4)R and increase in P2X(1)R expression in the corpora, and by a doubling in Cx43 and P2Y(2)R, and significant increase in P2Y(4)R expression in the bladder. Changes in Cx43 and P2R expression were largely prevented by insulin therapy. Ageing and diabetes mellitus markedly altered the expression of the gap junction protein Cx43 and of particular P2R subtypes in the rat penile corpora and urinary bladder. These changes in Cx43 and P2R expression provide the molecular substrate for altered gap junction and purinergic signalling in these tissues, and thus probably contribute to the early development of erectile dysfunction and higher detrusor activity in ageing and in diabetic rats.

  10. From language identification to language distance

    NASA Astrophysics Data System (ADS)

    Gamallo, Pablo; Pichel, José Ramom; Alegria, Iñaki

    2017-10-01

    In this paper, we define two quantitative distances to measure how far apart two languages are. The distance measure that we have identified as more accurate is based on the perplexity of n-gram models extracted from text corpora. An experiment to compare forty-four European languages has been performed. For this purpose, we computed the distances for all the possible language pairs and built a network whose nodes are languages and edges are distances. The network we have built on the basis of linguistic distances represents the current map of similarities and divergences among the main languages of Europe.

  11. Improving Feature Representation Based on a Neural Network for Author Profiling in Social Media Texts

    PubMed Central

    2016-01-01

    We introduce a lexical resource for preprocessing social media data. We show that a neural network-based feature representation is enhanced by using this resource. We conducted experiments on the PAN 2015 and PAN 2016 author profiling corpora and obtained better results when performing the data preprocessing using the developed lexical resource. The resource includes dictionaries of slang words, contractions, abbreviations, and emoticons commonly used in social media. Each of the dictionaries was built for the English, Spanish, Dutch, and Italian languages. The resource is freely available. PMID:27795703

  12. Biting off More than They Can Chew? The Impact of Pedagogical Application of Corpus on Vocabulary Ability of Intermediate-Level ESL Learners in Mainland China: A Quasi-Experimental Study

    ERIC Educational Resources Information Center

    Shi, Jing

    2017-01-01

    The pedagogical values of corpora for ELT have been widely acknowledged and exploited, but their direct application in classroom teaching has entailed many difficulties. This project aims to investigate the impact of the pedagogical application of corpora on the vocabulary ability of intermediate-level ESL learners in mainland China. This…

  13. Deleterious effects of progestagen treatment in VEGF expression in corpora lutea of pregnant ewes.

    PubMed

    Letelier, C A; Sanchez, M A; Garcia-Fernandez, R A; Sanchez, B; Garcia-Palencia, P; Gonzalez-Bulnes, A; Flores, J M

    2011-06-01

    The aim of the current study was to determine the possible effects of progestagen oestrous synchronization on vascular endothelial growth factor (VEGF) expression during sheep luteogenesis and the peri-implantation period and the relationship with luteal function. At days 9, 11, 13, 15, 17 and 21 of pregnancy, the ovaries from 30 progestagen treated and 30 ewes cycling after cloprostenol injection were evaluated by ultrasonography and, thereafter, collected and processed for immunohistochemical evaluation of VEGF; blood samples were drawn for evaluating plasma progesterone. The progestagen-treated group showed smaller corpora lutea than cloprostenol-treated and lower progesterone secretion. The expression of VEGF in the luteal cells increased with time in the cloprostenol group, but not in the progestagen-treated group, which even showed a decrease between days 11 and 13. In progestagen-treated sheep, VEGF expression in granulosa-derived parenchymal lobule capillaries was correlated with the size of the luteal tissue, larger corpora lutea had higher expression, and tended to have a higher progesterone secretion. In conclusion, the current study indicates the existence of deleterious effects from exogenous progestagen treatments on progesterone secretion from induced corpora lutea, which correlate with alterations in the expression of VEGF in the luteal tissue and, this, presumably in the processes of neoangiogenesis and luteogenesis. © 2010 Blackwell Verlag GmbH.

  14. Expression of PCV2 antigen in the ovarian tissues of gilts.

    PubMed

    Tummaruk, Padet; Pearodwong, Pachara

    2016-03-01

    The present study was performed to determine the expression of porcine circovirus type 2 (PCV2) antigen in the ovarian tissue of naturally infected gilts. Ovarian tissues were obtained from 11 culled gilts. The ovarian tissues sections were divided into two groups according to PCV2 DNA detection using PCR. PCV2 antigen was assessed in the paraffin embedded ovarian tissue sections by immunohistochemistry. A total of 2,131 ovarian follicles (i.e., 1,437 primordial, 133 primary, 353 secondary and 208 antral follicles), 66 atretic follicles and 131 corpora lutea were evaluated. It was found that PCV2 antigen was detected in 280 ovarian follicles (i.e., 239 primordial follicles, 12 primary follicles, 10 secondary follicles and 19 antral follicles), 1 atretic follicles and 3 corpora lutea (P<0.05). PCV2 antigen was detected in primordial follicles more often than in secondary follicles, atretic follicles and corpora lutea (P<0.05). The detection of PCV2 antigen was found mainly in oocytes. PCV2 antigen was found in both PCV2 DNA positive and negative ovarian tissues. It can be concluded that PCV2 antigen is expressed in all types of the ovarian follicles and corpora lutea. Further studies should be carried out to determine the influence of PCV2 on porcine ovarian function and oocyte quality.

  15. PIPE: a protein–protein interaction passage extraction module for BioCreative challenge

    PubMed Central

    Chu, Chun-Han; Su, Yu-Chen; Chen, Chien Chin; Hsu, Wen-Lian

    2016-01-01

    Identifying the interactions between proteins mentioned in biomedical literatures is one of the frequently discussed topics of text mining in the life science field. In this article, we propose PIPE, an interaction pattern generation module used in the Collaborative Biocurator Assistant Task at BioCreative V (http://www.biocreative.org/) to capture frequent protein-protein interaction (PPI) patterns within text. We also present an interaction pattern tree (IPT) kernel method that integrates the PPI patterns with convolution tree kernel (CTK) to extract PPIs. Methods were evaluated on LLL, IEPA, HPRD50, AIMed and BioInfer corpora using cross-validation, cross-learning and cross-corpus evaluation. Empirical evaluations demonstrate that our method is effective and outperforms several well-known PPI extraction methods. Database URL: PMID:27524807

  16. Luteinizing hormone receptors in human ovarian follicles and corpora lutea during the menstrual cycle

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Yamoto, M.; Nakano, R.; Iwasaki, M.

    The binding of /sup 125/I-labeled human luteinizing hormone (hLH) to the 2000-g fraction of human ovarian follicles and corpora lutea during the entire menstrual cycle was examined. Specific high affinity, low capacity receptors for hLH were demonstrated in the 2000-g fraction of both follicles and corpora lutea. Specific binding of /sup 125/I-labeled hLH to follicular tissue increased from the early follicular phase to the ovulatory phase. Specific binding of /sup 125/I-labeled hLH to luteal tissue increased from the early luteal phase to the midluteal phase and decreased towards the late luteal phase. The results of the present study indicate thatmore » the increase and decrease in receptors for hLH during the menstrual cycle might play an important role in the regulation of the ovarian cycle.« less

  17. Identifying unproven cancer treatments on the health web: addressing accuracy, generalizability and scalability.

    PubMed

    Aphinyanaphongs, Yin; Fu, Lawrence D; Aliferis, Constantin F

    2013-01-01

    Building machine learning models that identify unproven cancer treatments on the Health Web is a promising approach for dealing with the dissemination of false and dangerous information to vulnerable health consumers. Aside from the obvious requirement of accuracy, two issues are of practical importance in deploying these models in real world applications. (a) Generalizability: The models must generalize to all treatments (not just the ones used in the training of the models). (b) Scalability: The models can be applied efficiently to billions of documents on the Health Web. First, we provide methods and related empirical data demonstrating strong accuracy and generalizability. Second, by combining the MapReduce distributed architecture and high dimensionality compression via Markov Boundary feature selection, we show how to scale the application of the models to WWW-scale corpora. The present work provides evidence that (a) a very small subset of unproven cancer treatments is sufficient to build a model to identify unproven treatments on the web; (b) unproven treatments use distinct language to market their claims and this language is learnable; (c) through distributed parallelization and state of the art feature selection, it is possible to prepare the corpora and build and apply models with large scalability.

  18. Histology of the ovary of Chinchilla lanigera in captivity.

    PubMed

    Sánchez-Toranzo, G; Torres-Luque, A; Gramajo-Bühler, M C; Bühler, M I

    2014-08-01

    Chinchilla, the lanigera variety in particular, is one of the most valuable rodents in the fur industry. The chinchilla ovary is morphologically similar to that of other South American hystricognath rodents, especially as regards its anatomy and, to a lesser degree, its histology. The presence of numerous primary follicles throughout the annual cycle suggests that a few of them are recruited to initiate growth and differentiation during folliculogenesis. Primary follicles with two or more oocytes are common; this is not the case with follicles at more advanced stages, suggesting that they do not develop. Only one or two large corpora lutea (CL) and three to five small or accessories CL were observed but no corpora albicans. The presence of accessory CL may reflect the importance of continuous hormonal production to support prolonged gestation. Atretic CL were also present, showing signs of degeneration in luteal cells. The interstitial cells distributed throughout the cortex were the main histological feature shared with other species, as stated in previous reports. Antral atresia was observed in all sizes of antral follicles while basal atresia was confined exclusively to smaller follicles. Copyright © 2014 Elsevier B.V. All rights reserved.

  19. Web 2.0-Based Crowdsourcing for High-Quality Gold Standard Development in Clinical Natural Language Processing

    PubMed Central

    Deleger, Louise; Li, Qi; Kaiser, Megan; Stoutenborough, Laura

    2013-01-01

    Background A high-quality gold standard is vital for supervised, machine learning-based, clinical natural language processing (NLP) systems. In clinical NLP projects, expert annotators traditionally create the gold standard. However, traditional annotation is expensive and time-consuming. To reduce the cost of annotation, general NLP projects have turned to crowdsourcing based on Web 2.0 technology, which involves submitting smaller subtasks to a coordinated marketplace of workers on the Internet. Many studies have been conducted in the area of crowdsourcing, but only a few have focused on tasks in the general NLP field and only a handful in the biomedical domain, usually based upon very small pilot sample sizes. In addition, the quality of the crowdsourced biomedical NLP corpora were never exceptional when compared to traditionally-developed gold standards. The previously reported results on medical named entity annotation task showed a 0.68 F-measure based agreement between crowdsourced and traditionally-developed corpora. Objective Building upon previous work from the general crowdsourcing research, this study investigated the usability of crowdsourcing in the clinical NLP domain with special emphasis on achieving high agreement between crowdsourced and traditionally-developed corpora. Methods To build the gold standard for evaluating the crowdsourcing workers’ performance, 1042 clinical trial announcements (CTAs) from the ClinicalTrials.gov website were randomly selected and double annotated for medication names, medication types, and linked attributes. For the experiments, we used CrowdFlower, an Amazon Mechanical Turk-based crowdsourcing platform. We calculated sensitivity, precision, and F-measure to evaluate the quality of the crowd’s work and tested the statistical significance (P<.001, chi-square test) to detect differences between the crowdsourced and traditionally-developed annotations. Results The agreement between the crowd’s annotations and the traditionally-generated corpora was high for: (1) annotations (0.87, F-measure for medication names; 0.73, medication types), (2) correction of previous annotations (0.90, medication names; 0.76, medication types), and excellent for (3) linking medications with their attributes (0.96). Simple voting provided the best judgment aggregation approach. There was no statistically significant difference between the crowd and traditionally-generated corpora. Our results showed a 27.9% improvement over previously reported results on medication named entity annotation task. Conclusions This study offers three contributions. First, we proved that crowdsourcing is a feasible, inexpensive, fast, and practical approach to collect high-quality annotations for clinical text (when protected health information was excluded). We believe that well-designed user interfaces and rigorous quality control strategy for entity annotation and linking were critical to the success of this work. Second, as a further contribution to the Internet-based crowdsourcing field, we will publicly release the JavaScript and CrowdFlower Markup Language infrastructure code that is necessary to utilize CrowdFlower’s quality control and crowdsourcing interfaces for named entity annotations. Finally, to spur future research, we will release the CTA annotations that were generated by traditional and crowdsourced approaches. PMID:23548263

  20. TEES 2.2: Biomedical Event Extraction for Diverse Corpora

    PubMed Central

    2015-01-01

    Background The Turku Event Extraction System (TEES) is a text mining program developed for the extraction of events, complex biomedical relationships, from scientific literature. Based on a graph-generation approach, the system detects events with the use of a rich feature set built via dependency parsing. The TEES system has achieved record performance in several of the shared tasks of its domain, and continues to be used in a variety of biomedical text mining tasks. Results The TEES system was quickly adapted to the BioNLP'13 Shared Task in order to provide a public baseline for derived systems. An automated approach was developed for learning the underlying annotation rules of event type, allowing immediate adaptation to the various subtasks, and leading to a first place in four out of eight tasks. The system for the automated learning of annotation rules is further enhanced in this paper to the point of requiring no manual adaptation to any of the BioNLP'13 tasks. Further, the scikit-learn machine learning library is integrated into the system, bringing a wide variety of machine learning methods usable with TEES in addition to the default SVM. A scikit-learn ensemble method is also used to analyze the importances of the features in the TEES feature sets. Conclusions The TEES system was introduced for the BioNLP'09 Shared Task and has since then demonstrated good performance in several other shared tasks. By applying the current TEES 2.2 system to multiple corpora from these past shared tasks an overarching analysis of the most promising methods and possible pitfalls in the evolving field of biomedical event extraction are presented. PMID:26551925

  1. TEES 2.2: Biomedical Event Extraction for Diverse Corpora.

    PubMed

    Björne, Jari; Salakoski, Tapio

    2015-01-01

    The Turku Event Extraction System (TEES) is a text mining program developed for the extraction of events, complex biomedical relationships, from scientific literature. Based on a graph-generation approach, the system detects events with the use of a rich feature set built via dependency parsing. The TEES system has achieved record performance in several of the shared tasks of its domain, and continues to be used in a variety of biomedical text mining tasks. The TEES system was quickly adapted to the BioNLP'13 Shared Task in order to provide a public baseline for derived systems. An automated approach was developed for learning the underlying annotation rules of event type, allowing immediate adaptation to the various subtasks, and leading to a first place in four out of eight tasks. The system for the automated learning of annotation rules is further enhanced in this paper to the point of requiring no manual adaptation to any of the BioNLP'13 tasks. Further, the scikit-learn machine learning library is integrated into the system, bringing a wide variety of machine learning methods usable with TEES in addition to the default SVM. A scikit-learn ensemble method is also used to analyze the importances of the features in the TEES feature sets. The TEES system was introduced for the BioNLP'09 Shared Task and has since then demonstrated good performance in several other shared tasks. By applying the current TEES 2.2 system to multiple corpora from these past shared tasks an overarching analysis of the most promising methods and possible pitfalls in the evolving field of biomedical event extraction are presented.

  2. Pharmacological Prevention and Reversion of Erectile Dysfunction after Radical Prostatectomy, By Modulation of Nitric Oxide/Cgmp Pathways

    DTIC Science & Technology

    2008-03-01

    Figure 3. Time course of the effect of bilateral cavernosal nerve resection on the smooth muscle cell content in the rat corpora cavernosa. Penile...iindicates the apoptotic cells in the corpora cavernosa. Bottom: QIA for TUNEL ***Pɘ.001 Figure 7: Time course of the effect of bilateral...Figure 6 Effect of unilateral and bilateral cavernosal nerve resection and long-term sildenafil treatment on cell proliferation and turnover in the

  3. Gene Ontology synonym generation rules lead to increased performance in biomedical concept recognition.

    PubMed

    Funk, Christopher S; Cohen, K Bretonnel; Hunter, Lawrence E; Verspoor, Karin M

    2016-09-09

    Gene Ontology (GO) terms represent the standard for annotation and representation of molecular functions, biological processes and cellular compartments, but a large gap exists between the way concepts are represented in the ontology and how they are expressed in natural language text. The construction of highly specific GO terms is formulaic, consisting of parts and pieces from more simple terms. We present two different types of manually generated rules to help capture the variation of how GO terms can appear in natural language text. The first set of rules takes into account the compositional nature of GO and recursively decomposes the terms into their smallest constituent parts. The second set of rules generates derivational variations of these smaller terms and compositionally combines all generated variants to form the original term. By applying both types of rules, new synonyms are generated for two-thirds of all GO terms and an increase in F-measure performance for recognition of GO on the CRAFT corpus from 0.498 to 0.636 is observed. Additionally, we evaluated the combination of both types of rules over one million full text documents from Elsevier; manual validation and error analysis show we are able to recognize GO concepts with reasonable accuracy (88 %) based on random sampling of annotations. In this work we present a set of simple synonym generation rules that utilize the highly compositional and formulaic nature of the Gene Ontology concepts. We illustrate how the generated synonyms aid in improving recognition of GO concepts on two different biomedical corpora. We discuss other applications of our rules for GO ontology quality assurance, explore the issue of overgeneration, and provide examples of how similar methodologies could be applied to other biomedical terminologies. Additionally, we provide all generated synonyms for use by the text-mining community.

  4. Ontogenetic Profile of the Expression of Thyroid Hormone Receptors in Rat and Human Corpora Cavernosa of the Penis

    PubMed Central

    Carosa, Eleonora; Di Sante, Stefania; Rossi, Simona; Castri, Alessandra; D'Adamo, Fabio; Gravina, Giovanni Luca; Ronchi, Piero; Kostrouch, Zdenek; Dolci, Susanna; Lenzi, Andrea; Jannini, Emmanuele A

    2010-01-01

    Introduction In the last few years, various studies have underlined a correlation between thyroid function and male sexual function, hypothesizing a direct action of thyroid hormones on the penis. Aim To study the spatiotemporal distribution of mRNA for the thyroid hormone nuclear receptors (TR) α1, α2 and β in the penis and smooth muscle cells (SMCs) of the corpora cavernosa of rats and humans during development. Methods We used several molecular biology techniques to study the TR expression in whole tissues or primary cultures from human and rodent penile tissues of different ages. Main Outcome Measure We measured our data by semi-quantitative reverse transcription polymerase chain reaction (RT-PCR) amplification, Northern blot and immunohistochemistry. Results We found that TRα1 and TRα2 are both expressed in the penis and in SMCs during ontogenesis without development-dependent changes. However, in the rodent model, TRβ shows an increase from 3 to 6 days post natum (dpn) to 20 dpn, remaining high in adulthood. The same expression profile was observed in humans. While the expression of TRβ is strictly regulated by development, TRα1 is the principal isoform present in corpora cavernosa, suggesting its importance in SMC function. These results have been confirmed by immunohistochemistry localization in SMCs and endothelial cells of the corpora cavernosa. Conclusions The presence of TRs in the penis provides the biological basis for the direct action of thyroid hormones on this organ. Given this evidence, physicians would be advised to investigate sexual function in men with thyroid disorders. Carosa E, Di Sante S, Rossi S, Castri A, D'Adamo F, Gravina GL, Ronchi P, Kostrouch Z, Dolci S, Lenzi A, and Jannini EA. Ontogenetic profile of the expression of thyroid hormone receptors in rat and human corpora cavernosa of the penis. J Sex Med 2010;7:1381–1390. PMID:20141582

  5. Surgical anatomy of penis in exstrophy-epispadias: a study of arrangement of fascial planes and superficial vessels of surgical significance.

    PubMed

    Kureel, Shiv Narain; Gupta, Archika; Singh, Chandra Shekhar; Kumar, Manoj

    2013-10-01

    To study the anatomic arrangement of the fascial planes and superficial vessels in relationship to the laid-open urethral plate, glans, corpus spongiosum, and corpora cavernosa in the penis of patients with exstrophy or epispadias. Of 6 patients, 4 had classic exstrophy and 2 had incontinent epispadias. These patients had presented beyond adolescence without previous intervention and were selected for the present study. Using a 1.5-T magnetic resonance imaging scanner and compatible 3-in. surface coil, the epispadiac penises were studied using fast spin echo sequences and contrast-enhanced sequences. In 2 patients, angiography of the superficial vessels was also performed using multidetector row helical computed tomography. The imaging findings were also verified during the subsequent reconstructive surgery. A clear demarcation of the skin, dartos fascia, Buck's fascia, corpora cavernosa, corpus spongiosum, and the intraglanular planes were seen with the course of the blood vessels. The penile dartos received axial pattern vessels from the external pudendal vessels, with collateral branches from the dorsal penile artery as transverse branches at the shaft of the penis and preputial branches at the coronal sulcus. Buck's fascia sleeved the corpora cavernosa, enveloped the neurovascular bundle, and fused with the corpus spongiosum without crossing the midline. Intraglanular extension of Buck's fascia separated the intraglanular vascular arcade from the tip of the corpora. Parallel to the ventral midline, axial pattern vessels to the skin-dartos complex are present, with an additional supply to the prepuce from the terminal penile arteries. These findings can be used for designing the skin coverage. The subfascial plane between the tip of the corpora and the intraglanular vascular arcade and the plane of cleavage between the cavernosa-spongiosum interface can be used for efficient corporal urethral separation. Copyright © 2013 Elsevier Inc. All rights reserved.

  6. Electronic publishing: opportunities and challenges for clinical linguistics and phonetics.

    PubMed

    Powell, Thomas W; Müller, Nicole; Ball, Martin J

    2003-01-01

    This paper discusses the contributions of informatics technology to the field of clinical linguistics and phonetics. The electronic publication of research reports and books has facilitated both the dissemination and the retrieval of scientific information. Electronic archives of speech and language corpora, too, stimulate research efforts. Although technology provides many opportunities, there remain significant challenges. Establishment and maintenance of scientific archives is largely dependent upon volunteer efforts, and there are few standards to ensure long-term access. Coordinated efforts and peer review are necessary to ensure utility and quality.

  7. All-paths graph kernel for protein-protein interaction extraction with evaluation of cross-corpus learning.

    PubMed

    Airola, Antti; Pyysalo, Sampo; Björne, Jari; Pahikkala, Tapio; Ginter, Filip; Salakoski, Tapio

    2008-11-19

    Automated extraction of protein-protein interactions (PPI) is an important and widely studied task in biomedical text mining. We propose a graph kernel based approach for this task. In contrast to earlier approaches to PPI extraction, the introduced all-paths graph kernel has the capability to make use of full, general dependency graphs representing the sentence structure. We evaluate the proposed method on five publicly available PPI corpora, providing the most comprehensive evaluation done for a machine learning based PPI-extraction system. We additionally perform a detailed evaluation of the effects of training and testing on different resources, providing insight into the challenges involved in applying a system beyond the data it was trained on. Our method is shown to achieve state-of-the-art performance with respect to comparable evaluations, with 56.4 F-score and 84.8 AUC on the AImed corpus. We show that the graph kernel approach performs on state-of-the-art level in PPI extraction, and note the possible extension to the task of extracting complex interactions. Cross-corpus results provide further insight into how the learning generalizes beyond individual corpora. Further, we identify several pitfalls that can make evaluations of PPI-extraction systems incomparable, or even invalid. These include incorrect cross-validation strategies and problems related to comparing F-score results achieved on different evaluation resources. Recommendations for avoiding these pitfalls are provided.

  8. Semi-Supervised Recurrent Neural Network for Adverse Drug Reaction mention extraction.

    PubMed

    Gupta, Shashank; Pawar, Sachin; Ramrakhiyani, Nitin; Palshikar, Girish Keshav; Varma, Vasudeva

    2018-06-13

    Social media is a useful platform to share health-related information due to its vast reach. This makes it a good candidate for public-health monitoring tasks, specifically for pharmacovigilance. We study the problem of extraction of Adverse-Drug-Reaction (ADR) mentions from social media, particularly from Twitter. Medical information extraction from social media is challenging, mainly due to short and highly informal nature of text, as compared to more technical and formal medical reports. Current methods in ADR mention extraction rely on supervised learning methods, which suffer from labeled data scarcity problem. The state-of-the-art method uses deep neural networks, specifically a class of Recurrent Neural Network (RNN) which is Long-Short-Term-Memory network (LSTM). Deep neural networks, due to their large number of free parameters rely heavily on large annotated corpora for learning the end task. But in the real-world, it is hard to get large labeled data, mainly due to the heavy cost associated with the manual annotation. To this end, we propose a novel semi-supervised learning based RNN model, which can leverage unlabeled data also present in abundance on social media. Through experiments we demonstrate the effectiveness of our method, achieving state-of-the-art performance in ADR mention extraction. In this study, we tackle the problem of labeled data scarcity for Adverse Drug Reaction mention extraction from social media and propose a novel semi-supervised learning based method which can leverage large unlabeled corpus available in abundance on the web. Through empirical study, we demonstrate that our proposed method outperforms fully supervised learning based baseline which relies on large manually annotated corpus for a good performance.

  9. Effects of ageing and streptozotocin–induced diabetes on connexin43 and P2 purinoceptor expression in the rat corpora cavernosa and urinary bladder

    PubMed Central

    Suadicani, Sylvia O.; Urban–Maldonado, Marcia; Tar, Moses T.; Melman, Arnold; Spray, David C.

    2012-01-01

    OBJECTIVE To investigate whether ageing and diabetes alter the expression of the gap junction protein connexin43 (Cx43) and of particular purinoceptor (P2R) subtypes in the corpus cavernosum and urinary bladder, and determine whether changes in expression of these proteins correlate with development of erectile and bladder dysfunction in diabetic and ageing rats. MATERIALS AND METHODS Erectile and bladder function of streptozotocin (STZ)-induced diabetic, insulin-treated and age-matched control Fischer-344 rats were evaluated 2, 4 and 8 months after diabetes induction by in vivo cystometry and cavernosometry. Corporal and bladder tissue were then isolated at each of these sample times and protein expression levels of Cx43 and of various P2R subtypes were determined by Western blotting. RESULTS In the corpora of control rats ageing was accompanied by a significant decrease in Cx43 and P2X1R, and increase in P2X7R expression. There was decreased Cx43 and increased P2Y4R expression in the ageing control rat bladder. There was a significant negative correlation between erectile capacity and P2X1R expression levels, and a positive correlation between bladder spontaneous activity and P2Y4R expression levels. There was already development of erectile dysfunction and bladder overactivity at 2 months after inducing diabetes, the earliest sample measured in the study. The development of these urogenital complications was accompanied by significant decreases in Cx43, P2Y2R, P2X4R and increase in P2X1R expression in the corpora, and by a doubling in Cx43 and P2Y2R, and significant increase in P2Y4R expression in the bladder. Changes in Cx43 and P2R expression were largely prevented by insulin therapy. CONCLUSION Ageing and diabetes mellitus markedly altered the expression of the gap junction protein Cx43 and of particular P2R subtypes in the rat penile corpora and urinary bladder. These changes in Cx43 and P2R expression provide the molecular substrate for altered gap junction and purinergic signalling in these tissues, and thus probably contribute to the early development of erectile dysfunction and higher detrusor activity in ageing and in diabetic rats. PMID:19154470

  10. Towards a Generalizable Time Expression Model for Temporal Reasoning in Clinical Notes

    PubMed Central

    Velupillai, Sumithra; Mowery, Danielle L.; Abdelrahman, Samir; Christensen, Lee; Chapman, Wendy W

    2015-01-01

    Accurate temporal identification and normalization is imperative for many biomedical and clinical tasks such as generating timelines and identifying phenotypes. A major natural language processing challenge is developing and evaluating a generalizable temporal modeling approach that performs well across corpora and institutions. Our long-term goal is to create such a model. We initiate our work on reaching this goal by focusing on temporal expression (TIMEX3) identification. We present a systematic approach to 1) generalize existing solutions for automated TIMEX3 span detection, and 2) assess similarities and differences by various instantiations of TIMEX3 models applied on separate clinical corpora. When evaluated on the 2012 i2b2 and the 2015 Clinical TempEval challenge corpora, our conclusion is that our approach is successful – we achieve competitive results for automated classification, and we identify similarities and differences in TIMEX3 modeling that will be informative in the development of a simplified, general temporal model. PMID:26958265

  11. Sex Determination in Bees. IV. Genetic Control of Juvenile Hormone Production in MELIPONA QUADRIFASCIATA (Apidae)

    PubMed Central

    Kerr, Warwick Estevam; Akahira, Yukio; Camargo, Conceição A.

    1975-01-01

    Cell number and volume of corpora allata was determined for 8 phases of development, the first prepupal stage to adults 30 days old, in the social Apidae Melipona quadrifasciata. In the second prepupal stage a strong correlation was found between cell number and body weight ( r=0.651**), and cell number and corpora allata volume in prepupal stage (r=0.535*), which indicates that juvenile hormone has a definite role in caste determination in Melipona. The distribution of the volume of corpus allatum suggest a 3:1 segregation between bees with high volume of corpora allata against low and medium volume. This implies that genes xa and xb code for an enzyme that directly participates in juvenile hormone production. It was also concluded that the number of cells in the second prepupal stage is more important than the weight of the prepupa for caste determination. A scheme summarizing the genic control of sex and caste determination in Melipona bees in the prepupal phase is given. PMID:1213273

  12. Taking Advantage of the "Big Mo"—Momentum in Everyday English and Swedish and in Physics Teaching

    NASA Astrophysics Data System (ADS)

    Haglund, Jesper; Jeppsson, Fredrik; Ahrenberg, Lars

    2015-06-01

    Science education research suggests that our everyday intuitions of motion and interaction of physical objects fit well with how physicists use the term "momentum". Corpus linguistics provides an easily accessible approach to study language in different domains, including everyday language. Analysis of language samples from English text corpora reveals a trend of increasing metaphorical use of "momentum" in non-science domains, and through conceptual metaphor analysis, we show that the use of the word in everyday language, as opposed to for instance "force", is largely adequate from a physics point of view. In addition, "momentum" has recently been borrowed into Swedish as a metaphor in domains such as sports, politics and finance, with meanings similar to those in physics. As an implication for educational practice, we find support for the suggestion to introduce the term "momentum" to English-speaking pupils at an earlier age than what is typically done in the educational system today, thereby capitalising on their intuitions and experiences of everyday language. For Swedish-speaking pupils, and possibly also relevant to other languages, the parallel between "momentum" and the corresponding physics term in the students' mother tongue could be made explicit..

  13. Negated bio-events: analysis and identification

    PubMed Central

    2013-01-01

    Background Negation occurs frequently in scientific literature, especially in biomedical literature. It has previously been reported that around 13% of sentences found in biomedical research articles contain negation. Historically, the main motivation for identifying negated events has been to ensure their exclusion from lists of extracted interactions. However, recently, there has been a growing interest in negative results, which has resulted in negation detection being identified as a key challenge in biomedical relation extraction. In this article, we focus on the problem of identifying negated bio-events, given gold standard event annotations. Results We have conducted a detailed analysis of three open access bio-event corpora containing negation information (i.e., GENIA Event, BioInfer and BioNLP’09 ST), and have identified the main types of negated bio-events. We have analysed the key aspects of a machine learning solution to the problem of detecting negated events, including selection of negation cues, feature engineering and the choice of learning algorithm. Combining the best solutions for each aspect of the problem, we propose a novel framework for the identification of negated bio-events. We have evaluated our system on each of the three open access corpora mentioned above. The performance of the system significantly surpasses the best results previously reported on the BioNLP’09 ST corpus, and achieves even better results on the GENIA Event and BioInfer corpora, both of which contain more varied and complex events. Conclusions Recently, in the field of biomedical text mining, the development and enhancement of event-based systems has received significant interest. The ability to identify negated events is a key performance element for these systems. We have conducted the first detailed study on the analysis and identification of negated bio-events. Our proposed framework can be integrated with state-of-the-art event extraction systems. The resulting systems will be able to extract bio-events with attached polarities from textual documents, which can serve as the foundation for more elaborate systems that are able to detect mutually contradicting bio-events. PMID:23323936

  14. Using automatic alignment to analyze endangered language data: Testing the viability of untrained alignment

    PubMed Central

    DiCanio, Christian; Nam, Hosung; Whalen, Douglas H.; Timothy Bunnell, H.; Amith, Jonathan D.; García, Rey Castillo

    2013-01-01

    While efforts to document endangered languages have steadily increased, the phonetic analysis of endangered language data remains a challenge. The transcription of large documentation corpora is, by itself, a tremendous feat. Yet, the process of segmentation remains a bottleneck for research with data of this kind. This paper examines whether a speech processing tool, forced alignment, can facilitate the segmentation task for small data sets, even when the target language differs from the training language. The authors also examined whether a phone set with contextualization outperforms a more general one. The accuracy of two forced aligners trained on English (hmalign and p2fa) was assessed using corpus data from Yoloxóchitl Mixtec. Overall, agreement performance was relatively good, with accuracy at 70.9% within 30 ms for hmalign and 65.7% within 30 ms for p2fa. Segmental and tonal categories influenced accuracy as well. For instance, additional stop allophones in hmalign's phone set aided alignment accuracy. Agreement differences between aligners also corresponded closely with the types of data on which the aligners were trained. Overall, using existing alignment systems was found to have potential for making phonetic analysis of small corpora more efficient, with more allophonic phone sets providing better agreement than general ones. PMID:23967953

  15. Using automatic alignment to analyze endangered language data: testing the viability of untrained alignment.

    PubMed

    DiCanio, Christian; Nam, Hosung; Whalen, Douglas H; Bunnell, H Timothy; Amith, Jonathan D; García, Rey Castillo

    2013-09-01

    While efforts to document endangered languages have steadily increased, the phonetic analysis of endangered language data remains a challenge. The transcription of large documentation corpora is, by itself, a tremendous feat. Yet, the process of segmentation remains a bottleneck for research with data of this kind. This paper examines whether a speech processing tool, forced alignment, can facilitate the segmentation task for small data sets, even when the target language differs from the training language. The authors also examined whether a phone set with contextualization outperforms a more general one. The accuracy of two forced aligners trained on English (hmalign and p2fa) was assessed using corpus data from Yoloxóchitl Mixtec. Overall, agreement performance was relatively good, with accuracy at 70.9% within 30 ms for hmalign and 65.7% within 30 ms for p2fa. Segmental and tonal categories influenced accuracy as well. For instance, additional stop allophones in hmalign's phone set aided alignment accuracy. Agreement differences between aligners also corresponded closely with the types of data on which the aligners were trained. Overall, using existing alignment systems was found to have potential for making phonetic analysis of small corpora more efficient, with more allophonic phone sets providing better agreement than general ones.

  16. Image Location Estimation by Salient Region Matching.

    PubMed

    Qian, Xueming; Zhao, Yisi; Han, Junwei

    2015-11-01

    Nowadays, locations of images have been widely used in many application scenarios for large geo-tagged image corpora. As to images which are not geographically tagged, we estimate their locations with the help of the large geo-tagged image set by content-based image retrieval. In this paper, we exploit spatial information of useful visual words to improve image location estimation (or content-based image retrieval performances). We proposed to generate visual word groups by mean-shift clustering. To improve the retrieval performance, spatial constraint is utilized to code the relative position of visual words. We proposed to generate a position descriptor for each visual word and build fast indexing structure for visual word groups. Experiments show the effectiveness of our proposed approach.

  17. Computing quality scores and uncertainty for approximate pattern matching in geospatial semantic graphs

    DOE PAGES

    Stracuzzi, David John; Brost, Randolph C.; Phillips, Cynthia A.; ...

    2015-09-26

    Geospatial semantic graphs provide a robust foundation for representing and analyzing remote sensor data. In particular, they support a variety of pattern search operations that capture the spatial and temporal relationships among the objects and events in the data. However, in the presence of large data corpora, even a carefully constructed search query may return a large number of unintended matches. This work considers the problem of calculating a quality score for each match to the query, given that the underlying data are uncertain. As a result, we present a preliminary evaluation of three methods for determining both match qualitymore » scores and associated uncertainty bounds, illustrated in the context of an example based on overhead imagery data.« less

  18. Chapter 16: text mining for translational bioinformatics.

    PubMed

    Cohen, K Bretonnel; Hunter, Lawrence E

    2013-04-01

    Text mining for translational bioinformatics is a new field with tremendous research potential. It is a subfield of biomedical natural language processing that concerns itself directly with the problem of relating basic biomedical research to clinical practice, and vice versa. Applications of text mining fall both into the category of T1 translational research-translating basic science results into new interventions-and T2 translational research, or translational research for public health. Potential use cases include better phenotyping of research subjects, and pharmacogenomic research. A variety of methods for evaluating text mining applications exist, including corpora, structured test suites, and post hoc judging. Two basic principles of linguistic structure are relevant for building text mining applications. One is that linguistic structure consists of multiple levels. The other is that every level of linguistic structure is characterized by ambiguity. There are two basic approaches to text mining: rule-based, also known as knowledge-based; and machine-learning-based, also known as statistical. Many systems are hybrids of the two approaches. Shared tasks have had a strong effect on the direction of the field. Like all translational bioinformatics software, text mining software for translational bioinformatics can be considered health-critical and should be subject to the strictest standards of quality assurance and software testing.

  19. Detection of IUPAC and IUPAC-like chemical names.

    PubMed

    Klinger, Roman; Kolárik, Corinna; Fluck, Juliane; Hofmann-Apitius, Martin; Friedrich, Christoph M

    2008-07-01

    Chemical compounds like small signal molecules or other biological active chemical substances are an important entity class in life science publications and patents. Several representations and nomenclatures for chemicals like SMILES, InChI, IUPAC or trivial names exist. Only SMILES and InChI names allow a direct structure search, but in biomedical texts trivial names and Iupac like names are used more frequent. While trivial names can be found with a dictionary-based approach and in such a way mapped to their corresponding structures, it is not possible to enumerate all IUPAC names. In this work, we present a new machine learning approach based on conditional random fields (CRF) to find mentions of IUPAC and IUPAC-like names in scientific text as well as its evaluation and the conversion rate with available name-to-structure tools. We present an IUPAC name recognizer with an F(1) measure of 85.6% on a MEDLINE corpus. The evaluation of different CRF orders and offset conjunction orders demonstrates the importance of these parameters. An evaluation of hand-selected patent sections containing large enumerations and terms with mixed nomenclature shows a good performance on these cases (F(1) measure 81.5%). Remaining recognition problems are to detect correct borders of the typically long terms, especially when occurring in parentheses or enumerations. We demonstrate the scalability of our implementation by providing results from a full MEDLINE run. We plan to publish the corpora, annotation guideline as well as the conditional random field model as a UIMA component.

  20. NLP-PIER: A Scalable Natural Language Processing, Indexing, and Searching Architecture for Clinical Notes.

    PubMed

    McEwan, Reed; Melton, Genevieve B; Knoll, Benjamin C; Wang, Yan; Hultman, Gretchen; Dale, Justin L; Meyer, Tim; Pakhomov, Serguei V

    2016-01-01

    Many design considerations must be addressed in order to provide researchers with full text and semantic search of unstructured healthcare data such as clinical notes and reports. Institutions looking at providing this functionality must also address the big data aspects of their unstructured corpora. Because these systems are complex and demand a non-trivial investment, there is an incentive to make the system capable of servicing future needs as well, further complicating the design. We present architectural best practices as lessons learned in the design and implementation NLP-PIER (Patient Information Extraction for Research), a scalable, extensible, and secure system for processing, indexing, and searching clinical notes at the University of Minnesota.

  1. Neural Influences on Sonic Hedgehog and Apoptosis in the Rat Penis1

    PubMed Central

    Bond, Christopher; Tang, Yi; Podlasek, Carol A.

    2010-01-01

    The role of sonic hedgehog (SHH) in maintaining corpora cavernosal morphology in the adult penis has been established; however, the mechanism of how SHH itself is regulated remains unclear. Since decreased SHH protein is a cause of smooth muscle apoptosis and erectile dysfunction (ED) in the penis, and SHH treatment can suppress cavernous nerve (CN) injury-induced apoptosis, the question of how SHH signaling is regulated is significant. It is likely that neural input is involved in this process since two models of neuropathy-induced ED exhibit decreased SHH protein and increased apoptosis in the penis. We propose the hypothesis that SHH abundance in the corpora cavernosa is regulated by SHH signaling in the pelvic ganglia, neural activity, or neural transport of a trophic factor from the pelvic ganglia to the corpora. We have examined each of these potential mechanisms. SHH inhibition in the penis shows a 12-fold increase in smooth muscle apoptosis. SHH inhibition in the pelvic ganglia causes significantly increased apoptosis (1.3-fold) and decreased SHH protein (1.1-fold) in the corpora cavernosa. SHH protein is not transported by the CN. Colchicine treatment of the CN resulted in significantly increased smooth muscle apoptosis (1.2-fold) and decreased SHH protein (1.3-fold) in the penis. Lidocaine treatment of the CN caused a similar increase in apoptosis (1.6-fold) and decrease in SHH protein (1.3-fold) in the penis. These results show that neural activity and a trophic factor from the pelvic ganglia/CN are necessary to regulate SHH protein and smooth muscle abundance in the penis. PMID:18256331

  2. Ageing causes cytoplasmic retention of MaxiK channels in rat corporal smooth muscle cells

    PubMed Central

    Davies, KP; Stanevsky, Y; Moses, T; Chang, JS; Chance, MR; Melman, A

    2007-01-01

    The MaxiK channel plays a critical role in the regulation of corporal smooth muscle tone and thereby erectile function. Given that ageing results in a decline in erectile function, we determined changes in the expression of MaxiK, which might impact erectile function. Quantitative-polymerase chain reaction demonstrated that although there is no significant change in transcription of the α- and β-subunits that comprise the MaxiK channel, there are significant changes in the expression of transcripts encoding different splice variants. One transcript, SV1, is 13-fold increased in expression in the ageing rat corpora. SV1 has previously been reported to trap other isoforms of the MaxiK channel in the cytoplasm. Correlating with increased expression of SV1, we observed in older rats there is approximately a 13-fold decrease in MaxiK protein in the corpora cell membrane and a greater proportion is retained in the cytoplasm (approximately threefold). These experiments demonstrate that ageing of the corpora is accompanied by changes in alternative splicing and cellular localization of the MaxiK channel. PMID:17287835

  3. Ageing causes cytoplasmic retention of MaxiK channels in rat corporal smooth muscle cells.

    PubMed

    Davies, K P; Stanevsky, Y; Tar, M T; Moses, T; Chang, J S; Chance, M R; Melman, A

    2007-01-01

    The MaxiK channel plays a critical role in the regulation of corporal smooth muscle tone and thereby erectile function. Given that ageing results in a decline in erectile function, we determined changes in the expression of MaxiK, which might impact erectile function. Quantitative-polymerase chain reaction demonstrated that although there is no significant change in transcription of the alpha- and beta-subunits that comprise the MaxiK channel, there are significant changes in the expression of transcripts encoding different splice variants. One transcript, SV1, is 13-fold increased in expression in the ageing rat corpora. SV1 has previously been reported to trap other isoforms of the MaxiK channel in the cytoplasm. Correlating with increased expression of SV1, we observed in older rats there is approximately a 13-fold decrease in MaxiK protein in the corpora cell membrane and a greater proportion is retained in the cytoplasm (approximately threefold). These experiments demonstrate that ageing of the corpora is accompanied by changes in alternative splicing and cellular localization of the MaxiK channel.

  4. Mining protein phosphorylation information from biomedical literature using NLP parsing and Support Vector Machines.

    PubMed

    Raja, Kalpana; Natarajan, Jeyakumar

    2018-07-01

    Extraction of protein phosphorylation information from biomedical literature has gained much attention because of the importance in numerous biological processes. In this study, we propose a text mining methodology which consists of two phases, NLP parsing and SVM classification to extract phosphorylation information from literature. First, using NLP parsing we divide the data into three base-forms depending on the biomedical entities related to phosphorylation and further classify into ten sub-forms based on their distribution with phosphorylation keyword. Next, we extract the phosphorylation entity singles/pairs/triplets and apply SVM to classify the extracted singles/pairs/triplets using a set of features applicable to each sub-form. The performance of our methodology was evaluated on three corpora namely PLC, iProLink and hPP corpus. We obtained promising results of >85% F-score on ten sub-forms of training datasets on cross validation test. Our system achieved overall F-score of 93.0% on iProLink and 96.3% on hPP corpus test datasets. Furthermore, our proposed system achieved best performance on cross corpus evaluation and outperformed the existing system with recall of 90.1%. The performance analysis of our unique system on three corpora reveals that it extracts protein phosphorylation information efficiently in both non-organism specific general datasets such as PLC and iProLink, and human specific dataset such as hPP corpus. Copyright © 2018 Elsevier B.V. All rights reserved.

  5. Relationships between ovarian blood flow and ovarian response to eCG-treatment of dairy cows.

    PubMed

    Honnens, A; Niemann, H; Herzog, K; Paul, V; Meyer, H H D; Bollwein, H

    2009-07-01

    The goal of the present study was to investigate ovarian blood flow and ovarian response in cows undergoing a gonadotropin treatment to induce a superovulatory response, using transrectal colour Doppler sonography. Forty-two cows including 19 cross-bred, 14 German Holstein and 9 German Black Pied cows were examined sonographically before hormonal stimulation on Day 10 of the oestrous cycle, three days after administration of eCG (Day 13) and seven days after artificial insemination (Day 7(p.i.)). After each Doppler examination, blood was collected for determination of total oestrogens (E) and progesterone (P4) in peripheral plasma. The blood flow volume (BFV) and pulsatility index (PI), which is a measure for blood flow resistance, were determined in the ovarian artery, and B-mode sonography was used to count dominant follicles and corpora lutea. Important criteria to assess the ovarian response following the hormonal treatment were the number of follicles >5mm in diameter on Day 13 and the number of corpora lutea on Day 7(p.i.) per cow. The number of follicles ranged from 2 to 61 (mean+/-S.E.M.: 17.5+/-1.7) and corpora lutea from 0 to 50 (mean+/-S.E.M.: 17.0+/-1.6). The BFV increased from 28.4 to 45.0 ml/min between Days 10 and 13 and reached a maximum of 108.5 ml/min on Day 7(p.i.) The PI decreased from 6.25 on Day 10 to 4.70 on Day 13 and to 2.10 on Day 7(p.i.) The BFV and PI on Day 13 did not correlate with the number of follicles (P>0.05). However, on Day 7(p.i.) the number of corpora lutea correlated positively with the BFV (r=0.64; P<0.0001), and an inverse relationship was found for the PI (r=-0.51; P=0.0005). There were no correlations (P>0.05) between the BFV and PI on Day 10 and the number of follicles on Day 13 or the number of corpora lutea on Day 7(p.i.) Results of the present study show that in cows, a hormonal treatment to induce a superovulatory response yielded a marked increase in BFV and a marked decrease in PI in the ovarian artery. However, there was no correlation between BFV and PI in the ovarian arteries before hormonal stimulation and the number of follicles and corpora lutea that developed after stimulation. Thus BFV and PI measured in the ovarian arteries have limited diagnostic value to predict the outcome of a gonadotropin treatment.

  6. Mining the pharmacogenomics literature—a survey of the state of the art

    PubMed Central

    Cohen, K. Bretonnel; Garten, Yael; Shah, Nigam H.

    2012-01-01

    This article surveys efforts on text mining of the pharmacogenomics literature, mainly from the period 2008 to 2011. Pharmacogenomics (or pharmacogenetics) is the field that studies how human genetic variation impacts drug response. Therefore, publications span the intersection of research in genotypes, phenotypes and pharmacology, a topic that has increasingly become a focus of active research in recent years. This survey covers efforts dealing with the automatic recognition of relevant named entities (e.g. genes, gene variants and proteins, diseases and other pathological phenomena, drugs and other chemicals relevant for medical treatment), as well as various forms of relations between them. A wide range of text genres is considered, such as scientific publications (abstracts, as well as full texts), patent texts and clinical narratives. We also discuss infrastructure and resources needed for advanced text analytics, e.g. document corpora annotated with corresponding semantic metadata (gold standards and training data), biomedical terminologies and ontologies providing domain-specific background knowledge at different levels of formality and specificity, software architectures for building complex and scalable text analytics pipelines and Web services grounded to them, as well as comprehensive ways to disseminate and interact with the typically huge amounts of semiformal knowledge structures extracted by text mining tools. Finally, we consider some of the novel applications that have already been developed in the field of pharmacogenomic text mining and point out perspectives for future research. PMID:22833496

  7. Mining the pharmacogenomics literature--a survey of the state of the art.

    PubMed

    Hahn, Udo; Cohen, K Bretonnel; Garten, Yael; Shah, Nigam H

    2012-07-01

    This article surveys efforts on text mining of the pharmacogenomics literature, mainly from the period 2008 to 2011. Pharmacogenomics (or pharmacogenetics) is the field that studies how human genetic variation impacts drug response. Therefore, publications span the intersection of research in genotypes, phenotypes and pharmacology, a topic that has increasingly become a focus of active research in recent years. This survey covers efforts dealing with the automatic recognition of relevant named entities (e.g. genes, gene variants and proteins, diseases and other pathological phenomena, drugs and other chemicals relevant for medical treatment), as well as various forms of relations between them. A wide range of text genres is considered, such as scientific publications (abstracts, as well as full texts), patent texts and clinical narratives. We also discuss infrastructure and resources needed for advanced text analytics, e.g. document corpora annotated with corresponding semantic metadata (gold standards and training data), biomedical terminologies and ontologies providing domain-specific background knowledge at different levels of formality and specificity, software architectures for building complex and scalable text analytics pipelines and Web services grounded to them, as well as comprehensive ways to disseminate and interact with the typically huge amounts of semiformal knowledge structures extracted by text mining tools. Finally, we consider some of the novel applications that have already been developed in the field of pharmacogenomic text mining and point out perspectives for future research.

  8. PKDE4J: Entity and relation extraction for public knowledge discovery.

    PubMed

    Song, Min; Kim, Won Chul; Lee, Dahee; Heo, Go Eun; Kang, Keun Young

    2015-10-01

    Due to an enormous number of scientific publications that cannot be handled manually, there is a rising interest in text-mining techniques for automated information extraction, especially in the biomedical field. Such techniques provide effective means of information search, knowledge discovery, and hypothesis generation. Most previous studies have primarily focused on the design and performance improvement of either named entity recognition or relation extraction. In this paper, we present PKDE4J, a comprehensive text-mining system that integrates dictionary-based entity extraction and rule-based relation extraction in a highly flexible and extensible framework. Starting with the Stanford CoreNLP, we developed the system to cope with multiple types of entities and relations. The system also has fairly good performance in terms of accuracy as well as the ability to configure text-processing components. We demonstrate its competitive performance by evaluating it on many corpora and found that it surpasses existing systems with average F-measures of 85% for entity extraction and 81% for relation extraction. Copyright © 2015 Elsevier Inc. All rights reserved.

  9. Comparing published scientific journal articles to their pre-print versions

    DOE PAGES

    Klein, Martin; Broadwell, Peter; Farb, Sharon E.; ...

    2018-02-05

    Academic publishers claim that they add value to scholarly communications by coordinating reviews and contributing and enhancing text during publication. These contributions come at a considerable cost: US academic libraries paid $1.7 billion for serial subscriptions in 2008 alone. Library budgets, in contrast, are flat and not able to keep pace with serial price inflation. Here, we have investigated the publishers’ value proposition by conducting a comparative study of pre-print papers from two distinct science, technology, and medicine corpora and their final published counterparts. This comparison had two working assumptions: (1) If the publishers’ argument is valid, the text ofmore » a pre-print paper should vary measurably from its corresponding final published version, and (2) by applying standard similarity measures, we should be able to detect and quantify such differences. Our analysis revealed that the text contents of the scientific papers generally changed very little from their pre-print to final published versions. These findings contribute empirical indicators to discussions of the added value of commercial publishers and therefore should influence libraries’ economic decisions regarding access to scholarly publications.« less

  10. Identifying biological concepts from a protein-related corpus with a probabilistic topic model

    PubMed Central

    Zheng, Bin; McLean, David C; Lu, Xinghua

    2006-01-01

    Background Biomedical literature, e.g., MEDLINE, contains a wealth of knowledge regarding functions of proteins. Major recurring biological concepts within such text corpora represent the domains of this body of knowledge. The goal of this research is to identify the major biological topics/concepts from a corpus of protein-related MEDLINE© titles and abstracts by applying a probabilistic topic model. Results The latent Dirichlet allocation (LDA) model was applied to the corpus. Based on the Bayesian model selection, 300 major topics were extracted from the corpus. The majority of identified topics/concepts was found to be semantically coherent and most represented biological objects or concepts. The identified topics/concepts were further mapped to the controlled vocabulary of the Gene Ontology (GO) terms based on mutual information. Conclusion The major and recurring biological concepts within a collection of MEDLINE documents can be extracted by the LDA model. The identified topics/concepts provide parsimonious and semantically-enriched representation of the texts in a semantic space with reduced dimensionality and can be used to index text. PMID:16466569

  11. Automated annotation of functional imaging experiments via multi-label classification

    PubMed Central

    Turner, Matthew D.; Chakrabarti, Chayan; Jones, Thomas B.; Xu, Jiawei F.; Fox, Peter T.; Luger, George F.; Laird, Angela R.; Turner, Jessica A.

    2013-01-01

    Identifying the experimental methods in human neuroimaging papers is important for grouping meaningfully similar experiments for meta-analyses. Currently, this can only be done by human readers. We present the performance of common machine learning (text mining) methods applied to the problem of automatically classifying or labeling this literature. Labeling terms are from the Cognitive Paradigm Ontology (CogPO), the text corpora are abstracts of published functional neuroimaging papers, and the methods use the performance of a human expert as training data. We aim to replicate the expert's annotation of multiple labels per abstract identifying the experimental stimuli, cognitive paradigms, response types, and other relevant dimensions of the experiments. We use several standard machine learning methods: naive Bayes (NB), k-nearest neighbor, and support vector machines (specifically SMO or sequential minimal optimization). Exact match performance ranged from only 15% in the worst cases to 78% in the best cases. NB methods combined with binary relevance transformations performed strongly and were robust to overfitting. This collection of results demonstrates what can be achieved with off-the-shelf software components and little to no pre-processing of raw text. PMID:24409112

  12. Comparing published scientific journal articles to their pre-print versions

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Klein, Martin; Broadwell, Peter; Farb, Sharon E.

    Academic publishers claim that they add value to scholarly communications by coordinating reviews and contributing and enhancing text during publication. These contributions come at a considerable cost: US academic libraries paid $1.7 billion for serial subscriptions in 2008 alone. Library budgets, in contrast, are flat and not able to keep pace with serial price inflation. Here, we have investigated the publishers’ value proposition by conducting a comparative study of pre-print papers from two distinct science, technology, and medicine corpora and their final published counterparts. This comparison had two working assumptions: (1) If the publishers’ argument is valid, the text ofmore » a pre-print paper should vary measurably from its corresponding final published version, and (2) by applying standard similarity measures, we should be able to detect and quantify such differences. Our analysis revealed that the text contents of the scientific papers generally changed very little from their pre-print to final published versions. These findings contribute empirical indicators to discussions of the added value of commercial publishers and therefore should influence libraries’ economic decisions regarding access to scholarly publications.« less

  13. Dose-Volume Parameters of the Corpora Cavernosa Do Not Correlate With Erectile Dysfunction After External Beam Radiotherapy for Prostate Cancer: Results From a Dose-Escalation Trial

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Wielen, Gerard J. van der; Hoogeman, Mischa S.; Dohle, Gert R.

    2008-07-01

    Purpose: To analyze the correlation between dose-volume parameters of the corpora cavernosa and erectile dysfunction (ED) after external beam radiotherapy (EBRT) for prostate cancer. Methods and Materials: Between June 1997 and February 2003, a randomized dose-escalation trial comparing 68 Gy and 78 Gy was conducted. Patients at our institute were asked to participate in an additional part of the trial evaluating sexual function. After exclusion of patients with less than 2 years of follow-up, ED at baseline, or treatment with hormonal therapy, 96 patients were eligible. The proximal corpora cavernosa (crura), the superiormost 1-cm segment of the crura, and themore » penile bulb were contoured on the planning computed tomography scan and dose-volume parameters were calculated. Results: Two years after EBRT, 35 of the 96 patients had developed ED. No statistically significant correlations between ED 2 years after EBRT and dose-volume parameters of the crura, the superiormost 1-cm segment of the crura, or the penile bulb were found. The few patients using potency aids typically indicated to have ED. Conclusion: No correlation was found between ED after EBRT for prostate cancer and radiation dose to the crura or penile bulb. The present study is the largest study evaluating the correlation between ED and radiation dose to the corpora cavernosa after EBRT for prostate cancer. Until there is clear evidence that sparing the penile bulb or crura will reduce ED after EBRT, we advise to be careful in sparing these structures, especially when this involves reducing treatment margins.« less

  14. Prenatal development of the normal human vertebral corpora in different segments of the spine.

    PubMed

    Nolting, D; Hansen, B F; Keeling, J; Kjaer, I

    1998-11-01

    Vertebral columns from 13 normal human fetuses (10-24 weeks of gestation) that had aborted spontaneously were investigated as part of the legal autopsy procedure. The investigation included spinal cord analysis. To analyze the formation of the normal human vertebral corpora along the spine, including the early location and disappearance of the notochord. Reference material on the development of the normal human vertebral corpora is needed for interpretation of published observations on prenatal malformations in the spine, which include observations of various types of malformation (anencephaly, spina bifida) and various genotypes (trisomy 18, 21 and 13, as well as triploidy). The vertebral columns were studied by using radiography (Faxitron X-ray apparatus, Faxitron Model 43,855, Hewlett Packard) in lateral, frontal, and axial views and histology (decalcification, followed by toluidine blue and alcian blue staining) in and axial view. Immunohistochemical marking with Keratin Wide Spectrum also was done. Notochordal tissue (positive on marking with Keratin Wide Spectrum [DAKO, Denmark]) was located anterior to the cartilaginous body center in the youngest fetuses. The process of disintegration of the notochord and the morphology of the osseous vertebral corpora in the lumbosacral, thoracic, and cervical segments are described. Marked differences appeared in axial views, which were verified on horizontal histologic sections. Also, the increase in size was different in the different segments, being most pronounced in the thoracic and upper lumbar bodies. The lower thoracic bodies were the first to ossify. The morphologic changes observed by radiography were verified histologically. In this study, normal prenatal standards were established for the early development of the vertebral column. These standards can be used in the future--for evaluation of pathologic deviations in the human vertebral column in the second trimester.

  15. Mice null for Frizzled4 (Fzd4-/-) are infertile and exhibit impaired corpora lutea formation and function.

    PubMed

    Hsieh, Minnie; Boerboom, Derek; Shimada, Masayuki; Lo, Yuet; Parlow, Albert F; Luhmann, Ulrich F O; Berger, Wolfgang; Richards, JoAnne S

    2005-12-01

    Previous studies showed that transcripts encoding specific Wnt ligands and Frizzled receptors including Wnt4, Frizzled1 (Fzd1), and Frizzled4 (Fzd4) were expressed in a cell-specific manner in the adult mouse ovary. Overlapping expression of Wnt4 and Fzd4 mRNA in small follicles and corpora lutea led us to hypothesize that the infertility of mice null for Fzd4 (Fzd4-/-) might involve impaired follicular growth or corpus luteum formation. Analyses at defined stages of reproductive function indicate that immature Fzd4-/- mouse ovaries contain follicles at many stages of development and respond to exogenous hormone treatments in a manner similar to their wild-type littermates, indicating that the processes controlling follicular development and follicular cell responses to gonadotropins are intact. Adult Fzd4-/- mice also exhibit normal mating behavior and ovulate, indicating that endocrine events controlling these processes occur. However, Fzd4-/- mice fail to become pregnant and do not produce offspring. Histological and functional analyses of ovaries from timed mating pairs at Days 1.5-7.5 postcoitus (p.c.) indicate that the corpora lutea of the Fzd4-/- mice do not develop normally. Expression of luteal cell-specific mRNAs (Lhcgr, Prlr, Cyp11a1 and Sfrp4) is reduced, luteal cell morphology is altered, and markers of angiogenesis and vascular formation (Efnb1, Efnb2, Ephb4, Vegfa, Vegfc) are low in the Fzd4-/- mice. Although a recently identified, high-affinity FZD4 ligand Norrin (Norrie disease pseudoglioma homolog) is expressed in the ovary, adult Ndph-/- mice contain functional corpora lutea and do not phenocopy Fzd4-/- mice. Thus, Fzd4 appears to impact the formation of the corpus luteum by mechanisms that more closely phenocopy Prlr null mice.

  16. A rule-based named-entity recognition method for knowledge extraction of evidence-based dietary recommendations

    PubMed Central

    2017-01-01

    Evidence-based dietary information represented as unstructured text is a crucial information that needs to be accessed in order to help dietitians follow the new knowledge arrives daily with newly published scientific reports. Different named-entity recognition (NER) methods have been introduced previously to extract useful information from the biomedical literature. They are focused on, for example extracting gene mentions, proteins mentions, relationships between genes and proteins, chemical concepts and relationships between drugs and diseases. In this paper, we present a novel NER method, called drNER, for knowledge extraction of evidence-based dietary information. To the best of our knowledge this is the first attempt at extracting dietary concepts. DrNER is a rule-based NER that consists of two phases. The first one involves the detection and determination of the entities mention, and the second one involves the selection and extraction of the entities. We evaluate the method by using text corpora from heterogeneous sources, including text from several scientifically validated web sites and text from scientific publications. Evaluation of the method showed that drNER gives good results and can be used for knowledge extraction of evidence-based dietary recommendations. PMID:28644863

  17. Evaluating Hierarchical Structure in Music Annotations

    PubMed Central

    McFee, Brian; Nieto, Oriol; Farbood, Morwaread M.; Bello, Juan Pablo

    2017-01-01

    Music exhibits structure at multiple scales, ranging from motifs to large-scale functional components. When inferring the structure of a piece, different listeners may attend to different temporal scales, which can result in disagreements when they describe the same piece. In the field of music informatics research (MIR), it is common to use corpora annotated with structural boundaries at different levels. By quantifying disagreements between multiple annotators, previous research has yielded several insights relevant to the study of music cognition. First, annotators tend to agree when structural boundaries are ambiguous. Second, this ambiguity seems to depend on musical features, time scale, and genre. Furthermore, it is possible to tune current annotation evaluation metrics to better align with these perceptual differences. However, previous work has not directly analyzed the effects of hierarchical structure because the existing methods for comparing structural annotations are designed for “flat” descriptions, and do not readily generalize to hierarchical annotations. In this paper, we extend and generalize previous work on the evaluation of hierarchical descriptions of musical structure. We derive an evaluation metric which can compare hierarchical annotations holistically across multiple levels. sing this metric, we investigate inter-annotator agreement on the multilevel annotations of two different music corpora, investigate the influence of acoustic properties on hierarchical annotations, and evaluate existing hierarchical segmentation algorithms against the distribution of inter-annotator agreement. PMID:28824514

  18. Heroes or Health Victims?: Exploring How the Elite Media Frames Veterans on Veterans Day.

    PubMed

    Rhidenour, Kayla B; Barrett, Ashley K; Blackburn, Kate G

    2017-11-27

    We examine the frames the elite news media uses to portray veterans on and surrounding Veterans Day 2012, 2013, 2014, and 2015. We use mental health illness and media framing literature to explore how, why, and to what extent Veterans Day news coverage uses different media frames across the four consecutive years. We compiled a Media Coverage Corpora for each year, which contains the quotes and paraphrased remarks used in all veterans news stories for that year. In our primary study, we applied the meaning extraction method (MEM) to extract emergent media frames for Veterans Day 2014 and compiled a word frequency list, which captures the words most commonly used within the corpora. In post hoc analyses, we collected news stories and compiled word frequency lists for Veterans Day 2012, 2013, and 2015. Our findings reveal dissenting frames across 2012, 2013, and 2014 Veterans Day media coverage. Word frequency results suggest the 2012 and 2013 media frames largely celebrate Veterans as heroes, but the 2014 coverage depicts veterans as victimized by their wartime experiences. Furthermore, our results demonstrate how the prevailing 2015 media frames could be a reaction to 2014 frames that portrayed veterans as health victims. We consider the ramifications of this binary portrayal of veterans as either health victims or heroes and discuss the implications of these dueling frames for veterans' access to healthcare resources.

  19. Scavenger receptor-B1 and luteal function in mice.

    PubMed

    Jiménez, Leonor Miranda; Binelli, Mario; Bertolin, Kalyne; Pelletier, R Marc; Murphy, Bruce D

    2010-08-01

    During luteinization, circulating high-density lipoproteins supply cholesterol to ovarian cells via the scavenger receptor-B1 (SCARB1). In the mouse, SCARB1 is expressed in cytoplasm and periphery of theca, granulosa, and cumulus cells of developing follicles and increases dramatically during formation of corpora lutea. Blockade of ovulation in mice with meloxicam, a prostaglandin synthase-2 inhibitor, resulted in follicles with oocytes entrapped in unexpanded cumulus complexes and with granulosa cells with luteinized morphology and expressing SCARB1 characteristic of luteinization. Mice bearing null mutation of the Scarb1 gene (SCARB1(-/-)) had ovaries with small corpora lutea, large follicles with hypertrophied theca cells, and follicular cysts with blood-filled cavities. Plasma progesterone concentrations were decreased 50% in mice with Scarb1 gene disruption. When SCARB1(-/-) mice were treated with a combination of mevinolin [an inhibitor of 3-hydroxy-3-methylglutaryl CoA reductase (HMGR)] and chloroquine (an inhibitor of lysosomal processing of low-density lipoproteins), serum progesterone was further reduced. HMGR protein expression increased in SCARB1(-/-) mice, independent of treatment. It was concluded that theca, granulosa, and cumulus cells express SCARB1 during follicle development, but maximum expression depends on luteinization. Knockout of SCARB1(-/-) leads to ovarian pathology and suboptimal luteal steroidogenesis. Therefore, SCARB1 expression is essential for maintaining normal ovarian cholesterol homeostasis and luteal steroid synthesis.

  20. NLP-PIER: A Scalable Natural Language Processing, Indexing, and Searching Architecture for Clinical Notes

    PubMed Central

    McEwan, Reed; Melton, Genevieve B.; Knoll, Benjamin C.; Wang, Yan; Hultman, Gretchen; Dale, Justin L.; Meyer, Tim; Pakhomov, Serguei V.

    2016-01-01

    Many design considerations must be addressed in order to provide researchers with full text and semantic search of unstructured healthcare data such as clinical notes and reports. Institutions looking at providing this functionality must also address the big data aspects of their unstructured corpora. Because these systems are complex and demand a non-trivial investment, there is an incentive to make the system capable of servicing future needs as well, further complicating the design. We present architectural best practices as lessons learned in the design and implementation NLP-PIER (Patient Information Extraction for Research), a scalable, extensible, and secure system for processing, indexing, and searching clinical notes at the University of Minnesota. PMID:27570663

  1. Reflective Random Indexing and indirect inference: a scalable method for discovery of implicit connections.

    PubMed

    Cohen, Trevor; Schvaneveldt, Roger; Widdows, Dominic

    2010-04-01

    The discovery of implicit connections between terms that do not occur together in any scientific document underlies the model of literature-based knowledge discovery first proposed by Swanson. Corpus-derived statistical models of semantic distance such as Latent Semantic Analysis (LSA) have been evaluated previously as methods for the discovery of such implicit connections. However, LSA in particular is dependent on a computationally demanding method of dimension reduction as a means to obtain meaningful indirect inference, limiting its ability to scale to large text corpora. In this paper, we evaluate the ability of Random Indexing (RI), a scalable distributional model of word associations, to draw meaningful implicit relationships between terms in general and biomedical language. Proponents of this method have achieved comparable performance to LSA on several cognitive tasks while using a simpler and less computationally demanding method of dimension reduction than LSA employs. In this paper, we demonstrate that the original implementation of RI is ineffective at inferring meaningful indirect connections, and evaluate Reflective Random Indexing (RRI), an iterative variant of the method that is better able to perform indirect inference. RRI is shown to lead to more clearly related indirect connections and to outperform existing RI implementations in the prediction of future direct co-occurrence in the MEDLINE corpus. 2009 Elsevier Inc. All rights reserved.

  2. Detecting evolutionary forces in language change.

    PubMed

    Newberry, Mitchell G; Ahern, Christopher A; Clark, Robin; Plotkin, Joshua B

    2017-11-09

    Both language and genes evolve by transmission over generations with opportunity for differential replication of forms. The understanding that gene frequencies change at random by genetic drift, even in the absence of natural selection, was a seminal advance in evolutionary biology. Stochastic drift must also occur in language as a result of randomness in how linguistic forms are copied between speakers. Here we quantify the strength of selection relative to stochastic drift in language evolution. We use time series derived from large corpora of annotated texts dating from the 12th to 21st centuries to analyse three well-known grammatical changes in English: the regularization of past-tense verbs, the introduction of the periphrastic 'do', and variation in verbal negation. We reject stochastic drift in favour of selection in some cases but not in others. In particular, we infer selection towards the irregular forms of some past-tense verbs, which is likely driven by changing frequencies of rhyming patterns over time. We show that stochastic drift is stronger for rare words, which may explain why rare forms are more prone to replacement than common ones. This work provides a method for testing selective theories of language change against a null model and reveals an underappreciated role for stochasticity in language evolution.

  3. Performance analysis of Supply Chain Management with Supply Chain Operation reference model

    NASA Astrophysics Data System (ADS)

    Hasibuan, Abdurrozzaq; Arfah, Mahrani; Parinduri, Luthfi; Hernawati, Tri; Suliawati; Harahap, Bonar; Rahmah Sibuea, Siti; Krianto Sulaiman, Oris; purwadi, Adi

    2018-04-01

    This research was conducted at PT. Shamrock Manufacturing Corpora, the company is required to think creatively to implement competition strategy by producing goods/services that are more qualified, cheaper. Therefore, it is necessary to measure the performance of Supply Chain Management in order to improve the competitiveness. Therefore, the company is required to optimize its production output to meet the export quality standard. This research begins with the creation of initial dimensions based on Supply Chain Management process, ie Plan, Source, Make, Delivery, and Return with hierarchy based on Supply Chain Reference Operation that is Reliability, Responsiveness, Agility, Cost, and Asset. Key Performance Indicator identification becomes a benchmark in performance measurement whereas Snorm De Boer normalization serves to equalize Key Performance Indicator value. Analiytical Hierarchy Process is done to assist in determining priority criteria. Measurement of Supply Chain Management performance at PT. Shamrock Manufacturing Corpora produces SC. Responsiveness (0.649) has higher weight (priority) than other alternatives. The result of performance analysis using Supply Chain Reference Operation model of Supply Chain Management performance at PT. Shamrock Manufacturing Corpora looks good because its monitoring system between 50-100 is good.

  4. Positivity of the English Language

    PubMed Central

    Kloumann, Isabel M.; Danforth, Christopher M.; Harris, Kameron Decker; Bliss, Catherine A.; Dodds, Peter Sheridan

    2012-01-01

    Over the last million years, human language has emerged and evolved as a fundamental instrument of social communication and semiotic representation. People use language in part to convey emotional information, leading to the central and contingent questions: (1) What is the emotional spectrum of natural language? and (2) Are natural languages neutrally, positively, or negatively biased? Here, we report that the human-perceived positivity of over 10,000 of the most frequently used English words exhibits a clear positive bias. More deeply, we characterize and quantify distributions of word positivity for four large and distinct corpora, demonstrating that their form is broadly invariant with respect to frequency of word use. PMID:22247779

  5. Detection of IUPAC and IUPAC-like chemical names

    PubMed Central

    Klinger, Roman; Kolářik, Corinna; Fluck, Juliane; Hofmann-Apitius, Martin; Friedrich, Christoph M.

    2008-01-01

    Motivation: Chemical compounds like small signal molecules or other biological active chemical substances are an important entity class in life science publications and patents. Several representations and nomenclatures for chemicals like SMILES, InChI, IUPAC or trivial names exist. Only SMILES and InChI names allow a direct structure search, but in biomedical texts trivial names and Iupac like names are used more frequent. While trivial names can be found with a dictionary-based approach and in such a way mapped to their corresponding structures, it is not possible to enumerate all IUPAC names. In this work, we present a new machine learning approach based on conditional random fields (CRF) to find mentions of IUPAC and IUPAC-like names in scientific text as well as its evaluation and the conversion rate with available name-to-structure tools. Results: We present an IUPAC name recognizer with an F1 measure of 85.6% on a MEDLINE corpus. The evaluation of different CRF orders and offset conjunction orders demonstrates the importance of these parameters. An evaluation of hand-selected patent sections containing large enumerations and terms with mixed nomenclature shows a good performance on these cases (F1 measure 81.5%). Remaining recognition problems are to detect correct borders of the typically long terms, especially when occurring in parentheses or enumerations. We demonstrate the scalability of our implementation by providing results from a full MEDLINE run. Availability: We plan to publish the corpora, annotation guideline as well as the conditional random field model as a UIMA component. Contact: roman.klinger@scai.fraunhofer.de PMID:18586724

  6. A novel procedure on next generation sequencing data analysis using text mining algorithm.

    PubMed

    Zhao, Weizhong; Chen, James J; Perkins, Roger; Wang, Yuping; Liu, Zhichao; Hong, Huixiao; Tong, Weida; Zou, Wen

    2016-05-13

    Next-generation sequencing (NGS) technologies have provided researchers with vast possibilities in various biological and biomedical research areas. Efficient data mining strategies are in high demand for large scale comparative and evolutional studies to be performed on the large amounts of data derived from NGS projects. Topic modeling is an active research field in machine learning and has been mainly used as an analytical tool to structure large textual corpora for data mining. We report a novel procedure to analyse NGS data using topic modeling. It consists of four major procedures: NGS data retrieval, preprocessing, topic modeling, and data mining using Latent Dirichlet Allocation (LDA) topic outputs. The NGS data set of the Salmonella enterica strains were used as a case study to show the workflow of this procedure. The perplexity measurement of the topic numbers and the convergence efficiencies of Gibbs sampling were calculated and discussed for achieving the best result from the proposed procedure. The output topics by LDA algorithms could be treated as features of Salmonella strains to accurately describe the genetic diversity of fliC gene in various serotypes. The results of a two-way hierarchical clustering and data matrix analysis on LDA-derived matrices successfully classified Salmonella serotypes based on the NGS data. The implementation of topic modeling in NGS data analysis procedure provides a new way to elucidate genetic information from NGS data, and identify the gene-phenotype relationships and biomarkers, especially in the era of biological and medical big data. The implementation of topic modeling in NGS data analysis provides a new way to elucidate genetic information from NGS data, and identify the gene-phenotype relationships and biomarkers, especially in the era of biological and medical big data.

  7. Bengali-English Relevant Cross Lingual Information Access Using Finite Automata

    NASA Astrophysics Data System (ADS)

    Banerjee, Avishek; Bhattacharyya, Swapan; Hazra, Simanta; Mondal, Shatabdi

    2010-10-01

    CLIR techniques searches unrestricted texts and typically extract term and relationships from bilingual electronic dictionaries or bilingual text collections and use them to translate query and/or document representations into a compatible set of representations with a common feature set. In this paper, we focus on dictionary-based approach by using a bilingual data dictionary with a combination to statistics-based methods to avoid the problem of ambiguity also the development of human computer interface aspects of NLP (Natural Language processing) is the approach of this paper. The intelligent web search with regional language like Bengali is depending upon two major aspect that is CLIA (Cross language information access) and NLP. In our previous work with IIT, KGP we already developed content based CLIA where content based searching in trained on Bengali Corpora with the help of Bengali data dictionary. Here we want to introduce intelligent search because to recognize the sense of meaning of a sentence and it has a better real life approach towards human computer interactions.

  8. Automated extraction and semantic analysis of mutation impacts from the biomedical literature

    PubMed Central

    2012-01-01

    Background Mutations as sources of evolution have long been the focus of attention in the biomedical literature. Accessing the mutational information and their impacts on protein properties facilitates research in various domains, such as enzymology and pharmacology. However, manually curating the rich and fast growing repository of biomedical literature is expensive and time-consuming. As a solution, text mining approaches have increasingly been deployed in the biomedical domain. While the detection of single-point mutations is well covered by existing systems, challenges still exist in grounding impacts to their respective mutations and recognizing the affected protein properties, in particular kinetic and stability properties together with physical quantities. Results We present an ontology model for mutation impacts, together with a comprehensive text mining system for extracting and analysing mutation impact information from full-text articles. Organisms, as sources of proteins, are extracted to help disambiguation of genes and proteins. Our system then detects mutation series to correctly ground detected impacts using novel heuristics. It also extracts the affected protein properties, in particular kinetic and stability properties, as well as the magnitude of the effects and validates these relations against the domain ontology. The output of our system can be provided in various formats, in particular by populating an OWL-DL ontology, which can then be queried to provide structured information. The performance of the system is evaluated on our manually annotated corpora. In the impact detection task, our system achieves a precision of 70.4%-71.1%, a recall of 71.3%-71.5%, and grounds the detected impacts with an accuracy of 76.5%-77%. The developed system, including resources, evaluation data and end-user and developer documentation is freely available under an open source license at http://www.semanticsoftware.info/open-mutation-miner. Conclusion We present Open Mutation Miner (OMM), the first comprehensive, fully open-source approach to automatically extract impacts and related relevant information from the biomedical literature. We assessed the performance of our work on manually annotated corpora and the results show the reliability of our approach. The representation of the extracted information into a structured format facilitates knowledge management and aids in database curation and correction. Furthermore, access to the analysis results is provided through multiple interfaces, including web services for automated data integration and desktop-based solutions for end user interactions. PMID:22759648

  9. Circumferential Peyronie's disease involving both the corpora cavernosa.

    PubMed

    Narita, T; Kudo, H; Matsumoto, K

    1995-05-01

    An extraordinary form of Peyronies disease is reported. The patient was a 52 year old male, who died of a malignant thymoma with multiple bone metastasis, extensive pleural carcinomatosis of the left lung and some metastatic nodules in the liver and the mesenterium. At autopsy, the proximal and middle portions of the penis were very hard. Macroscopically, the entire tunica albuginea of both the corpora cavernosa was markedly thickened, 2-4 mm; and calcified. Microscopically, the tunica albuginea showed extensive hyaline degeneration, calcification and ossifying foci with osteoblasts and osteoclasts. Inflammatory cells were frequently found beneath the thickened tunica albuginea. In the corpus cavernosum, cavernous arteries showed marked intimal thickening and medial muscular degeneration with a few inflammatory cells. Smooth muscles of the stroma were extensively atrophic and degenerative, and some of them were infiltrated with a few inflammatory cells. In the corpus spongiosum, the tunica albuginea was not thickened, but the smooth muscle in the stroma was atrophic and degenerative and a few inflammatory cells were also found. Surprisingly, there was no Littrés gland around the urethra. In Peyronies disease, the dorsal part of the penis is usually involved, and less frequently lateral or ventral sites are involved. The circumferential involvement of both the corpora cavernosa has not been reported until now, as far as the authors know.

  10. A universal multilingual weightless neural network tagger via quantitative linguistics.

    PubMed

    Carneiro, Hugo C C; Pedreira, Carlos E; França, Felipe M G; Lima, Priscila M V

    2017-07-01

    In the last decade, given the availability of corpora in several distinct languages, research on multilingual part-of-speech tagging started to grow. Amongst the novelties there is mWANN-Tagger (multilingual weightless artificial neural network tagger), a weightless neural part-of-speech tagger capable of being used for mostly-suffix-oriented languages. The tagger was subjected to corpora in eight languages of quite distinct natures and had a remarkable accuracy with very low sample deviation in every one of them, indicating the robustness of weightless neural systems for part-of-speech tagging tasks. However, mWANN-Tagger needed to be tuned for every new corpus, since each one required a different parameter configuration. For mWANN-Tagger to be truly multilingual, it should be usable for any new language with no need of parameter tuning. This article proposes a study that aims to find a relation between the lexical diversity of a language and the parameter configuration that would produce the best performing mWANN-Tagger instance. Preliminary analyses suggested that a single parameter configuration may be applied to the eight aforementioned languages. The mWANN-Tagger instance produced by this configuration was as accurate as the language-dependent ones obtained through tuning. Afterwards, the weightless neural tagger was further subjected to new corpora in languages that range from very isolating to polysynthetic ones. The best performing instances of mWANN-Tagger are again the ones produced by the universal parameter configuration. Hence, mWANN-Tagger can be applied to new corpora with no need of parameter tuning, making it a universal multilingual part-of-speech tagger. Further experiments with Universal Dependencies treebanks reveal that mWANN-Tagger may be extended and that it has potential to outperform most state-of-the-art part-of-speech taggers if better word representations are provided. Copyright © 2017 Elsevier Ltd. All rights reserved.

  11. Desiderata for ontologies to be used in semantic annotation of biomedical documents.

    PubMed

    Bada, Michael; Hunter, Lawrence

    2011-02-01

    A wealth of knowledge valuable to the translational research scientist is contained within the vast biomedical literature, but this knowledge is typically in the form of natural language. Sophisticated natural-language-processing systems are needed to translate text into unambiguous formal representations grounded in high-quality consensus ontologies, and these systems in turn rely on gold-standard corpora of annotated documents for training and testing. To this end, we are constructing the Colorado Richly Annotated Full-Text (CRAFT) Corpus, a collection of 97 full-text biomedical journal articles that are being manually annotated with the entire sets of terms from select vocabularies, predominantly from the Open Biomedical Ontologies (OBO) library. Our efforts in building this corpus has illuminated infelicities of these ontologies with respect to the semantic annotation of biomedical documents, and we propose desiderata whose implementation could substantially improve their utility in this task; these include the integration of overlapping terms across OBOs, the resolution of OBO-specific ambiguities, the integration of the BFO with the OBOs and the use of mid-level ontologies, the inclusion of noncanonical instances, and the expansion of relations and realizable entities. Copyright © 2010 Elsevier Inc. All rights reserved.

  12. Social Media and Language Processing: How Facebook and Twitter Provide the Best Frequency Estimates for Studying Word Recognition.

    PubMed

    Herdağdelen, Amaç; Marelli, Marco

    2017-05-01

    Corpus-based word frequencies are one of the most important predictors in language processing tasks. Frequencies based on conversational corpora (such as movie subtitles) are shown to better capture the variance in lexical decision tasks compared to traditional corpora. In this study, we show that frequencies computed from social media are currently the best frequency-based estimators of lexical decision reaction times (up to 3.6% increase in explained variance). The results are robust (observed for Twitter- and Facebook-based frequencies on American English and British English datasets) and are still substantial when we control for corpus size. © 2016 The Authors. Cognitive Science published by Wiley Periodicals, Inc. on behalf of Cognitive Science Society.

  13. Toward a complete dataset of drug-drug interaction information from publicly available sources.

    PubMed

    Ayvaz, Serkan; Horn, John; Hassanzadeh, Oktie; Zhu, Qian; Stan, Johann; Tatonetti, Nicholas P; Vilar, Santiago; Brochhausen, Mathias; Samwald, Matthias; Rastegar-Mojarad, Majid; Dumontier, Michel; Boyce, Richard D

    2015-06-01

    Although potential drug-drug interactions (PDDIs) are a significant source of preventable drug-related harm, there is currently no single complete source of PDDI information. In the current study, all publically available sources of PDDI information that could be identified using a comprehensive and broad search were combined into a single dataset. The combined dataset merged fourteen different sources including 5 clinically-oriented information sources, 4 Natural Language Processing (NLP) Corpora, and 5 Bioinformatics/Pharmacovigilance information sources. As a comprehensive PDDI source, the merged dataset might benefit the pharmacovigilance text mining community by making it possible to compare the representativeness of NLP corpora for PDDI text extraction tasks, and specifying elements that can be useful for future PDDI extraction purposes. An analysis of the overlap between and across the data sources showed that there was little overlap. Even comprehensive PDDI lists such as DrugBank, KEGG, and the NDF-RT had less than 50% overlap with each other. Moreover, all of the comprehensive lists had incomplete coverage of two data sources that focus on PDDIs of interest in most clinical settings. Based on this information, we think that systems that provide access to the comprehensive lists, such as APIs into RxNorm, should be careful to inform users that the lists may be incomplete with respect to PDDIs that drug experts suggest clinicians be aware of. In spite of the low degree of overlap, several dozen cases were identified where PDDI information provided in drug product labeling might be augmented by the merged dataset. Moreover, the combined dataset was also shown to improve the performance of an existing PDDI NLP pipeline and a recently published PDDI pharmacovigilance protocol. Future work will focus on improvement of the methods for mapping between PDDI information sources, identifying methods to improve the use of the merged dataset in PDDI NLP algorithms, integrating high-quality PDDI information from the merged dataset into Wikidata, and making the combined dataset accessible as Semantic Web Linked Data. Copyright © 2015 The Authors. Published by Elsevier Inc. All rights reserved.

  14. Normalization of relative and incomplete temporal expressions in clinical narratives.

    PubMed

    Sun, Weiyi; Rumshisky, Anna; Uzuner, Ozlem

    2015-09-01

    To improve the normalization of relative and incomplete temporal expressions (RI-TIMEXes) in clinical narratives. We analyzed the RI-TIMEXes in temporally annotated corpora and propose two hypotheses regarding the normalization of RI-TIMEXes in the clinical narrative domain: the anchor point hypothesis and the anchor relation hypothesis. We annotated the RI-TIMEXes in three corpora to study the characteristics of RI-TMEXes in different domains. This informed the design of our RI-TIMEX normalization system for the clinical domain, which consists of an anchor point classifier, an anchor relation classifier, and a rule-based RI-TIMEX text span parser. We experimented with different feature sets and performed an error analysis for each system component. The annotation confirmed the hypotheses that we can simplify the RI-TIMEXes normalization task using two multi-label classifiers. Our system achieves anchor point classification, anchor relation classification, and rule-based parsing accuracy of 74.68%, 87.71%, and 57.2% (82.09% under relaxed matching criteria), respectively, on the held-out test set of the 2012 i2b2 temporal relation challenge. Experiments with feature sets reveal some interesting findings, such as: the verbal tense feature does not inform the anchor relation classification in clinical narratives as much as the tokens near the RI-TIMEX. Error analysis showed that underrepresented anchor point and anchor relation classes are difficult to detect. We formulate the RI-TIMEX normalization problem as a pair of multi-label classification problems. Considering only RI-TIMEX extraction and normalization, the system achieves statistically significant improvement over the RI-TIMEX results of the best systems in the 2012 i2b2 challenge. © The Author 2015. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For Permissions, please email: journals.permissions@oup.com.

  15. Scaling-up NLP Pipelines to Process Large Corpora of Clinical Notes.

    PubMed

    Divita, G; Carter, M; Redd, A; Zeng, Q; Gupta, K; Trautner, B; Samore, M; Gundlapalli, A

    2015-01-01

    This article is part of the Focus Theme of Methods of Information in Medicine on "Big Data and Analytics in Healthcare". This paper describes the scale-up efforts at the VA Salt Lake City Health Care System to address processing large corpora of clinical notes through a natural language processing (NLP) pipeline. The use case described is a current project focused on detecting the presence of an indwelling urinary catheter in hospitalized patients and subsequent catheter-associated urinary tract infections. An NLP algorithm using v3NLP was developed to detect the presence of an indwelling urinary catheter in hospitalized patients. The algorithm was tested on a small corpus of notes on patients for whom the presence or absence of a catheter was already known (reference standard). In planning for a scale-up, we estimated that the original algorithm would have taken 2.4 days to run on a larger corpus of notes for this project (550,000 notes), and 27 days for a corpus of 6 million records representative of a national sample of notes. We approached scaling-up NLP pipelines through three techniques: pipeline replication via multi-threading, intra-annotator threading for tasks that can be further decomposed, and remote annotator services which enable annotator scale-out. The scale-up resulted in reducing the average time to process a record from 206 milliseconds to 17 milliseconds or a 12- fold increase in performance when applied to a corpus of 550,000 notes. Purposely simplistic in nature, these scale-up efforts are the straight forward evolution from small scale NLP processing to larger scale extraction without incurring associated complexities that are inherited by the use of the underlying UIMA framework. These efforts represent generalizable and widely applicable techniques that will aid other computationally complex NLP pipelines that are of need to be scaled out for processing and analyzing big data.

  16. Text mining, a race against time? An attempt to quantify possible variations in text corpora of medical publications throughout the years.

    PubMed

    Wagner, Mathias; Vicinus, Benjamin; Muthra, Sherieda T; Richards, Tereza A; Linder, Roland; Frick, Vilma Oliveira; Groh, Andreas; Rubie, Claudia; Weichert, Frank

    2016-06-01

    The continuous growth of medical sciences literature indicates the need for automated text analysis. Scientific writing which is neither unitary, transcending social situation nor defined by a timeless idea is subject to constant change as it develops in response to evolving knowledge, aims at different goals, and embodies different assumptions about nature and communication. The objective of this study was to evaluate whether publication dates should be considered when performing text mining. A search of PUBMED for combined references to chemokine identifiers and particular cancer related terms was conducted to detect changes over the past 36 years. Text analyses were performed using freeware available from the World Wide Web. TOEFL Scores of territories hosting institutional affiliations as well as various readability indices were investigated. Further assessment was conducted using Principal Component Analysis. Laboratory examination was performed to evaluate the quality of attempts to extract content from the examined linguistic features. The PUBMED search yielded a total of 14,420 abstracts (3,190,219 words). The range of findings in laboratory experimentation were coherent with the variability of the results described in the analyzed body of literature. Increased concurrence of chemokine identifiers together with cancer related terms was found at the abstract and sentence level, whereas complexity of sentences remained fairly stable. The findings of the present study indicate that concurrent references to chemokines and cancer increased over time whereas text complexity remained stable. Copyright © 2016 Elsevier Ltd. All rights reserved.

  17. Penile embryology and anatomy.

    PubMed

    Yiee, Jenny H; Baskin, Laurence S

    2010-06-29

    Knowledge of penile embryology and anatomy is essential to any pediatric urologist in order to fully understand and treat congenital anomalies. Sex differentiation of the external genitalia occurs between the 7th and 17th weeks of gestation. The Y chromosome initiates male differentiation through the SRY gene, which triggers testicular development. Under the influence of androgens produced by the testes, external genitalia then develop into the penis and scrotum. Dorsal nerves supply penile skin sensation and lie within Buck's fascia. These nerves are notably absent at the 12 o'clock position. Perineal nerves supply skin sensation to the ventral shaft skin and frenulum. Cavernosal nerves lie within the corpora cavernosa and are responsible for sexual function. Paired cavernosal, dorsal, and bulbourethral arteries have extensive anastomotic connections. During erection, the cavernosal artery causes engorgement of the cavernosa, while the deep dorsal artery leads to glans enlargement. The majority of venous drainage occurs through a single, deep dorsal vein into which multiple emissary veins from the corpora and circumflex veins from the spongiosum drain. The corpora cavernosa and spongiosum are all made of spongy erectile tissue. Buck's fascia circumferentially envelops all three structures, splitting into two leaves ventrally at the spongiosum. The male urethra is composed of six parts: bladder neck, prostatic, membranous, bulbous, penile, and fossa navicularis. The urethra receives its blood supply from both proximal and distal directions.

  18. Penile Embryology and Anatomy

    PubMed Central

    Yiee, Jenny H.; Baskin, Laurence S.

    2010-01-01

    Knowledge of penile embryology and anatomy is essential to any pediatric urologist in order to fully understand and treat congenital anomalies. Sex differentiation of the external genitalia occurs between the 7and 17 weeks of gestation. The Y chromosome initiates male differentiation through the SRY gene, which triggers testicular development. Under the influence of androgens produced by the testes, external genitalia then develop into the penis and scrotum. Dorsal nerves supply penile skin sensation and lie within Buck's fascia. These nerves are notably absent at the 12 o'clock position. Perineal nerves supply skin sensation to the ventral shaft skin and frenulum. Cavernosal nerves lie within the corpora cavernosa and are responsible for sexual function. Paired cavernosal, dorsal, and bulbourethral arteries have extensive anastomotic connections. During erection, the cavernosal artery causes engorgement of the cavernosa, while the deep dorsal artery leads to glans enlargement. The majority of venous drainage occurs through a single, deep dorsal vein into which multiple emissary veins from the corpora and circumflex veins from the spongiosum drain. The corpora cavernosa and spongiosum are all made of spongy erectile tissue. Buck's fascia circumferentially envelops all three structures, splitting into two leaves ventrally at the spongiosum. The male urethra is composed of six parts: bladder neck, prostatic, membranous, bulbous, penile, and fossa navicularis. The urethra receives its blood supply from both proximal and distal directions. PMID:20602076

  19. Juvenile Hormone Biosynthesis Gene Expression in the corpora allata of Honey Bee (Apis mellifera L.) Female Castes

    PubMed Central

    Rosa, Gustavo Conrado Couto; Moda, Livia Maria; Martins, Juliana Ramos; Bitondi, Márcia Maria Gentile; Hartfelder, Klaus; Simões, Zilá Luz Paulino

    2014-01-01

    Juvenile hormone (JH) controls key events in the honey bee life cycle, viz. caste development and age polyethism. We quantified transcript abundance of 24 genes involved in the JH biosynthetic pathway in the corpora allata-corpora cardiaca (CA-CC) complex. The expression of six of these genes showing relatively high transcript abundance was contrasted with CA size, hemolymph JH titer, as well as JH degradation rates and JH esterase (jhe) transcript levels. Gene expression did not match the contrasting JH titers in queen and worker fourth instar larvae, but jhe transcript abundance and JH degradation rates were significantly lower in queen larvae. Consequently, transcriptional control of JHE is of importance in regulating larval JH titers and caste development. In contrast, the same analyses applied to adult worker bees allowed us inferring that the high JH levels in foragers are due to increased JH synthesis. Upon RNAi-mediated silencing of the methyl farnesoate epoxidase gene (mfe) encoding the enzyme that catalyzes methyl farnesoate-to-JH conversion, the JH titer was decreased, thus corroborating that JH titer regulation in adult honey bees depends on this final JH biosynthesis step. The molecular pathway differences underlying JH titer regulation in larval caste development versus adult age polyethism lead us to propose that mfe and jhe genes be assayed when addressing questions on the role(s) of JH in social evolution. PMID:24489805

  20. Evaluating the state of the art in coreference resolution for electronic medical records

    PubMed Central

    Bodnari, Andreea; Shen, Shuying; Forbush, Tyler; Pestian, John; South, Brett R

    2012-01-01

    Background The fifth i2b2/VA Workshop on Natural Language Processing Challenges for Clinical Records conducted a systematic review on resolution of noun phrase coreference in medical records. Informatics for Integrating Biology and the Bedside (i2b2) and the Veterans Affair (VA) Consortium for Healthcare Informatics Research (CHIR) partnered to organize the coreference challenge. They provided the research community with two corpora of medical records for the development and evaluation of the coreference resolution systems. These corpora contained various record types (ie, discharge summaries, pathology reports) from multiple institutions. Methods The coreference challenge provided the community with two annotated ground truth corpora and evaluated systems on coreference resolution in two ways: first, it evaluated systems for their ability to identify mentions of concepts and to link together those mentions. Second, it evaluated the ability of the systems to link together ground truth mentions that refer to the same entity. Twenty teams representing 29 organizations and nine countries participated in the coreference challenge. Results The teams' system submissions showed that machine-learning and rule-based approaches worked best when augmented with external knowledge sources and coreference clues extracted from document structure. The systems performed better in coreference resolution when provided with ground truth mentions. Overall, the systems struggled in solving coreference resolution for cases that required domain knowledge. PMID:22366294

  1. Unsupervised learning of natural languages

    PubMed Central

    Solan, Zach; Horn, David; Ruppin, Eytan; Edelman, Shimon

    2005-01-01

    We address the problem, fundamental to linguistics, bioinformatics, and certain other disciplines, of using corpora of raw symbolic sequential data to infer underlying rules that govern their production. Given a corpus of strings (such as text, transcribed speech, chromosome or protein sequence data, sheet music, etc.), our unsupervised algorithm recursively distills from it hierarchically structured patterns. The adios (automatic distillation of structure) algorithm relies on a statistical method for pattern extraction and on structured generalization, two processes that have been implicated in language acquisition. It has been evaluated on artificial context-free grammars with thousands of rules, on natural languages as diverse as English and Chinese, and on protein data correlating sequence with function. This unsupervised algorithm is capable of learning complex syntax, generating grammatical novel sentences, and proving useful in other fields that call for structure discovery from raw data, such as bioinformatics. PMID:16087885

  2. Unsupervised learning of natural languages.

    PubMed

    Solan, Zach; Horn, David; Ruppin, Eytan; Edelman, Shimon

    2005-08-16

    We address the problem, fundamental to linguistics, bioinformatics, and certain other disciplines, of using corpora of raw symbolic sequential data to infer underlying rules that govern their production. Given a corpus of strings (such as text, transcribed speech, chromosome or protein sequence data, sheet music, etc.), our unsupervised algorithm recursively distills from it hierarchically structured patterns. The adios (automatic distillation of structure) algorithm relies on a statistical method for pattern extraction and on structured generalization, two processes that have been implicated in language acquisition. It has been evaluated on artificial context-free grammars with thousands of rules, on natural languages as diverse as English and Chinese, and on protein data correlating sequence with function. This unsupervised algorithm is capable of learning complex syntax, generating grammatical novel sentences, and proving useful in other fields that call for structure discovery from raw data, such as bioinformatics.

  3. Tailoring vocabularies for NLP in sub-domains: a method to detect unused word sense.

    PubMed

    Figueroa, Rosa L; Zeng-Treitler, Qing; Goryachev, Sergey; Wiechmann, Eduardo P

    2009-11-14

    We developed a method to help tailor a comprehensive vocabulary system (e.g. the UMLS) for a sub-domain (e.g. clinical reports) in support of natural language processing (NLP). The method detects unused sense in a sub-domain by comparing the relational neighborhood of a word/term in the vocabulary with the semantic neighborhood of the word/term in the sub-domain. The semantic neighborhood of the word/term in the sub-domain is determined using latent semantic analysis (LSA). We trained and tested the unused sense detection on two clinical text corpora: one contains discharge summaries and the other outpatient visit notes. We were able to detect unused senses with precision from 79% to 87%, recall from 48% to 74%, and an area under receiver operation curve (AUC) of 72% to 87%.

  4. Chemical Entity Recognition and Resolution to ChEBI

    PubMed Central

    Grego, Tiago; Pesquita, Catia; Bastos, Hugo P.; Couto, Francisco M.

    2012-01-01

    Chemical entities are ubiquitous through the biomedical literature and the development of text-mining systems that can efficiently identify those entities are required. Due to the lack of available corpora and data resources, the community has focused its efforts in the development of gene and protein named entity recognition systems, but with the release of ChEBI and the availability of an annotated corpus, this task can be addressed. We developed a machine-learning-based method for chemical entity recognition and a lexical-similarity-based method for chemical entity resolution and compared them with Whatizit, a popular-dictionary-based method. Our methods outperformed the dictionary-based method in all tasks, yielding an improvement in F-measure of 20% for the entity recognition task, 2–5% for the entity-resolution task, and 15% for combined entity recognition and resolution tasks. PMID:25937941

  5. Challenges for automatically extracting molecular interactions from full-text articles.

    PubMed

    McIntosh, Tara; Curran, James R

    2009-09-24

    The increasing availability of full-text biomedical articles will allow more biomedical knowledge to be extracted automatically with greater reliability. However, most Information Retrieval (IR) and Extraction (IE) tools currently process only abstracts. The lack of corpora has limited the development of tools that are capable of exploiting the knowledge in full-text articles. As a result, there has been little investigation into the advantages of full-text document structure, and the challenges developers will face in processing full-text articles. We manually annotated passages from full-text articles that describe interactions summarised in a Molecular Interaction Map (MIM). Our corpus tracks the process of identifying facts to form the MIM summaries and captures any factual dependencies that must be resolved to extract the fact completely. For example, a fact in the results section may require a synonym defined in the introduction. The passages are also annotated with negated and coreference expressions that must be resolved.We describe the guidelines for identifying relevant passages and possible dependencies. The corpus includes 2162 sentences from 78 full-text articles. Our corpus analysis demonstrates the necessity of full-text processing; identifies the article sections where interactions are most commonly stated; and quantifies the proportion of interaction statements requiring coherent dependencies. Further, it allows us to report on the relative importance of identifying synonyms and resolving negated expressions. We also experiment with an oracle sentence retrieval system using the corpus as a gold-standard evaluation set. We introduce the MIM corpus, a unique resource that maps interaction facts in a MIM to annotated passages within full-text articles. It is an invaluable case study providing guidance to developers of biomedical IR and IE systems, and can be used as a gold-standard evaluation set for full-text IR tasks.

  6. Acute inflammatory proteins constitute the organic matrix of prostatic corpora amylacea and calculi in men with prostate cancer

    PubMed Central

    Sfanos, Karen S.; Wilson, Brice A.; De Marzo, Angelo M.; Isaacs, William B.

    2009-01-01

    Corpora amylacea (CA) are a frequent microscopic finding in radical prostatectomy specimens from men undergoing treatment for prostate cancer. Although often observed histologically to be associated with inflammation, the contribution of CA to prostatitis-related symptoms of unknown etiology or to prostate carcinogenesis remains unclear. Prostatic calculi (PC), which potentially represent calcified forms of CA, are less common but can cause urological disease including urinary retention and prostatitis. We conducted a comprehensive compositional analysis of CA/PC to gain insight into their biogenesis. Infrared spectroscopy analysis of calculi collected from 23 patients confirmed a prevalence of calcium phosphate in the form of hydroxyapatite. This result sets PC apart from most urinary stones, which largely are composed of calcium oxalate. Tandem mass spectrometry-based proteomic analysis of CA/PC revealed that lactoferrin is the predominant protein component, a result that was confirmed by Western blot analysis. Other proteins identified, including calprotectin, myeloperoxidase, and α-defensins, are proteins contained in neutrophil granules. Immunohistochemistry (IHC) suggested the source of lactoferrin to be prostate-infiltrating neutrophils as well as inflamed prostate epithelium; however, IHC for calprotectin suggested prostate-infiltrating neutrophils as a major source of the protein, because it was absent from other prostate compartments. This study represents a definitive analysis of the protein composition of prostatic CA and calculi and suggests that acute inflammation has a role in their biogenesis—an intriguing finding, given the prevalence of CA in prostatectomy specimens and the hypothesized role for inflammation in prostate carcinogenesis. PMID:19202053

  7. Peptidomics of Neuropeptidergic Tissues of the Tsetse Fly Glossina morsitans morsitans

    NASA Astrophysics Data System (ADS)

    Caers, Jelle; Boonen, Kurt; Van Den Abbeele, Jan; Van Rompay, Liesbeth; Schoofs, Liliane; Van Hiel, Matthias B.

    2015-12-01

    Neuropeptides and peptide hormones are essential signaling molecules that regulate nearly all physiological processes. The recent release of the tsetse fly genome allowed the construction of a detailed in silico neuropeptide database (International Glossina Genome Consortium, Science 344, 380-386 (2014)), as well as an in-depth mass spectrometric analysis of the most important neuropeptidergic tissues of this medically and economically important insect species. Mass spectrometric confirmation of predicted peptides is a vital step in the functional characterization of neuropeptides, as in vivo peptides can be modified, cleaved, or even mispredicted. Using a nanoscale reversed phase liquid chromatography coupled to a Q Exactive Orbitrap mass spectrometer, we detected 51 putative bioactive neuropeptides encoded by 19 precursors: adipokinetic hormone (AKH) I and II, allatostatin A and B, capability/pyrokinin (capa/PK), corazonin, calcitonin-like diuretic hormone (CT/DH), FMRFamide, hugin, leucokinin, myosuppressin, natalisin, neuropeptide-like precursor (NPLP) 1, orcokinin, pigment dispersing factor (PDF), RYamide, SIFamide, short neuropeptide F (sNPF) and tachykinin. In addition, propeptides, truncated and spacer peptides derived from seven additional precursors were found, and include the precursors of allatostatin C, crustacean cardioactive peptide, corticotropin releasing factor-like diuretic hormone (CRF/DH), ecdysis triggering hormone (ETH), ion transport peptide (ITP), neuropeptide F, and proctolin, respectively. The majority of the identified neuropeptides are present in the central nervous system, with only a limited number of peptides in the corpora cardiaca-corpora allata and midgut. Owing to the large number of identified peptides, this study can be used as a reference for comparative studies in other insects.

  8. Clinical documentation variations and NLP system portability: a case study in asthma birth cohorts across institutions.

    PubMed

    Sohn, Sunghwan; Wang, Yanshan; Wi, Chung-Il; Krusemark, Elizabeth A; Ryu, Euijung; Ali, Mir H; Juhn, Young J; Liu, Hongfang

    2017-11-30

    To assess clinical documentation variations across health care institutions using different electronic medical record systems and investigate how they affect natural language processing (NLP) system portability. Birth cohorts from Mayo Clinic and Sanford Children's Hospital (SCH) were used in this study (n = 298 for each). Documentation variations regarding asthma between the 2 cohorts were examined in various aspects: (1) overall corpus at the word level (ie, lexical variation), (2) topics and asthma-related concepts (ie, semantic variation), and (3) clinical note types (ie, process variation). We compared those statistics and explored NLP system portability for asthma ascertainment in 2 stages: prototype and refinement. There exist notable lexical variations (word-level similarity = 0.669) and process variations (differences in major note types containing asthma-related concepts). However, semantic-level corpora were relatively homogeneous (topic similarity = 0.944, asthma-related concept similarity = 0.971). The NLP system for asthma ascertainment had an F-score of 0.937 at Mayo, and produced 0.813 (prototype) and 0.908 (refinement) when applied at SCH. The criteria for asthma ascertainment are largely dependent on asthma-related concepts. Therefore, we believe that semantic similarity is important to estimate NLP system portability. As the Mayo Clinic and SCH corpora were relatively homogeneous at a semantic level, the NLP system, developed at Mayo Clinic, was imported to SCH successfully with proper adjustments to deal with the intrinsic corpus heterogeneity. © The Author 2017. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For Permissions, please email: journals.permissions@oup.com

  9. Developing a corpus of clinical notes manually annotated for part-of-speech.

    PubMed

    Pakhomov, Serguei V; Coden, Anni; Chute, Christopher G

    2006-06-01

    This paper presents a project whose main goal is to construct a corpus of clinical text manually annotated for part-of-speech (POS) information. We describe and discuss the process of training three domain experts to perform linguistic annotation. Three domain experts were trained to perform manual annotation of a corpus of clinical notes. A part of this corpus was combined with the Penn Treebank corpus of general purpose English text and another part was set aside for testing. The corpora were then used for training and testing statistical part-of-speech taggers. We list some of the challenges as well as encouraging results pertaining to inter-rater agreement and consistency of annotation. We used the Trigrams'n'Tags (TnT) [T. Brants, TnT-a statistical part-of-speech tagger, In: Proceedings of NAACL/ANLP-2000 Symposium, 2000] tagger trained on general English data to achieve 89.79% correctness. The same tagger trained on a portion of the medical data annotated for this project improved the performance to 94.69%. Furthermore, we find that discriminating between different types of discourse represented by different sections of clinical text may be very beneficial to improve correctness of POS tagging. Our preliminary experimental results indicate the necessity for adapting state-of-the-art POS taggers to the sublanguage domain of clinical text.

  10. SAIL: Summation-bAsed Incremental Learning for Information-Theoretic Text Clustering.

    PubMed

    Cao, Jie; Wu, Zhiang; Wu, Junjie; Xiong, Hui

    2013-04-01

    Information-theoretic clustering aims to exploit information-theoretic measures as the clustering criteria. A common practice on this topic is the so-called Info-Kmeans, which performs K-means clustering with KL-divergence as the proximity function. While expert efforts on Info-Kmeans have shown promising results, a remaining challenge is to deal with high-dimensional sparse data such as text corpora. Indeed, it is possible that the centroids contain many zero-value features for high-dimensional text vectors, which leads to infinite KL-divergence values and creates a dilemma in assigning objects to centroids during the iteration process of Info-Kmeans. To meet this challenge, in this paper, we propose a Summation-bAsed Incremental Learning (SAIL) algorithm for Info-Kmeans clustering. Specifically, by using an equivalent objective function, SAIL replaces the computation of KL-divergence by the incremental computation of Shannon entropy. This can avoid the zero-feature dilemma caused by the use of KL-divergence. To improve the clustering quality, we further introduce the variable neighborhood search scheme and propose the V-SAIL algorithm, which is then accelerated by a multithreaded scheme in PV-SAIL. Our experimental results on various real-world text collections have shown that, with SAIL as a booster, the clustering performance of Info-Kmeans can be significantly improved. Also, V-SAIL and PV-SAIL indeed help improve the clustering quality at a lower cost of computation.

  11. Mining of relations between proteins over biomedical scientific literature using a deep-linguistic approach.

    PubMed

    Rinaldi, Fabio; Schneider, Gerold; Kaljurand, Kaarel; Hess, Michael; Andronis, Christos; Konstandi, Ourania; Persidis, Andreas

    2007-02-01

    The amount of new discoveries (as published in the scientific literature) in the biomedical area is growing at an exponential rate. This growth makes it very difficult to filter the most relevant results, and thus the extraction of the core information becomes very expensive. Therefore, there is a growing interest in text processing approaches that can deliver selected information from scientific publications, which can limit the amount of human intervention normally needed to gather those results. This paper presents and evaluates an approach aimed at automating the process of extracting functional relations (e.g. interactions between genes and proteins) from scientific literature in the biomedical domain. The approach, using a novel dependency-based parser, is based on a complete syntactic analysis of the corpus. We have implemented a state-of-the-art text mining system for biomedical literature, based on a deep-linguistic, full-parsing approach. The results are validated on two different corpora: the manually annotated genomics information access (GENIA) corpus and the automatically annotated arabidopsis thaliana circadian rhythms (ATCR) corpus. We show how a deep-linguistic approach (contrary to common belief) can be used in a real world text mining application, offering high-precision relation extraction, while at the same time retaining a sufficient recall.

  12. Dealing with extreme data diversity: extraction and fusion from the growing types of document formats

    NASA Astrophysics Data System (ADS)

    David, Peter; Hansen, Nichole; Nolan, James J.; Alcocer, Pedro

    2015-05-01

    The growth in text data available online is accompanied by a growth in the diversity of available documents. Corpora with extreme heterogeneity in terms of file formats, document organization, page layout, text style, and content are common. The absence of meaningful metadata describing the structure of online and open-source data leads to text extraction results that contain no information about document structure and are cluttered with page headers and footers, web navigation controls, advertisements, and other items that are typically considered noise. We describe an approach to document structure and metadata recovery that uses visual analysis of documents to infer the communicative intent of the author. Our algorithm identifies the components of documents such as titles, headings, and body content, based on their appearance. Because it operates on an image of a document, our technique can be applied to any type of document, including scanned images. Our approach to document structure recovery considers a finer-grained set of component types than prior approaches. In this initial work, we show that a machine learning approach to document structure recovery using a feature set based on the geometry and appearance of images of documents achieves a 60% greater F1- score than a baseline random classifier.

  13. The presence of English and Spanish dyslexia in the Web

    NASA Astrophysics Data System (ADS)

    Rello, Luz; Baeza-Yates, Ricardo

    2012-09-01

    In this study we present a lower bound of the prevalence of dyslexia in the Web for English and Spanish. On the basis of analysis of corpora written by dyslexic people, we propose a classification of the different kinds of dyslexic errors. A representative data set of dyslexic words is used to calculate this lower bound in web pages containing English and Spanish dyslexic errors. We also present an analysis of dyslexic errors in major Internet domains, social media sites, and throughout English- and Spanish-speaking countries. To show the independence of our estimations from the presence of other kinds of errors, we compare them with the overall lexical quality of the Web and with the error rate of noncorrected corpora. The presence of dyslexic errors in the Web motivates work in web accessibility for dyslexic users.

  14. Permanent alterations in catecholamine concentrations in discrete areas of brain in the offspring of rats treated with methylamphetamine and chlorpromazine

    PubMed Central

    Tonge, Sally R.

    1973-01-01

    Methylamphetamine hydrochloride (80 mg/l.) and/or chlorpromazine hydrochloride (200 mg/l.) have been administered in the drinking water of female Wistar rats during pregnancy and suckling. The offspring were weaned at 21 days and thereafter received no drugs. Nine months later, male offspring were killed and noradrenaline and normetanephrine concentrations were determined in eight discrete areas of the brains: neocortex, hippocampus, striatum, thalamus, hypothalamus, corpora quadrigemina, pons/medulla, and amygdala region. Both drugs appeared to have permanently altered catecholamine concentrations in several areas of the brain. There was evidence of antagonism between the effects of the two drugs in the hippocampus, striatum, thalamus, and corpora quadrigemina, where the individual drugs produced altered noradrenaline concentrations but a combination of the two had no effect. PMID:4722052

  15. The Hebrew CHILDES corpus: transcription and morphological analysis

    PubMed Central

    Albert, Aviad; MacWhinney, Brian; Nir, Bracha

    2014-01-01

    We present a corpus of transcribed spoken Hebrew that reflects spoken interactions between children and adults. The corpus is an integral part of the CHILDES database, which distributes similar corpora for over 25 languages. We introduce a dedicated transcription scheme for the spoken Hebrew data that is sensitive to both the phonology and the standard orthography of the language. We also introduce a morphological analyzer that was specifically developed for this corpus. The analyzer adequately covers the entire corpus, producing detailed correct analyses for all tokens. Evaluation on a new corpus reveals high coverage as well. Finally, we describe a morphological disambiguation module that selects the correct analysis of each token in context. The result is a high-quality morphologically-annotated CHILDES corpus of Hebrew, along with a set of tools that can be applied to new corpora. PMID:25419199

  16. Evaluating a Pivot-Based Approach for Bilingual Lexicon Extraction

    PubMed Central

    Kim, Jae-Hoon; Kwon, Hong-Seok; Seo, Hyeong-Won

    2015-01-01

    A pivot-based approach for bilingual lexicon extraction is based on the similarity of context vectors represented by words in a pivot language like English. In this paper, in order to show validity and usability of the pivot-based approach, we evaluate the approach in company with two different methods for estimating context vectors: one estimates them from two parallel corpora based on word association between source words (resp., target words) and pivot words and the other estimates them from two parallel corpora based on word alignment tools for statistical machine translation. Empirical results on two language pairs (e.g., Korean-Spanish and Korean-French) have shown that the pivot-based approach is very promising for resource-poor languages and this approach observes its validity and usability. Furthermore, for words with low frequency, our method is also well performed. PMID:25983745

  17. Morphosyntactic annotation of CHILDES transcripts*

    PubMed Central

    SAGAE, KENJI; DAVIS, ERIC; LAVIE, ALON; MACWHINNEY, BRIAN; WINTNER, SHULY

    2014-01-01

    Corpora of child language are essential for research in child language acquisition and psycholinguistics. Linguistic annotation of the corpora provides researchers with better means for exploring the development of grammatical constructions and their usage. We describe a project whose goal is to annotate the English section of the CHILDES database with grammatical relations in the form of labeled dependency structures. We have produced a corpus of over 18,800 utterances (approximately 65,000 words) with manually curated gold-standard grammatical relation annotations. Using this corpus, we have developed a highly accurate data-driven parser for the English CHILDES data, which we used to automatically annotate the remainder of the English section of CHILDES. We have also extended the parser to Spanish, and are currently working on supporting more languages. The parser and the manually and automatically annotated data are freely available for research purposes. PMID:20334720

  18. SyllabO+: A new tool to study sublexical phenomena in spoken Quebec French.

    PubMed

    Bédard, Pascale; Audet, Anne-Marie; Drouin, Patrick; Roy, Johanna-Pascale; Rivard, Julie; Tremblay, Pascale

    2017-10-01

    Sublexical phonotactic regularities in language have a major impact on language development, as well as on speech processing and production throughout the entire lifespan. To understand the impact of phonotactic regularities on speech and language functions at the behavioral and neural levels, it is essential to have access to oral language corpora to study these complex phenomena in different languages. Yet, probably because of their complexity, oral language corpora remain less common than written language corpora. This article presents the first corpus and database of spoken Quebec French syllables and phones: SyllabO+. This corpus contains phonetic transcriptions of over 300,000 syllables (over 690,000 phones) extracted from recordings of 184 healthy adult native Quebec French speakers, ranging in age from 20 to 97 years. To ensure the representativeness of the corpus, these recordings were made in both formal and familiar communication contexts. Phonotactic distributional statistics (e.g., syllable and co-occurrence frequencies, percentages, percentile ranks, transition probabilities, and pointwise mutual information) were computed from the corpus. An open-access online application to search the database was developed, and is available at www.speechneurolab.ca/syllabo . In this article, we present a brief overview of the corpus, as well as the syllable and phone databases, and we discuss their practical applications in various fields of research, including cognitive neuroscience, psycholinguistics, neurolinguistics, experimental psychology, phonetics, and phonology. Nonacademic practical applications are also discussed, including uses in speech-language pathology.

  19. Making adjustments to event annotations for improved biological event extraction.

    PubMed

    Baek, Seung-Cheol; Park, Jong C

    2016-09-16

    Current state-of-the-art approaches to biological event extraction train statistical models in a supervised manner on corpora annotated with event triggers and event-argument relations. Inspecting such corpora, we observe that there is ambiguity in the span of event triggers (e.g., "transcriptional activity" vs. 'transcriptional'), leading to inconsistencies across event trigger annotations. Such inconsistencies make it quite likely that similar phrases are annotated with different spans of event triggers, suggesting the possibility that a statistical learning algorithm misses an opportunity for generalizing from such event triggers. We anticipate that adjustments to the span of event triggers to reduce these inconsistencies would meaningfully improve the present performance of event extraction systems. In this study, we look into this possibility with the corpora provided by the 2009 BioNLP shared task as a proof of concept. We propose an Informed Expectation-Maximization (EM) algorithm, which trains models using the EM algorithm with a posterior regularization technique, which consults the gold-standard event trigger annotations in a form of constraints. We further propose four constraints on the possible event trigger annotations to be explored by the EM algorithm. The algorithm is shown to outperform the state-of-the-art algorithm on the development corpus in a statistically significant manner and on the test corpus by a narrow margin. The analysis of the annotations generated by the algorithm shows that there are various types of ambiguity in event annotations, even though they could be small in number.

  20. Biomechanically Preferred Consonant-Vowel Combinations Fail to Appear in Adult Spoken Corpora

    PubMed Central

    Whalen, D. H.; Giulivi, Sara; Nam, Hosung; Levitt, Andrea G.; Hallé, Pierre; Goldstein, Louis M.

    2012-01-01

    Certain consonant/vowel (CV) combinations are more frequent than would be expected from the individual C and V frequencies alone, both in babbling and, to a lesser extent, in adult language, based on dictionary counts: Labial consonants co-occur with central vowels more often than chance would dictate; coronals co-occur with front vowels, and velars with back vowels (Davis & MacNeilage, 1994). Plausible biomechanical explanations have been proposed, but it is also possible that infants are mirroring the frequency of the CVs that they hear. As noted, previous assessments of adult language were based on dictionaries; these “type” counts are incommensurate with the babbling measures, which are necessarily “token” counts. We analyzed the tokens in two spoken corpora for English, two for French and one for Mandarin. We found that the adult spoken CV preferences correlated with the type counts for Mandarin and French, not for English. Correlations between the adult spoken corpora and the babbling results had all three possible outcomes: significantly positive (French), uncorrelated (Mandarin), and significantly negative (English). There were no correlations of the dictionary data with the babbling results when we consider all nine combinations of consonants and vowels. The results indicate that spoken frequencies of CV combinations can differ from dictionary (type) counts and that the CV preferences apparent in babbling are biomechanically driven and can ignore the frequencies of CVs in the ambient spoken language. PMID:23420980

  1. NegBio: a high-performance tool for negation and uncertainty detection in radiology reports.

    PubMed

    Peng, Yifan; Wang, Xiaosong; Lu, Le; Bagheri, Mohammadhadi; Summers, Ronald; Lu, Zhiyong

    2018-01-01

    Negative and uncertain medical findings are frequent in radiology reports, but discriminating them from positive findings remains challenging for information extraction. Here, we propose a new algorithm, NegBio, to detect negative and uncertain findings in radiology reports. Unlike previous rule-based methods, NegBio utilizes patterns on universal dependencies to identify the scope of triggers that are indicative of negation or uncertainty. We evaluated NegBio on four datasets, including two public benchmarking corpora of radiology reports, a new radiology corpus that we annotated for this work, and a public corpus of general clinical texts. Evaluation on these datasets demonstrates that NegBio is highly accurate for detecting negative and uncertain findings and compares favorably to a widely-used state-of-the-art system NegEx (an average of 9.5% improvement in precision and 5.1% in F1-score). https://github.com/ncbi-nlp/NegBio.

  2. Biomedical information retrieval across languages.

    PubMed

    Daumke, Philipp; Markü, Kornél; Poprat, Michael; Schulz, Stefan; Klar, Rüdiger

    2007-06-01

    This work presents a new dictionary-based approach to biomedical cross-language information retrieval (CLIR) that addresses many of the general and domain-specific challenges in current CLIR research. Our method is based on a multilingual lexicon that was generated partly manually and partly automatically, and currently covers six European languages. It contains morphologically meaningful word fragments, termed subwords. Using subwords instead of entire words significantly reduces the number of lexical entries necessary to sufficiently cover a specific language and domain. Mediation between queries and documents is based on these subwords as well as on lists of word-n-grams that are generated from large monolingual corpora and constitute possible translation units. The translations are then sent to a standard Internet search engine. This process makes our approach an effective tool for searching the biomedical content of the World Wide Web in different languages. We evaluate this approach using the OHSUMED corpus, a large medical document collection, within a cross-language retrieval setting.

  3. A generative model for scientific concept hierarchies.

    PubMed

    Datta, Srayan; Adar, Eytan

    2018-01-01

    In many scientific disciplines, each new 'product' of research (method, finding, artifact, etc.) is often built upon previous findings-leading to extension and branching of scientific concepts over time. We aim to understand the evolution of scientific concepts by placing them in phylogenetic hierarchies where scientific keyphrases from a large, longitudinal academic corpora are used as a proxy of scientific concepts. These hierarchies exhibit various important properties, including power-law degree distribution, power-law component size distribution, existence of a giant component and less probability of extending an older concept. We present a generative model based on preferential attachment to simulate the graphical and temporal properties of these hierarchies which helps us understand the underlying process behind scientific concept evolution and may be useful in simulating and predicting scientific evolution.

  4. A generative model for scientific concept hierarchies

    PubMed Central

    Adar, Eytan

    2018-01-01

    In many scientific disciplines, each new ‘product’ of research (method, finding, artifact, etc.) is often built upon previous findings–leading to extension and branching of scientific concepts over time. We aim to understand the evolution of scientific concepts by placing them in phylogenetic hierarchies where scientific keyphrases from a large, longitudinal academic corpora are used as a proxy of scientific concepts. These hierarchies exhibit various important properties, including power-law degree distribution, power-law component size distribution, existence of a giant component and less probability of extending an older concept. We present a generative model based on preferential attachment to simulate the graphical and temporal properties of these hierarchies which helps us understand the underlying process behind scientific concept evolution and may be useful in simulating and predicting scientific evolution. PMID:29474409

  5. hSMR3A as a Marker for Patients With Erectile Dysfunction

    PubMed Central

    Tong, Yuehong; Tar, Moses; Monrose, Val; DiSanto, Michael; Melman, Arnold; Davies, Kelvin P.

    2007-01-01

    Purpose We recently reported that Vcsa1 is one of the most down-regulated genes in the corpora of rats in 3 distinct models of erectile dysfunction. Since gene transfer of plasmids expressing Vcsa1 or intracorporeal injection of its mature peptide product sialorphin into the corpora of aging rats was shown to restore erectile function, we proposed that the Vcsa1 gene has a direct role in erectile function. To determine if similar changes in gene expression occur in the corpora of human subjects with erectile dysfunction we identified a human homologue of Vcsa1 (hSMR3A) and determined the level of expression of hSMR3A in patients. Materials and Methods hSMR3A was identified as a homologue of Vcsa1 by searching protein databases for proteins with similarity. hSMR3A cDNA was generated and subcloned into the plasmid pVAX to generate pVAX-hSMR3A. pVAX-hSMR3A (25 or 100 μg) was intracorporeally injected into aging rats. The effect on erectile physiology was compared histologically and by measuring intracorporeal pressure/blood pressure with controls treated with the empty plasmid pVAX. Total RNA was extracted from human corporeal tissue obtained from patients undergoing previously scheduled penile surgery. Patients were grouped according to normal erectile function (3), erectile dysfunction and diabetes (5) and patients without diabetes but with erectile dysfunction (5). Quantitative reverse-transcriptase polymerase chain reaction was used to determine the hSMR3A expression level. Results Intracorporeal injection of 25 μg pVAX-hSMR3A was able to significantly increase the intracorporeal pressure-to-blood pressure ratio in aging rats compared to age matched controls. Higher amounts (100 μg) of gene transfer of the plasmid caused less of an improvement in the intracorporeal pressure-to-blood pressure ratio compared to controls, although there was histological and visual evidence that the animals were post-priapitic. These physiological effects were similar to previously reported effects of intracorporeal injection of pVAX-Vcsa1 into the corpora of aging rats, establishing hSMR3A as a functional homologue of Vcsa1. More than 10-fold down-regulation in hSMR3A transcript expression was observed in the corpora of patients with vs without erectile dysfunction. In patients with diabetes associated and nondiabetes associated erectile dysfunction hSMR3A expression was found to be down-regulated. Conclusions These results suggest that hSMR3A can act as a marker for erectile dysfunction associated with diabetic and nondiabetic etiologies. Given that our previous studies demonstrated that gene transfer of the Vcsa1 gene and intracorporeal injection of its protein product in rats can restore erectile function, these results suggest that therapies that increase the hSMR3A gene and product expression could potentially have a positive impact on erectile function. PMID:17512016

  6. hSMR3A as a marker for patients with erectile dysfunction.

    PubMed

    Tong, Yuehong; Tar, Moses; Monrose, Val; DiSanto, Michael; Melman, Arnold; Davies, Kelvin P

    2007-07-01

    We recently reported that Vcsa1 is one of the most down-regulated genes in the corpora of rats in 3 distinct models of erectile dysfunction. Since gene transfer of plasmids expressing Vcsa1 or intracorporeal injection of its mature peptide product sialorphin into the corpora of aging rats was shown to restore erectile function, we proposed that the Vcsa1 gene has a direct role in erectile function. To determine if similar changes in gene expression occur in the corpora of human subjects with erectile dysfunction we identified a human homologue of Vcsa1 (hSMR3A) and determined the level of expression of hSMR3A in patients. hSMR3A was identified as a homologue of Vcsa1 by searching protein databases for proteins with similarity. hSMR3A cDNA was generated and subcloned into the plasmid pVAX to generate pVAX-hSMR3A. pVAX-hSMR3A (25 or 100 microg) was intracorporeally injected into aging rats. The effect on erectile physiology was compared histologically and by measuring intracorporeal pressure/blood pressure with controls treated with the empty plasmid pVAX. Total RNA was extracted from human corporeal tissue obtained from patients undergoing previously scheduled penile surgery. Patients were grouped according to normal erectile function (3), erectile dysfunction and diabetes (5) and patients without diabetes but with erectile dysfunction (5). Quantitative reverse-transcriptase polymerase chain reaction was used to determine the hSMR3A expression level. Intracorporeal injection of 25 microg pVAX-hSMR3A was able to significantly increase the intracorporeal pressure-to-blood pressure ratio in aging rats compared to age matched controls. Higher amounts (100 microg) of gene transfer of the plasmid caused less of an improvement in the intracorporeal pressure-to-blood pressure ratio compared to controls, although there was histological and visual evidence that the animals were post-priapitic. These physiological effects were similar to previously reported effects of intracorporeal injection of pVAX-Vcsa1 into the corpora of aging rats, establishing hSMR3A as a functional homologue of Vcsa1. More than 10-fold down-regulation in hSMR3A transcript expression was observed in the corpora of patients with vs without erectile dysfunction. In patients with diabetes associated and nondiabetes associated erectile dysfunction hSMR3A expression was found to be down-regulated. These results suggest that hSMR3A can act as a marker for erectile dysfunction associated with diabetic and nondiabetic etiologies. Given that our previous studies demonstrated that gene transfer of the Vcsa1 gene and intracorporeal injection of its protein product in rats can restore erectile function, these results suggest that therapies that increase the hSMR3A gene and product expression could potentially have a positive impact on erectile function.

  7. Reading Ability and Print Exposure: Item Response Theory Analysis of the Author Recognition Test

    PubMed Central

    Moore, Mariah; Gordon, Peter C.

    2015-01-01

    In the Author Recognition Test (ART) participants are presented with a series of names and foils and are asked to indicate which ones they recognize as authors. The test is a strong predictor of reading skill, with this predictive ability generally explained as occurring because author knowledge is likely acquired through reading or other forms of print exposure. This large-scale study (1012 college student participants) used Item Response Theory (IRT) to analyze item (author) characteristics to facilitate identification of the determinants of item difficulty, provide a basis for further test development, and to optimize scoring of the ART. Factor analysis suggests a potential two factor structure of the ART differentiating between literary vs. popular authors. Effective and ineffective author names were identified so as to facilitate future revisions of the ART. Analyses showed that the ART is a highly significant predictor of time spent encoding words as measured using eye-tracking during reading. The relationship between the ART and time spent reading provided a basis for implementing a higher penalty for selecting foils, rather than the standard method of ART scoring (names selected minus foils selected). The findings provide novel support for the view that the ART is a valid indicator of reading volume. Further, they show that frequency data can be used to select items of appropriate difficulty and that frequency data from corpora based on particular time periods and types of text may allow test adaptation for different populations. PMID:25410405

  8. Reading ability and print exposure: item response theory analysis of the author recognition test.

    PubMed

    Moore, Mariah; Gordon, Peter C

    2015-12-01

    In the author recognition test (ART), participants are presented with a series of names and foils and are asked to indicate which ones they recognize as authors. The test is a strong predictor of reading skill, and this predictive ability is generally explained as occurring because author knowledge is likely acquired through reading or other forms of print exposure. In this large-scale study (1,012 college student participants), we used item response theory (IRT) to analyze item (author) characteristics in order to facilitate identification of the determinants of item difficulty, provide a basis for further test development, and optimize scoring of the ART. Factor analysis suggested a potential two-factor structure of the ART, differentiating between literary and popular authors. Effective and ineffective author names were identified so as to facilitate future revisions of the ART. Analyses showed that the ART is a highly significant predictor of the time spent encoding words, as measured using eyetracking during reading. The relationship between the ART and time spent reading provided a basis for implementing a higher penalty for selecting foils, rather than the standard method of ART scoring (names selected minus foils selected). The findings provide novel support for the view that the ART is a valid indicator of reading volume. Furthermore, they show that frequency data can be used to select items of appropriate difficulty, and that frequency data from corpora based on particular time periods and types of texts may allow adaptations of the test for different populations.

  9. Physiological roles of trehalose in Leptinotarsa larvae revealed by RNA interference of trehalose-6-phosphate synthase and trehalase genes.

    PubMed

    Shi, Ji-Feng; Xu, Qing-Yu; Sun, Qiang-Kun; Meng, Qing-Wei; Mu, Li-Li; Guo, Wen-Chao; Li, Guo-Qing

    2016-10-01

    Trehalose is proposed to serve multiple physiological roles in insects. However, its importance remains largely unconfirmed. In the present paper, we knocked down either a trehalose biosynthesis gene (trehalose-6-phosphate synthase, LdTPS) or each of three degradation genes (soluble trehalases LdTRE1a, LdTRE1b or membrane-bound LdTRE2) in Leptinotarsa decemlineata by RNA interference (RNAi). Knockdown of LdTPS decreased trehalose content and caused larval and pupal lethality. The LdTPS RNAi survivors consumed a greater amount of foliage, obtained a heavier body mass, accumulated more glycogen, lipid and proline, and had a smaller amount of chitin compared with the controls. Ingestion of trehalose but not glucose rescued the food consumption increase and larval mass rise, increased survivorship, and recovered glycogen, lipid and chitin to the normal levels. In contrast, silencing of LdTRE1a increased trehalose content and resulted in larval and pupal lethality. The surviving LdTRE1a RNAi hypomorphs fed a smaller quantity of food, had a lighter body weight, depleted lipid and several glucogenic amino acids, and contained a smaller amount of chitin. Neither trehalose nor glucose ingestion rescued these LdTRE1a RNAi defects. Silencing of LdTRE1b caused little effects. Knockdown of LdTRE2 caused larval death, increased trehalose contents in several tissues and diminished glycogen in the brain-corpora cardiaca-corpora allata complex (BCC). Feeding glucose but not trehalose partially rescued the high mortality rate and recovered glycogen content in the BCC. It seems that trehalose is involved in feeding regulation, sugar absorption, brain energy supply and chitin biosynthesis in L. decemlineata larvae. Copyright © 2016 Elsevier Ltd. All rights reserved.

  10. Frequency of word-use predicts rates of lexical evolution throughout Indo-European history.

    PubMed

    Pagel, Mark; Atkinson, Quentin D; Meade, Andrew

    2007-10-11

    Greek speakers say "omicronupsilonrho", Germans "schwanz" and the French "queue" to describe what English speakers call a 'tail', but all of these languages use a related form of 'two' to describe the number after one. Among more than 100 Indo-European languages and dialects, the words for some meanings (such as 'tail') evolve rapidly, being expressed across languages by dozens of unrelated words, while others evolve much more slowly--such as the number 'two', for which all Indo-European language speakers use the same related word-form. No general linguistic mechanism has been advanced to explain this striking variation in rates of lexical replacement among meanings. Here we use four large and divergent language corpora (English, Spanish, Russian and Greek) and a comparative database of 200 fundamental vocabulary meanings in 87 Indo-European languages to show that the frequency with which these words are used in modern language predicts their rate of replacement over thousands of years of Indo-European language evolution. Across all 200 meanings, frequently used words evolve at slower rates and infrequently used words evolve more rapidly. This relationship holds separately and identically across parts of speech for each of the four language corpora, and accounts for approximately 50% of the variation in historical rates of lexical replacement. We propose that the frequency with which specific words are used in everyday language exerts a general and law-like influence on their rates of evolution. Our findings are consistent with social models of word change that emphasize the role of selection, and suggest that owing to the ways that humans use language, some words will evolve slowly and others rapidly across all languages.

  11. Compressed Natural Gas Safety in Transit Operations

    DOT National Transportation Integrated Search

    1995-09-14

    This report examines the safety issues relating to the use of Compressed Natural Gas (CNG) in transit service. The safety issues were determined by on-site surveys performed by Battelle of Columbus, Ohio and Science Applications International Corpora...

  12. WARCProcessor: An Integrative Tool for Building and Management of Web Spam Corpora.

    PubMed

    Callón, Miguel; Fdez-Glez, Jorge; Ruano-Ordás, David; Laza, Rosalía; Pavón, Reyes; Fdez-Riverola, Florentino; Méndez, Jose Ramón

    2017-12-22

    In this work we present the design and implementation of WARCProcessor, a novel multiplatform integrative tool aimed to build scientific datasets to facilitate experimentation in web spam research. The developed application allows the user to specify multiple criteria that change the way in which new corpora are generated whilst reducing the number of repetitive and error prone tasks related with existing corpus maintenance. For this goal, WARCProcessor supports up to six commonly used data sources for web spam research, being able to store output corpus in standard WARC format together with complementary metadata files. Additionally, the application facilitates the automatic and concurrent download of web sites from Internet, giving the possibility of configuring the deep of the links to be followed as well as the behaviour when redirected URLs appear. WARCProcessor supports both an interactive GUI interface and a command line utility for being executed in background.

  13. Birth of the cool: a two-centuries decline in emotional expression in Anglophone fiction.

    PubMed

    Morin, Olivier; Acerbi, Alberto

    2017-12-01

    The presence of emotional words and content in stories has been shown to enhance a story's memorability, and its cultural success. Yet, recent cultural trends run in the opposite direction. Using the Google Books corpus, coupled with two metadata-rich corpora of Anglophone fiction books, we show a decrease in emotionality in English-speaking literature starting plausibly in the nineteenth century. We show that this decrease cannot be explained by changes unrelated to emotionality (such as demographic dynamics concerning age or gender balance, changes in vocabulary richness, or changes in the prevalence of literary genres), and that, in our three corpora, the decrease is driven almost entirely by a decline in the proportion of positive emotion-related words, while the frequency of negative emotion-related words shows little if any decline. Consistently with previous studies, we also find a link between ageing and negative emotionality at the individual level.

  14. Emergence of linguistic laws in human voice

    PubMed Central

    Torre, Iván González; Luque, Bartolo; Lacasa, Lucas; Luque, Jordi; Hernández-Fernández, Antoni

    2017-01-01

    Linguistic laws constitute one of the quantitative cornerstones of modern cognitive sciences and have been routinely investigated in written corpora, or in the equivalent transcription of oral corpora. This means that inferences of statistical patterns of language in acoustics are biased by the arbitrary, language-dependent segmentation of the signal, and virtually precludes the possibility of making comparative studies between human voice and other animal communication systems. Here we bridge this gap by proposing a method that allows to measure such patterns in acoustic signals of arbitrary origin, without needs to have access to the language corpus underneath. The method has been applied to sixteen different human languages, recovering successfully some well-known laws of human communication at timescales even below the phoneme and finding yet another link between complexity and criticality in a biological system. These methods further pave the way for new comparative studies in animal communication or the analysis of signals of unknown code. PMID:28272418

  15. Emergence of linguistic laws in human voice

    NASA Astrophysics Data System (ADS)

    Torre, Iván González; Luque, Bartolo; Lacasa, Lucas; Luque, Jordi; Hernández-Fernández, Antoni

    2017-03-01

    Linguistic laws constitute one of the quantitative cornerstones of modern cognitive sciences and have been routinely investigated in written corpora, or in the equivalent transcription of oral corpora. This means that inferences of statistical patterns of language in acoustics are biased by the arbitrary, language-dependent segmentation of the signal, and virtually precludes the possibility of making comparative studies between human voice and other animal communication systems. Here we bridge this gap by proposing a method that allows to measure such patterns in acoustic signals of arbitrary origin, without needs to have access to the language corpus underneath. The method has been applied to sixteen different human languages, recovering successfully some well-known laws of human communication at timescales even below the phoneme and finding yet another link between complexity and criticality in a biological system. These methods further pave the way for new comparative studies in animal communication or the analysis of signals of unknown code.

  16. Subtitle-Based Word Frequencies as the Best Estimate of Reading Behavior: The Case of Greek

    PubMed Central

    Dimitropoulou, Maria; Duñabeitia, Jon Andoni; Avilés, Alberto; Corral, José; Carreiras, Manuel

    2010-01-01

    Previous evidence has shown that word frequencies calculated from corpora based on film and television subtitles can readily account for reading performance, since the language used in subtitles greatly approximates everyday language. The present study examines this issue in a society with increased exposure to subtitle reading. We compiled SUBTLEX-GR, a subtitled-based corpus consisting of more than 27 million Modern Greek words, and tested to what extent subtitle-based frequency estimates and those taken from a written corpus of Modern Greek account for the lexical decision performance of young Greek adults who are exposed to subtitle reading on a daily basis. Results showed that SUBTLEX-GR frequency estimates effectively accounted for participants’ reading performance in two different visual word recognition experiments. More importantly, different analyses showed that frequencies estimated from a subtitle corpus explained the obtained results significantly better than traditional frequencies derived from written corpora. PMID:21833273

  17. WARCProcessor: An Integrative Tool for Building and Management of Web Spam Corpora

    PubMed Central

    Callón, Miguel; Fdez-Glez, Jorge; Ruano-Ordás, David; Laza, Rosalía; Pavón, Reyes; Méndez, Jose Ramón

    2017-01-01

    In this work we present the design and implementation of WARCProcessor, a novel multiplatform integrative tool aimed to build scientific datasets to facilitate experimentation in web spam research. The developed application allows the user to specify multiple criteria that change the way in which new corpora are generated whilst reducing the number of repetitive and error prone tasks related with existing corpus maintenance. For this goal, WARCProcessor supports up to six commonly used data sources for web spam research, being able to store output corpus in standard WARC format together with complementary metadata files. Additionally, the application facilitates the automatic and concurrent download of web sites from Internet, giving the possibility of configuring the deep of the links to be followed as well as the behaviour when redirected URLs appear. WARCProcessor supports both an interactive GUI interface and a command line utility for being executed in background. PMID:29271913

  18. Negation’s Not Solved: Generalizability Versus Optimizability in Clinical Natural Language Processing

    PubMed Central

    Wu, Stephen; Miller, Timothy; Masanz, James; Coarr, Matt; Halgrim, Scott; Carrell, David; Clark, Cheryl

    2014-01-01

    A review of published work in clinical natural language processing (NLP) may suggest that the negation detection task has been “solved.” This work proposes that an optimizable solution does not equal a generalizable solution. We introduce a new machine learning-based Polarity Module for detecting negation in clinical text, and extensively compare its performance across domains. Using four manually annotated corpora of clinical text, we show that negation detection performance suffers when there is no in-domain development (for manual methods) or training data (for machine learning-based methods). Various factors (e.g., annotation guidelines, named entity characteristics, the amount of data, and lexical and syntactic context) play a role in making generalizability difficult, but none completely explains the phenomenon. Furthermore, generalizability remains challenging because it is unclear whether to use a single source for accurate data, combine all sources into a single model, or apply domain adaptation methods. The most reliable means to improve negation detection is to manually annotate in-domain training data (or, perhaps, manually modify rules); this is a strategy for optimizing performance, rather than generalizing it. These results suggest a direction for future work in domain-adaptive and task-adaptive methods for clinical NLP. PMID:25393544

  19. Disambiguating ambiguous biomedical terms in biomedical narrative text: an unsupervised method.

    PubMed

    Liu, H; Lussier, Y A; Friedman, C

    2001-08-01

    With the growing use of Natural Language Processing (NLP) techniques for information extraction and concept indexing in the biomedical domain, a method that quickly and efficiently assigns the correct sense of an ambiguous biomedical term in a given context is needed concurrently. The current status of word sense disambiguation (WSD) in the biomedical domain is that handcrafted rules are used based on contextual material. The disadvantages of this approach are (i) generating WSD rules manually is a time-consuming and tedious task, (ii) maintenance of rule sets becomes increasingly difficult over time, and (iii) handcrafted rules are often incomplete and perform poorly in new domains comprised of specialized vocabularies and different genres of text. This paper presents a two-phase unsupervised method to build a WSD classifier for an ambiguous biomedical term W. The first phase automatically creates a sense-tagged corpus for W, and the second phase derives a classifier for W using the derived sense-tagged corpus as a training set. A formative experiment was performed, which demonstrated that classifiers trained on the derived sense-tagged corpora achieved an overall accuracy of about 97%, with greater than 90% accuracy for each individual ambiguous term.

  20. Blood transfusion and resuscitation using penile corpora: an experimental study.

    PubMed

    Abolyosr, Ahmad; Sayed, M A; Elanany, Fathy; Smeika, M A; Shaker, S E

    2005-10-01

    To test the feasibility of using the penile corpora cavernosa for blood transfusion and resuscitation purposes. Three male donkeys were used for autologous blood transfusion into the corpus cavernosum during three sessions with a 1-week interval between each. Two blood units (450 mL each) were transfused per session to each donkey. Moreover, three dogs were bled up until a state of shock was produced. The mean arterial blood pressure decreased to 60 mm Hg. The withdrawn blood (mean volume 396.3 mL) was transfused back into their corpora cavernosa under 150 mm Hg pressure. Different transfusion parameters were assessed. The Assiut faculty of medicine ethical committee approved the study before its initiation. For the donkey model, the mean time of blood collection was 12 minutes. The mean time needed to establish corporal access was 22 seconds. The mean time of blood transfusion was 14.2 minutes. The mean rate of blood transfusion was 31.7 mL/min. Mild penile elongation with or without mild penile tumescence was observed on four occasions. All penile shafts returned spontaneously to their pretransfusion state at a maximum of 5 minutes after cessation of blood transfusion. No extravasation, hematoma formation, or color changes occurred. Regarding the dog model, the mean rate of transfusion was 35.2 mL/min. All dogs were resuscitated at the end of the transfusion. The corpus cavernosum is a feasible, simple, rapid, and effective alternative route for blood transfusion and venous access. It can be resorted to whenever necessary. It is a reliable means for volume replacement and resuscitation in males.

  1. Two-stage repair with long channel technique for primary severe hypospadias.

    PubMed

    Yang, Tianyou; Xie, Qigen; Liang, Qifeng; Xu, Yeqing; Su, Cheng

    2014-07-01

    To introduce a 2-stage repair with long channel technique for primary severe hypospadias. Between March 2010 and November 2013, 16 children with primary severe hypospadias underwent 2-stage repair with long channel technique. The technique applied in the first stage was almost the same as Bracka 2-stage repair. The second stage was usually performed 6 months later. A small transverse skin incision, distal to the meatal opening and about 1 cm in length, was made. Dissection was carried out deep into the surface of corpora cavernosa and a plane between the subcutaneous tissue and corpora cavernosa was reached. A long channel between the subcutaneous tissue and corpora cavernosa was created from the para-meatus incision to the apex of glans. A rectangle, pedicle scrotal septal skin flap was elevated and tubularized into neourethra around a stenting tube. The neourethra was delivered through the subcutaneous channel and fixed at the apex of glans. The mean operation time of the first and second stages was 65 and 55 minutes, respectively. The mean age at the first and second operation was 28 and 36 months, respectively. The mean follow-up was 10 months. No fistula, glans dehiscence, urethral stricture, and meatal stenosis were recorded. One scrotal surgical wound infection occurred after second stage and healed successfully with antibiotics treatment. The overall cosmetic and functional outcomes after second stage were excellent. Two-stage repair with long channel technique was applicable for primary severe hypospadias, with excellent short-term outcomes. Copyright © 2014 Elsevier Inc. All rights reserved.

  2. Corpora and Data Preparation for Information Extraction

    DTIC Science & Technology

    1993-09-01

    technical publications in fields such as communications, airline transportation, rubber & plas- tics, and food marketing . The Japanese-language...types in the U. S., for example, avocado farms, electric popcorn popper sales, management consulting. The template-filling task required that products

  3. BioTextQuest(+): a knowledge integration platform for literature mining and concept discovery.

    PubMed

    Papanikolaou, Nikolas; Pavlopoulos, Georgios A; Pafilis, Evangelos; Theodosiou, Theodosios; Schneider, Reinhard; Satagopam, Venkata P; Ouzounis, Christos A; Eliopoulos, Aristides G; Promponas, Vasilis J; Iliopoulos, Ioannis

    2014-11-15

    The iterative process of finding relevant information in biomedical literature and performing bioinformatics analyses might result in an endless loop for an inexperienced user, considering the exponential growth of scientific corpora and the plethora of tools designed to mine PubMed(®) and related biological databases. Herein, we describe BioTextQuest(+), a web-based interactive knowledge exploration platform with significant advances to its predecessor (BioTextQuest), aiming to bridge processes such as bioentity recognition, functional annotation, document clustering and data integration towards literature mining and concept discovery. BioTextQuest(+) enables PubMed and OMIM querying, retrieval of abstracts related to a targeted request and optimal detection of genes, proteins, molecular functions, pathways and biological processes within the retrieved documents. The front-end interface facilitates the browsing of document clustering per subject, the analysis of term co-occurrence, the generation of tag clouds containing highly represented terms per cluster and at-a-glance popup windows with information about relevant genes and proteins. Moreover, to support experimental research, BioTextQuest(+) addresses integration of its primary functionality with biological repositories and software tools able to deliver further bioinformatics services. The Google-like interface extends beyond simple use by offering a range of advanced parameterization for expert users. We demonstrate the functionality of BioTextQuest(+) through several exemplary research scenarios including author disambiguation, functional term enrichment, knowledge acquisition and concept discovery linking major human diseases, such as obesity and ageing. The service is accessible at http://bioinformatics.med.uoc.gr/biotextquest. g.pavlopoulos@gmail.com or georgios.pavlopoulos@esat.kuleuven.be Supplementary data are available at Bioinformatics online. © The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

  4. Automatic Parsing of Parental Verbal Input

    PubMed Central

    Sagae, Kenji; MacWhinney, Brian; Lavie, Alon

    2006-01-01

    To evaluate theoretical proposals regarding the course of child language acquisition, researchers often need to rely on the processing of large numbers of syntactically parsed utterances, both from children and their parents. Because it is so difficult to do this by hand, there are currently no parsed corpora of child language input data. To automate this process, we developed a system that combined the MOR tagger, a rule-based parser, and statistical disambiguation techniques. The resultant system obtained nearly 80% correct parses for the sentences spoken to children. To achieve this level, we had to construct a particular processing sequence that minimizes problems caused by the coverage/ambiguity trade-off in parser design. These procedures are particularly appropriate for use with the CHILDES database, an international corpus of transcripts. The data and programs are now freely available over the Internet. PMID:15190707

  5. Selective Arterial Embolization of Idiopathic Priapism

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Cohen, Gary S.; Braunstein, Larry; Ball, David S.

    1996-11-15

    We report a case of idiopathic priapism that was only identified as high-flow or arterial priapism after drainage of the corpora cavernosa. Following failure of conservative and surgical treatment attempts, two consecutive embolizations of a unilateral penile artery were performed with gelgoam particles.

  6. TNF-alpha infusion impairs corpora cavernosa reactivity.

    PubMed

    Carneiro, Fernando S; Zemse, Saiprazad; Giachini, Fernanda R C; Carneiro, Zidonia N; Lima, Victor V; Webb, R Clinton; Tostes, Rita C

    2009-03-01

    Erectile dysfunction (ED), as well as cardiovascular diseases (CVDs), is associated with endothelial dysfunction and increased levels of proinflammatory cytokines, such as tumor necrosis factor-alpha (TNF-alpha). We hypothesized that increased TNF-alpha levels impair cavernosal function. In vitro organ bath studies were used to measure cavernosal reactivity in mice infused with vehicle or TNF-alpha (220 ng/kg/min) for 14 days. Gene expression of nitric oxide synthase isoforms was evaluated by real-time polymerase chain reaction. Corpora cavernosa from TNF-alpha-infused mice exhibited decreased nitric oxide (NO)-dependent relaxation, which was associated with decreased endothelial nitric oxide synthase (eNOS) and neuronal nitric oxide synthase (nNOS) cavernosal expression. Cavernosal strips from the TNF-alpha-infused mice displayed decreased nonadrenergic-noncholinergic (NANC)-induced relaxation (59.4 +/- 6.2 vs. control: 76.2 +/- 4.7; 16 Hz) compared with the control animals. These responses were associated with decreased gene expression of eNOS and nNOS (P < 0.05). Sympathetic-mediated, as well as phenylephrine (PE)-induced, contractile responses (PE-induced contraction; 1.32 +/- 0.06 vs. control: 0.9 +/- 0.09, mN) were increased in cavernosal strips from TNF-alpha-infused mice. Additionally, infusion of TNF-alpha increased cavernosal responses to endothelin-1 and endothelin receptor A subtype (ET(A)) receptor expression (P < 0.05) and slightly decreased tumor necrosis factor-alpha receptor 1 (TNFR1) expression (P = 0.063). Corpora cavernosa from TNF-alpha-infused mice display increased contractile responses and decreased NANC nerve-mediated relaxation associated with decreased eNOS and nNOS gene expression. These changes may trigger ED and indicate that TNF-alpha plays a detrimental role in erectile function. Blockade of TNF-alpha actions may represent an alternative therapeutic approach for ED, especially in pathologic conditions associated with increased levels of this cytokine.

  7. Assessing the readability of ClinicalTrials.gov

    PubMed Central

    Wu, Danny TY; Hanauer, David A; Mei, Qiaozhu; Clark, Patricia M; An, Lawrence C; Proulx, Joshua; Zeng, Qing T; Vydiswaran, VG Vinod; Collins-Thompson, Kevyn

    2016-01-01

    Objective ClinicalTrials.gov serves critical functions of disseminating trial information to the public and helping the trials recruit participants. This study assessed the readability of trial descriptions at ClinicalTrials.gov using multiple quantitative measures. Materials and Methods The analysis included all 165 988 trials registered at ClinicalTrials.gov as of April 30, 2014. To obtain benchmarks, the authors also analyzed 2 other medical corpora: (1) all 955 Health Topics articles from MedlinePlus and (2) a random sample of 100 000 clinician notes retrieved from an electronic health records system intended for conveying internal communication among medical professionals. The authors characterized each of the corpora using 4 surface metrics, and then applied 5 different scoring algorithms to assess their readability. The authors hypothesized that clinician notes would be most difficult to read, followed by trial descriptions and MedlinePlus Health Topics articles. Results Trial descriptions have the longest average sentence length (26.1 words) across all corpora; 65% of their words used are not covered by a basic medical English dictionary. In comparison, average sentence length of MedlinePlus Health Topics articles is 61% shorter, vocabulary size is 95% smaller, and dictionary coverage is 46% higher. All 5 scoring algorithms consistently rated CliniclTrials.gov trial descriptions the most difficult corpus to read, even harder than clinician notes. On average, it requires 18 years of education to properly understand these trial descriptions according to the results generated by the readability assessment algorithms. Discussion and Conclusion Trial descriptions at CliniclTrials.gov are extremely difficult to read. Significant work is warranted to improve their readability in order to achieve CliniclTrials.gov’s goal of facilitating information dissemination and subject recruitment. PMID:26269536

  8. Abnormal morphology of the penis in male rats exposed neonatally to diethylstilbestrol is associated with altered profile of estrogen receptor-alpha protein, but not of androgen receptor protein: a developmental and immunocytochemical study.

    PubMed

    Goyal, H O; Braden, T D; Williams, C S; Dalvi, P; Mansour, M M; Mansour, M; Williams, J W; Bartol, F F; Wiley, A A; Birch, L; Prins, G S

    2004-05-01

    Objectives of the study were to determine developmental changes in morphology and expression of androgen receptor (AR) and estrogen receptor (ER)alpha in the body of the rat penis exposed neonatally to diethylstilbestrol (DES). Male pups received DES at a dose of 10 microg per rat on alternate days from Postnatal Day 2 to Postnatal Day 12. Controls received olive oil vehicle only. Tissue samples were collected on Days 18 (prepuberty), 41 (puberty), and 120 (adult) of age. DES-induced abnormalities were evident at 18 days of age and included smaller, lighter, and thinner penis, loss of cavernous spaces and associated smooth muscle cells, and increased deposition of fat cells in the corpora cavernosa penis. Fat cells virtually filled the entire area of the corpora cavernosa at puberty and adulthood. Plasma testosterone (T) was reduced to an undetectable level, while LH was unaltered in all treated groups. AR-positive cells were ubiquitous and their profile (incidence and staining intensity) did not differ between control and treated rats of the respective age groups. Conversely, ERalpha-positive cells were limited to the stroma of corpus spongiosus in all age groups of both control and treated rats, but the expression in treated rats at 18 days was up-regulated in stromal cells of corpora cavernosa, coincident with the presence of morphological abnormalities. Hence, this study reports for the first time DES-induced developmental, morphological abnormalities in the body of the penis and suggests that these abnormalities may have resulted from decreased T and/or overexpression of ERalpha.

  9. Timing of mating and ovarian response in llamas (Lama glama) treated with pFSH.

    PubMed

    Ratto, M H; Gatica, R; Correa, J E

    1997-08-01

    The effect of the timing of mating on ovarian response in llamas was evaluated using 20 adult llamas weighing 90-120 kg which had been in oestrus for 5 days and were treated with 20 mg pFSH every 12 h for the following 5 days (total dose: 200 mg of FSH-NIH-P1). They were randomly allocated to Group A (N = 10) and mated immediately at the end of pFSH treatment or to Group B (n = 10) and mated 36 h after the end of pFSH treatment. Llamas of both groups were given hCG (750 iu, i.m.) immediately after mating. A second mating was allowed 12 h later. Ova and embryos were recovered by non-surgical uterine flushing 7 days after the first mating. Ovarian response was immediately evaluated afterwards via laparoscopy. The mean ovulation rate of 4.5 corpora lutea for Group A was significantly lower (P < 0.01) than the mean of 13.8 observed for Group B. The total ovarian response (number of corpora lutea + follicles > 10 mm) was also significantly higher (P < 0.01) in Group B than in Group A. Twenty-seven ova were recovered in each group, corresponding to 60% and 20% (P < 0.01) of the corpora lutea observed in Groups A and B, respectively; however, no significant difference (P > 0.05) in fertilisation rate was observed. The results show that pFSH induces superovulation in llamas treated during oestrus and that a 36-h interval between the end of FSH treatment and mating increases ovulation rate and the total ovarian response but does not affect the number of ova/embryos recovered.

  10. Losartan, an Angiotensin type I receptor, restores erectile function by downregulation of cavernous renin-angiotensin system in streptozocin-induced diabetic rats.

    PubMed

    Yang, Rong; Yang, Bin; Wen, Yanting; Fang, Feng; Cui, Souxi; Lin, Guiting; Sun, Zeyu; Wang, Run; Dai, Yutian

    2009-03-01

    The high incidence of erectile dysfunction (ED) in diabetes highlights the need for good treatment strategies. Recent evidence indicates that blockade of the angiotensin type I receptor (AT1) may reverse ED from various diseases. To explore the role of cavernous renin-angiotensin system (RAS) in the pathogenesis of diabetic ED and the role of losartan in the treatment of diabetic ED. The AT1 blocker (ARB) losartan (30 mg/kg/d) was administered to rats with streptozocin (65 mg/kg)-induced diabetes. Erectile function, cavernous structure, and tissue gene and protein expression of RAS in the corpora cavernosa were studied. We sought to determine the changes of cavernous RAS in the condition of diabetes and after treatment with losartan. RAS components (angiotensinogen, [pro]renin receptor, angiotensin-converting enzyme [ACE], and AT1) were expressed in cavernosal tissue. In diabetic rats, RAS components were upregulated, resulting in the increased concentration of angiotensin II (Ang II) in the corpora. A positive feedback loop for Ang II formation in cavernosum was also identified, which could contribute to overactivity of cavernous RAS in diabetic rats. Administration of losartan blocked the effect of Ang II, downregulated the expression of AT1 and Ang II generated locally, and partially restored erectile function (losartan-treated group revealed an improved intracavernous pressure/mean systemic arterial pressure ratio as compared with the diabetic group (0.480 +/- 0.031 vs. 0.329 +/- 0.020, P < 0.01). However, losartan could not elevate the reduced smooth muscle/collagen ratio in diabetic rats. The cavernous RAS plays a role in modulating erectile function in corpora cavernosa and is involved in the pathogenesis of diabetic ED. ARB can restore diabetic ED through downregulating cavernous RAS.

  11. Morphological and histological characters of penile organization in eleven species of molossid bats.

    PubMed

    Comelis, Manuela T; Bueno, Larissa M; Góes, Rejane M; Taboga, S R; Morielle-Versute, Eliana

    2018-04-01

    The penis is the reproductive organ that ensures efficient copulation and success of internal fertilization in all species of mammals, with special challenges for bats, where copulation can occur during flight. Comparative anatomical analyses of different species of bats can contribute to a better understanding of morphological diversity of this organ, concerning organization and function. In this study, we describe the external morphology and histomorphology of the penis and baculum in eleven species of molossid bats. The present study showed that penile organization in these species displayed the basic vascular mammalian pattern and had a similar pattern concerning the presence of the tissues constituting the penis, exhibiting three types of erectile tissue (the corpus cavernosum, accessory cavernous tissue, and corpus spongiosum) around the urethra. However, certain features varied among the species, demonstrating that most species are distinguishable by glans and baculum morphology and glans histological organization. Major variations in glans morphology were genus-specific, and the greatest similarities were shared by Eumops species and N. laticaudatus. The greatest interspecific similarities occurred between M. molossus and M. rufus and between Eumops species. Save for M. molossus and M. rufus, morphology of the baculum was species-specific; and in E. perotis, it did not occur in all specimens, indicating that it is probably under selection. In the histological organization, the most evident differences were number of septa and localization of the corpora cavernosa. In species with a baculum (Molossus, Eumops and Nyctinomops species), the corpora cavernosa predominantly occupied the dorsal region of the penile glans and is associated with the proximal (basal) portion of the baculum. In species that do not have a baculum (Cynomops, Molossops and Neoplatymops species), the corpora cavernosa predominantly occupied the ventro-lateral region of the glans. Copyright © 2018 Elsevier GmbH. All rights reserved.

  12. Building an ontology of pulmonary diseases with natural language processing tools using textual corpora.

    PubMed

    Baneyx, Audrey; Charlet, Jean; Jaulent, Marie-Christine

    2007-01-01

    Pathologies and acts are classified in thesauri to help physicians to code their activity. In practice, the use of thesauri is not sufficient to reduce variability in coding and thesauri are not suitable for computer processing. We think the automation of the coding task requires a conceptual modeling of medical items: an ontology. Our task is to help lung specialists code acts and diagnoses with software that represents medical knowledge of this concerned specialty by an ontology. The objective of the reported work was to build an ontology of pulmonary diseases dedicated to the coding process. To carry out this objective, we develop a precise methodological process for the knowledge engineer in order to build various types of medical ontologies. This process is based on the need to express precisely in natural language the meaning of each concept using differential semantics principles. A differential ontology is a hierarchy of concepts and relationships organized according to their similarities and differences. Our main research hypothesis is to apply natural language processing tools to corpora to develop the resources needed to build the ontology. We consider two corpora, one composed of patient discharge summaries and the other being a teaching book. We propose to combine two approaches to enrich the ontology building: (i) a method which consists of building terminological resources through distributional analysis and (ii) a method based on the observation of corpus sequences in order to reveal semantic relationships. Our ontology currently includes 1550 concepts and the software implementing the coding process is still under development. Results show that the proposed approach is operational and indicates that the combination of these methods and the comparison of the resulting terminological structures give interesting clues to a knowledge engineer for the building of an ontology.

  13. Creating Realistic Corpora for Security and Forensic Education

    DTIC Science & Technology

    2011-05-01

    School of Information and Library Science University of North Carolina Chapel Hill, NC kamwoods@email.unc.edu Christopher A. Lee School of...Information and Library Science University of North Carolina Chapel Hill, NC callee@ils.unc.edu Simson Garfinkel Graduate School of Operational and

  14. Translation Ambiguity in and out of Context

    ERIC Educational Resources Information Center

    Prior, Anat; Wintner, Shuly; MacWhinney, Brian; Lavie, Alon

    2011-01-01

    We compare translations of single words, made by bilingual speakers in a laboratory setting, with contextualized translation choices of the same items, made by professional translators and extracted from parallel language corpora. The translation choices in both cases show moderate convergence, demonstrating that decontextualized translation…

  15. BAAL/CUP Seminars 2009

    ERIC Educational Resources Information Center

    Cutting, Joan; Murphy, Brona

    2010-01-01

    The seminar, organised by Joan Cutting and Brona Murphy, aimed: (1) to bring together researchers involved in both emergent and established academic corpora (written and spoken) as well as linguists, lecturers and teachers researching in education, be it language teaching, language-teacher training or continuing professional development in…

  16. 31 CFR 358.7 - Where do I send my bearer corpora and detached bearer coupons to be converted?

    Code of Federal Regulations, 2010 CFR

    2010-07-01

    ... detached bearer coupons to be converted to: Bureau of the Public Debt, Division of Customer Service, P. O... Relating to Money and Finance (Continued) FISCAL SERVICE, DEPARTMENT OF THE TREASURY BUREAU OF THE PUBLIC...

  17. Accessory corpora lutea formation in pregnant Hokkaido sika deer (Cervus nippon yesoensis) investigated by examination of ovarian dynamics and steroid hormone concentrations.

    PubMed

    Yanagawa, Yojiro; Matsuura, Yukiko; Suzuki, Masatsugu; Saga, Shin-Ichi; Okuyama, Hideto; Fukui, Daisuke; Bando, Gen; Nagano, Masashi; Katagiri, Seiji; Takahashi, Yoshiyuki; Tsubota, Toshio

    2015-01-01

    Generally, sika deer conceive a single fetus, but approximately 80% of pregnant females have two corpora lutea (CLs). The function of the accessory CL (ACL) is unknown; moreover, the process of ACL formation is unclear, and understanding this is necessary to know its role. To elucidate the process of ACL formation, the ovarian dynamics of six adult Hokkaido sika deer females were examined ultrasonographically together with peripheral estradiol-17β and progesterone concentrations. ACLs formed in three females that conceived at the first estrus of the breeding season, but not in those females that conceived at the second estrus. After copulation, postconception ovulation of the dominant follicle of the first wave is induced by an increase in estradiol-17β, which leads to formation of an ACL. A relatively low concentration of progesterone after the first estrus of the breeding season is considered to be responsible for the increase in estradiol-17β after copulation.

  18. Accessory corpora lutea formation in pregnant Hokkaido sika deer (Cervus nippon yesoensis) investigated by examination of ovarian dynamics and steroid hormone concentrations

    PubMed Central

    YANAGAWA, Yojiro; MATSUURA, Yukiko; SUZUKI, Masatsugu; SAGA, Shin-ichi; OKUYAMA, Hideto; FUKUI, Daisuke; BANDO, Gen; NAGANO, Masashi; KATAGIRI, Seiji; TAKAHASHI, Yoshiyuki; TSUBOTA, Toshio

    2014-01-01

    Generally, sika deer conceive a single fetus, but approximately 80% of pregnant females have two corpora lutea (CLs). The function of the accessory CL (ACL) is unknown; moreover, the process of ACL formation is unclear, and understanding this is necessary to know its role. To elucidate the process of ACL formation, the ovarian dynamics of six adult Hokkaido sika deer females were examined ultrasonographically together with peripheral estradiol-17β and progesterone concentrations. ACLs formed in three females that conceived at the first estrus of the breeding season, but not in those females that conceived at the second estrus. After copulation, postconception ovulation of the dominant follicle of the first wave is induced by an increase in estradiol-17β, which leads to formation of an ACL. A relatively low concentration of progesterone after the first estrus of the breeding season is considered to be responsible for the increase in estradiol-17β after copulation. PMID:25482110

  19. Rhythm histograms and musical meter: A corpus study of Malian percussion music.

    PubMed

    London, Justin; Polak, Rainer; Jacoby, Nori

    2017-04-01

    Studies of musical corpora have given empirical grounding to the various features that characterize particular musical styles and genres. Palmer & Krumhansl (1990) found that in Western classical music the likeliest places for a note to occur are the most strongly accented beats in a measure, and this was also found in subsequent studies using both Western classical and folk music corpora (Huron & Ommen, 2006; Temperley, 2010). We present a rhythmic analysis of a corpus of 15 performances of percussion music from Bamako, Mali. In our corpus, the relative frequency of note onsets in a given metrical position does not correspond to patterns of metrical accent, though there is a stable relationship between onset frequency and metrical position. The implications of this non-congruence between simple statistical likelihood and metrical structure for the ways in which meter and metrical accent may be learned and understood are discussed, along with importance of cross-cultural studies for psychological research.

  20. Association between erythrocyte 2,3-diphosphoglycerate levels and reproduction capacity in Long-Evans rats.

    PubMed

    Noble, N A; Brewer, G J

    1982-03-01

    During genetic selection of rats for high and low levels of red cell 2,3-diphosphoglycerate (DPG) the decreased fertility in Low-DPG animals was due to significantly (P less than 0.01) fewer offspring born per litter. The rat lines were intercrossed and animals at the tails of the F2 2,3-diphosphoglycerate distribution were mated. Subsequent matings of F3 offspring were monitored. Low-DPG F3 pregnant females killed at 20 days of gestation showed significantly (P less than 0.05) fewer corpora lutea than High-DPG F3 females. There were also significantly (P less than 0.01) fewer corpora lutea in Low-DPG line rats compared to High-DPG rats. It is concluded that the relationship between 2,3-diphosphoglycerate levels and fertility is not due to inbreeding but to a possible genetic linkage, a shared biochemical determinant or a relationship through the effect of 2,3-diphosphoglycerate levels on oxygen delivery to tissue.

  1. Specification of Drosophila Corpora Cardiaca Neuroendocrine Cells from Mesoderm Is Regulated by Notch Signaling

    PubMed Central

    Park, Sangbin; Bustamante, Erika L.; Antonova, Julie; McLean, Graeme W.; Kim, Seung K.

    2011-01-01

    Drosophila neuroendocrine cells comprising the corpora cardiaca (CC) are essential for systemic glucose regulation and represent functional orthologues of vertebrate pancreatic α-cells. Although Drosophila CC cells have been regarded as developmental orthologues of pituitary gland, the genetic regulation of CC development is poorly understood. From a genetic screen, we identified multiple novel regulators of CC development, including Notch signaling factors. Our studies demonstrate that the disruption of Notch signaling can lead to the expansion of CC cells. Live imaging demonstrates localized emergence of extra precursor cells as the basis of CC expansion in Notch mutants. Contrary to a recent report, we unexpectedly found that CC cells originate from head mesoderm. We show that Tinman expression in head mesoderm is regulated by Notch signaling and that the combination of Daughterless and Tinman is sufficient for ectopic CC specification in mesoderm. Understanding the cellular, genetic, signaling, and transcriptional basis of CC cell specification and expansion should accelerate discovery of molecular mechanisms regulating ontogeny of organs that control metabolism. PMID:21901108

  2. Trends in gel dosimetry: Preliminary bibliometric overview of active growth areas, research trends and hot topics from Gore’s 1984 paper onwards

    NASA Astrophysics Data System (ADS)

    Baldock, C.

    2017-05-01

    John Gore’s seminal 1984 paper on gel dosimetry spawned a vibrant research field ranging from fundamental science through to clinical applications. A preliminary bibliometric study was undertaken of the gel dosimetry family of publications inspired by, and resulting from, Gore’s original 1984 paper to determine active growth areas, research trends and hot topics from Gore’s paper up to and including 2016. Themes and trends of the gel dosimetry research field were bibliometrically explored by way of co-occurrence term maps using the titles and abstracts text corpora from the Web of Science database for all relevant papers from 1984 to 2016. Visualisation of similarities was used by way of the VOSviewer visualisation tool to generate cluster maps of gel dosimetry knowledge domains and the associated citation impact of topics within the domains. Heat maps were then generated to assist in the understanding of active growth areas, research trends, and emerging and hot topics in gel dosimetry.

  3. Ontology construction and application in practice case study of health tourism in Thailand.

    PubMed

    Chantrapornchai, Chantana; Choksuchat, Chidchanok

    2016-01-01

    Ontology is one of the key components in semantic webs. It contains the core knowledge for an effective search. However, building ontology requires the carefully-collected knowledge which is very domain-sensitive. In this work, we present the practice of ontology construction for a case study of health tourism in Thailand. The whole process follows the METHONTOLOGY approach, which consists of phases: information gathering, corpus study, ontology engineering, evaluation, publishing, and the application construction. Different sources of data such as structure web documents like HTML and other documents are acquired in the information gathering process. The tourism corpora from various tourism texts and standards are explored. The ontology is evaluated in two aspects: automatic reasoning using Pellet, and RacerPro, and the questionnaires, used to evaluate by experts of the domains: tourism domain experts and ontology experts. The ontology usability is demonstrated via the semantic web application and via example axioms. The developed ontology is actually the first health tourism ontology in Thailand with the published application.

  4. Estimating affective word covariates using word association data.

    PubMed

    Van Rensbergen, Bram; De Deyne, Simon; Storms, Gert

    2016-12-01

    Word ratings on affective dimensions are an important tool in psycholinguistic research. Traditionally, they are obtained by asking participants to rate words on each dimension, a time-consuming procedure. As such, there has been some interest in computationally generating norms, by extrapolating words' affective ratings using their semantic similarity to words for which these values are already known. So far, most attempts have derived similarity from word co-occurrence in text corpora. In the current paper, we obtain similarity from word association data. We use these similarity ratings to predict the valence, arousal, and dominance of 14,000 Dutch words with the help of two extrapolation methods: Orientation towards Paradigm Words and k-Nearest Neighbors. The resulting estimates show very high correlations with human ratings when using Orientation towards Paradigm Words, and even higher correlations when using k-Nearest Neighbors. We discuss possible theoretical accounts of our results and compare our findings with previous attempts at computationally generating affective norms.

  5. Context-Aware Adaptive Hybrid Semantic Relatedness in Biomedical Science

    NASA Astrophysics Data System (ADS)

    Emadzadeh, Ehsan

    Text mining of biomedical literature and clinical notes is a very active field of research in biomedical science. Semantic analysis is one of the core modules for different Natural Language Processing (NLP) solutions. Methods for calculating semantic relatedness of two concepts can be very useful in solutions solving different problems such as relationship extraction, ontology creation and question / answering [1--6]. Several techniques exist in calculating semantic relatedness of two concepts. These techniques utilize different knowledge sources and corpora. So far, researchers attempted to find the best hybrid method for each domain by combining semantic relatedness techniques and data sources manually. In this work, attempts were made to eliminate the needs for manually combining semantic relatedness methods targeting any new contexts or resources through proposing an automated method, which attempted to find the best combination of semantic relatedness techniques and resources to achieve the best semantic relatedness score in every context. This may help the research community find the best hybrid method for each context considering the available algorithms and resources.

  6. Progressive Learning of Topic Modeling Parameters: A Visual Analytics Framework.

    PubMed

    El-Assady, Mennatallah; Sevastjanova, Rita; Sperrle, Fabian; Keim, Daniel; Collins, Christopher

    2018-01-01

    Topic modeling algorithms are widely used to analyze the thematic composition of text corpora but remain difficult to interpret and adjust. Addressing these limitations, we present a modular visual analytics framework, tackling the understandability and adaptability of topic models through a user-driven reinforcement learning process which does not require a deep understanding of the underlying topic modeling algorithms. Given a document corpus, our approach initializes two algorithm configurations based on a parameter space analysis that enhances document separability. We abstract the model complexity in an interactive visual workspace for exploring the automatic matching results of two models, investigating topic summaries, analyzing parameter distributions, and reviewing documents. The main contribution of our work is an iterative decision-making technique in which users provide a document-based relevance feedback that allows the framework to converge to a user-endorsed topic distribution. We also report feedback from a two-stage study which shows that our technique results in topic model quality improvements on two independent measures.

  7. Machine learning with naturally labeled data for identifying abbreviation definitions.

    PubMed

    Yeganova, Lana; Comeau, Donald C; Wilbur, W John

    2011-06-09

    The rapid growth of biomedical literature requires accurate text analysis and text processing tools. Detecting abbreviations and identifying their definitions is an important component of such tools. Most existing approaches for the abbreviation definition identification task employ rule-based methods. While achieving high precision, rule-based methods are limited to the rules defined and fail to capture many uncommon definition patterns. Supervised learning techniques, which offer more flexibility in detecting abbreviation definitions, have also been applied to the problem. However, they require manually labeled training data. In this work, we develop a machine learning algorithm for abbreviation definition identification in text which makes use of what we term naturally labeled data. Positive training examples are naturally occurring potential abbreviation-definition pairs in text. Negative training examples are generated by randomly mixing potential abbreviations with unrelated potential definitions. The machine learner is trained to distinguish between these two sets of examples. Then, the learned feature weights are used to identify the abbreviation full form. This approach does not require manually labeled training data. We evaluate the performance of our algorithm on the Ab3P, BIOADI and Medstract corpora. Our system demonstrated results that compare favourably to the existing Ab3P and BIOADI systems. We achieve an F-measure of 91.36% on Ab3P corpus, and an F-measure of 87.13% on BIOADI corpus which are superior to the results reported by Ab3P and BIOADI systems. Moreover, we outperform these systems in terms of recall, which is one of our goals.

  8. Discovering body site and severity modifiers in clinical texts

    PubMed Central

    Dligach, Dmitriy; Bethard, Steven; Becker, Lee; Miller, Timothy; Savova, Guergana K

    2014-01-01

    Objective To research computational methods for discovering body site and severity modifiers in clinical texts. Methods We cast the task of discovering body site and severity modifiers as a relation extraction problem in the context of a supervised machine learning framework. We utilize rich linguistic features to represent the pairs of relation arguments and delegate the decision about the nature of the relationship between them to a support vector machine model. We evaluate our models using two corpora that annotate body site and severity modifiers. We also compare the model performance to a number of rule-based baselines. We conduct cross-domain portability experiments. In addition, we carry out feature ablation experiments to determine the contribution of various feature groups. Finally, we perform error analysis and report the sources of errors. Results The performance of our method for discovering body site modifiers achieves F1 of 0.740–0.908 and our method for discovering severity modifiers achieves F1 of 0.905–0.929. Discussion Results indicate that both methods perform well on both in-domain and out-domain data, approaching the performance of human annotators. The most salient features are token and named entity features, although syntactic dependency features also contribute to the overall performance. The dominant sources of errors are infrequent patterns in the data and inability of the system to discern deeper semantic structures. Conclusions We investigated computational methods for discovering body site and severity modifiers in clinical texts. Our best system is released open source as part of the clinical Text Analysis and Knowledge Extraction System (cTAKES). PMID:24091648

  9. Discovering body site and severity modifiers in clinical texts.

    PubMed

    Dligach, Dmitriy; Bethard, Steven; Becker, Lee; Miller, Timothy; Savova, Guergana K

    2014-01-01

    To research computational methods for discovering body site and severity modifiers in clinical texts. We cast the task of discovering body site and severity modifiers as a relation extraction problem in the context of a supervised machine learning framework. We utilize rich linguistic features to represent the pairs of relation arguments and delegate the decision about the nature of the relationship between them to a support vector machine model. We evaluate our models using two corpora that annotate body site and severity modifiers. We also compare the model performance to a number of rule-based baselines. We conduct cross-domain portability experiments. In addition, we carry out feature ablation experiments to determine the contribution of various feature groups. Finally, we perform error analysis and report the sources of errors. The performance of our method for discovering body site modifiers achieves F1 of 0.740-0.908 and our method for discovering severity modifiers achieves F1 of 0.905-0.929. Results indicate that both methods perform well on both in-domain and out-domain data, approaching the performance of human annotators. The most salient features are token and named entity features, although syntactic dependency features also contribute to the overall performance. The dominant sources of errors are infrequent patterns in the data and inability of the system to discern deeper semantic structures. We investigated computational methods for discovering body site and severity modifiers in clinical texts. Our best system is released open source as part of the clinical Text Analysis and Knowledge Extraction System (cTAKES).

  10. The Timing and Construction of Preference: A Quantitative Study

    ERIC Educational Resources Information Center

    Kendrick, Kobin H.; Torreira, Francisco

    2015-01-01

    Conversation-analytic research has argued that the timing and construction of preferred responding actions (e.g., acceptances) differ from that of dispreferred responding actions (e.g., rejections), potentially enabling early response prediction by recipients. We examined 195 preferred and dispreferred responding actions in telephone corpora and…

  11. Exploring Business Request Genres: Students' Rhetorical Choices

    ERIC Educational Resources Information Center

    Nguyen, Hai; Miller, Jennifer

    2012-01-01

    This article presents selective findings from an ongoing study that investigates rhetorical differences in business letter writing between Vietnamese students taking an English for Specific Purposes course in Vietnam and business professionals. Rhetorical analyses are based on two corpora, namely, scenario (N = 20) and authentic business letters…

  12. Modeling Spanish Mood Choice in Belief Statements

    ERIC Educational Resources Information Center

    Robinson, Jason R.

    2013-01-01

    This work develops a computational methodology new to linguistics that empirically evaluates competing linguistic theories on Spanish verbal mood choice through the use of computational techniques to learn mood and other hidden linguistic features from Spanish belief statements found in corpora. The machine learned probabilistic linguistic models…

  13. Uncertainties in Engineering Design. Mathematical Theory and Numerical Experience.

    DTIC Science & Technology

    1985-08-01

    Theoretical Mannual, Noetic Technologies Corpora- tion, St. Louis, Missouri, 1985. F :.5,. "V. .. " ,9.., 25 . - 5* 5 .° . . . . .. ,koA ’FIGURES...international center of study and research for foreign students in numerical mathematics who are supported by foreign govern- ments or exchange agencies

  14. Mapping texts through dimensionality reduction and visualization techniques for interactive exploration of document collections

    NASA Astrophysics Data System (ADS)

    de Andrade Lopes, Alneu; Minghim, Rosane; Melo, Vinícius; Paulovich, Fernando V.

    2006-01-01

    The current availability of information many times impair the tasks of searching, browsing and analyzing information pertinent to a topic of interest. This paper presents a methodology to create a meaningful graphical representation of documents corpora targeted at supporting exploration of correlated documents. The purpose of such an approach is to produce a map from a document body on a research topic or field based on the analysis of their contents, and similarities amongst articles. The document map is generated, after text pre-processing, by projecting the data in two dimensions using Latent Semantic Indexing. The projection is followed by hierarchical clustering to support sub-area identification. The map can be interactively explored, helping to narrow down the search for relevant articles. Tests were performed using a collection of documents pre-classified into three research subject classes: Case-Based Reasoning, Information Retrieval, and Inductive Logic Programming. The map produced was capable of separating the main areas and approaching documents by their similarity, revealing possible topics, and identifying boundaries between them. The tool can deal with the exploration of inter-topics and intra-topic relationship and is useful in many contexts that need deciding on relevant articles to read, such as scientific research, education, and training.

  15. Linguistic positivity in historical texts reflects dynamic environmental and psychological factors.

    PubMed

    Iliev, Rumen; Hoover, Joe; Dehghani, Morteza; Axelrod, Robert

    2016-12-06

    People use more positive words than negative words. Referred to as "linguistic positivity bias" (LPB), this effect has been found across cultures and languages, prompting the conclusion that it is a panhuman tendency. However, although multiple competing explanations of LPB have been proposed, there is still no consensus on what mechanism(s) generate LPB or even on whether it is driven primarily by universal cognitive features or by environmental factors. In this work we propose that LPB has remained unresolved because previous research has neglected an essential dimension of language: time. In four studies conducted with two independent, time-stamped text corpora (Google books Ngrams and the New York Times), we found that LPB in American English has decreased during the last two centuries. We also observed dynamic fluctuations in LPB that were predicted by changes in objective environment, i.e., war and economic hardships, and by changes in national subjective happiness. In addition to providing evidence that LPB is a dynamic phenomenon, these results suggest that cognitive mechanisms alone cannot account for the observed dynamic fluctuations in LPB. At the least, LPB likely arises from multiple interacting mechanisms involving subjective, objective, and societal factors. In addition to having theoretical significance, our results demonstrate the value of newly available data sources in addressing long-standing scientific questions.

  16. Linguistic positivity in historical texts reflects dynamic environmental and psychological factors

    PubMed Central

    Iliev, Rumen; Hoover, Joe; Dehghani, Morteza

    2016-01-01

    People use more positive words than negative words. Referred to as “linguistic positivity bias” (LPB), this effect has been found across cultures and languages, prompting the conclusion that it is a panhuman tendency. However, although multiple competing explanations of LPB have been proposed, there is still no consensus on what mechanism(s) generate LPB or even on whether it is driven primarily by universal cognitive features or by environmental factors. In this work we propose that LPB has remained unresolved because previous research has neglected an essential dimension of language: time. In four studies conducted with two independent, time-stamped text corpora (Google books Ngrams and the New York Times), we found that LPB in American English has decreased during the last two centuries. We also observed dynamic fluctuations in LPB that were predicted by changes in objective environment, i.e., war and economic hardships, and by changes in national subjective happiness. In addition to providing evidence that LPB is a dynamic phenomenon, these results suggest that cognitive mechanisms alone cannot account for the observed dynamic fluctuations in LPB. At the least, LPB likely arises from multiple interacting mechanisms involving subjective, objective, and societal factors. In addition to having theoretical significance, our results demonstrate the value of newly available data sources in addressing long-standing scientific questions. PMID:27872286

  17. Encoding Sequential Information in Semantic Space Models: Comparing Holographic Reduced Representation and Random Permutation

    PubMed Central

    Recchia, Gabriel; Sahlgren, Magnus; Kanerva, Pentti; Jones, Michael N.

    2015-01-01

    Circular convolution and random permutation have each been proposed as neurally plausible binding operators capable of encoding sequential information in semantic memory. We perform several controlled comparisons of circular convolution and random permutation as means of encoding paired associates as well as encoding sequential information. Random permutations outperformed convolution with respect to the number of paired associates that can be reliably stored in a single memory trace. Performance was equal on semantic tasks when using a small corpus, but random permutations were ultimately capable of achieving superior performance due to their higher scalability to large corpora. Finally, “noisy” permutations in which units are mapped to other units arbitrarily (no one-to-one mapping) perform nearly as well as true permutations. These findings increase the neurological plausibility of random permutations and highlight their utility in vector space models of semantics. PMID:25954306

  18. Intelligent E-Learning Systems: Automatic Construction of Ontologies

    NASA Astrophysics Data System (ADS)

    Peso, Jesús del; de Arriaga, Fernando

    2008-05-01

    During the last years a new generation of Intelligent E-Learning Systems (ILS) has emerged with enhanced functionality due, mainly, to influences from Distributed Artificial Intelligence, to the use of cognitive modelling, to the extensive use of the Internet, and to new educational ideas such as the student-centered education and Knowledge Management. The automatic construction of ontologies provides means of automatically updating the knowledge bases of their respective ILS, and of increasing their interoperability and communication among them, sharing the same ontology. The paper presents a new approach, able to produce ontologies from a small number of documents such as those obtained from the Internet, without the assistance of large corpora, by using simple syntactic rules and some semantic information. The method is independent of the natural language used. The use of a multi-agent system increases the flexibility and capability of the method. Although the method can be easily improved, the results so far obtained, are promising.

  19. Integrating hidden Markov model and PRAAT: a toolbox for robust automatic speech transcription

    NASA Astrophysics Data System (ADS)

    Kabir, A.; Barker, J.; Giurgiu, M.

    2010-09-01

    An automatic time-aligned phone transcription toolbox of English speech corpora has been developed. Especially the toolbox would be very useful to generate robust automatic transcription and able to produce phone level transcription using speaker independent models as well as speaker dependent models without manual intervention. The system is based on standard Hidden Markov Models (HMM) approach and it was successfully experimented over a large audiovisual speech corpus namely GRID corpus. One of the most powerful features of the toolbox is the increased flexibility in speech processing where the speech community would be able to import the automatic transcription generated by HMM Toolkit (HTK) into a popular transcription software, PRAAT, and vice-versa. The toolbox has been evaluated through statistical analysis on GRID data which shows that automatic transcription deviates by an average of 20 ms with respect to manual transcription.

  20. An empirical generative framework for computational modeling of language acquisition.

    PubMed

    Waterfall, Heidi R; Sandbank, Ben; Onnis, Luca; Edelman, Shimon

    2010-06-01

    This paper reports progress in developing a computer model of language acquisition in the form of (1) a generative grammar that is (2) algorithmically learnable from realistic corpus data, (3) viable in its large-scale quantitative performance and (4) psychologically real. First, we describe new algorithmic methods for unsupervised learning of generative grammars from raw CHILDES data and give an account of the generative performance of the acquired grammars. Next, we summarize findings from recent longitudinal and experimental work that suggests how certain statistically prominent structural properties of child-directed speech may facilitate language acquisition. We then present a series of new analyses of CHILDES data indicating that the desired properties are indeed present in realistic child-directed speech corpora. Finally, we suggest how our computational results, behavioral findings, and corpus-based insights can be integrated into a next-generation model aimed at meeting the four requirements of our modeling framework.

  1. The Faces in Infant-Perspective Scenes Change over the First Year of Life

    PubMed Central

    Jayaraman, Swapnaa; Fausey, Caitlin M.; Smith, Linda B.

    2015-01-01

    Mature face perception has its origins in the face experiences of infants. However, little is known about the basic statistics of faces in early visual environments. We used head cameras to capture and analyze over 72,000 infant-perspective scenes from 22 infants aged 1-11 months as they engaged in daily activities. The frequency of faces in these scenes declined markedly with age: for the youngest infants, faces were present 15 minutes in every waking hour but only 5 minutes for the oldest infants. In general, the available faces were well characterized by three properties: (1) they belonged to relatively few individuals; (2) they were close and visually large; and (3) they presented views showing both eyes. These three properties most strongly characterized the face corpora of our youngest infants and constitute environmental constraints on the early development of the visual system. PMID:26016988

  2. Neuropathologic findings in an aged albino gorilla.

    PubMed

    Márquez, M; Serafin, A; Fernández-Bellon, H; Serrat, S; Ferrer-Admetlla, A; Bertranpetit, J; Ferrer, I; Pumarola, M

    2008-07-01

    Pallido-nigral spheroids associated with iron deposition have been observed in some aged clinically normal nonhuman primates. In humans, similar findings are observed in neurodegeneration with brain iron accumulation diseases, which, in some cases, show associated mutations in pantothenate kinase 2 gene (PANK2). Here we present an aged gorilla, 40 years old, suffering during the last 2 years of life from progressive tetraparesis, nystagmus, and dyskinesia of the arms, hands, and neck, with accompanying abnormal behavior. The postmortem neuropathologic examination revealed, in addition to aging-associated changes in the brain, numerous corpora amylacea in some brain areas, especially the substantia nigra, and large numbers of axonal spheroids associated with iron accumulation in the internal globus pallidus. Sequencing of the gorilla PANK2 gene failed to detect any mutation. The clinical, neuropathologic, and genetic findings in this gorilla point to an age-related pallido-nigral degeneration that presented PKAN-like neurologic deficits.

  3. Linguistic Corpora and Language Teaching.

    ERIC Educational Resources Information Center

    Murison-Bowie, Simon

    1996-01-01

    Examines issues raised by corpus linguistics concerning the description of language. The article argues that it is necessary to start from correct descriptions of linguistic units and the contexts in which they occur. Corpus linguistics has joined with language teaching by sharing a recognition of the importance of a larger, schematic view of…

  4. Quantitative Investigations in Hungarian Phonotactics and Syllable Structure

    ERIC Educational Resources Information Center

    Grimes, Stephen M.

    2010-01-01

    This dissertation investigates statistical properties of segment collocation and syllable geometry of the Hungarian language. A corpus and dictionary based approach to studying language phonologies is outlined. In order to conduct research on Hungarian, a phonological lexicon was created by compiling existing dictionaries and corpora and using a…

  5. Does "Word Coach" Coach Words?

    ERIC Educational Resources Information Center

    Cobb, Tom; Horst, Marlise

    2011-01-01

    This study reports on the design and testing of an integrated suite of vocabulary training games for Nintendo[TM] collectively designated "My Word Coach" (Ubisoft, 2008). The games' design is based on a wide range of learning research, from classic studies on recycling patterns to frequency studies of modern corpora. Its general usage…

  6. Some Statistical Properties of Tonality, 1650-1900

    ERIC Educational Resources Information Center

    White, Christopher Wm.

    2013-01-01

    This dissertation investigates the statistical properties present within corpora of common practice music, involving a data set of more than 8,000 works spanning from 1650 to 1900, and focusing specifically on the properties of the chord progressions contained therein. In the first chapter, methodologies concerning corpus analysis are presented…

  7. Natural Learning Case Study Archives

    ERIC Educational Resources Information Center

    Lawler, Robert W.

    2015-01-01

    Natural Learning Case Study Archives (NLCSA) is a research facility for those interested in using case study analysis to deepen their understanding of common sense knowledge and natural learning (how the mind interacts with everyday experiences to develop common sense knowledge). The database comprises three case study corpora based on experiences…

  8. 31 CFR 358.1 - What special terms apply to this part?

    Code of Federal Regulations, 2010 CFR

    2010-07-01

    ... unmatured bearer securities are bearer bonds. BECCS means the Treasury's Bearer Corpora Conversion System... BECCS. Callable means a United States Treasury security subject to call before maturity. Callable Coupons means the coupons associated with a callable security that are due after the date the security is...

  9. Adaptable, high recall, event extraction system with minimal configuration.

    PubMed

    Miwa, Makoto; Ananiadou, Sophia

    2015-01-01

    Biomedical event extraction has been a major focus of biomedical natural language processing (BioNLP) research since the first BioNLP shared task was held in 2009. Accordingly, a large number of event extraction systems have been developed. Most such systems, however, have been developed for specific tasks and/or incorporated task specific settings, making their application to new corpora and tasks problematic without modification of the systems themselves. There is thus a need for event extraction systems that can achieve high levels of accuracy when applied to corpora in new domains, without the need for exhaustive tuning or modification, whilst retaining competitive levels of performance. We have enhanced our state-of-the-art event extraction system, EventMine, to alleviate the need for task-specific tuning. Task-specific details are specified in a configuration file, while extensive task-specific parameter tuning is avoided through the integration of a weighting method, a covariate shift method, and their combination. The task-specific configuration and weighting method have been employed within the context of two different sub-tasks of BioNLP shared task 2013, i.e. Cancer Genetics (CG) and Pathway Curation (PC), removing the need to modify the system specifically for each task. With minimal task specific configuration and tuning, EventMine achieved the 1st place in the PC task, and 2nd in the CG, achieving the highest recall for both tasks. The system has been further enhanced following the shared task by incorporating the covariate shift method and entity generalisations based on the task definitions, leading to further performance improvements. We have shown that it is possible to apply a state-of-the-art event extraction system to new tasks with high levels of performance, without having to modify the system internally. Both covariate shift and weighting methods are useful in facilitating the production of high recall systems. These methods and their combination can adapt a model to the target data with no deep tuning and little manual configuration.

  10. Mining for Evidence in Enterprise Corpora

    ERIC Educational Resources Information Center

    Almquist, Brian Alan

    2011-01-01

    The primary research aim of this dissertation is to identify the strategies that best meet the information retrieval needs as expressed in the "e-discovery" scenario. This task calls for a high-recall system that, in response to a request for all available relevant documents to a legal complaint, effectively prioritizes documents from an…

  11. The Dynamics of Syntax Acquisition: Facilitation between Syntactic Structures

    ERIC Educational Resources Information Center

    Keren-Portnoy, Tamar; Keren, Michael

    2011-01-01

    This paper sets out to show how facilitation between different clause structures operates over time in syntax acquisition. The phenomenon of facilitation within given structures has been widely documented, yet inter-structure facilitation has rarely been reported so far. Our findings are based on the naturalistic production corpora of six toddlers…

  12. Development and Use of a Corpus Tailored for Legal English Learning

    ERIC Educational Resources Information Center

    Skier, Jason; Vibulphol, Jutarat

    2016-01-01

    While corpus linguistics has been applied towards many specific academic purposes, reports are few regarding its use to facilitate learning of legal English by non-native English speakers. Specialized corpora are required because legal English often differs significantly from ordinary usage, with words such as bar, motion, and hearing having…

  13. Cultivating Effective Corpus Use by Language Learners

    ERIC Educational Resources Information Center

    Kennedy, Claire; Miceli, Tiziana

    2017-01-01

    While there is widespread agreement on the expected benefits of hands-on access to corpora for language learners, reports abound of the difficulties involved in realising those benefits in practice. A particular focus of discussion is the challenge of transferring the skills of the corpus linguist to learners, so that they can explore this type of…

  14. A Shared Platform for Studying Second Language Acquisition

    ERIC Educational Resources Information Center

    MacWhinney, Brian

    2017-01-01

    The study of second language acquisition (SLA) can benefit from the same process of datasharing that has proven effective in areas such as first language acquisition and aphasiology. Researchers can work together to construct a shared platform that combines data from spoken and written corpora, online tutors, and Web-based experimentation. Many of…

  15. Discovering English with the Sketch Engine

    ERIC Educational Resources Information Center

    Thomas, James

    2014-01-01

    "Discovering English with the Sketch Engine" is the title of a new book (Thomas, 2014) which introduces the use of corpora in language study, teaching, writing and translating. It focuses on using the Sketch Engine to identify patterns of normal usage in many aspects of English ranging from morphology to discourse and pragmatics. This…

  16. Submission Letters across English Language Teaching and Mathematics: The Case of Iranian Professionals

    ERIC Educational Resources Information Center

    Jalilifar, Alireza

    2009-01-01

    Submitting an article to an English journal for publication requires enclosing an accompanying cover letter. Yet, the phraseology and rhetorical conventions of such letters are not comprehensively documented in literature. This article investigates two English corpora of genuine electronic submission letters to journal editors by Iranian English…

  17. Cognition, Corpora, and Computing: Triangulating Research in Usage-Based Language Learning

    ERIC Educational Resources Information Center

    Ellis, Nick C.

    2017-01-01

    Usage-based approaches explore how we learn language from our experience of language. Related research thus involves the analysis of the usage from which learners learn and of learner usage as it develops. This program involves considerable data recording, transcription, and analysis, using a variety of corpus and computational techniques, many of…

  18. Automatic Selection of Suitable Sentences for Language Learning Exercises

    ERIC Educational Resources Information Center

    Pilán, Ildikó; Volodina, Elena; Johansson, Richard

    2013-01-01

    In our study we investigated second and foreign language (L2) sentence readability, an area little explored so far in the case of several languages, including Swedish. The outcome of our research consists of two methods for sentence selection from native language corpora based on Natural Language Processing (NLP) and machine learning (ML)…

  19. On the Application of Corpus of Contemporary American English in Vocabulary Instruction

    ERIC Educational Resources Information Center

    Yusu, Xu

    2014-01-01

    The development of corpus linguistics has laid theoretical foundation and provided technical support for breaking the bottleneck in traditional vocabulary instruction in China. Corpora allow access to authentic data and show frequency patterns of words and grammar construction. Such patterns can be used to improve language materials or to directly…

  20. 31 CFR 358.21 - Can these regulations be amended?

    Code of Federal Regulations, 2010 CFR

    2010-07-01

    ... 31 Money and Finance: Treasury 2 2010-07-01 2010-07-01 false Can these regulations be amended? 358.21 Section 358.21 Money and Finance: Treasury Regulations Relating to Money and Finance (Continued... CONVERSION OF BEARER CORPORA AND DETACHED BEARER COUPONS § 358.21 Can these regulations be amended? We may at...

  1. 31 CFR 358.20 - Can these regulations be waived?

    Code of Federal Regulations, 2010 CFR

    2010-07-01

    ... 31 Money and Finance: Treasury 2 2010-07-01 2010-07-01 false Can these regulations be waived? 358.20 Section 358.20 Money and Finance: Treasury Regulations Relating to Money and Finance (Continued... CONVERSION OF BEARER CORPORA AND DETACHED BEARER COUPONS § 358.20 Can these regulations be waived? We reserve...

  2. Investigating Move Structure of English Applied Linguistics Research Article Discussions Published in International and Thai Journals

    ERIC Educational Resources Information Center

    Amnuai, Wirada; Wannaruk, Anchalee

    2013-01-01

    This study investigates the rhetorical move structure of English applied linguistic research article Discussions published in Thai and international journals. Two corpora comprising of 30 Thai Discussions and 30 international Discussions were analyzed using Yang & Allison's (2003) move model. Based on the analysis, both similarities and…

  3. The Co-Occurrence of Quotatives with Mimetic Performances.

    ERIC Educational Resources Information Center

    Buchstaller, Isabelle

    2003-01-01

    This paper discusses mimesis, the direct representation and total imitation of an event. It studies the co-occurrence of quotative verbs with mimetic enactment based on two corpora of U.S. American English, both available through the University of Pennsylvania Data Consortium. The Switchboard Corpus has 542 speakers ranging in age from 20-60 years…

  4. Corpus-Based Approaches to Language Description for Specialized Academic Writing

    ERIC Educational Resources Information Center

    Flowerdew, John

    2017-01-01

    Language description is a fundamental requirement for second language (L2) syllabus design. The greatest advances in language description in recent decades have been done with the help of electronic corpora. Such language description is the theme of this article. The article first introduces some basic concepts and principles in corpus research.…

  5. Non-Arbitrariness in Mapping Word Form to Meaning: Cross-Linguistic Formal Markers of Word Concreteness

    ERIC Educational Resources Information Center

    Reilly, Jamie; Hung, Jinyi; Westbury, Chris

    2017-01-01

    Arbitrary symbolism is a linguistic doctrine that predicts an orthogonal relationship between word forms and their corresponding meanings. Recent corpora analyses have demonstrated violations of arbitrary symbolism with respect to concreteness, a variable characterizing the sensorimotor salience of a word. In addition to qualitative semantic…

  6. Unlearning Overgenerated "Be" through Data-Driven Learning in the Secondary EFL Classroom

    ERIC Educational Resources Information Center

    Moon, Soyeon; Oh, Sun-Young

    2018-01-01

    This paper reports on the cognitive and affective benefits of data-driven learning (DDL), in which Korean EFL learners at the secondary level notice and unlearn their "overgenerated 'be'" by comparing native English-speaker and learner corpora with guided induction. To select the target language item and compile learner-corpus-based…

  7. Category Induction via Distributional Analysis: Evidence from a Serial Reaction Time Task

    ERIC Educational Resources Information Center

    Hunt, Ruskin H.; Aslin, Richard N.

    2010-01-01

    Category formation lies at the heart of a number of higher-order behaviors, including language. We assessed the ability of human adults to learn, from distributional information alone, categories embedded in a sequence of input stimuli using a serial reaction time task. Artificial grammars generated corpora of input strings containing a…

  8. Exploring the Effects and Use of a Chinese-English Parallel Concordancer

    ERIC Educational Resources Information Center

    Gao, Zhao-Ming

    2011-01-01

    Previous studies on self-correction using corpora involve monolingual concordances and intervention from instructors such as marking of errors, the use of modified concordances, and other simplifications of the task. Can L2 learners independently refine their previous outputs by simply using a parallel concordancer without any hints about their…

  9. Second Language Vocabulary Assessment: Current Practices and New Directions

    ERIC Educational Resources Information Center

    Read, John

    2007-01-01

    This paper surveys some current developments in second language vocabulary assessment, with particular attention to the ways in which computer corpora can provide better quality information about the frequency of words and how they are used in specific contexts. The relative merits of different word lists are discussed, including the Academic Word…

  10. A Corpus-Based Approach to Online Materials Development for Writing Research Articles

    ERIC Educational Resources Information Center

    Chang, Ching-Fen; Kuo, Chih-Hua

    2011-01-01

    There has been increasing interest in the possible applications of corpora to both linguistic research and pedagogy. This study takes a corpus-based, genre-analytic approach to discipline-specific materials development. Combining corpus analysis with genre analysis makes it possible to develop teaching materials that are not only authentic but…

  11. The Paradigmatic Hearts of Subjects Which Their "English" Flows Through

    ERIC Educational Resources Information Center

    Pilcher, Nick; Richards, Kendall

    2016-01-01

    Much research into the use of corpora and discourse to support higher education students on pre-sessional and in-sessional courses champions subject specificity. Drawing on the work of writers such as Bakhtin [(1981). "The dialogic imagination: Four essays by MM Bakhtin" (M. Holquist, Ed.; C. Emerson & M. Holquist, Trans.). Austin:…

  12. Corpora and Collocations in Chinese-English Dictionaries for Chinese Users

    ERIC Educational Resources Information Center

    Xia, Lixin

    2015-01-01

    The paper identifies the major problems of the Chinese-English dictionary in representing collocational information after an extensive survey of nine dictionaries popular among Chinese users. It is found that the Chinese-English dictionary only provides the collocation types of "v+n" and "v+n," but completely ignores those of…

  13. Fine-Grained Linguistic Soft Constraints on Statistical Natural Language Processing Models

    DTIC Science & Technology

    2009-01-01

    88 4 Monolingually -Derived Phrasal Paraphrase Generation for Statistical Ma- chine Translation 90 4.1...123 4.4 Spanish-English (S2E) results . . . . . . . . . . . . . . . . . . . . . . 125 4.5 Gains from using larger monolingual corpora for...96 4.2 Visual example of a phrasal distributional profile . . . . . . . . . . . . 103 4.3 Monolingual corpus-based distributional

  14. CALL and Less Commonly Taught Languages--Still a Way to Go

    ERIC Educational Resources Information Center

    Ward, Monica

    2016-01-01

    Many Computer Assisted Language Learning (CALL) innovations mainly apply to the Most Commonly Taught Languages (MCTLs), especially English. Recent manifestations of CALL for MCTLs such as corpora, Mobile Assisted Language Learning (MALL) and Massively Open Online Courses (MOOCs) are found less frequently in the world of Less Commonly Taught…

  15. Radioisotope penile plethysmography: A technique for evaluating corpora cavernosal blood flow during early tumescence

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Schwartz, A.N.; Graham, M.M.; Ferency, G.F.

    1989-04-01

    Radioisotope penile plethysmography is a nuclear medicine technique which assists in the evaluation of patients with erectile dysfunction. This technique attempts to noninvasively quantitate penile corpora cavernosal blood flow during early penile tumescence using technetium-99m-labeled red blood cells. Penile images and counts were acquired in a steady-state blood-pool phase prior to and after the administration of intracorporal papaverine. Penile counts, images, and time-activity curves were computer analyzed in order to determine peak corporal flow and volume changes. Peak corporal flow rates were compared to arterial integrity (determined by angiography) and venosinusoidal corporal leak (determined by cavernosometry). Peak corporal flow correlatedmore » well with arterial integrity (r = 0.91) but did not correlate with venosinusoidal leak parameters (r = 0.01). This report focuses on the methodology and the assumptions which form the foundation of this technique. The strong correlation of peak corporal flow and angiography suggests that radioisotope penile plethysmography could prove useful in the evaluation of arterial inflow disorders in patients with erectile dysfunction.« less

  16. Virtopsy: postmortem imaging of laryngeal foreign bodies.

    PubMed

    Oesterhelweg, Lars; Bolliger, Stephan A; Thali, Michael J; Ross, Steffen

    2009-05-01

    Death from corpora aliena in the larynx is a well-known entity in forensic pathology. The correct diagnosis of this cause of death is difficult without an autopsy, and misdiagnoses by external examination alone are common. To determine the postmortem usefulness of modern imaging techniques in the diagnosis of foreign bodies in the larynx, multislice computed tomography, magnetic resonance imaging, and postmortem full-body computed tomography-angiography were performed. Three decedents with a suspected foreign body in the larynx underwent the 3 different imaging techniques before medicolegal autopsy. Multislice computed tomography has a high diagnostic value in the noninvasive localization of a foreign body and abnormalities in the larynx. The differentiation between neoplasm or soft foreign bodies (eg, food) is possible, but difficult, by unenhanced multislice computed tomography. By magnetic resonance imaging, the discrimination of the soft tissue structures and soft foreign bodies is much easier. In addition to the postmortem multislice computed tomography, the combination with postmortem angiography will increase the diagnostic value. Postmortem, cross-sectional imaging methods are highly valuable procedures for the noninvasive detection of corpora aliena in the larynx.

  17. Evoked Cavernous Activity: Normal Values

    PubMed Central

    Yang, Claire C.; Yilmaz, Ugur; Vicars, Brenda G.

    2009-01-01

    Purpose We present normative data for evoked cavernous activity (ECA), an electrodiagnostic test that evaluates the autonomic innervation of the corpora cavernosa. Material and Methods We enrolled 37 healthy, sexually active and potent men for the study. Each subject completed an IIEF questionnaire and underwent simultaneous ECA and hand and foot sympathetic skin response (SSR) testing. The sympathetic skin response tests were performed as autonomic controls. Results Thirty six men had discernible ECA and SSRs. The mean IIEF erectile domain score was 27. ECA is a low frequency wave that is morphologically and temporally similar in both corpora. The amplitudes of the responses were highly variable. The latencies, although variable, always occurred after the hand SSR. There was no change in the quality or the latency of the ECA with age. Conclusions ECA is measurable in healthy, potent men in a wide range of ages. Similar to other evoked responses of the autonomic nervous system, the measured waveform is highly variable, but its presence is consistent. The association between ECA and erectile function is to be determined. PMID:18423763

  18. Distributed smoothed tree kernel for protein-protein interaction extraction from the biomedical literature

    PubMed Central

    Murugesan, Gurusamy; Abdulkadhar, Sabenabanu; Natarajan, Jeyakumar

    2017-01-01

    Automatic extraction of protein-protein interaction (PPI) pairs from biomedical literature is a widely examined task in biological information extraction. Currently, many kernel based approaches such as linear kernel, tree kernel, graph kernel and combination of multiple kernels has achieved promising results in PPI task. However, most of these kernel methods fail to capture the semantic relation information between two entities. In this paper, we present a special type of tree kernel for PPI extraction which exploits both syntactic (structural) and semantic vectors information known as Distributed Smoothed Tree kernel (DSTK). DSTK comprises of distributed trees with syntactic information along with distributional semantic vectors representing semantic information of the sentences or phrases. To generate robust machine learning model composition of feature based kernel and DSTK were combined using ensemble support vector machine (SVM). Five different corpora (AIMed, BioInfer, HPRD50, IEPA, and LLL) were used for evaluating the performance of our system. Experimental results show that our system achieves better f-score with five different corpora compared to other state-of-the-art systems. PMID:29099838

  19. Distributed smoothed tree kernel for protein-protein interaction extraction from the biomedical literature.

    PubMed

    Murugesan, Gurusamy; Abdulkadhar, Sabenabanu; Natarajan, Jeyakumar

    2017-01-01

    Automatic extraction of protein-protein interaction (PPI) pairs from biomedical literature is a widely examined task in biological information extraction. Currently, many kernel based approaches such as linear kernel, tree kernel, graph kernel and combination of multiple kernels has achieved promising results in PPI task. However, most of these kernel methods fail to capture the semantic relation information between two entities. In this paper, we present a special type of tree kernel for PPI extraction which exploits both syntactic (structural) and semantic vectors information known as Distributed Smoothed Tree kernel (DSTK). DSTK comprises of distributed trees with syntactic information along with distributional semantic vectors representing semantic information of the sentences or phrases. To generate robust machine learning model composition of feature based kernel and DSTK were combined using ensemble support vector machine (SVM). Five different corpora (AIMed, BioInfer, HPRD50, IEPA, and LLL) were used for evaluating the performance of our system. Experimental results show that our system achieves better f-score with five different corpora compared to other state-of-the-art systems.

  20. BioCreative V CDR task corpus: a resource for chemical disease relation extraction.

    PubMed

    Li, Jiao; Sun, Yueping; Johnson, Robin J; Sciaky, Daniela; Wei, Chih-Hsuan; Leaman, Robert; Davis, Allan Peter; Mattingly, Carolyn J; Wiegers, Thomas C; Lu, Zhiyong

    2016-01-01

    Community-run, formal evaluations and manually annotated text corpora are critically important for advancing biomedical text-mining research. Recently in BioCreative V, a new challenge was organized for the tasks of disease named entity recognition (DNER) and chemical-induced disease (CID) relation extraction. Given the nature of both tasks, a test collection is required to contain both disease/chemical annotations and relation annotations in the same set of articles. Despite previous efforts in biomedical corpus construction, none was found to be sufficient for the task. Thus, we developed our own corpus called BC5CDR during the challenge by inviting a team of Medical Subject Headings (MeSH) indexers for disease/chemical entity annotation and Comparative Toxicogenomics Database (CTD) curators for CID relation annotation. To ensure high annotation quality and productivity, detailed annotation guidelines and automatic annotation tools were provided. The resulting BC5CDR corpus consists of 1500 PubMed articles with 4409 annotated chemicals, 5818 diseases and 3116 chemical-disease interactions. Each entity annotation includes both the mention text spans and normalized concept identifiers, using MeSH as the controlled vocabulary. To ensure accuracy, the entities were first captured independently by two annotators followed by a consensus annotation: The average inter-annotator agreement (IAA) scores were 87.49% and 96.05% for the disease and chemicals, respectively, in the test set according to the Jaccard similarity coefficient. Our corpus was successfully used for the BioCreative V challenge tasks and should serve as a valuable resource for the text-mining research community.Database URL: http://www.biocreative.org/tasks/biocreative-v/track-3-cdr/. Published by Oxford University Press 2016. This work is written by US Government employees and is in the public domain in the United States.

  1. A new technique for augmentation phalloplasty: albugineal surgery with bilateral saphenous grafts--three years of experience.

    PubMed

    Austoni, E; Guarneri, A; Cazzaniga, A

    2002-09-01

    Penile augmentation surgery is a highly controversial issue due to the low level of standardisation of surgical techniques. The aim of the study is to illustrate a new technique to solve the problem of enlarging the penis by means of additive surgery on the albuginea of the corpora cavernosa, guaranteeing a real increase in size of the erect penis. Between 1995 and 1997, 39 patients who requested an increase in the diameter of their penises underwent augmentation phalloplasty with bilateral saphena grafts. The patients considered eligible for surgery were patients with either hypoplasia of the penis or functional penile dysmorphophobia. All the patients included in our study presented normal erection at screening. The average penis diameter in a flaccid state and during erection was found to be 2.1cm (1.6-2.7 cm) and 2.9 cm (2.2-3.7 cm), respectively. Before surgery the patients were informed of the experimental nature of the surgical procedure. The increase in volume of the corpora cavernosa was achieved by applying saphena grafts to longitudinal openings made bilaterally in the albuginea along the whole length of the penis. No major complications and specifically no losses of sensitivity of the penis or erection deficiencies occurred during the post-operative follow-up period. All the patients resumed their sexual activity in 4 months. A measurement of the penile dimensions was carried out 9 months after surgery. No clinical meaningful increases in the diameter of the flaccid penis were documented. The average penis diameter during erection was found to be 4.2 cm (3.4-4.9) with post-surgery increases in diameter varying from 1.1 to 2.1cm (p<0.01). The penile enlargement phalloplasty technique with albuginea surgery suggested by the authors definitely is indicated for increasing the volume of the corpora cavernosa during erection. Albuginea surgery with saphena grafts has been found to be free from aesthetic and functional complications with excellent patient satisfaction.

  2. The carbohydrate deposits detected by histochemical methods in the molecular layer of the dentate gyrus in the hippocampal formation of patients with schizophrenia, Down's syndrome and dementia, and aged person.

    PubMed

    Nishimura, A; Ikemoto, K; Satoh, K; Yamamoto, Y; Rand, S; Brinkmann, B; Nishi, K

    2000-11-01

    Post-mortem brain tissue was obtained from 28 patients with brain disorders, of which 15 had clinically diagnosed schizophrenia, 6 Alzheimer type dementia, 5 dementia with tangles and 2 cases of Down's syndrome. The controls were 22 cases from autopsies without brain disorders or with no known episodes of brain disorder. The tissues were stained for the detection of carbohydrate deposits in the hippocampal formation, using lectin, immunohistochemical and conventional staining methods. The staining revealed the existence of spherical deposits in the inner and middle molecular layers of the dentate gyrus in the hippocampal formation which contained fucose, galactose, N-acetyl galactosamine, N-acetyl glucosamine, sialic acid, mannose and chondroitin sulfate. The number of the deposits was higher in patients with brain disorder such as schizophrenia, Alzheimer type dementia, dementia with tangles or Down's syndrome, and in some aged individuals, in comparison to those in younger individuals. No deposits were detected in a few younger or aged individuals. Spherical deposits 3-10 microm in diameter may be an immature form of the corpora amylacea, since they were similar in the histochemical characteristics with lectin, immunohistochemical and conventional staining methods. However, differing staining ability by hematoxylin, periodic acid Schiff's reagent and antibodies against the intracellular degraded proteins such as ubiquitin and tau-protein was observed. The antibodies against ubiquitin and tau-protein showed clear reactivity with the corpora amylacea and no reactivity with spherical deposits, indicating that the corpora amylacea has an intracellular origin and spherical deposits an extracellular matrix origin. The results obtained in this study indicate that not only neuronal degeneration but also unusual glycometabolism in neurons may disturb the neuronal function and cause brain disorders, and that spherical deposits may cause dysfunction of the neuronal network in the dentate gyrus of the hippocampus which is closely linked with recognition and memory functions.

  3. Tetrodotoxin-insensitive electrical field stimulation-induced contractions on Crotalus durissus terrificus corpus cavernosum.

    PubMed

    Campos, Rafael; Mónica, Fabíola Z; Rodrigues, Renata Lopes; Rojas-Moscoso, Julio Alejandro; Moreno, Ronilson Agnaldo; Cogo, José Carlos; de Oliveira, Marco Antonio; Antunes, Edson; De Nucci, Gilberto

    2017-01-01

    Reptiles are the first amniotes to develop an intromitent penis, however until now the mechanisms involved in the electrical field stimulation-induced contraction on corpora cavernosa isolated from Crotalus durissus terrificus were not investigated. Crotalus and rabbit corpora cavernosa were mounted in 10 mL organ baths for isometric tension recording. Electrical field stimulation (EFS)-induced contractions were performed in presence/absence of phentolamine (10 μM), guanethidine (30 μM), tetrodotoxin (1 μM and 1mM), A-803467 (10 μM), 3-iodo-L-Tyrosine (1 mM), salsolinol (3 μM) and a modified Krebs solution (equimolar substitution of NaCl by N-methyl-D-glucamine). Immuno-histochemistry for tyrosine hydroxylase was also performed. Electrical field stimulation (EFS; 8 Hz and 16 Hz) caused contractions in both Crotalus and rabbit corpora cavernosa. The contractions were abolished by previous incubation with either phentolamine or guanethidine. Tetrodotoxin (1 μM) also abolished the EFS-induced contractions of rabbit CC, but did not affect EFS-induced contractions of Crotalus CC. Addition of A-803467 (10 μM) did not change the EFS-induced contractions of Crotalus CC but abolished rabbit CC contractions. 3-iodo-L-Tyrosine and salsolinol had no effect on EFS-induced contractions of Crotalus CC and Rabbit CC. Replacement of NaCl by N- Methyl-D-glucamine (NMDG) abolished EFS-induced contractions of rabbit CC, but did not affect Crotalus CC. The presence of tyrosine hydroxylase was identified in endothelial cells only of Crotalus CC. Since the EFS-induced contractions of Crotalus CC is dependent on catecholamine release, insensitive to TTX, insensitive to A803467 and to NaCl replacement, it indicates that the source of cathecolamine is unlikely to be from adrenergic terminals. The finding that tyrosine hydroxylase is present in endothelial cells suggests that these cells can modulate Crotalus CC tone.

  4. Assessing the readability of ClinicalTrials.gov.

    PubMed

    Wu, Danny T Y; Hanauer, David A; Mei, Qiaozhu; Clark, Patricia M; An, Lawrence C; Proulx, Joshua; Zeng, Qing T; Vydiswaran, V G Vinod; Collins-Thompson, Kevyn; Zheng, Kai

    2016-03-01

    ClinicalTrials.gov serves critical functions of disseminating trial information to the public and helping the trials recruit participants. This study assessed the readability of trial descriptions at ClinicalTrials.gov using multiple quantitative measures. The analysis included all 165,988 trials registered at ClinicalTrials.gov as of April 30, 2014. To obtain benchmarks, the authors also analyzed 2 other medical corpora: (1) all 955 Health Topics articles from MedlinePlus and (2) a random sample of 100,000 clinician notes retrieved from an electronic health records system intended for conveying internal communication among medical professionals. The authors characterized each of the corpora using 4 surface metrics, and then applied 5 different scoring algorithms to assess their readability. The authors hypothesized that clinician notes would be most difficult to read, followed by trial descriptions and MedlinePlus Health Topics articles. Trial descriptions have the longest average sentence length (26.1 words) across all corpora; 65% of their words used are not covered by a basic medical English dictionary. In comparison, average sentence length of MedlinePlus Health Topics articles is 61% shorter, vocabulary size is 95% smaller, and dictionary coverage is 46% higher. All 5 scoring algorithms consistently rated CliniclTrials.gov trial descriptions the most difficult corpus to read, even harder than clinician notes. On average, it requires 18 years of education to properly understand these trial descriptions according to the results generated by the readability assessment algorithms. Trial descriptions at CliniclTrials.gov are extremely difficult to read. Significant work is warranted to improve their readability in order to achieve CliniclTrials.gov's goal of facilitating information dissemination and subject recruitment. Published by Oxford University Press on behalf of the American Medical Informatics Association 2015. This work is written by US Government employees and is in the public domain in the US.

  5. Using latent semantic analysis and the predication algorithm to improve extraction of meanings from a diagnostic corpus.

    PubMed

    Jorge-Botana, Guillermo; Olmos, Ricardo; León, José Antonio

    2009-11-01

    There is currently a widespread interest in indexing and extracting taxonomic information from large text collections. An example is the automatic categorization of informally written medical or psychological diagnoses, followed by the extraction of epidemiological information or even terms and structures needed to formulate guiding questions as an heuristic tool for helping doctors. Vector space models have been successfully used to this end (Lee, Cimino, Zhu, Sable, Shanker, Ely & Yu, 2006; Pakhomov, Buntrock & Chute, 2006). In this study we use a computational model known as Latent Semantic Analysis (LSA) on a diagnostic corpus with the aim of retrieving definitions (in the form of lists of semantic neighbors) of common structures it contains (e.g. "storm phobia", "dog phobia") or less common structures that might be formed by logical combinations of categories and diagnostic symptoms (e.g. "gun personality" or "germ personality"). In the quest to bring definitions into line with the meaning of structures and make them in some way representative, various problems commonly arise while recovering content using vector space models. We propose some approaches which bypass these problems, such as Kintsch's (2001) predication algorithm and some corrections to the way lists of neighbors are obtained, which have already been tested on semantic spaces in a non-specific domain (Jorge-Botana, León, Olmos & Hassan-Montero, under review). The results support the idea that the predication algorithm may also be useful for extracting more precise meanings of certain structures from scientific corpora, and that the introduction of some corrections based on vector length may increases its efficiency on non-representative terms.

  6. Large-scale evidence of dependency length minimization in 37 languages

    PubMed Central

    Futrell, Richard; Mahowald, Kyle; Gibson, Edward

    2015-01-01

    Explaining the variation between human languages and the constraints on that variation is a core goal of linguistics. In the last 20 y, it has been claimed that many striking universals of cross-linguistic variation follow from a hypothetical principle that dependency length—the distance between syntactically related words in a sentence—is minimized. Various models of human sentence production and comprehension predict that long dependencies are difficult or inefficient to process; minimizing dependency length thus enables effective communication without incurring processing difficulty. However, despite widespread application of this idea in theoretical, empirical, and practical work, there is not yet large-scale evidence that dependency length is actually minimized in real utterances across many languages; previous work has focused either on a small number of languages or on limited kinds of data about each language. Here, using parsed corpora of 37 diverse languages, we show that overall dependency lengths for all languages are shorter than conservative random baselines. The results strongly suggest that dependency length minimization is a universal quantitative property of human languages and support explanations of linguistic variation in terms of general properties of human information processing. PMID:26240370

  7. A New Parallel Corpus Approach to Japanese Learners' English, Using Their Corrected Essays

    ERIC Educational Resources Information Center

    Miki, Nozomi

    2010-01-01

    This research introduces unique parallel corpora to uncover linguistic behaviors in L2 argumentative writing in the exact correspondence to their appropriate forms provided by English native speakers (NSs). The current paper targets at the mysterious behavior of I think in argumentative prose. I think is regarded as arguably problematic and…

  8. Lexical Bundles in the Academic Writing of Advanced Chinese EFL Learners

    ERIC Educational Resources Information Center

    Wei, Yaoyu; Lei, Lei

    2011-01-01

    The present study investigated the use of lexical bundles in the academic writing of advanced Chinese EFL learners. A corpus of doctoral dissertations by the learners and a corpus of published journal articles by professional writers were collected for the study. Four-word lexical bundles in the two corpora were identified and analysed. Results…

  9. Learning a Generative Probabilistic Grammar of Experience: A Process-Level Model of Language Acquisition

    ERIC Educational Resources Information Center

    Kolodny, Oren; Lotem, Arnon; Edelman, Shimon

    2015-01-01

    We introduce a set of biologically and computationally motivated design choices for modeling the learning of language, or of other types of sequential, hierarchically structured experience and behavior, and describe an implemented system that conforms to these choices and is capable of unsupervised learning from raw natural-language corpora. Given…

  10. English and French Journal Abstracts in the Language Sciences: Three Exploratory Studies

    ERIC Educational Resources Information Center

    Van Bonn, Sarah; Swales, John M.

    2007-01-01

    This article compares French and English academic article abstracts from the language sciences in an attempt to understand how and why language choice might affect this part-genre--both in actual use and according to authors' linguistic and rhetorical perceptions. Two corpora are used: Corpus A consists of abstracts from a French linguistics…

  11. Data-Driven Learning of Speech Acts Based on Corpora of DVD Subtitles

    ERIC Educational Resources Information Center

    Kitao, S. Kathleen; Kitao, Kenji

    2013-01-01

    Data-driven learning (DDL) is an inductive approach to language learning in which students study examples of authentic language and use them to find patterns of language use. This inductive approach to learning has the advantages of being learner-centered, encouraging hypothesis testing and learner autonomy, and helping develop learning skills.…

  12. The Extension of the Progressive Aspect in Black South African English

    ERIC Educational Resources Information Center

    van Rooy, Bertus

    2006-01-01

    The extension of the progressive aspect to stative verbs has been identified as a characteristic feature of New Varieties of English across the world, including the English of black South Africans (BSAfE). This paper examines the use of the progressive aspect in BSAfE, by doing a comparative analysis of three corpora of argumentative student…

  13. Towards an Auditory Account of Speech Rhythm: Application of a Model of the Auditory "Primal Sketch" to Two Multi-Language Corpora

    ERIC Educational Resources Information Center

    Lee, Christopher S.; Todd, Neil P. McAngus

    2004-01-01

    The world's languages display important differences in their rhythmic organization; most particularly, different languages seem to privilege different phonological units (mora, syllable, or stress foot) as their basic rhythmic unit. There is now considerable evidence that such differences have important consequences for crucial aspects of language…

  14. Polysemous Verbs and Modality in Native and Non-Native Argumentative Writing: A Corpus-Based Study

    ERIC Educational Resources Information Center

    Salazar, Danica; Verdaguer, Isabel

    2009-01-01

    The present study is a corpus-based analysis of a selection of polysemous lexical verbs used to express modality in student argumentative writing. Twenty-three lexical verbs were searched for in three 100,000-word corpora of argumentative essays written in English by American, Filipino and Spanish university students. Concordance lines were…

  15. Corpus Linguistics and Language Testing: Navigating Uncharted Waters

    ERIC Educational Resources Information Center

    Egbert, Jesse

    2017-01-01

    The use of corpora and corpus linguistic methods in language testing research is increasing at an accelerated pace. The growing body of language testing research that uses corpus linguistic data is a testament to their utility in test development and validation. Although there are many reasons to be optimistic about the future of using corpus data…

  16. Data-Driven Learning for Beginners: The Case of German Verb-Preposition Collocations

    ERIC Educational Resources Information Center

    Vyatkina, Nina

    2016-01-01

    Research on data-driven learning (DDL), or teaching and learning languages with the help of electronic corpora, has shown that it is both effective and efficient. Nevertheless, DDL is still far from common pedagogical practice, not least because the empirical research on it is still limited and narrowly focused. This study addresses some gaps in…

  17. The Pedagogical Mediation of a Developmental Learner Corpus for Classroom-Based Language Instruction

    ERIC Educational Resources Information Center

    Belz, Julie A.; Vyatkina, Nina

    2008-01-01

    Although corpora have been used in language teaching for some time, few empirical studies explore their impact on learning outcomes. We provide a microgenetic account of learners' responses to corpus-driven instructional units for German modal particles and pronominal "da"-compounds. The units are based on developmental corpus data produced by…

  18. Adverbials of Result: Phraseology and Functions in the Problem-Solution Pattern

    ERIC Educational Resources Information Center

    Charles, Maggie

    2011-01-01

    This paper combines the use of corpus techniques with discourse analysis in order to investigate adverbials of result in the writing of advanced academic student writers. It focuses in detail on the phraseology and functions of "thus," "therefore," "then," "hence," "so" and "consequently." Two corpora of native-speaker theses are examined: 190,000…

  19. You Are Your Words: Modeling Students' Vocabulary Knowledge with Natural Language Processing Tools

    ERIC Educational Resources Information Center

    Allen, Laura K.; McNamara, Danielle S.

    2015-01-01

    The current study investigates the degree to which the lexical properties of students' essays can inform stealth assessments of their vocabulary knowledge. In particular, we used indices calculated with the natural language processing tool, TAALES, to predict students' performance on a measure of vocabulary knowledge. To this end, two corpora were…

  20. Lexicogrammar in the International Construction Industry: A Corpus-Based Case Study of Japanese-Hong-Kongese On-Site Interactions in English

    ERIC Educational Resources Information Center

    Handford, Michael; Matous, Petr

    2011-01-01

    The purpose of this research is to identify and interpret statistically significant lexicogrammatical items that are used in on-site spoken communication in the international construction industry, initially through comparisons with reference corpora of everyday spoken and business language. Several data sources, including audio and video…

  1. A Submodularity Framework for Data Subset Selection

    DTIC Science & Technology

    2013-09-01

    37 7 List of Language Modeling Corpora in thet Arabic -to-English NIST Task ............. 37 8...Task ( Arabic -to-English) ................. 39 10 Baseline BLEU (%) PER Scores on Transtac Task (English-to- Arabic ) ................. 39 11...Comparison of BLEU (%) PER Scores on Transtac Task ( Arabic -to-English) ....... 39 12 Comparison of BLEU (%) PER Scores on Transtac Task (English-to- Arabic

  2. Spoken Grammar Awareness Raising: Does It Affect the Listening Ability of Iranian EFL Learners?

    ERIC Educational Resources Information Center

    Rashtchi, Mojgan; Afzali, Mahnaz

    2011-01-01

    Advances in spoken corpora analysis have brought about new insights into language pedagogy and have led to an awareness of the characteristics of spoken language. Current findings have shown that grammar of spoken language is different from written language. However, most listening and speaking materials are concocted based on written grammar and…

  3. What Do We Want EAP Teaching Materials for?

    ERIC Educational Resources Information Center

    Harwood, Nigel

    2005-01-01

    This paper explores the various anti-textbook arguments in the literature to determine their relevance to the field of EAP. I distinguish between what I call a "strong" and a "weak" anti-textbook line, then review the corpus-based studies which compare the language EAP textbooks teach with corpora of the language academic writers use. After…

  4. Frequent Collocates and Major Senses of Two Prepositions in ESL and ENL Corpora

    ERIC Educational Resources Information Center

    Nkemleke, Daniel

    2009-01-01

    This contribution assesses in quantitative terms frequent collocates and major senses of "between" and "through" in the corpus of Cameroonian English (CCE), the corpus of East-African (Kenya and Tanzania) English which is part of the International Corpus of English (ICE) project (ICE-EA), and the London Oslo/Bergen (LOB) corpus…

  5. Activation of Alternative Wnt Signaling Pathways in Human Mammary Gland and Breast Cancer Cells

    DTIC Science & Technology

    2006-06-01

    Parlow AF, Luhmann UF, Berger W, and Richards JS. Mice null for Frizzled4 (Fzd4–/–) are infertile and exhibit impaired corpora lutea formation and...proliferation, and matrix metalloproteinases pro- duction and regulated angiogenesis. J Immunol 170: 3369—3376, 2003. 39. Lobov IB, Rao S, Carroll TJ, Vallance

  6. The Development of a Corpus-Based Tool for Exploring Domain-Specific Collocational Knowledge in English

    ERIC Educational Resources Information Center

    Huang, Ping-Yu; Chen, Chien-Ming; Tsao, Nai-Lung; Wible, David

    2015-01-01

    Since it was published, Coxhead's (2000) Academic Word List (AWL) has been frequently used in English for academic purposes (EAP) classrooms, included in numerous teaching materials, and re-examined in light of various domain-specific corpora. Although well-received, the AWL has been criticized for ignoring some important facts that words still…

  7. What Data for Data-Driven Learning?

    ERIC Educational Resources Information Center

    Boulton, Alex

    2012-01-01

    Corpora have multiple affordances, not least for use by teachers and learners of a foreign language (L2) in what has come to be known as "data-driven learning" or DDL. The corpus and concordance interface were originally conceived by and for linguists, so other users need to adopt the role of "language researcher" to make the most of them. Despite…

  8. Learning English Grammar with a Corpus: Experimenting with Concordancing in a University Grammar Course

    ERIC Educational Resources Information Center

    Vannestal, Maria Estling; Lindquist, Hans

    2007-01-01

    Corpora have been used for pedagogical purposes for more than two decades but empirical studies are relatively rare, particularly in the context of grammar teaching. The present study focuses on students' attitudes towards grammar and how these attitudes are affected by the introduction of concordancing. The principal aims of the project were to…

  9. Will Dutch Become Flemish? Autonomous Developments in Belgian Dutch

    ERIC Educational Resources Information Center

    Van de Velde, Hans; Kissine, Mikhail; Tops, Evie; van der Harst, Sander; van Hout, Roeland

    2010-01-01

    In this paper a series of studies of standard Dutch pronunciation in Belgium and the Netherlands is presented. The research is based on two speech corpora: a diachronic corpus of radio speech (1935-1995) and a synchronic corpus of Belgian and Netherlandic standard Dutch from different regions at the turn of the millennium. It is shown that two…

  10. Language Assessment and the Inseparability of Lexis and Grammar: Focus on the Construct of Speaking

    ERIC Educational Resources Information Center

    Römer, Ute

    2017-01-01

    This paper aims to connect recent corpus research on phraseology with current language testing practice. It discusses how corpora and corpus-analytic techniques can illuminate central aspects of speech and help in conceptualizing the notion of lexicogrammar in second language speaking assessment. The description of speech and some of its core…

  11. Students' Engagement in Reflective Tasks: An Investigation of Interactive and Non-Interactive Discourse Corpora

    ERIC Educational Resources Information Center

    Farr, Fiona; Riordan, Elaine

    2012-01-01

    Reflective learning, a practice carrying relatively high educational value, has been with us for some time. Its popularity has grown to the extent that it is often adopted unquestioningly by educational practitioners. However, there are some important questions to be asked in relation to reflective practice. In reality, its impact on improved and…

  12. Erectile dysfunction due to a 'hidden' penis after pelvic trauma.

    PubMed

    Simonis, L A; Borovets, S; Van Driel, M F; Ten Duis, H J; Mensink, H J

    1999-02-01

    We describe a twenty-six year old patient who presented us with a dorsally retracted 'hidden' penis, which was entrapped in scar tissue and prevesical fat, 20y after a pelvic fracture with symphysiolysis. Penile 'lengthening' was performed by V-Y plasty, removal of fatty tissue, dissection of the entrapped corpora cavernosa followed by ventral fixation.

  13. The Construction of Stance in Reporting Clauses: A Cross-Disciplinary Study of Theses

    ERIC Educational Resources Information Center

    Charles, Maggie

    2006-01-01

    Using a corpus-based approach, this paper investigates the construction of stance in finite reporting clauses with "that"-clause complementation. The data are drawn from two corpora of theses in contrasting disciplines: a social science--politics--and a natural science--materials science. A network for the analysis of reporting clauses is…

  14. Euphemism as a Core Feature of "Patientese": A Comparative Study between English and French

    ERIC Educational Resources Information Center

    Faure, Pascaline

    2016-01-01

    The purpose of this lexicological study is to present a typology of patients' euphemizing lay denominations of medical terms illustrated by examples in English and French. Various textbooks and lexicons dealing with English and French for medical purposes served as corpora. The euphemisms were classified according to the three semantic processes…

  15. Biomechanically Preferred Consonant-Vowel Combinations Fail to Appear in Adult Spoken Corpora

    ERIC Educational Resources Information Center

    Whalen, D. H.; Giulivi, Sara; Nam, Hosung; Levitt, Andrea G.; Halle, Pierre; Goldstein, Louis M.

    2012-01-01

    Certain consonant/vowel (CV) combinations are more frequent than would be expected from the individual C and V frequencies alone, both in babbling and, to a lesser extent, in adult language, based on dictionary counts: Labial consonants co-occur with central vowels more often than chance would dictate; coronals co-occur with front vowels, and…

  16. Granulosa cell and oocyte mitochondrial abnormalities in a mouse model of fragile X primary ovarian insufficiency.

    PubMed

    Conca Dioguardi, Carola; Uslu, Bahar; Haynes, Monique; Kurus, Meltem; Gul, Mehmet; Miao, De-Qiang; De Santis, Lucia; Ferrari, Maurizio; Bellone, Stefania; Santin, Alessandro; Giulivi, Cecilia; Hoffman, Gloria; Usdin, Karen; Johnson, Joshua

    2016-06-01

    We hypothesized that the mitochondria of granulosa cells (GC) and/or oocytes might be abnormal in a mouse model of fragile X premutation (FXPM). Mice heterozygous and homozygous for the FXPM have increased death (atresia) of large ovarian follicles, fewer corpora lutea with a gene dosage effect manifesting in decreased litter size(s). Furthermore, granulosa cells (GC) and oocytes of FXPM mice have decreased mitochondrial content, structurally abnormal mitochondria, and reduced expression of critical mitochondrial genes. Because this mouse allele produces the mutant Fragile X mental retardation 1 (Fmr1) transcript and reduced levels of wild-type (WT) Fmr1 protein (FMRP), but does not produce a Repeat Associated Non-ATG Translation (RAN)-translation product, our data lend support to the idea that Fmr1 mRNA with large numbers of CGG-repeats is intrinsically deleterious in the ovary. Mitochondrial dysfunction has been detected in somatic cells of human and mouse FX PM carriers and mitochondria are essential for oogenesis and ovarian follicle development, FX-associated primary ovarian insufficiency (FXPOI) is seen in women with FXPM alleles. These alleles have 55-200 CGG repeats in the 5' UTR of an X-linked gene known as FMR1. The molecular basis of the pathology seen in this disorder is unclear but is thought to involve either some deleterious consequence of overexpression of RNA with long CGG-repeat tracts or of the generation of a repeat-associated non-AUG translation (RAN translation) product that is toxic. Analysis of ovarian function in a knock-in FXPM mouse model carrying 130 CGG repeats was performed as follows on WT, PM/+, and PM/PM genotypes. Histomorphometric assessment of follicle and corpora lutea numbers in ovaries from 8-month-old mice was executed, along with litter size analysis. Mitochondrial DNA copy number was quantified in oocytes and GC using quantitative PCR, and cumulus granulosa mitochondrial content was measured by flow cytometric analysis after staining of cells with Mitotracker dye. Transmission electron micrographs were prepared of GC within small growing follicles and mitochondrial architecture was compared. Quantitative RT-PCR analysis of key genes involved in mitochondrial structure and recycling was performed. A defect was found in follicle survival at the large antral stage in PM/+ and PM/PM mice. Litter size was significantly decreased in PM/PM mice, and corpora lutea were significantly reduced in mice of both mutant genotypes. Mitochondrial DNA copy number was significantly decreased in GC and metaphase II eggs in mutants. Flow cytometric analysis revealed that PM/+ and PM/PM animals lack the cumulus GC that harbor the greatest mitochondrial content as found in wild-type animals. Electron microscopic evaluation of GC of small growing follicles revealed mitochondrial structural abnormalities, including disorganized and vacuolar cristae. Finally, aberrant mitochondrial gene expression was detected. Mitofusin 2 (Mfn2) and Optic atrophy 1 (Opa1), genes involved in mitochondrial fusion and structure, respectively, were significantly decreased in whole ovaries of both mutant genotypes. Mitochondrial fission factor 1 (Mff1) was significantly decreased in PM/+ and PM/PM GC and eggs compared with wild-type controls. Data from the mouse model used for these studies should be viewed with some caution when considering parallels to the human FXPOI condition. Our data lend support to the idea that Fmr1 mRNA with large numbers of CGG-repeats is intrinsically deleterious in the ovary. FXPM disease states, including FXPOI, may share mitochondrial dysfunction as a common underlying mechanism. Not applicable. Studies were supported by NIH R21 071873 (J.J./G.H), The Albert McKern Fund for Perinatal Research (J.J.), NIH Intramural Funds (K.U.), and a TUBITAK Research Fellowship Award (B.U.). No conflict(s) of interest or competing interest(s) are noted. © The Author 2016. Published by Oxford University Press on behalf of the European Society of Human Reproduction and Embryology. All rights reserved. For Permissions, please email: journals.permissions@oup.com.

  17. On the creation of a clinical gold standard corpus in Spanish: Mining adverse drug reactions.

    PubMed

    Oronoz, Maite; Gojenola, Koldo; Pérez, Alicia; de Ilarraza, Arantza Díaz; Casillas, Arantza

    2015-08-01

    The advances achieved in Natural Language Processing make it possible to automatically mine information from electronically created documents. Many Natural Language Processing methods that extract information from texts make use of annotated corpora, but these are scarce in the clinical domain due to legal and ethical issues. In this paper we present the creation of the IxaMed-GS gold standard composed of real electronic health records written in Spanish and manually annotated by experts in pharmacology and pharmacovigilance. The experts mainly annotated entities related to diseases and drugs, but also relationships between entities indicating adverse drug reaction events. To help the experts in the annotation task, we adapted a general corpus linguistic analyzer to the medical domain. The quality of the annotation process in the IxaMed-GS corpus has been assessed by measuring the inter-annotator agreement, which was 90.53% for entities and 82.86% for events. In addition, the corpus has been used for the automatic extraction of adverse drug reaction events using machine learning. Copyright © 2015 Elsevier Inc. All rights reserved.

  18. Crying wolf? Biosecurity and metacommunication in the context of the 2009 swine flu pandemic.

    PubMed

    Nerlich, Brigitte; Koteyko, Nelya

    2012-07-01

    This article explores how the 2009 pandemic of swine flu (H1N1) intersected with issues of biosecurity in the context of an increasing entanglement between the spread of disease and the spread of information. Drawing on research into metacommunication, the article studies the rise of communication about ways in which swine flu was communicated, both globally and locally, during the pandemic. It examines and compares two corpora of texts, namely UK newspaper articles and blogs, written between 28 March and 11 June 2009, that is, the period from the start of the outbreak till the WHO announcement of the pandemic. Findings show that the interaction between traditional and digital media as well as the interaction between warnings about swine flu and previous warnings about other epidemics contributed to a heightened discourse of blame and counter-blame but also, more surprisingly, self-blame and reflections about the role the media in pandemic communication. The consequences of this increase in metacommunication for research into crisis communication are explored. Copyright © 2011 Elsevier Ltd. All rights reserved.

  19. Sophia: A Expedient UMLS Concept Extraction Annotator.

    PubMed

    Divita, Guy; Zeng, Qing T; Gundlapalli, Adi V; Duvall, Scott; Nebeker, Jonathan; Samore, Matthew H

    2014-01-01

    An opportunity exists for meaningful concept extraction and indexing from large corpora of clinical notes in the Veterans Affairs (VA) electronic medical record. Currently available tools such as MetaMap, cTAKES and HITex do not scale up to address this big data need. Sophia, a rapid UMLS concept extraction annotator was developed to fulfill a mandate and address extraction where high throughput is needed while preserving performance. We report on the development, testing and benchmarking of Sophia against MetaMap and cTAKEs. Sophia demonstrated improved performance on recall as compared to cTAKES and MetaMap (0.71 vs 0.66 and 0.38). The overall f-score was similar to cTAKES and an improvement over MetaMap (0.53 vs 0.57 and 0.43). With regard to speed of processing records, we noted Sophia to be several fold faster than cTAKES and the scaled-out MetaMap service. Sophia offers a viable alternative for high-throughput information extraction tasks.

  20. Thought Leaders during Crises in Massive Social Networks

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Corley, Courtney D.; Farber, Robert M.; Reynolds, William

    The vast amount of social media data that can be gathered from the internet coupled with workflows that utilize both commodity systems and massively parallel supercomputers, such as the Cray XMT, open new vistas for research to support health, defense, and national security. Computer technology now enables the analysis of graph structures containing more than 4 billion vertices joined by 34 billion edges along with metrics and massively parallel algorithms that exhibit near-linear scalability according to number of processors. The challenge lies in making this massive data and analysis comprehensible to an analyst and end-users that require actionable knowledge tomore » carry out their duties. Simply stated, we have developed language and content agnostic techniques to reduce large graphs built from vast media corpora into forms people can understand. Specifically, our tools and metrics act as a survey tool to identify thought leaders' -- those members that lead or reflect the thoughts and opinions of an online community, independent of the source language.« less

  1. Primary extramedullary plasmacytoma of the penis: a case report.

    PubMed

    Wang, Yao; Li, Hong-Yan; Liang, Ting-Ting; Han, Yu-Ping; Wang, Xue-Ju; Wei, Xin; Fan, Li; Wang, Wei-Hua

    2013-10-01

    Extramedullary plasmacytoma involving the penis is extremely rare. Here, we describe a case of primary extramedullary plasmacytoma of the penis in a 64-year-old man who presented with a palpable penile mass. Nuclear magnetic resonance imaging revealed the presence of a large, round non-encapsulated mass in the perineum. A contrast-enhanced computed tomography scan of the pelvis showed that the mass was located in the tunica albuginea and corpora cavernosa at the base of the penis. The mass encased the urethra and demonstrated no marked enhancement during the arterial phase. The patient underwent successful surgical resection of the tumor. Histologically, the tumor was composed primarily of neoplastic plasma cells that were positive for CD38, vimentin and Ki 67. Postoperatively, the patient recovered well and exhibited no evidence of development of multiple myeloma, local recurrence or distant metastasis at 2 months post-surgery. To the best of our knowledge, our case represents the first documented case of human primary extramedullary plasmacytoma of the penis.

  2. Optical urethrotomy under local anaesthesia is a feasible option in urethral stricture disease.

    PubMed

    Munks, D G; Alli, M O; Goad, E H Abdel

    2010-01-01

    The aim of our study was to assess the feasibility of performing optical urethrotomy for urethral stricture disease under local anaesthesia. A total of 33 patients with radiologically proven urethral stricture underwent optical urethrotomy by a single operator under local anaesthesia. Of these patients, 23 (70%) had stricture involving the corpora spongiosum and 18 (55%) of the patients were dependent on supra-pubic catheters. The procedure was successful in 30 cases (91%). The procedure was very well tolerated (average visual analogue pain score of 2/10) with an extremely low complication rate. The large number of patients with urethral stricture disease and the premium on operating time on formal theatre slates encouraged us to perform these procedures under local anaesthetic. Although most patients had severe stricture disease, the majority of cases were successful and very well tolerated. Optical urethrotomy under local anesthesia could be a viable option in the absence of formal theatre time and the facilities to perform general anaesthesia.

  3. Complex temporal topic evolution modelling using the Kullback-Leibler divergence and the Bhattacharyya distance.

    PubMed

    Andrei, Victor; Arandjelović, Ognjen

    2016-12-01

    The rapidly expanding corpus of medical research literature presents major challenges in the understanding of previous work, the extraction of maximum information from collected data, and the identification of promising research directions. We present a case for the use of advanced machine learning techniques as an aide in this task and introduce a novel methodology that is shown to be capable of extracting meaningful information from large longitudinal corpora and of tracking complex temporal changes within it. Our framework is based on (i) the discretization of time into epochs, (ii) epoch-wise topic discovery using a hierarchical Dirichlet process-based model, and (iii) a temporal similarity graph which allows for the modelling of complex topic changes. More specifically, this is the first work that discusses and distinguishes between two groups of particularly challenging topic evolution phenomena: topic splitting and speciation and topic convergence and merging, in addition to the more widely recognized emergence and disappearance and gradual evolution. The proposed framework is evaluated on a public medical literature corpus.

  4. Sophia: A Expedient UMLS Concept Extraction Annotator

    PubMed Central

    Divita, Guy; Zeng, Qing T; Gundlapalli, Adi V.; Duvall, Scott; Nebeker, Jonathan; Samore, Matthew H.

    2014-01-01

    An opportunity exists for meaningful concept extraction and indexing from large corpora of clinical notes in the Veterans Affairs (VA) electronic medical record. Currently available tools such as MetaMap, cTAKES and HITex do not scale up to address this big data need. Sophia, a rapid UMLS concept extraction annotator was developed to fulfill a mandate and address extraction where high throughput is needed while preserving performance. We report on the development, testing and benchmarking of Sophia against MetaMap and cTAKEs. Sophia demonstrated improved performance on recall as compared to cTAKES and MetaMap (0.71 vs 0.66 and 0.38). The overall f-score was similar to cTAKES and an improvement over MetaMap (0.53 vs 0.57 and 0.43). With regard to speed of processing records, we noted Sophia to be several fold faster than cTAKES and the scaled-out MetaMap service. Sophia offers a viable alternative for high-throughput information extraction tasks. PMID:25954351

  5. Applied Linguistics Project: Student-Led Computer Assisted Research in High School EAL/EAP

    ERIC Educational Resources Information Center

    Bohát, Róbert; Rödlingová, Beata; Horáková, Nina

    2015-01-01

    The Applied Linguistics Project (ALP) started at the International School of Prague (ISP) in 2013. Every year, Grade 9 English as an Additional Language (EAL) students identify an area of learning in need of improvement and design a research method followed by data collection and analysis using basic computer software tools or online corpora.…

  6. Automated Measurement of Syntactic Complexity in Corpus-Based L2 Writing Research and Implications for Writing Assessment

    ERIC Educational Resources Information Center

    Lu, Xiaofei

    2017-01-01

    Research investigating corpora of English learners' language raises new questions about how syntactic complexity is defined theoretically and operationally for second language (L2) writing assessment. I show that syntactic complexity is important in construct definitions and L2 writing rating scales as well as in L2 writing research. I describe…

  7. An Analysis of Document Category Prediction Responses to Classifier Model Parameter Treatment Permutations within the Software Design Patterns Subject Domain

    ERIC Educational Resources Information Center

    Pankau, Brian L.

    2009-01-01

    This empirical study evaluates the document category prediction effectiveness of Naive Bayes (NB) and K-Nearest Neighbor (KNN) classifier treatments built from different feature selection and machine learning settings and trained and tested against textual corpora of 2300 Gang-Of-Four (GOF) design pattern documents. Analysis of the experiment's…

  8. Using Google as a Super Corpus to Drive Written Language Learning: A Comparison with the British National Corpus

    ERIC Educational Resources Information Center

    Sha, Guoquan

    2010-01-01

    Data-driven learning (DDL), or corpus-based language learning, involves the learner in an exploratory task to discover appropriate expressions or collocates regarding his writing. However, the problematic units of meaning in each learner's writing are so diverse that conventional corpora often prove futile. The search engine Google with the…

  9. The Effects of Using Corpora on Revision Tasks in L2 Writing with Coded Error Feedback

    ERIC Educational Resources Information Center

    Tono, Yukio; Satake, Yoshiho; Miura, Aika

    2014-01-01

    This study reports on the results of classroom research investigating the effects of corpus use in the process of revising compositions in English as a foreign language. Our primary aim was to investigate the relationship between the information extracted from corpus data and how that information actually helped in revising different types of…

  10. Unraveling the Dyad: Using Recurrence Analysis to Explore Patterns of Syntactic Coordination between Children and Caregivers in Conversation

    ERIC Educational Resources Information Center

    Dale, Rick; Spivey, Michael J.

    2006-01-01

    Recurrence analysis is introduced as a means to investigate syntactic coordination between child and caregiver. Three CHILDES ( MacWhinney, 2000) corpora are analyzed and demonstrate coordination between children and their caregivers in terms of word-class n-gram sequences. Results further indicate that trade-offs in leading or following this…

  11. A Corpus-Based Study on the Use of the Logical Connector "Thus" in the Academic Writing of Turkish EFL Learners

    ERIC Educational Resources Information Center

    Uçar, Serpil; Yükselir, Ceyhun

    2017-01-01

    This research was conducted to investigate how frequently Turkish advanced learners of English use the logical connector "thus" in their academic prose and to investigate whether it was overused, underused or misused semantically in comparison to English native speakers. The data were collected from three corpora; Corpus of Contemporary…

  12. History Classroom Interactions and the Transmission of the Recent Memory of Human Rights Violations in Chile

    ERIC Educational Resources Information Center

    Oteíza, Teresa; Henríquez, Rodrigo; Pinuer, Claudio

    2015-01-01

    The purpose of this article is to examine history classroom interactions in Chilean secondary schools in relation to the transmission of historical memories of human rights violations committed by Augusto Pinochet's dictatorship from 1973 to 1990. Corpora of this research are comprised of history lessons filmed in the two types of public schools…

  13. Estimating the Latent Number of Types in Growing Corpora with Reduced Cost-Accuracy Trade-Off

    ERIC Educational Resources Information Center

    Hidaka, Shohei

    2016-01-01

    The number of unique words in children's speech is one of most basic statistics indicating their language development. We may, however, face difficulties when trying to accurately evaluate the number of unique words in a child's growing corpus over time with a limited sample size. This study proposes a novel technique to estimate the latent number…

  14. Pedagogical Models of Concordance Use: Correlations between Concordance User Preferences

    ERIC Educational Resources Information Center

    Ballance, Oliver James

    2017-01-01

    One of the most promising avenues of research in computer-assisted language learning is the potential for language learners to make use of language corpora. However, using a corpus requires use of a corpus tool as an interface, typically a concordancer. How such a tool can be made most accessible to learners is an important issue. Specifically,…

  15. The Use of Corpora and IT in Evaluating Oral Task Competence for Tourism English

    ERIC Educational Resources Information Center

    Fuentes, Alejandro Curado

    2004-01-01

    This paper describes a method in oral fluency evaluation for Tourism English according to a corpus-based lexical approach. Our main research focus is placed on measuring oral skill competence among Tourism English (TE) learners by contrasting their word use and linguistic fluency, achieved in two types of oral tasks, with corpus data frequencies.…

  16. The Vocabulary Thresholds of Business Textbooks and Business Research Articles for EFL Learners

    ERIC Educational Resources Information Center

    Hsu, Wenhua

    2011-01-01

    This research compiled two corpora, one for English-medium textbooks for business core courses (totaling 7.2 million running words) and the other for business research articles (7.62 million tokens), to form a basis of analysis. The results show that knowledge of the most frequent 3500 word families and 5000 word families plus proper nouns would…

  17. Separating Fact and Fiction: The Real Story of Corpus Use in Language Teaching

    ERIC Educational Resources Information Center

    Boulton, Alex

    2013-01-01

    This paper investigates uses of corpora in language learning ("data-driven learning") through analysis of a 600K-word corpus of empirical research papers in the field. The corpus can tell us much--the authors and the countries the studies are conducted in, the types of publication, and so on. The corpus investigation itself starts with…

  18. Recent Developments in Corpus Linguistics and Corpus-Based Research/Department of Linguistics and Modern Language Studies at the Hong Kong Institute of Education

    ERIC Educational Resources Information Center

    Xie, Qin

    2015-01-01

    Corpus linguistics has transformed the landscape of empirical research on languages in recent decades. The proliferation of corpus technology has enabled researchers worldwide to conduct research in their own geographical locations with few hindrances. It has become increasingly commonplace for researchers to compile their own corpora for specific…

  19. Data Analysis Project: Leveraging Massive Textual Corpora Using n-Gram Statistics

    DTIC Science & Technology

    2008-05-01

    oysters cereals oatmeal, sorghum, Frosted Flakes, wheat , Cheerios, oats, maize, rye, millet, bran cities Chicago, Beijing, San Francisco, Los Angeles...Coolmax flatworms no extractions fruits apples, oranges, bananas, grapes, strawberries, peaches, mangoes, mango, peach, pineapple fungi email directories...incidents, leap years, renewals, monsoon, collaborators, HBO, Showtime, carrots, VAT, proposals organisms fungi , humans, algae, Coliform bacteria, E. coli

  20. A Comparison of the Effectiveness of EFL Students' Use of Dictionaries and an Online Corpus for the Enhancement of Revision Skills

    ERIC Educational Resources Information Center

    Mueller, Charles M.; Jacobsen, Natalia D.

    2016-01-01

    Qualitative research focusing primarily on advanced-proficiency second language (L2) learners suggests that online corpora can function as useful reference tools for language learners, especially when addressing phraseological issues. However, the feasibility and effectiveness of online corpus consultation for learners at a basic level of L2…

  1. Optimizing Estimated Loss Reduction for Active Sampling in Rank Learning

    DTIC Science & Technology

    2008-01-01

    active learning framework for SVM-based and boosting-based rank learning. Our approach suggests sampling based on maximizing the estimated loss differential over unlabeled data. Experimental results on two benchmark corpora show that the proposed model substantially reduces the labeling effort, and achieves superior performance rapidly with as much as 30% relative improvement over the margin-based sampling

  2. Imaging the ovary.

    PubMed

    Feng, Yi; Tamadon, Amin; Hsueh, Aaron J W

    2018-05-01

    During each reproductive cycle, the ovary exhibits tissue remodelling and cyclic vasculature changes associated with hormonally regulated folliculogenesis, follicle rupture, luteal formation and regression. However, the relationships among different types of follicles and corpora lutea are unclear, and the role of ovarian vasculature in folliculogenesis and luteal dynamics has not been extensively investigated. Understanding of ovarian physiology and pathophysiology relies upon elucidation of ovarian morphology and architecture. This paper summarizes the literature on traditional approaches to the imaging of ovarian structures and discusses recent advances in ovarian imaging. Traditional in-vivo ultrasound, together with histological and electron microscopic approaches provide detailed views of the ovary at organ, tissue and molecular levels. However, in-vivo imaging is limited to antral and larger follicles whereas histological imaging is mainly two-dimensional in nature. Also discussed are emerging approaches in the use of near-infrared fluorophores to image follicles in live animals to detect preantral follicles as well as visualizing ovarian structures using CLARITY in fixed whole ovaries to elucidate three-dimensional interrelationships among follicles, corpora lutea and ovarian vasculature. Advances in ovarian imaging techniques provide new understanding of ovarian physiology and allow for the development of better tools to diagnose ovarian pathophysiology. Copyright © 2018 Reproductive Healthcare Ltd. All rights reserved.

  3. BALDEY: A database of auditory lexical decisions.

    PubMed

    Ernestus, Mirjam; Cutler, Anne

    2015-01-01

    In an auditory lexical decision experiment, 5541 spoken content words and pseudowords were presented to 20 native speakers of Dutch. The words vary in phonological make-up and in number of syllables and stress pattern, and are further representative of the native Dutch vocabulary in that most are morphologically complex, comprising two stems or one stem plus derivational and inflectional suffixes, with inflections representing both regular and irregular paradigms; the pseudowords were matched in these respects to the real words. The BALDEY ("biggest auditory lexical decision experiment yet") data file includes response times and accuracy rates, with for each item morphological information plus phonological and acoustic information derived from automatic phonemic segmentation of the stimuli. Two initial analyses illustrate how this data set can be used. First, we discuss several measures of the point at which a word has no further neighbours and compare the degree to which each measure predicts our lexical decision response outcomes. Second, we investigate how well four different measures of frequency of occurrence (from written corpora, spoken corpora, subtitles, and frequency ratings by 75 participants) predict the same outcomes. These analyses motivate general conclusions about the auditory lexical decision task. The (publicly available) BALDEY database lends itself to many further analyses.

  4. Anatomy of the clitoris.

    PubMed

    O'Connell, Helen E; Sanjeevan, Kalavampara V; Hutson, John M

    2005-10-01

    We present a comprehensive account of clitoral anatomy, including its component structures, neurovascular supply, relationship to adjacent structures (the urethra, vagina and vestibular glands, and connective tissue supports), histology and immunohistochemistry. We related recent anatomical findings to the historical literature to determine when data on accurate anatomy became available. An extensive review of the current and historical literature was done. The studies reviewed included dissection and microdissection, magnetic resonance imaging (MRI), 3-dimensional sectional anatomy reconstruction, histology and immunohistochemical studies. The clitoris is a multiplanar structure with a broad attachment to the pubic arch and via extensive supporting tissue to the mons pubis and labia. Centrally it is attached to the urethra and vagina. Its components include the erectile bodies (paired bulbs and paired corpora, which are continuous with the crura) and the glans clitoris. The glans is a midline, densely neural, non-erectile structure that is the only external manifestation of the clitoris. All other components are composed of erectile tissue with the composition of the bulbar erectile tissue differing from that of the corpora. The clitoral and perineal neurovascular bundles are large, paired terminations of the pudendal neurovascular bundles. The clitoral neurovascular bundles ascend along the ischiopubic rami to meet each other and pass along the superior surface of the clitoral body supplying the clitoris. The neural trunks pass largely intact into the glans. These nerves are at least 2 mm in diameter even in infancy. The cavernous or autonomic neural anatomy is microscopic and difficult to define consistently. MRI complements dissection studies and clarifies the anatomy. Clitoral pharmacology and histology appears to parallel those of penile tissue, although the clinical impact is vastly different. Typical textbook descriptions of the clitoris lack detail and include inaccuracies. It is impossible to convey clitoral anatomy in a single diagram showing only 1 plane, as is typically provided in textbooks, which reveal it as a flat structure. MRI provides a multiplanar representation of clitoral anatomy in the live state, which is a major advantage, and complements dissection materials. The work of Kobelt in the early 19th century provides a most comprehensive and accurate description of clitoral anatomy, and modern study provides objective images and few novel findings. The bulbs appear to be part of the clitoris. They are spongy in character and in continuity with the other parts of the clitoris. The distal urethra and vagina are intimately related structures, although they are not erectile in character. They form a tissue cluster with the clitoris. This cluster appears to be the locus of female sexual function and orgasm.

  5. Adaptable, high recall, event extraction system with minimal configuration

    PubMed Central

    2015-01-01

    Background Biomedical event extraction has been a major focus of biomedical natural language processing (BioNLP) research since the first BioNLP shared task was held in 2009. Accordingly, a large number of event extraction systems have been developed. Most such systems, however, have been developed for specific tasks and/or incorporated task specific settings, making their application to new corpora and tasks problematic without modification of the systems themselves. There is thus a need for event extraction systems that can achieve high levels of accuracy when applied to corpora in new domains, without the need for exhaustive tuning or modification, whilst retaining competitive levels of performance. Results We have enhanced our state-of-the-art event extraction system, EventMine, to alleviate the need for task-specific tuning. Task-specific details are specified in a configuration file, while extensive task-specific parameter tuning is avoided through the integration of a weighting method, a covariate shift method, and their combination. The task-specific configuration and weighting method have been employed within the context of two different sub-tasks of BioNLP shared task 2013, i.e. Cancer Genetics (CG) and Pathway Curation (PC), removing the need to modify the system specifically for each task. With minimal task specific configuration and tuning, EventMine achieved the 1st place in the PC task, and 2nd in the CG, achieving the highest recall for both tasks. The system has been further enhanced following the shared task by incorporating the covariate shift method and entity generalisations based on the task definitions, leading to further performance improvements. Conclusions We have shown that it is possible to apply a state-of-the-art event extraction system to new tasks with high levels of performance, without having to modify the system internally. Both covariate shift and weighting methods are useful in facilitating the production of high recall systems. These methods and their combination can adapt a model to the target data with no deep tuning and little manual configuration. PMID:26201408

  6. Looking for Structure: Is the Two-Word Stage of Language Development in Apes and Human Children the Same or Different?

    ERIC Educational Resources Information Center

    Patkowski, Mark

    2014-01-01

    Previously published corpora of two-word utterances by three chimpanzees and three human children were compared to determine whether, as has been claimed, apes possess the same basic syntactic and semantic capacities as 2-year old children. Some similarities were observed in the type of semantic relations expressed by the two groups; however,…

  7. Teaching Specialized Vocabulary by Integrating a Corpus-Based Approach: Implications for ESP Course Design at the University Level

    ERIC Educational Resources Information Center

    Hou, Hsiao-I

    2014-01-01

    The purpose of this study is to demonstrate how to integrate two in-house specialized corpora into a university-level English for Specific Purposes (ESP) course for nonnative speakers of English. The ESP course was an introductory level of wine tasting for Applied English Department students at a university specializing in hospitality in Taiwan.…

  8. Linguistic Models, Acquisition Theories, and Learner Corpora: Morphological Productivity in SLA Research Exemplified by Complex Verbs in German

    ERIC Educational Resources Information Center

    Lüdeling, Anke; Hirschmann, Hagen; Shadrova, Anna

    2017-01-01

    The present study analyzes morphological productivity for complex verbs in second language acquisition by analyzing a corpus of German as a Foreign Language (GFL). It shows that advanced learners of GFL use prefix and particle verbs relatively frequently and productively but less so than native speakers do and discusses these findings in the light…

  9. Metrics for Systems Thinking in the Human Dimension

    DTIC Science & Technology

    2016-11-01

    corpora of documents. 2 Methodology Overview We present a human-in-the- loop methodology that assists researchers and analysts by characterizing...supervised learning methods. Building on this foundation, we present an unsupervised, human-in-the- loop methodology that utilizes topic models to...the definition of strong systems thinking and in the interpretation of topics, but this is what makes the human-in-the- loop methodology so effective

  10. Heat-shock protein-25/27 phosphorylation by the delta isoform of protein kinase C.

    PubMed Central

    Maizels, E T; Peters, C A; Kline, M; Cutler, R E; Shanmugam, M; Hunzicker-Dunn, M

    1998-01-01

    Small heat-shock proteins (sHSPs) are widely expressed 25-28 kDa proteins whose functions are dynamically regulated by phosphorylation. While recent efforts have clearly delineated a stress-responsive p38 mitogen-activated protein-kinase (MAPK)-dependent kinase pathway culminating in activation of the heat-shock (HSP)-kinases, mitogen-activated protein-kinase-activated protein kinase-2 and -3, not all sHSP phosphorylation events can be explained by the p38 MAPK-dependent pathway. The contribution of protein kinase C (PKC) to sHSP phosphorylation was suggested by early studies but later questioned on the basis of the reported poor ability of purified PKC to phosphorylate sHSP in vitro. The current study re-evaluates the role of PKC in sHSP phosphorylation in the light of the isoform complexity of the PKC family. We evaluated the sHSP phosphorylation status in rat corpora lutea obtained from two stages of pregnancy, mid-pregnancy and late-pregnancy, which express different levels of the novel PKC isoform, PKC-delta. Two-dimensional Western blot analysis showed that HSP-27 was more highly phosphorylated in vivo in corpora lutea of late pregnancy, corresponding to the developmental stage in which PKC-delta is abundant and active. Late-pregnant luteal extracts contained a lipid-sensitive HSP-kinase activity which exactly co-purified with PKC-delta using hydroxyapatite and S-Sepharose column chromatography. To determine whether there might be preferential phosphorylation of sHSP by a particular PKC isoform, purified recombinant PKC isoforms corresponding to those PKC isoforms detected in rat corpora lutea were evaluated for HSP-kinase activity in vitro. Recombinant PKC-delta effectively catalysed the phosphorylation of sHSP in vitro, and PKC-alpha was 30-50% as effective as an HSP-kinase; other PKCs tested (beta1, beta2, epsilon and zeta) were poor HSP-kinases. These results show that select PKC family members can function as direct HSP-kinases in vitro. Moreover, the observation of enhanced luteal HSP-27 phosphorylation in vivo, in late pregnancy, when PKC-delta is abundant and active, suggests that select PKC family members contribute to sHSP phosphorylation events in vivo. PMID:9620873

  11. Doppler sonography of the uterine arteries during a superovulatory regime in cattle. Uterine blood flow in superovulated cattle.

    PubMed

    Honnens, A; Niemann, H; Paul, V; Meyer, H H D; Bollwein, H

    2008-09-15

    Transrectal color Doppler sonography was used to investigate the effects of a gonadotropin treatment to induce superovulation on uterine blood flow and its relationship with steroid hormone levels, ovarian response and embryo yield in dairy cows. The estrous cycle of 42 cows was synchronized by using PGF(2alpha) during diestrus and GnRH 48 h later (Day 0). Cows were examined on the day of eCG (2750 IU)-administration (Day 10), 3 days after eCG (Day 13) and 7 days after artificial insemination (Day 22), including the determination of total estrogens (E) and progesterone (P(4)) in peripheral plasma. Eight days after insemination (Day 23) the uterus was flushed and the number of total ova and embryos as well as transferable embryos was determined. The ovarian response was defined by the number of follicles>5.0mm in diameter on Day 13 and the number of corpora lutea on Day 22. Uterine blood flow was reflected by the blood flow volume (BFV) and the pulsatility index (PI) in the uterine arteries. Both variables showed distinct changes throughout the superovulatory cycle: BFV increased by 94% and PI decreased by 30% between Days 10 and 22 (P<0.0001). On Day 13, BFV but not PI correlated with follicle numbers (r=0.35; P<0.05); no correlation was found with E and P(4) (P>0.05). On Day 22, BFV correlated positively and PI correlated negatively with the number of corpora lutea (r=0.45 and r=-0.37; P<0.05) and P(4) (r=0.39 and r=-0.30; P<0.05). The number of transferable embryos was solely related to BFV measured on Day 13 (r=0.32; P<0.05). Our results show for the first time that in cows a superovulatory treatment is associated with a marked increase in BFV and a marked decrease in PI in the uterine arteries, concurrent with the development of multiple follicles and corpora lutea. However, transrectal color Doppler sonography of the uterine arteries does not facilitate the prediction of embryo yields following superovulatory treatment.

  12. Induced ovulation and conception in locating sows.

    PubMed

    Hausler, C L; Hodson, H H; Kuo, D C; Kinney, T J; Rauwolf, V A; Strack, L E

    1980-05-01

    Fifty lactating sows were injected with 1,500 IU pregnant mare serum gonadotrophin (PMSG) at an average of 25 days postpartum. Twenty-four of these sows received prostaglandin F2 alpha (PGF2 alpha) 24 hr prior to PMSG. Ninety-six hours after the PMSG injection, 1,000 IU of human chorionic gonadotrophin (HCG) were injected. Artificial insemination was performed at 24 and 36 to 42 hr post-HCG. The PMSG/HCG treatment resulted in pregnancy in 17 of 20 sows slaughtered from 34 to 43 days postbreeding and in 23 of 30 sows allowed to complete gestation. Mean numbers of corpora lutea (33) and viable embryos (15) were counted at slaughter. Litter sizes were averaged (11) for those sows allowed to farrow. Treatment with PGF2 alpha prior to PMSG injection had no effect on conception rates, number of corpora lutea, number of embryos or litter size in the lactating sows. In a second experiment, the same hormone treatments were administered to lactating sows beginning on day 5, 10, 15 or 20 postpartum. Pregnancy rates were 0/10, 2/10, 8/10 and 6/10, respectively (P less than .05, chi-square). At slaughter (30 to 40 days postbreeding), corpora lutea and embryo numbers recorded from pregnant sows were 23.0, 9.5; 31.5, 15.3, and 28.0, 18.8, respectively, for the sows in the day 10, day 15 and day 20 groups. In a third experiment, sows were given PMSG-HCG as previously described on either day 5 (five sows) or day 10 (14 sows) postpartum. Laparotomy of these sows 2 to 5 days postbreeding revealed minimal ovarian responsiveness at day 5, but 43% of the animals responded with multiple ovulations at day 10. The low pregnancy rate seen at day 10 in Exp. 2 may reflect embryonic mortality due to unfavorable uterine environment. We conclude that the PMSG/HCG treatment followed by timed artificial insemination of lactating sows will induce ovulation and coneption as early as 15 days postfarrowing. Pregnancy is thus concurrent with lactation, eliminating the need for early weaning and reducing the interval between successive farrowings.

  13. Correlations between ultrasonographic characteristics of corpora lutea and systemic concentrations of progesterone during the discrete stages of corpora lutea lifespan and secretory activity in cyclic ewes.

    PubMed

    Gallienne, Jacqueline; Gregg, Caroline; LeBlanc, Evan; Yaakob, Norazlin; Wu, Di; Davies, Kate; Rawlings, Norman; Pierson, Roger; Deardon, Rob; Bartlewski, Pawel

    2012-05-01

    Associations between physical characteristics and functionality of corpora lutea (CL) have previously been reported in monovulatory species, albeit several studies in cattle and humans have refuted the existence of temporal relationships between CL size, echotexture and serum progesterone (P(4)) concentrations. The main objective of the present study was to examine whether or not there were correlations between ultrasonographic image attributes of CL and systemic concentrations of P(4) during the discrete stages of the luteal phase in two breeds of sheep differing in ovulation rates (non-prolific Western White Face [WWF] ewes and prolific Finn [F] sheep). Transrectal ovarian ultrasonography utilized a 7.5-MHz linear-array transducer connected to a portable scanner (Aloka SSD-500) and the images were analyzed using commercially available image analytical software (Image ProPlus(®)) validated for the present application in sheep. The correlations were assessed using the Pearson's Product Moment (PPM) analysis and also, to increase the accuracy of statistical tests, the analysis of covariance (ANCOVA), with the number of CL as a co-factor. In WWF ewes, serum concentrations of P(4) correlated significantly with the total luteal area (TLA) during the CL growth phase (days 3-6; day 0 = ovulation) and functional luteolysis (days 12-15), and with numerical pixel values (NPVs--pixel intensity) during luteolysis; the results obtained by using two different statistical methods were generally similar. In prolific F ewes, serum P(4) concentrations were directly correlated with TLA during CL growth (days 3-6; ANCOVA), functional luteolysis (days 13-14; PPM), and structural CL regression (days 11-14; PPM and ANCOVA), and with NPVs during functional luteolysis (PPM and ANCOVA). We concluded that systemic P(4) concentrations could only be accurately predicted from the changes in luteal area during CL growth and regression, and from NPVs during luteolysis, in both prolific and non-prolific ewes, but the changes in size and echotexture of the luteal glands at mid-cycle were not indicative of serum P(4) concentrations in sheep.

  14. Biomedical literature classification using encyclopedic knowledge: a Wikipedia-based bag-of-concepts approach.

    PubMed

    Mouriño García, Marcos Antonio; Pérez Rodríguez, Roberto; Anido Rifón, Luis E

    2015-01-01

    Automatic classification of text documents into a set of categories has a lot of applications. Among those applications, the automatic classification of biomedical literature stands out as an important application for automatic document classification strategies. Biomedical staff and researchers have to deal with a lot of literature in their daily activities, so it would be useful a system that allows for accessing to documents of interest in a simple and effective way; thus, it is necessary that these documents are sorted based on some criteria-that is to say, they have to be classified. Documents to classify are usually represented following the bag-of-words (BoW) paradigm. Features are words in the text-thus suffering from synonymy and polysemy-and their weights are just based on their frequency of occurrence. This paper presents an empirical study of the efficiency of a classifier that leverages encyclopedic background knowledge-concretely Wikipedia-in order to create bag-of-concepts (BoC) representations of documents, understanding concept as "unit of meaning", and thus tackling synonymy and polysemy. Besides, the weighting of concepts is based on their semantic relevance in the text. For the evaluation of the proposal, empirical experiments have been conducted with one of the commonly used corpora for evaluating classification and retrieval of biomedical information, OHSUMED, and also with a purpose-built corpus of MEDLINE biomedical abstracts, UVigoMED. Results obtained show that the Wikipedia-based bag-of-concepts representation outperforms the classical bag-of-words representation up to 157% in the single-label classification problem and up to 100% in the multi-label problem for OHSUMED corpus, and up to 122% in the single-label classification problem and up to 155% in the multi-label problem for UVigoMED corpus.

  15. QTLTableMiner++: semantic mining of QTL tables in scientific articles.

    PubMed

    Singh, Gurnoor; Kuzniar, Arnold; van Mulligen, Erik M; Gavai, Anand; Bachem, Christian W; Visser, Richard G F; Finkers, Richard

    2018-05-25

    A quantitative trait locus (QTL) is a genomic region that correlates with a phenotype. Most of the experimental information about QTL mapping studies is described in tables of scientific publications. Traditional text mining techniques aim to extract information from unstructured text rather than from tables. We present QTLTableMiner ++ (QTM), a table mining tool that extracts and semantically annotates QTL information buried in (heterogeneous) tables of plant science literature. QTM is a command line tool written in the Java programming language. This tool takes scientific articles from the Europe PMC repository as input, extracts QTL tables using keyword matching and ontology-based concept identification. The tables are further normalized using rules derived from table properties such as captions, column headers and table footers. Furthermore, table columns are classified into three categories namely column descriptors, properties and values based on column headers and data types of cell entries. Abbreviations found in the tables are expanded using the Schwartz and Hearst algorithm. Finally, the content of QTL tables is semantically enriched with domain-specific ontologies (e.g. Crop Ontology, Plant Ontology and Trait Ontology) using the Apache Solr search platform and the results are stored in a relational database and a text file. The performance of the QTM tool was assessed by precision and recall based on the information retrieved from two manually annotated corpora of open access articles, i.e. QTL mapping studies in tomato (Solanum lycopersicum) and in potato (S. tuberosum). In summary, QTM detected QTL statements in tomato with 74.53% precision and 92.56% recall and in potato with 82.82% precision and 98.94% recall. QTM is a unique tool that aids in providing QTL information in machine-readable and semantically interoperable formats.

  16. How strongly do word reading times and lexical decision times correlate? Combining data from eye movement corpora and megastudies.

    PubMed

    Kuperman, Victor; Drieghe, Denis; Keuleers, Emmanuel; Brysbaert, Marc

    2013-01-01

    We assess the amount of shared variance between three measures of visual word recognition latencies: eye movement latencies, lexical decision times, and naming times. After partialling out the effects of word frequency and word length, two well-documented predictors of word recognition latencies, we see that 7-44% of the variance is uniquely shared between lexical decision times and naming times, depending on the frequency range of the words used. A similar analysis of eye movement latencies shows that the percentage of variance they uniquely share either with lexical decision times or with naming times is much lower. It is 5-17% for gaze durations and lexical decision times in studies with target words presented in neutral sentences, but drops to 0.2% for corpus studies in which eye movements to all words are analysed. Correlations between gaze durations and naming latencies are lower still. These findings suggest that processing times in isolated word processing and continuous text reading are affected by specific task demands and presentation format, and that lexical decision times and naming times are not very informative in predicting eye movement latencies in text reading once the effect of word frequency and word length are taken into account. The difference between controlled experiments and natural reading suggests that reading strategies and stimulus materials may determine the degree to which the immediacy-of-processing assumption and the eye-mind assumption apply. Fixation times are more likely to exclusively reflect the lexical processing of the currently fixated word in controlled studies with unpredictable target words rather than in natural reading of sentences or texts.

  17. Bayesian Recurrent Neural Network for Language Modeling.

    PubMed

    Chien, Jen-Tzung; Ku, Yuan-Chu

    2016-02-01

    A language model (LM) is calculated as the probability of a word sequence that provides the solution to word prediction for a variety of information systems. A recurrent neural network (RNN) is powerful to learn the large-span dynamics of a word sequence in the continuous space. However, the training of the RNN-LM is an ill-posed problem because of too many parameters from a large dictionary size and a high-dimensional hidden layer. This paper presents a Bayesian approach to regularize the RNN-LM and apply it for continuous speech recognition. We aim to penalize the too complicated RNN-LM by compensating for the uncertainty of the estimated model parameters, which is represented by a Gaussian prior. The objective function in a Bayesian classification network is formed as the regularized cross-entropy error function. The regularized model is constructed not only by calculating the regularized parameters according to the maximum a posteriori criterion but also by estimating the Gaussian hyperparameter by maximizing the marginal likelihood. A rapid approximation to a Hessian matrix is developed to implement the Bayesian RNN-LM (BRNN-LM) by selecting a small set of salient outer-products. The proposed BRNN-LM achieves a sparser model than the RNN-LM. Experiments on different corpora show the robustness of system performance by applying the rapid BRNN-LM under different conditions.

  18. Men, women…who cares? A population-based study on sex differences and gender roles in empathy and moral cognition.

    PubMed

    Baez, Sandra; Flichtentrei, Daniel; Prats, María; Mastandueno, Ricardo; García, Adolfo M; Cetkovich, Marcelo; Ibáñez, Agustín

    2017-01-01

    Research on sex differences in empathy has revealed mixed findings. Whereas experimental and neuropsychological measures show no consistent sex effect, self-report data consistently indicates greater empathy in women. However, available results mainly come from separate populations with relatively small samples, which may inflate effect sizes and hinder comparability between both empirical corpora. To elucidate the issue, we conducted two large-scale studies. First, we examined whether sex differences emerge in a large population-based sample (n = 10,802) when empathy is measured with an experimental empathy-for-pain paradigm. Moreover, we investigated the relationship between empathy and moral judgment. In the second study, a subsample (n = 334) completed a self-report empathy questionnaire. Results showed some sex differences in the experimental paradigm, but with minuscule effect sizes. Conversely, women did portray themselves as more empathic through self-reports. In addition, utilitarian responses to moral dilemmas were less frequent in women, although these differences also had small effect sizes. These findings suggest that sex differences in empathy are highly driven by the assessment measure. In particular, self-reports may induce biases leading individuals to assume gender-role stereotypes. Awareness of the role of measurement instruments in this field may hone our understanding of the links between empathy, sex differences, and gender roles.

  19. Your Laptop to the Rescue: Using the Child Language Data Exchange System Archive and CLAN Utilities to Improve Child Language Sample Analysis.

    PubMed

    Ratner, Nan Bernstein; MacWhinney, Brian

    2016-05-01

    In this article, we review the advantages of language sample analysis (LSA) and explain how clinicians can make the process of LSA faster, easier, more accurate, and more insightful than LSA done "by hand" by using free, available software programs such as Computerized Language Analysis (CLAN). We demonstrate the utility of CLAN analysis in studying the expressive language of a very large cohort of 24-month-old toddlers tracked in a recent longitudinal study; toddlers in particular are the most likely group to receive LSA by clinicians, but existing reference "norms" for this population are based on fairly small cohorts of children. Finally, we demonstrate how a CLAN utility such as KidEval can now extract potential normative data from the very large number of corpora now available for English and other languages at the Child Language Data Exchange System project site. Most of the LSA measures that we studied appear to show developmental profiles suggesting that they may be of specifically higher value for children at certain ages, because they do not show an even developmental trajectory from 2 to 7 years of age. Thieme Medical Publishers 333 Seventh Avenue, New York, NY 10001, USA.

  20. Parsing clinical text: how good are the state-of-the-art parsers?

    PubMed

    Jiang, Min; Huang, Yang; Fan, Jung-wei; Tang, Buzhou; Denny, Josh; Xu, Hua

    2015-01-01

    Parsing, which generates a syntactic structure of a sentence (a parse tree), is a critical component of natural language processing (NLP) research in any domain including medicine. Although parsers developed in the general English domain, such as the Stanford parser, have been applied to clinical text, there are no formal evaluations and comparisons of their performance in the medical domain. In this study, we investigated the performance of three state-of-the-art parsers: the Stanford parser, the Bikel parser, and the Charniak parser, using following two datasets: (1) A Treebank containing 1,100 sentences that were randomly selected from progress notes used in the 2010 i2b2 NLP challenge and manually annotated according to a Penn Treebank based guideline; and (2) the MiPACQ Treebank, which is developed based on pathology notes and clinical notes, containing 13,091 sentences. We conducted three experiments on both datasets. First, we measured the performance of the three state-of-the-art parsers on the clinical Treebanks with their default settings. Then we re-trained the parsers using the clinical Treebanks and evaluated their performance using the 10-fold cross validation method. Finally we re-trained the parsers by combining the clinical Treebanks with the Penn Treebank. Our results showed that the original parsers achieved lower performance in clinical text (Bracketing F-measure in the range of 66.6%-70.3%) compared to general English text. After retraining on the clinical Treebank, all parsers achieved better performance, with the best performance from the Stanford parser that reached the highest Bracketing F-measure of 73.68% on progress notes and 83.72% on the MiPACQ corpus using 10-fold cross validation. When the combined clinical Treebanks and Penn Treebank was used, of the three parsers, the Charniak parser achieved the highest Bracketing F-measure of 73.53% on progress notes and the Stanford parser reached the highest F-measure of 84.15% on the MiPACQ corpus. Our study demonstrates that re-training using clinical Treebanks is critical for improving general English parsers' performance on clinical text, and combining clinical and open domain corpora might achieve optimal performance for parsing clinical text.

  1. Biological network extraction from scientific literature: state of the art and challenges.

    PubMed

    Li, Chen; Liakata, Maria; Rebholz-Schuhmann, Dietrich

    2014-09-01

    Networks of molecular interactions explain complex biological processes, and all known information on molecular events is contained in a number of public repositories including the scientific literature. Metabolic and signalling pathways are often viewed separately, even though both types are composed of interactions involving proteins and other chemical entities. It is necessary to be able to combine data from all available resources to judge the functionality, complexity and completeness of any given network overall, but especially the full integration of relevant information from the scientific literature is still an ongoing and complex task. Currently, the text-mining research community is steadily moving towards processing the full body of the scientific literature by making use of rich linguistic features such as full text parsing, to extract biological interactions. The next step will be to combine these with information from scientific databases to support hypothesis generation for the discovery of new knowledge and the extension of biological networks. The generation of comprehensive networks requires technologies such as entity grounding, coordination resolution and co-reference resolution, which are not fully solved and are required to further improve the quality of results. Here, we analyse the state of the art for the extraction of network information from the scientific literature and the evaluation of extraction methods against reference corpora, discuss challenges involved and identify directions for future research. © The Author 2013. Published by Oxford University Press. For Permissions, please email: journals.permissions@oup.com.

  2. Structural Metadata Research in the Ears Program

    DTIC Science & Technology

    2005-01-01

    detecting structural information in the word stream (the so-called “structural MDE” portion of the EARS program); other MDE efforts on speaker ... diarization are overviewed in [13]. The rest of this paper is organized as follows. We describe the structural MDE tasks, performance measurement, and corpora...tems have only recently been introduced, with NIST reporting re- sults with the Wilcoxon signed rank test for speaker -level average score differences

  3. A Study of Quantitative Measurements of Programmer Productivity for Fleet Material Support Office (FMSO).

    DTIC Science & Technology

    1982-12-01

    paper examines the various measures discussed in the literature and used in selected corpora- tions which develop software. It presents several methods ...examines the various measures discassed in the literature and used in selected corporations which develop software. It presents several methods for...HOUR .... 40 D. SELECTED INDUSrRY METHODS FOR MEASURING PRODUCTIVITY 41 _ I1. 1IBM 41.. . . . . . . . ; 2. Amdahl . . . . . . . . . . . . . . . . . . 44

  4. Insights from a Learner Corpus as Opposed to a Native Corpus about Cohesive Devices in an Academic Writing Context

    ERIC Educational Resources Information Center

    Ersanli, Ceylan Yangin

    2015-01-01

    This study reports on the insights from an EFL learner corpora (a total of 151 essays and 49,690 words) generated from essays collected over the years in a Turkish state university from freshmen students enrolling in the Advanced Writing course. The comparison of cohesive devices in the non-native corpus (NNC) with those in a native corpus (NC)…

  5. Sources of Variability in Language Development of Children with Cochlear Implants: Age at Implantation, Parental Language, and Early Features of Children's Language Construction

    ERIC Educational Resources Information Center

    Szagun, Gisela; Schramm, Satyam A.

    2016-01-01

    The aim of the present study was to analyze the relative influence of age at implantation, parental expansions, and child language internal factors on grammatical progress in children with cochlear implants (CI). Data analyses used two longitudinal corpora of spontaneous speech samples, one with twenty-two and one with twenty-six children,…

  6. CHOOSING A FORM OF BUSINESS ORGANIZATION.

    DTIC Science & Technology

    Contents: Non-tax considerations in the choice of business form Tax considerations in the choice of a form of business association Tax control and...choice of business form Subchapter S corporations (Int. Rev. Code of 1954, paragraphs 1371-77), permitting shareholders of small business corporations...to be taxed directly on the corpora tion’s earnings rather than have the corporation taxed as an entity at normal corporate rates Choice of business

  7. Informational Packaging, Level of Formality, and the Use of Circumstance Adverbials in L1 and L2 Student Academic Presentations

    ERIC Educational Resources Information Center

    Zareva, Alla

    2009-01-01

    The analysis of circumstance adverbials in this paper was based on L1 and L2 corpora of student presentations, each of which consisting of approximately 30,000 words. The overall goal of the investigation was to identify specific functions L1 and L2 college students attributed to circumstance adverbials (the most frequently used adverbial class in…

  8. Advances in understanding of mammalian penile evolution, human penile anatomy and human erection physiology: Clinical implications for physicians and surgeons

    PubMed Central

    Hsieh, Cheng-Hsing; Liu, Shih-Ping; Hsu, Geng-Long; Chen, Heng-Shuen; Molodysky, Eugen; Chen, Ying-Hui; Yu, Hong-Jeng

    2012-01-01

    Summary Recent studies substantiate a model of the tunica albuginea of the corpora cavernosa as a bi-layered structure with a 360° complete inner circular layer and a 300° incomplete outer longitudinal coat spanning from the bulbospongiosus and ischiocavernosus proximally and extending continuously into the distal ligament within the glans penis. The anatomical location and histology of the distal ligament invites convincing parallels with the quadrupedal os penis and therefore constitutes potential evidence of the evolutionary process. In the corpora cavernosa, a chamber design is responsible for facilitating rigid erections. For investigating its venous factors exclusively, hemodynamic studies have been performed on both fresh and defrosted human male cadavers. In each case, a rigid erection was unequivocally attainable following venous removal. This clearly has significant ramifications in relation to penile venous surgery and its role in treating impotent patients. One deep dorsal vein, 2 cavernosal veins and 2 pairs of para-arterial veins (as opposed to 1 single vein) are situated between Buck’s fascia and the tunica albuginea. These newfound insights into penile tunical, venous anatomy and erection physiology were inspired by and, in turn, enhance clinical applications routinely encountered by physicians and surgeons, such as penile morphological reconstruction, penile implantation and penile venous surgery. PMID:22739749

  9. A frame selective dynamic programming approach for noise robust pitch estimation.

    PubMed

    Yarra, Chiranjeevi; Deshmukh, Om D; Ghosh, Prasanta Kumar

    2018-04-01

    The principles of the existing pitch estimation techniques are often different and complementary in nature. In this work, a frame selective dynamic programming (FSDP) method is proposed which exploits the complementary characteristics of two existing methods, namely, sub-harmonic to harmonic ratio (SHR) and sawtooth-wave inspired pitch estimator (SWIPE). Using variants of SHR and SWIPE, the proposed FSDP method classifies all the voiced frames into two classes-the first class consists of the frames where a confidence score maximization criterion is used for pitch estimation, while for the second class, a dynamic programming (DP) based approach is proposed. Experiments are performed on speech signals separately from KEELE, CSLU, and PaulBaghsaw corpora under clean and additive white Gaussian noise at 20, 10, 5, and 0 dB SNR conditions using four baseline schemes including SHR, SWIPE, and two DP based techniques. The pitch estimation performance of FSDP, when averaged over all SNRs, is found to be better than those of the baseline schemes suggesting the benefit of applying smoothness constraint using DP in selected frames in the proposed FSDP scheme. The VuV classification error from FSDP is also found to be lower than that from all four baseline schemes in almost all SNR conditions on three corpora.

  10. Alternative treatment of ovarian cysts with Tribulus terrestris extract: a rat model.

    PubMed

    Dehghan, A; Esfandiari, A; Bigdeli, S Momeni

    2012-02-01

    Tribulus terrestris has long been used in traditional medicine to treat impotency and improve sexual functions in man. The aim of this study was to evaluate the efficiency of T. terrestris extract in the treatment of polycystic ovary (PCO) in Wistar rat. Estradiol valerate was injected to 15 mature Wistar rats to induce PCO. Rats were randomly divided into three groups (control, low-dose and high-dose groups) of five each and received 0, 5 and 10 mg of T. terrestris extract, respectively.Treatments began on days 50 and 61 after estradiol injection; at the same time, vaginal smear was prepared. The ovaries were removed on day 62, and histological sections were prepared accordingly. The number and diameter of corpora lutea, thickness of the theca interna layer and the number of all follicles were evaluated in both ovaries. In comparison with the control group, the number of corpora lutea and primary and secondary follicles significantly increased following T. terrestris treatment; however, the number of ovarian cysts significantly decreased. It can be concluded that T. terrestris have a luteinizing effect on ovarian cysts, which may relate to its gonadotropin-like activity; also, a high dose of the extract can efficiently remove ovarian cysts and resume ovarian activity. © 2011 Blackwell Verlag GmbH.

  11. Myostatin, a profibrotic factor and the main inhibitor of striated muscle mass, is present in the penile and vascular smooth muscle.

    PubMed

    Kovanecz, I; Masouminia, M; Gelfand, R; Vernet, D; Rajfer, J; Gonzalez-Cadavid, N F

    2017-09-01

    Myostatin is present in striated myofibers but, except for myometrial cells, has not been reported within smooth muscle cells (SMC). We investigated in the rat whether myostatin is present in SMC within the penis and the vascular wall and, if so, whether it is transcriptionally expressed and associated with the loss of corporal SMC occurring in certain forms of erectile dysfunction (ED). Myostatin protein was detected by immunohistochemistry/fluorescence and western blots in the perineal striated muscles, and also in the SMC of the penile corpora, arteries and veins, and aorta. Myostatin was found in corporal SMC cultures, and its transcriptional expression (and its receptor) was shown there by DNA microarrays. Myostatin protein was measured by western blots in the penile shaft of rats subjected to bilateral cavernosal nerve resection (BCNR), that were left untreated, or treated (45 days) with muscle-derived stem cells (MDSC), or concurrent daily low-dose sildenafil. Myostatin was not increased by BCNR (compared with sham operated animals), but over expressed after treatment with MDSC. This was reduced by concurrent sildenafil. The presence of myostatin in corporal and vascular SMC, and its overexpression in the corpora by MDSC therapy, may have relevance for the stem cell treatment of corporal fibrosis and ED.

  12. A bootstrapping method for development of Treebank

    NASA Astrophysics Data System (ADS)

    Zarei, F.; Basirat, A.; Faili, H.; Mirain, M.

    2017-01-01

    Using statistical approaches beside the traditional methods of natural language processing could significantly improve both the quality and performance of several natural language processing (NLP) tasks. The effective usage of these approaches is subject to the availability of the informative, accurate and detailed corpora on which the learners are trained. This article introduces a bootstrapping method for developing annotated corpora based on a complex and rich linguistically motivated elementary structure called supertag. To this end, a hybrid method for supertagging is proposed that combines both of the generative and discriminative methods of supertagging. The method was applied on a subset of Wall Street Journal (WSJ) in order to annotate its sentences with a set of linguistically motivated elementary structures of the English XTAG grammar that is using a lexicalised tree-adjoining grammar formalism. The empirical results confirm that the bootstrapping method provides a satisfactory way for annotating the English sentences with the mentioned structures. The experiments show that the method could automatically annotate about 20% of WSJ with the accuracy of F-measure about 80% of which is particularly 12% higher than the F-measure of the XTAG Treebank automatically generated from the approach proposed by Basirat and Faili [(2013). Bridge the gap between statistical and hand-crafted grammars. Computer Speech and Language, 27, 1085-1104].

  13. Type VII Collagen Expression in the Human Vitreoretinal Interface, Corpora Amylacea and Inner Retinal Layers

    PubMed Central

    Wullink, Bart; Pas, Hendri H.; Van der Worp, Roelofje J.; Kuijer, Roel; Los, Leonoor I.

    2015-01-01

    Type VII collagen, as a major component of anchoring fibrils found at basement membrane zones, is crucial in anchoring epithelial tissue layers to their underlying stroma. Recently, type VII collagen was discovered in the inner human retina by means of immunohistochemistry, while proteomic investigations demonstrated type VII collagen at the vitreoretinal interface of chicken. Because of its potential anchoring function at the vitreoretinal interface, we further assessed the presence of type VII collagen at this site. We evaluated the vitreoretinal interface of human donor eyes by means of immunohistochemistry, confocal microscopy, immunoelectron microscopy, and Western blotting. Firstly, type VII collagen was detected alongside vitreous fibers6 at the vitreoretinal interface. Because of its known anchoring function, it is likely that type VII collagen is involved in vitreoretinal attachment. Secondly, type VII collagen was found within cytoplasmic vesicles of inner retinal cells. These cells resided most frequently in the ganglion cell layer and inner plexiform layer. Thirdly, type VII collagen was found in astrocytic cytoplasmic inclusions, known as corpora amylacea. The intraretinal presence of type VII collagen was confirmed by Western blotting of homogenized retinal preparations. These data add to the understanding of vitreoretinal attachment, which is important for a better comprehension of common vitreoretinal attachment pathologies. PMID:26709927

  14. Distribution of thyroid hormone and thyrotropin receptors in reproductive tissues of adult female rabbits.

    PubMed

    Rodríguez-Castelán, Julia; Anaya-Hernández, Arely; Méndez-Tepepa, Maribel; Martínez-Gómez, Margarita; Castelán, Francisco; Cuevas-Romero, Estela

    2017-02-01

    Thyroid dysfunctions are related to anovulation, miscarriages, and infertility in women and laboratory animals. Mechanisms associated with these effects are unknown, although indirect or direct actions of thyroid hormones and thyrotropin could be assumed. The present study aimed to identify the distribution of thyroid hormones (TRs) and thyrotropin (TSHR) receptors in reproductive organs of female rabbits. Ovary of virgin and pregnant rabbits, as well as the oviduct, uterus, and vagina of virgin rabbits were excised, histologically processed, and cut. Slices from these organs were used for immunohistochemical studies for TRα1-2, TRß1, and TSHR. The presence of TRs and TSHR was found in the primordial, primary, secondary, tertiary, and Graafian follicles of virgin rabbits, as well as in the corpora lutea, corpora albicans, and wall of hemorrhagic cysts of pregnant rabbits. Oviductal regions (fimbria-infundibulum, ampulla, isthmus, and utero-tubal junction), uterus (endometrium and myometrium), and vagina (abdominal, pelvic, and perineal portions) of virgin rabbits showed anti-TRs and anti-TSHR immunoreactivity. Additionally, the distal urethra, paravaginal ganglia, levator ani and iliococcygeus muscles, dorsal nerve and body of the clitoris, perigenital skin, and prostate had TRs and TSHR. The wide presence of TRs and TSHR in female reproductive organs suggests varied effects of thyroid hormones and thyrotropin in reproduction.

  15. Physiologically Persistent Corpora lutea in Eurasian Lynx (Lynx lynx) – Longitudinal Ultrasound and Endocrine Examinations Intra-Vitam

    PubMed Central

    Painer, Johanna; Jewgenow, Katarina; Dehnhard, Martin; Arnemo, Jon M.; Linnell, John D. C.; Odden, John; Hildebrandt, Thomas B.; Goeritz, Frank

    2014-01-01

    Felids generally follow a poly-estrous reproductive strategy. Eurasian lynx (Lynx lynx) display a different pattern of reproductive cyclicity where physiologically persistent corpora lutea (CLs) induce a mono-estrous condition which results in highly seasonal reproduction. The present study was based around a sono-morphological and endocrine study of captive Eurasian lynx, and a control-study on free-ranging lynx. We verified that CLs persist after pregnancy and pseudo-pregnancy for at least a two-year period. We could show that lynx are able to enter estrus in the following year, while CLs from the previous years persisted in structure and only temporarily reduced their function for the period of estrus onset or birth, which is unique among felids. The almost constant luteal progesterone secretion (average of 5 ng/ml serum) seems to prevent folliculogenesis outside the breeding season and has converted a poly-estrous general felid cycle into a mono-estrous cycle specific for lynx. The hormonal regulation mechanism which causes lynx to have the longest CL lifespan amongst mammals remains unclear. The described non-felid like ovarian physiology appears to be a remarkably non-plastic system. The lynx's reproductive ability to adapt to environmental and anthropogenic changes needs further investigation. PMID:24599348

  16. The effects of data-driven learning activities on EFL learners' writing development.

    PubMed

    Luo, Qinqin

    2016-01-01

    Data-driven learning has been proved as an effective approach in helping learners solve various writing problems such as correcting lexical or grammatical errors, improving the use of collocations and generating ideas in writing, etc. This article reports on an empirical study in which data-driven learning was accomplished with the assistance of the user-friendly BNCweb, and presents the evaluation of the outcome by comparing the effectiveness of BNCweb and a search engine Baidu which is most commonly used as reference resource by Chinese learners of English as a foreign language. The quantitative results about 48 Chinese college students revealed that the experimental group which used BNCweb performed significantly better in the post-test in terms of writing fluency and accuracy, as compared with the control group which used the search engine Baidu. However, no significant difference was found between the two groups in terms of writing complexity. The qualitative results about the interview revealed that learners generally showed a positive attitude toward the use of BNCweb but there were still some problems of using corpora in the writing process, thus the combined use of corpora and other types of reference resource was suggested as a possible way to counter the potential barriers for Chinese learners of English.

  17. A unified framework for evaluating the risk of re-identification of text de-identification tools.

    PubMed

    Scaiano, Martin; Middleton, Grant; Arbuckle, Luk; Kolhatkar, Varada; Peyton, Liam; Dowling, Moira; Gipson, Debbie S; El Emam, Khaled

    2016-10-01

    It has become regular practice to de-identify unstructured medical text for use in research using automatic methods, the goal of which is to remove patient identifying information to minimize re-identification risk. The metrics commonly used to determine if these systems are performing well do not accurately reflect the risk of a patient being re-identified. We therefore developed a framework for measuring the risk of re-identification associated with textual data releases. We apply the proposed evaluation framework to a data set from the University of Michigan Medical School. Our risk assessment results are then compared with those that would be obtained using a typical contemporary micro-average evaluation of recall in order to illustrate the difference between the proposed evaluation framework and the current baseline method. We demonstrate how this framework compares against common measures of the re-identification risk associated with an automated text de-identification process. For the probability of re-identification using our evaluation framework we obtained a mean value for direct identifiers of 0.0074 and a mean value for quasi-identifiers of 0.0022. The 95% confidence interval for these estimates were below the relevant thresholds. The threshold for direct identifier risk was based on previously used approaches in the literature. The threshold for quasi-identifiers was determined based on the context of the data release following commonly used de-identification criteria for structured data. Our framework attempts to correct for poorly distributed evaluation corpora, accounts for the data release context, and avoids the often optimistic assumptions that are made using the more traditional evaluation approach. It therefore provides a more realistic estimate of the true probability of re-identification. This framework should be used as a basis for computing re-identification risk in order to more realistically evaluate future text de-identification tools. Copyright © 2016 The Authors. Published by Elsevier Inc. All rights reserved.

  18. Limited-memory fast gradient descent method for graph regularized nonnegative matrix factorization.

    PubMed

    Guan, Naiyang; Wei, Lei; Luo, Zhigang; Tao, Dacheng

    2013-01-01

    Graph regularized nonnegative matrix factorization (GNMF) decomposes a nonnegative data matrix X[Symbol:see text]R(m x n) to the product of two lower-rank nonnegative factor matrices, i.e.,W[Symbol:see text]R(m x r) and H[Symbol:see text]R(r x n) (r < min {m,n}) and aims to preserve the local geometric structure of the dataset by minimizing squared Euclidean distance or Kullback-Leibler (KL) divergence between X and WH. The multiplicative update rule (MUR) is usually applied to optimize GNMF, but it suffers from the drawback of slow-convergence because it intrinsically advances one step along the rescaled negative gradient direction with a non-optimal step size. Recently, a multiple step-sizes fast gradient descent (MFGD) method has been proposed for optimizing NMF which accelerates MUR by searching the optimal step-size along the rescaled negative gradient direction with Newton's method. However, the computational cost of MFGD is high because 1) the high-dimensional Hessian matrix is dense and costs too much memory; and 2) the Hessian inverse operator and its multiplication with gradient cost too much time. To overcome these deficiencies of MFGD, we propose an efficient limited-memory FGD (L-FGD) method for optimizing GNMF. In particular, we apply the limited-memory BFGS (L-BFGS) method to directly approximate the multiplication of the inverse Hessian and the gradient for searching the optimal step size in MFGD. The preliminary results on real-world datasets show that L-FGD is more efficient than both MFGD and MUR. To evaluate the effectiveness of L-FGD, we validate its clustering performance for optimizing KL-divergence based GNMF on two popular face image datasets including ORL and PIE and two text corpora including Reuters and TDT2. The experimental results confirm the effectiveness of L-FGD by comparing it with the representative GNMF solvers.

  19. Substructural Regularization With Data-Sensitive Granularity for Sequence Transfer Learning.

    PubMed

    Sun, Shichang; Liu, Hongbo; Meng, Jiana; Chen, C L Philip; Yang, Yu

    2018-06-01

    Sequence transfer learning is of interest in both academia and industry with the emergence of numerous new text domains from Twitter and other social media tools. In this paper, we put forward the data-sensitive granularity for transfer learning, and then, a novel substructural regularization transfer learning model (STLM) is proposed to preserve target domain features at substructural granularity in the light of the condition of labeled data set size. Our model is underpinned by hidden Markov model and regularization theory, where the substructural representation can be integrated as a penalty after measuring the dissimilarity of substructures between target domain and STLM with relative entropy. STLM can achieve the competing goals of preserving the target domain substructure and utilizing the observations from both the target and source domains simultaneously. The estimation of STLM is very efficient since an analytical solution can be derived as a necessary and sufficient condition. The relative usability of substructures to act as regularization parameters and the time complexity of STLM are also analyzed and discussed. Comprehensive experiments of part-of-speech tagging with both Brown and Twitter corpora fully justify that our model can make improvements on all the combinations of source and target domains.

  20. Beyond common features: The role of roles in determining similarity1

    PubMed Central

    Jones, Matt; Love, Bradley C.

    2007-01-01

    Historically, accounts of object representation and perceived similarity have focused on intrinsic features. Although more recent accounts have explored how objects, scenes, and situations containing common relational structures come to be perceived as similar, less is known about how the perceived similarity of parts or objects embedded within these relational systems is affected. The current studies test the hypothesis that objects situated in common relational systems come to be perceived as more similar. Similarity increases most for objects playing the same role within a relation (e.g., predator), but also increases for objects playing different roles within the same relation (e.g., the predator or prey role in the hunts relation) regardless of whether the objects participate in the same instance of the relation. This pattern of results can be captured by extending existing models that extract meaning from text corpora so that they are sensitive to the verb-specific thematic roles that objects fill. Alternative explanations based on analogical and inferential processes are also considered, as well as the implications of the current findings to research in language processing, personality and person perception, decision making, and category learning. PMID:17094958

Top