Sample records for corpus-based language learning

  1. Corpus Based Authenicity Analysis of Language Teaching Course Books

    ERIC Educational Resources Information Center

    Peksoy, Emrah; Harmaoglu, Özhan

    2017-01-01

    In this study, the resemblance of the language learning course books used in Turkey to authentic language spoken by native speakers is explored by using a corpus-based approach. For this, the 10-million-word spoken part of the British National Corpus was selected as reference corpus. After that, all language learning course books used in high…

  2. Collocations in Corpus-Based Language Learning Research: Identifying, Comparing, and Interpreting the Evidence

    ERIC Educational Resources Information Center

    Gablasova, Dana; Brezina, Vaclav; McEnery, Tony

    2017-01-01

    This article focuses on the use of collocations in language learning research (LLR). Collocations, as units of formulaic language, are becoming prominent in our understanding of language learning and use; however, while the number of corpus-based LLR studies of collocations is growing, there is still a need for a deeper understanding of factors…

  3. Using Google as a Super Corpus to Drive Written Language Learning: A Comparison with the British National Corpus

    ERIC Educational Resources Information Center

    Sha, Guoquan

    2010-01-01

    Data-driven learning (DDL), or corpus-based language learning, involves the learner in an exploratory task to discover appropriate expressions or collocates regarding his writing. However, the problematic units of meaning in each learner's writing are so diverse that conventional corpora often prove futile. The search engine Google with the…

  4. FLAX: Flexible and Open Corpus-Based Language Collections Development

    ERIC Educational Resources Information Center

    Fitzgerald, Alannah; Wu, Shaoqun; Marín, María José

    2015-01-01

    In this case study we present innovative work in building open corpus-based language collections by focusing on a description of the opensource multilingual Flexible Language Acquisition (FLAX) language project, which is an ongoing example of open materials development practices for language teaching and learning. We present language-learning…

  5. Exploring Learner Language through Corpora: Comparing and Interpreting Corpus Frequency Information

    ERIC Educational Resources Information Center

    Gablasova, Dana; Brezina, Vaclav; McEnery, Tony

    2017-01-01

    This article contributes to the debate about the appropriate use of corpus data in language learning research. It focuses on frequencies of linguistic features in language use and their comparison across corpora. The majority of corpus-based second language acquisition studies employ a comparative design in which either one or more second language…

  6. The Pedagogical Mediation of a Developmental Learner Corpus for Classroom-Based Language Instruction

    ERIC Educational Resources Information Center

    Belz, Julie A.; Vyatkina, Nina

    2008-01-01

    Although corpora have been used in language teaching for some time, few empirical studies explore their impact on learning outcomes. We provide a microgenetic account of learners' responses to corpus-driven instructional units for German modal particles and pronominal "da"-compounds. The units are based on developmental corpus data produced by…

  7. A Corpus-Based Comparative Study of "Learn" and "Acquire"

    ERIC Educational Resources Information Center

    Yang, Bei

    2016-01-01

    As an important yet intricate linguistic feature in English language, synonymy poses a great challenge for second language learners. Using the 100 million-word British National Corpus (BNC) as data and the software Sketch Engine (SkE) as an analyzing tool, this article compares the usage of "learn" and "acquire" used in natural…

  8. Cognition, Corpora, and Computing: Triangulating Research in Usage-Based Language Learning

    ERIC Educational Resources Information Center

    Ellis, Nick C.

    2017-01-01

    Usage-based approaches explore how we learn language from our experience of language. Related research thus involves the analysis of the usage from which learners learn and of learner usage as it develops. This program involves considerable data recording, transcription, and analysis, using a variety of corpus and computational techniques, many of…

  9. A Corpus-Based System of Error Detection and Revision Suggestion for Spanish Learners in Taiwan: A Case Study

    ERIC Educational Resources Information Center

    Lu, Hui-Chuan; Chu, Yu-Hsin; Chang, Cheng-Yu

    2013-01-01

    Compared with English learners, Spanish learners have fewer resources for automatic error detection and revision and following the current integrative Computer Assisted Language Learning (CALL), we combined corpus-based approach and CALL to create the System of Error Detection and Revision Suggestion (SEDRS) for learning Spanish. Through…

  10. Separating Fact and Fiction: The Real Story of Corpus Use in Language Teaching

    ERIC Educational Resources Information Center

    Boulton, Alex

    2013-01-01

    This paper investigates uses of corpora in language learning ("data-driven learning") through analysis of a 600K-word corpus of empirical research papers in the field. The corpus can tell us much--the authors and the countries the studies are conducted in, the types of publication, and so on. The corpus investigation itself starts with…

  11. Effects of DDL Technology on Genre Learning

    ERIC Educational Resources Information Center

    Cotos, Elena; Link, Stephanie; Huffman, Sarah

    2017-01-01

    To better understand the promising effects of data-driven learning (DDL) on language learning processes and outcomes, this study explored DDL learning events enabled by the Research Writing Tutor (RWT), a web-based platform containing an English language corpus annotated to enhance rhetorical input, a concordancer that was searchable for…

  12. Integrating Corpus-Based CALL Programs in Teaching English through Children's Literature

    ERIC Educational Resources Information Center

    Johns, Tim F.; Hsingchin, Lee; Lixun, Wang

    2008-01-01

    This paper presents particular pedagogical applications of a number of corpus-based CALL (computer assisted language learning) programs such as "CONTEXTS" and "CLOZE," "MATCHUP" and "BILINGUAL SENTENCE SHUFFLER," in the teaching of English through children's literature. An elective course in Taiwan for…

  13. Forgetting of Foreign-Language Skills: A Corpus-Based Analysis of Online Tutoring Software

    ERIC Educational Resources Information Center

    Ridgeway, Karl; Mozer, Michael C.; Bowles, Anita R.

    2017-01-01

    We explore the nature of forgetting in a corpus of 125,000 students learning Spanish using the Rosetta Stone® foreign-language instruction software across 48 lessons. Students are tested on a lesson after its initial study and are then retested after a variable time lag. We observe forgetting consistent with power function decay at a rate that…

  14. Evaluating a Web-Based Video Corpus through an Analysis of User Interactions

    ERIC Educational Resources Information Center

    Caws, Catherine G.

    2013-01-01

    As shown by several studies, successful integration of technology in language learning requires a holistic approach in order to scientifically understand what learners do when working with web-based technology (cf. Raby, 2007). Additionally, a growing body of research in computer assisted language learning (CALL) evaluation, design and…

  15. Corpora Processing and Computational Scaffolding for a Web-Based English Learning Environment: The CANDLE Project

    ERIC Educational Resources Information Center

    Liou, Hsien-Chin; Chang, Jason S; Chen, Hao-Jan; Lin, Chih-Cheng; Liaw, Meei-Ling; Gao, Zhao-Ming; Jang, Jyh-Shing Roger; Yeh, Yuli; Chuang, Thomas C.; You, Geeng-Neng

    2006-01-01

    This paper describes the development of an innovative web-based environment for English language learning with advanced data-driven and statistical approaches. The project uses various corpora, including a Chinese-English parallel corpus ("Sinorama") and various natural language processing (NLP) tools to construct effective English…

  16. Corpus Use in Language Learning: A Meta-Analysis

    ERIC Educational Resources Information Center

    Boulton, Alex; Cobb, Tom

    2017-01-01

    This study applied systematic meta-analytic procedures to summarize findings from experimental and quasi-experimental investigations into the effectiveness of using the tools and techniques of corpus linguistics for second language learning or use, here referred to as data-driven learning (DDL). Analysis of 64 separate studies representing 88…

  17. Application of Learner Corpora to Second Language Learning and Teaching: An Overview

    ERIC Educational Resources Information Center

    Xu, Qi

    2016-01-01

    The paper gives an overview of learner corpora and their application to second language learning and teaching. It is proposed that there are four core components in learner corpus research, namely, corpus linguistics expertise, a good background in linguistic theory, knowledge of SLA theory, and a good understanding of foreign language teaching…

  18. Corpus of High School Academic Texts (COHAT): Data-Driven, Computer Assisted Discovery in Learning Academic English

    ERIC Educational Resources Information Center

    Bohát, Róbert; Rödlingová, Beata; Horáková, Nina

    2015-01-01

    Corpus of High School Academic Texts (COHAT), currently of 150,000+ words, aims to make academic language instruction a more data-driven and student-centered discovery learning as a special type of Computer-Assisted Language Learning (CALL), emphasizing students' critical thinking and metacognition. Since 2013, high school English as an additional…

  19. Corpus Linguistics and the Design of a Response Message

    NASA Astrophysics Data System (ADS)

    Atwell, E.

    2002-01-01

    Most research related to SETI, the Search for Extra-Terrestrial Intelligence, is focussed on techniques for detection of possible incoming signals from extra-terrestrial intelligent sources (e.g. Turnbull et al. 1999), and algorithms for analysis of these signals to identify intelligent language-like characteristics (e.g. Elliott and Atwell 1999, 2000). However, another issue for research and debate is the nature of our response, should a signal arrive and be detected. The design of potentially the most significant communicative act in history should not be decided solely by astrophysicists; the Corpus Linguistics research community has a contribution to make to what is essentially a Corpus design and implementation project. (Vakoch 1998) advocated that the message constructed to transmit to extraterrestrials should include a broad, representative collection of perspectives rather than a single viewpoint or genre; this should strike a chord with Corpus Linguists for whom a central principle is that a corpus must be "balanced" to be representative (Meyer 2001). One idea favoured by SETI researchers is to transmit an encyclopaedia summarising human knowledge, such as the Encyclopaedia Britannica, to give ET communicators an overview and "training set" key to analysis of subsequent messages. Furthermore, this should be sent in several versions in parallel: the text; page-images, to include illustrations left out of the text-file and perhaps some sort of abstract linguistic representation of the text, using a functional or logic language (Ollongren 1999, Freudenthal 1960). The idea of "enriching" the message corpus with annotations at several levels should also strike a chord with Corpus Linguists who have long known that Natural language exhibits highly complex multi-layering sequencing, structural and functional patterns, as difficult to model as sequences and structures found in more traditional physical and biological sciences. Some corpora have been annotated with several levels or layers of linguistic knowledge, for example the SEC corpus (Taylor and Knowles 1988), the ISLE corpus (Menzel et al. 2000). Tagged and parsed corpus can be used by corpus linguists as a testbed to guide their development of grammars (e.g. Souter and Atwell 1994); and they can be used to train Natural Language Learning or data-mining models of complex sequence data (e.g. Brill 1993, Hughes 1993, Atwell 1996). Corpus linguists have a range of standards and tools for design and annotation of representative corpus resources, and experience of which annotation types are more amenable to Natural Language Learning algorithms. An Advisory panel of corpus linguists could help design and implement an extended Multi-annotated Interstellar Corpus of English, incorporating ideas from Corpus Linguistics such as: - Augment the Encyclopaedia Britannica with a collection of samples representing the diversity of language in real use. - As an additional "key", transmit a dictionary aimed at language learners which has also been a rich source for NLP - Supply our ET communicators with several levels of linguistic annotation, to give them a richer training set for their - Add translations of the English text into other human languages: Humanity should not be represented by English alone, This calls for a large-scale corpus annotation project, requiring an Interstellar Corpus Advisory Panel, analogous to the BNC or MATE advisory panels, to include experts in English grammar and semantics, English language learning, computational Natural language Learning algorithms, and corpus design, implementation, annotation, standardisation, and analysis.

  20. "Yes, Your Honor!": A Corpus-Based Study of Technical Vocabulary in Discipline-Related Movies and TV Shows

    ERIC Educational Resources Information Center

    Csomay, Eniko; Petrovic, Marija

    2012-01-01

    Vocabulary is an essential element of every second/foreign language teaching and learning program. While the goal of language teaching programs is to focus on explicit vocabulary teaching to promote learning, "materials which provide visual and aural input such as movies may be conducive to incidental vocabulary learning." (Webb and Rodgers, 2009,…

  1. The Effect of Corpus-Based Activities on Verb-Noun Collocations in EFL Classes

    ERIC Educational Resources Information Center

    Ucar, Serpil; Yükselir, Ceyhun

    2015-01-01

    This current study sought to reveal the impacts of corpus-based activities on verb-noun collocation learning in EFL classes. This study was carried out on two groups--experimental and control groups- each of which consists of 15 students. The students were preparatory class students at School of Foreign Languages, Osmaniye Korkut Ata University.…

  2. Motivating College Students' Learning English for Specific Purposes Courses through Corpus Building

    ERIC Educational Resources Information Center

    Wu, Lin-Fang

    2014-01-01

    This study was conducted to determine how to motivate technical college students to learn English for specific purposes (ESP) courses through corpus building and enhance their language proficiency during the coursework for their majors. This study explores corpus building skills, how to simplify ESP courses by corpus building for English as second…

  3. A Corpus-Based Evaluation of Metaphors in a Business English Textbook

    ERIC Educational Resources Information Center

    Skorczynska Sznajder, Hanna

    2010-01-01

    This study aims to evaluate the selection of metaphors in a published business English textbook using findings from a specialised corpus of written business English. While most scholars agree that metaphors should be included in English for Specific Purposes (ESP) syllabuses as a potentially problematic area in successful language learning, it is…

  4. What Does Corpus Linguistics Have to Offer to Language Assessment?

    ERIC Educational Resources Information Center

    Xi, Xiaoming

    2017-01-01

    In recent years, continuing advances in technology have increased the capacity to automate the extraction of a range of linguistic features of texts and thus have provided the impetus for the substantial growth of corpus linguistics. While corpus linguistic tools and methods have been used extensively in second language learning research, they…

  5. Retesting the Limits of Data-Driven Learning: Feedback and Error Correction

    ERIC Educational Resources Information Center

    Crosthwaite, Peter

    2017-01-01

    An increasing number of studies have looked at the value of corpus-based data-driven learning (DDL) for second language (L2) written error correction, with generally positive results. However, a potential conundrum for language teachers involved in the process is how to provide feedback on students' written production for DDL. The study looks at…

  6. Corpus Linguistics for Korean Language Learning and Teaching. NFLRC Technical Report No. 26

    ERIC Educational Resources Information Center

    Bley-Vroman, Robert, Ed.; Ko, Hyunsook, Ed.

    2006-01-01

    Dramatic advances in personal computer technology have given language teachers access to vast quantities of machine-readable text, which can be analyzed with a view toward improving the basis of language instruction. Corpus linguistics provides analytic techniques and practical tools for studying language in use. This volume includes both an…

  7. An Analysis of the Application of Wikipedia Corpus on the Lexical Learning in the Second Language Acquisition

    ERIC Educational Resources Information Center

    Shi, Jing

    2015-01-01

    Corpus linguistics has transformed linguistic research but has a slightly moderate impact on the ESL teaching and learning. The Wikipedia Corpus, designed by Mark Davis is introduced in this essay. The corpus allows teachers to search Wikipedia in a powerful way: they can search by word, phrase, part of speech, and synonyms. Teachers can also find…

  8. Pronunciation Teaching Practices in Communicative Second Language Classes

    ERIC Educational Resources Information Center

    Foote, Jennifer Ann; Trofimovich, Pavel; Collins, Laura; Urzúa, Fernanda Soler

    2016-01-01

    The objective of this research was to provide longitudinal, corpus-based evidence of actual teacher behaviour with respect to the teaching of second language (L2) pronunciation in a communicative language learning context. The data involved 40 hours of videotaped lessons from three experienced teachers recorded four times at 100-hour increments…

  9. Realization of Chinese word segmentation based on deep learning method

    NASA Astrophysics Data System (ADS)

    Wang, Xuefei; Wang, Mingjiang; Zhang, Qiquan

    2017-08-01

    In recent years, with the rapid development of deep learning, it has been widely used in the field of natural language processing. In this paper, I use the method of deep learning to achieve Chinese word segmentation, with large-scale corpus, eliminating the need to construct additional manual characteristics. In the process of Chinese word segmentation, the first step is to deal with the corpus, use word2vec to get word embedding of the corpus, each character is 50. After the word is embedded, the word embedding feature is fed to the bidirectional LSTM, add a linear layer to the hidden layer of the output, and then add a CRF to get the model implemented in this paper. Experimental results show that the method used in the 2014 People's Daily corpus to achieve a satisfactory accuracy.

  10. Learning in Parallel: Using Parallel Corpora to Enhance Written Language Acquisition at the Beginning Level

    ERIC Educational Resources Information Center

    Bluemel, Brody

    2014-01-01

    This article illustrates the pedagogical value of incorporating parallel corpora in foreign language education. It explores the development of a Chinese/English parallel corpus designed specifically for pedagogical application. The corpus tool was created to aid language learners in reading comprehension and writing development by making foreign…

  11. Yaounde French Speech Corpus

    DTIC Science & Technology

    2017-03-01

    the Center for Technology Enhanced Language Learning (CTELL), a research cell in the Department of Foreign Languages, United States Military Academy...models for automatic speech recognition (ASR), and to, thereby, investigate the utility of ASR in pedagogical technology . The corpus is a sample of...lexical resources, language technology 16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF ABSTRACT UU 18. NUMBER OF

  12. Computational Investigations of Multiword Chunks in Language Learning.

    PubMed

    McCauley, Stewart M; Christiansen, Morten H

    2017-07-01

    Second-language learners rarely arrive at native proficiency in a number of linguistic domains, including morphological and syntactic processing. Previous approaches to understanding the different outcomes of first- versus second-language learning have focused on cognitive and neural factors. In contrast, we explore the possibility that children and adults may rely on different linguistic units throughout the course of language learning, with specific focus on the granularity of those units. Following recent psycholinguistic evidence for the role of multiword chunks in online language processing, we explore the hypothesis that children rely more heavily on multiword units in language learning than do adults learning a second language. To this end, we take an initial step toward using large-scale, corpus-based computational modeling as a tool for exploring the granularity of speakers' linguistic units. Employing a computational model of language learning, the Chunk-Based Learner, we compare the usefulness of chunk-based knowledge in accounting for the speech of second-language learners versus children and adults speaking their first language. Our findings suggest that while multiword units are likely to play a role in second-language learning, adults may learn less useful chunks, rely on them to a lesser extent, and arrive at them through different means than children learning a first language. Copyright © 2017 Cognitive Science Society, Inc.

  13. Pedagogical Models of Concordance Use: Correlations between Concordance User Preferences

    ERIC Educational Resources Information Center

    Ballance, Oliver James

    2017-01-01

    One of the most promising avenues of research in computer-assisted language learning is the potential for language learners to make use of language corpora. However, using a corpus requires use of a corpus tool as an interface, typically a concordancer. How such a tool can be made most accessible to learners is an important issue. Specifically,…

  14. Corpus-Based versus Traditional Learning of Collocations

    ERIC Educational Resources Information Center

    Daskalovska, Nina

    2015-01-01

    One of the aspects of knowing a word is the knowledge of which words it is usually used with. Since knowledge of collocations is essential for appropriate and fluent use of language, learning collocations should have a central place in the study of vocabulary. There are different opinions about the best ways of learning collocations. This study…

  15. Listen, Listen, Listen and Listen: Building a Comprehension Corpus and Making It Comprehensible

    ERIC Educational Resources Information Center

    Mordaunt, Owen G.; Olson, Daniel W.

    2010-01-01

    Listening comprehension input is necessary for language learning and acculturation. One approach to developing listening comprehension skills is through exposure to massive amounts of naturally occurring spoken language input. But exposure to this input is not enough; learners also need to make the comprehension corpus meaningful to their learning…

  16. Corpus-Based Investigations of Language Use.

    ERIC Educational Resources Information Center

    Biber, Douglas; And Others

    1996-01-01

    Examines a representative text corpus to gain insights into language structure and use and to open new areas of linguistic inquiry. Various illustrations are presented that provide a glimpse into the value of corpus-based investigations for increasing one's understanding of language use and imparting insights important for designing effective…

  17. How Can We Use Corpus Wordlists for Language Learning? Interfaces between Computer Corpora and Expert Intervention

    ERIC Educational Resources Information Center

    Chen, Yu-Hua; Bruncak, Radovan

    2015-01-01

    With the advances in technology, wordlists retrieved from computer corpora have become increasingly popular in recent years. The lexical items in those wordlists are usually selected, according to a set of robust frequency and dispersion criteria, from large corpora of authentic and naturally occurring language. Corpus wordlists are of great value…

  18. Metaphoric Modeling of Foreign Language Teaching and Learning, with Special Reference to Teaching Philosophy Statements

    ERIC Educational Resources Information Center

    Alghbban, Mohammed I.; Ben Salamh, Sami; Maalej, Zouheir

    2017-01-01

    The current article investigates teachers' metaphoric modeling of foreign language teaching and learning at the College of Languages and Translation, King Saud University. It makes use of teaching philosophy statements as a corpus. Our objective is to analyze the underlying conceptualizations of teaching/learning, the teachers' perception of the…

  19. Data-Informed Language Learning

    ERIC Educational Resources Information Center

    Godwin-Jones, Robert

    2017-01-01

    Although data collection has been used in language learning settings for some time, it is only in recent decades that large corpora have become available, along with efficient tools for their use. Advances in natural language processing (NLP) have enabled rich tagging and annotation of corpus data, essential for their effective use in language…

  20. Formulaic Language and Collocations in German Essays: From Corpus-Driven Data to Corpus-Based Materials

    ERIC Educational Resources Information Center

    Krummes, Cedric; Ensslin, Astrid

    2015-01-01

    Whereas there exists a plethora of research on collocations and formulaic language in English, this article contributes towards a somewhat less developed area: the understanding and teaching of formulaic language in German as a foreign language. It analyses formulaic sequences and collocations in German writing (corpus-driven) and provides modern…

  1. Jointly learning word embeddings using a corpus and a knowledge base

    PubMed Central

    Bollegala, Danushka; Maehara, Takanori; Kawarabayashi, Ken-ichi

    2018-01-01

    Methods for representing the meaning of words in vector spaces purely using the information distributed in text corpora have proved to be very valuable in various text mining and natural language processing (NLP) tasks. However, these methods still disregard the valuable semantic relational structure between words in co-occurring contexts. These beneficial semantic relational structures are contained in manually-created knowledge bases (KBs) such as ontologies and semantic lexicons, where the meanings of words are represented by defining the various relationships that exist among those words. We combine the knowledge in both a corpus and a KB to learn better word embeddings. Specifically, we propose a joint word representation learning method that uses the knowledge in the KBs, and simultaneously predicts the co-occurrences of two words in a corpus context. In particular, we use the corpus to define our objective function subject to the relational constrains derived from the KB. We further utilise the corpus co-occurrence statistics to propose two novel approaches, Nearest Neighbour Expansion (NNE) and Hedged Nearest Neighbour Expansion (HNE), that dynamically expand the KB and therefore derive more constraints that guide the optimisation process. Our experimental results over a wide-range of benchmark tasks demonstrate that the proposed method statistically significantly improves the accuracy of the word embeddings learnt. It outperforms a corpus-only baseline and reports an improvement of a number of previously proposed methods that incorporate corpora and KBs in both semantic similarity prediction and word analogy detection tasks. PMID:29529052

  2. Statistical Measures for Usage-Based Linguistics

    ERIC Educational Resources Information Center

    Gries, Stefan Th.; Ellis, Nick C.

    2015-01-01

    The advent of usage-/exemplar-based approaches has resulted in a major change in the theoretical landscape of linguistics, but also in the range of methodologies that are brought to bear on the study of language acquisition/learning, structure, and use. In particular, methods from corpus linguistics are now frequently used to study distributional…

  3. Lexical Properties of Slovene Sign Language: A Corpus-Based Study

    ERIC Educational Resources Information Center

    Vintar, Špela

    2015-01-01

    Slovene Sign Language (SZJ) has as yet received little attention from linguists. This article presents some basic facts about SZJ, its history, current status, and a description of the Slovene Sign Language Corpus and Pilot Grammar (SIGNOR) project, which compiled and annotated a representative corpus of SZJ. Finally, selected quantitative data…

  4. Language and Content: The Case of Law.

    ERIC Educational Resources Information Center

    Beasley, Colin J.

    A discussion of the teaching and learning of English for special purposes focuses on the interrelationship of content and language, particularly in the case of education for the legal professions. It is noted that law students must both study a large corpus of case and statute law and legal principles and learn the language of the law, with its…

  5. Developing Corpus-Based Materials to Teach Pragmatic Routines

    ERIC Educational Resources Information Center

    Bardovi-Harlig, Kathleen; Mossman, Sabrina; Vellenga, Heidi E.

    2015-01-01

    This article describes how to develop teaching materials for pragmatics based on authentic language by using a spoken corpus. The authors show how to use the corpus in conjunction with textbooks to identify pragmatic routines for speech acts and how to extract appropriate language samples and adapt them for classroom use. They demonstrate how to…

  6. Corpora in Language Teaching and Learning

    ERIC Educational Resources Information Center

    Boulton, Alex

    2017-01-01

    This timeline looks at explicit uses of corpora in foreign or second language (L2) teaching and learning, i.e. what happens when end-users explore corpus data, whether directly via concordancers or integrated into CALL programs, or indirectly with prepared printed materials. The underlying rationale is that such contact provides the massive…

  7. Recent Developments in Corpus Linguistics and Corpus-Based Research/Department of Linguistics and Modern Language Studies at the Hong Kong Institute of Education

    ERIC Educational Resources Information Center

    Xie, Qin

    2015-01-01

    Corpus linguistics has transformed the landscape of empirical research on languages in recent decades. The proliferation of corpus technology has enabled researchers worldwide to conduct research in their own geographical locations with few hindrances. It has become increasingly commonplace for researchers to compile their own corpora for specific…

  8. A corpus of full-text journal articles is a robust evaluation tool for revealing differences in performance of biomedical natural language processing tools

    PubMed Central

    2012-01-01

    Background We introduce the linguistic annotation of a corpus of 97 full-text biomedical publications, known as the Colorado Richly Annotated Full Text (CRAFT) corpus. We further assess the performance of existing tools for performing sentence splitting, tokenization, syntactic parsing, and named entity recognition on this corpus. Results Many biomedical natural language processing systems demonstrated large differences between their previously published results and their performance on the CRAFT corpus when tested with the publicly available models or rule sets. Trainable systems differed widely with respect to their ability to build high-performing models based on this data. Conclusions The finding that some systems were able to train high-performing models based on this corpus is additional evidence, beyond high inter-annotator agreement, that the quality of the CRAFT corpus is high. The overall poor performance of various systems indicates that considerable work needs to be done to enable natural language processing systems to work well when the input is full-text journal articles. The CRAFT corpus provides a valuable resource to the biomedical natural language processing community for evaluation and training of new models for biomedical full text publications. PMID:22901054

  9. Learning through Drama.

    ERIC Educational Resources Information Center

    Jensen, Ina; Rechis, Ruth; Luna, J. Don

    This chapter is part of a book that recounts the year's work at the Early Childhood Development Center (ECDC) at Texas A & M University-Corpus Christi. Rather than an "elitist" laboratory school for the children of university faculty, the dual-language ECDC is a collaboration between the Corpus Christi Independent School District and…

  10. An empirical generative framework for computational modeling of language acquisition.

    PubMed

    Waterfall, Heidi R; Sandbank, Ben; Onnis, Luca; Edelman, Shimon

    2010-06-01

    This paper reports progress in developing a computer model of language acquisition in the form of (1) a generative grammar that is (2) algorithmically learnable from realistic corpus data, (3) viable in its large-scale quantitative performance and (4) psychologically real. First, we describe new algorithmic methods for unsupervised learning of generative grammars from raw CHILDES data and give an account of the generative performance of the acquired grammars. Next, we summarize findings from recent longitudinal and experimental work that suggests how certain statistically prominent structural properties of child-directed speech may facilitate language acquisition. We then present a series of new analyses of CHILDES data indicating that the desired properties are indeed present in realistic child-directed speech corpora. Finally, we suggest how our computational results, behavioral findings, and corpus-based insights can be integrated into a next-generation model aimed at meeting the four requirements of our modeling framework.

  11. Corpus-Based Approaches to Language Description for Specialized Academic Writing

    ERIC Educational Resources Information Center

    Flowerdew, John

    2017-01-01

    Language description is a fundamental requirement for second language (L2) syllabus design. The greatest advances in language description in recent decades have been done with the help of electronic corpora. Such language description is the theme of this article. The article first introduces some basic concepts and principles in corpus research.…

  12. Forgetting of Foreign-Language Skills: A Corpus-Based Analysis of Online Tutoring Software.

    PubMed

    Ridgeway, Karl; Mozer, Michael C; Bowles, Anita R

    2017-05-01

    We explore the nature of forgetting in a corpus of 125,000 students learning Spanish using the Rosetta Stone ® foreign-language instruction software across 48 lessons. Students are tested on a lesson after its initial study and are then retested after a variable time lag. We observe forgetting consistent with power function decay at a rate that varies across lessons but not across students. We find that lessons which are better learned initially are forgotten more slowly, a correlation which likely reflects a latent cause such as the quality or difficulty of the lesson. We obtain improved predictive accuracy of the forgetting model by augmenting it with features that encode characteristics of a student's initial study of the lesson and the activities the student engaged in between the initial and delayed tests. The augmented model can predict 23.9% of the variance in an individual's score on the delayed test. We analyze which features best explain individual performance. Copyright © 2016 Cognitive Science Society, Inc.

  13. Vietnamese Document Representation and Classification

    NASA Astrophysics Data System (ADS)

    Nguyen, Giang-Son; Gao, Xiaoying; Andreae, Peter

    Vietnamese is very different from English and little research has been done on Vietnamese document classification, or indeed, on any kind of Vietnamese language processing, and only a few small corpora are available for research. We created a large Vietnamese text corpus with about 18000 documents, and manually classified them based on different criteria such as topics and styles, giving several classification tasks of different difficulty levels. This paper introduces a new syllable-based document representation at the morphological level of the language for efficient classification. We tested the representation on our corpus with different classification tasks using six classification algorithms and two feature selection techniques. Our experiments show that the new representation is effective for Vietnamese categorization, and suggest that best performance can be achieved using syllable-pair document representation, an SVM with a polynomial kernel as the learning algorithm, and using Information gain and an external dictionary for feature selection.

  14. Lexical Awareness and Development through Data Driven Learning: Attitudes and Beliefs of EFL Learners

    ERIC Educational Resources Information Center

    Asik, Asuman; Vural, Arzu Sarlanoglu; Akpinar, Kadriye Dilek

    2016-01-01

    Data-driven learning (DDL) has become an innovative approach developed from corpus linguistics. It plays a significant role in the progression of foreign language pedagogy, since it offers learners plentiful authentic corpora examples that make them analyze language rules with the help of online corpora and concordancers. The present study…

  15. What Data for Data-Driven Learning?

    ERIC Educational Resources Information Center

    Boulton, Alex

    2012-01-01

    Corpora have multiple affordances, not least for use by teachers and learners of a foreign language (L2) in what has come to be known as "data-driven learning" or DDL. The corpus and concordance interface were originally conceived by and for linguists, so other users need to adopt the role of "language researcher" to make the most of them. Despite…

  16. NCBI disease corpus: a resource for disease name recognition and concept normalization.

    PubMed

    Doğan, Rezarta Islamaj; Leaman, Robert; Lu, Zhiyong

    2014-02-01

    Information encoded in natural language in biomedical literature publications is only useful if efficient and reliable ways of accessing and analyzing that information are available. Natural language processing and text mining tools are therefore essential for extracting valuable information, however, the development of powerful, highly effective tools to automatically detect central biomedical concepts such as diseases is conditional on the availability of annotated corpora. This paper presents the disease name and concept annotations of the NCBI disease corpus, a collection of 793 PubMed abstracts fully annotated at the mention and concept level to serve as a research resource for the biomedical natural language processing community. Each PubMed abstract was manually annotated by two annotators with disease mentions and their corresponding concepts in Medical Subject Headings (MeSH®) or Online Mendelian Inheritance in Man (OMIM®). Manual curation was performed using PubTator, which allowed the use of pre-annotations as a pre-step to manual annotations. Fourteen annotators were randomly paired and differing annotations were discussed for reaching a consensus in two annotation phases. In this setting, a high inter-annotator agreement was observed. Finally, all results were checked against annotations of the rest of the corpus to assure corpus-wide consistency. The public release of the NCBI disease corpus contains 6892 disease mentions, which are mapped to 790 unique disease concepts. Of these, 88% link to a MeSH identifier, while the rest contain an OMIM identifier. We were able to link 91% of the mentions to a single disease concept, while the rest are described as a combination of concepts. In order to help researchers use the corpus to design and test disease identification methods, we have prepared the corpus as training, testing and development sets. To demonstrate its utility, we conducted a benchmarking experiment where we compared three different knowledge-based disease normalization methods with a best performance in F-measure of 63.7%. These results show that the NCBI disease corpus has the potential to significantly improve the state-of-the-art in disease name recognition and normalization research, by providing a high-quality gold standard thus enabling the development of machine-learning based approaches for such tasks. The NCBI disease corpus, guidelines and other associated resources are available at: http://www.ncbi.nlm.nih.gov/CBBresearch/Dogan/DISEASE/. Published by Elsevier Inc.

  17. Russian National Corpus as a Tool of Linguo-Didactic Innovation in Teaching Languages

    ERIC Educational Resources Information Center

    Ponomareva, Lyubov Dmitrievna; Churilina, Lyubov Nikolaevna; Buzhinskaya, Darya Sergeyevna; Derevskova, Elena Nikolayevna; Dorfman, Oksana Vyacheslavovna; Sokolova, Elena Petrovna

    2016-01-01

    Emphasis on universal learning activities of each student rather than acquisition of ready knowledge, as well as on how an individual masters a language necessitate the development and application of innovative technologies promoting functional-semantic and textual approaches. In the modern context, Russian language teachers, along with knowledge…

  18. Concordancers in the Design and Implementation of Foreign Language Courses.

    ERIC Educational Resources Information Center

    Polezzi, Loredana

    1994-01-01

    Discusses an interdisciplinary approach to foreign language learning that is integrated with an academic/professional curriculum as opposed to a traditional beginner's course. The use of an electronic concordance is described; the language of a pedagogic corpus is examined; and an example is given of an Italian course for Renaissance theater…

  19. Supporting English-medium pedagogy through an online corpus of science and engineering lectures

    NASA Astrophysics Data System (ADS)

    Kunioshi, Nílson; Noguchi, Judy; Tojo, Kazuko; Hayashi, Hiroko

    2016-05-01

    As English-medium instruction (EMI) spreads around the world, university teachers and students who are non-native speakers of English (NNS) need to put much effort into the delivery or reception of content. Construction of scientific meaning in the process of learning is already complex when instruction is delivered in the first language of the teachers and students, and may become even more challenging in a second language, because science education depends greatly on language. In order to identify important pedagogical functions that teachers use to deliver content and to present different ways to realise each function, a corpus of lectures related to science and engineering courses was created and analysed. NNS teachers and students in science and engineering involved in EMI higher education can obtain insights for delivering and listening to lectures from the Online Corpus of Academic Lectures (OnCAL).

  20. Task Effects on Linguistic Complexity and Accuracy: A Large-Scale Learner Corpus Analysis Employing Natural Language Processing Techniques

    ERIC Educational Resources Information Center

    Alexopoulou, Theodora; Michel, Marije; Murakami, Akira; Meurers, Detmar

    2017-01-01

    Large-scale learner corpora collected from online language learning platforms, such as the EF-Cambridge Open Language Database (EFCAMDAT), provide opportunities to analyze learner data at an unprecedented scale. However, interpreting the learner language in such corpora requires a precise understanding of tasks: How does the prompt and input of a…

  1. An Investigation of Native and Nonnative English Speakers' Levels of Written Syntactic Complexity in Asynchronous Online Discussions

    ERIC Educational Resources Information Center

    Mancilla, Rae L.; Polat, Nihat; Akcay, Ahmet O.

    2017-01-01

    This manuscript reports on a corpus-based comparison of native and nonnative graduate students' language production in an asynchronous learning environment. Using 486 discussion board postings from a five-year period (2009-2013), we analyzed the extent to which native and nonnative university students' writing differed in 10 measures of syntactic…

  2. Are Teachers Test-Oriented? A Comparative Corpus-Based Analysis of the English Entrance Exam and Junior High School English Textbooks

    ERIC Educational Resources Information Center

    Tai, Sophie; Chen, Hao-Jan

    2015-01-01

    The communicative language teaching approach has dominated English teaching and learning since the 1970s. In Taiwan, standardized and highstakes English tests also put focus on the assessment of learners' communicative competence. While the test contents change, the modifications teachers made are superficial rather than substantial. A comparative…

  3. An annotated corpus with nanomedicine and pharmacokinetic parameters

    PubMed Central

    Lewinski, Nastassja A; Jimenez, Ivan; McInnes, Bridget T

    2017-01-01

    A vast amount of data on nanomedicines is being generated and published, and natural language processing (NLP) approaches can automate the extraction of unstructured text-based data. Annotated corpora are a key resource for NLP and information extraction methods which employ machine learning. Although corpora are available for pharmaceuticals, resources for nanomedicines and nanotechnology are still limited. To foster nanotechnology text mining (NanoNLP) efforts, we have constructed a corpus of annotated drug product inserts taken from the US Food and Drug Administration’s Drugs@FDA online database. In this work, we present the development of the Engineered Nanomedicine Database corpus to support the evaluation of nanomedicine entity extraction. The data were manually annotated for 21 entity mentions consisting of nanomedicine physicochemical characterization, exposure, and biologic response information of 41 Food and Drug Administration-approved nanomedicines. We evaluate the reliability of the manual annotations and demonstrate the use of the corpus by evaluating two state-of-the-art named entity extraction systems, OpenNLP and Stanford NER. The annotated corpus is available open source and, based on these results, guidelines and suggestions for future development of additional nanomedicine corpora are provided. PMID:29066897

  4. Data-Driven Learning and the Acquisition of Italian Collocations: From Design to Student Evaluation

    ERIC Educational Resources Information Center

    Forti, Luciana

    2017-01-01

    This paper looks at how corpus data was used to design an Italian as an L2 language learning programme and how it was evaluated by students. The study focuses on the acquisition of Italian verb-noun collocations by Chinese native students attending a ten month long Italian language course before enrolling at an Italian university. It describes how…

  5. Classification of health webpages as expert and non expert with a reduced set of cross-language features.

    PubMed

    Grabar, Natalia; Krivine, Sonia; Jaulent, Marie-Christine

    2007-10-11

    Making the distinction between expert and non expert health documents can help users to select the information which is more suitable for them, according to whether they are familiar or not with medical terminology. This issue is particularly important for the information retrieval area. In our work we address this purpose through stylistic corpus analysis and the application of machine learning algorithms. Our hypothesis is that this distinction can be performed on the basis of a small number of features and that such features can be language and domain independent. The used features were acquired in source corpus (Russian language, diabetes topic) and then tested on target (French language, pneumology topic) and source corpora. These cross-language features show 90% precision and 93% recall with non expert documents in source language; and 85% precision and 74% recall with expert documents in target language.

  6. Bootstrapping language acquisition.

    PubMed

    Abend, Omri; Kwiatkowski, Tom; Smith, Nathaniel J; Goldwater, Sharon; Steedman, Mark

    2017-07-01

    The semantic bootstrapping hypothesis proposes that children acquire their native language through exposure to sentences of the language paired with structured representations of their meaning, whose component substructures can be associated with words and syntactic structures used to express these concepts. The child's task is then to learn a language-specific grammar and lexicon based on (probably contextually ambiguous, possibly somewhat noisy) pairs of sentences and their meaning representations (logical forms). Starting from these assumptions, we develop a Bayesian probabilistic account of semantically bootstrapped first-language acquisition in the child, based on techniques from computational parsing and interpretation of unrestricted text. Our learner jointly models (a) word learning: the mapping between components of the given sentential meaning and lexical words (or phrases) of the language, and (b) syntax learning: the projection of lexical elements onto sentences by universal construction-free syntactic rules. Using an incremental learning algorithm, we apply the model to a dataset of real syntactically complex child-directed utterances and (pseudo) logical forms, the latter including contextually plausible but irrelevant distractors. Taking the Eve section of the CHILDES corpus as input, the model simulates several well-documented phenomena from the developmental literature. In particular, the model exhibits syntactic bootstrapping effects (in which previously learned constructions facilitate the learning of novel words), sudden jumps in learning without explicit parameter setting, acceleration of word-learning (the "vocabulary spurt"), an initial bias favoring the learning of nouns over verbs, and one-shot learning of words and their meanings. The learner thus demonstrates how statistical learning over structured representations can provide a unified account for these seemingly disparate phenomena. Copyright © 2017 Elsevier B.V. All rights reserved.

  7. Division of Labor in Vocabulary Structure: Insights From Corpus Analyses.

    PubMed

    Christiansen, Morten H; Monaghan, Padraic

    2016-07-01

    Psychologists have used experimental methods to study language for more than a century. However, only with the recent availability of large-scale linguistic databases has a more complete picture begun to emerge of how language is actually used, and what information is available as input to language acquisition. Analyses of such "big data" have resulted in reappraisals of key assumptions about the nature of language. As an example, we focus on corpus-based research that has shed new light on the arbitrariness of the sign: the longstanding assumption that the relationship between the sound of a word and its meaning is arbitrary. The results reveal a systematic relationship between the sound of a word and its meaning, which is stronger for early acquired words. Moreover, the analyses further uncover a systematic relationship between words and their lexical categories-nouns and verbs sound differently from each other-affecting how we learn new words and use them in sentences. Together, these results point to a division of labor between arbitrariness and systematicity in sound-meaning mappings. We conclude by arguing in favor of including "big data" analyses into the language scientist's methodological toolbox. Copyright © 2015 Cognitive Science Society, Inc.

  8. Learn Locally, Act Globally: Learning Language from Variation Set Cues

    PubMed Central

    Onnis, Luca; Waterfall, Heidi R.; Edelman, Shimon

    2011-01-01

    Variation set structure — partial overlap of successive utterances in child-directed speech — has been shown to correlate with progress in children’s acquisition of syntax. We demonstrate the benefits of variation set structure directly: in miniature artificial languages, arranging a certain proportion of utterances in a training corpus in variation sets facilitated word and phrase constituent learning in adults. Our findings have implications for understanding the mechanisms of L1 acquisition by children, and for the development of more efficient algorithms for automatic language acquisition, as well as better methods for L2 instruction. PMID:19019350

  9. Refining the Use of the Web (and Web Search) as a Language Teaching and Learning Resource

    ERIC Educational Resources Information Center

    Wu, Shaoqun; Franken, Margaret; Witten, Ian H.

    2009-01-01

    The web is a potentially useful corpus for language study because it provides examples of language that are contextualized and authentic, and is large and easily searchable. However, web contents are heterogeneous in the extreme, uncontrolled and hence "dirty," and exhibit features different from the written and spoken texts in other linguistic…

  10. A Corpus of Writing, Pronunciation, Reading, and Listening by Learners of English as a Foreign Language

    ERIC Educational Resources Information Center

    Kotani, Katsunori; Yoshimi, Takehiko; Nanjo, Hiroaki; Isahara, Hitoshi

    2016-01-01

    In order to develop effective teaching methods and computer-assisted language teaching systems for learners of English as a foreign language who need to study the basic linguistic competences for writing, pronunciation, reading, and listening, it is necessary to first investigate which vocabulary and grammar they have or have not yet learned.…

  11. Building Corpus-Informed Word Lists for L2 Vocabulary Learning in Nine Languages

    ERIC Educational Resources Information Center

    Charalabopoulou, Frieda; Gavrilidou, Maria; Kokkinakis, Sofie Johansson; Volodina, Elena

    2012-01-01

    Lexical competence constitutes a crucial aspect in L2 learning, since building a rich repository of words is considered indispensable for successful communication. CALL practitioners have experimented with various kinds of computer-mediated glosses to facilitate L2 vocabulary building in the context of incidental vocabulary learning. Intentional…

  12. A grammar-based semantic similarity algorithm for natural language sentences.

    PubMed

    Lee, Ming Che; Chang, Jia Wei; Hsieh, Tung Cheng

    2014-01-01

    This paper presents a grammar and semantic corpus based similarity algorithm for natural language sentences. Natural language, in opposition to "artificial language", such as computer programming languages, is the language used by the general public for daily communication. Traditional information retrieval approaches, such as vector models, LSA, HAL, or even the ontology-based approaches that extend to include concept similarity comparison instead of cooccurrence terms/words, may not always determine the perfect matching while there is no obvious relation or concept overlap between two natural language sentences. This paper proposes a sentence similarity algorithm that takes advantage of corpus-based ontology and grammatical rules to overcome the addressed problems. Experiments on two famous benchmarks demonstrate that the proposed algorithm has a significant performance improvement in sentences/short-texts with arbitrary syntax and structure.

  13. Deontic Modals in RP-US Visiting Forces Agreement (VFA): A Corpus-Based Analysis

    ERIC Educational Resources Information Center

    Dela Rosa, John Paul Obillos

    2017-01-01

    The marriage between language and the law is apparent in any legal document of whatever purpose. Hence, at present, studies on the language of the law are definitely in vogue. Grounded on Quirk et al. (1985) and Matulewska's (2010) description of deontic modality, this corpus-based linguistic study aimed at analyzing the use of deontic modals in…

  14. Lexical Borrowing from Chinese Languages in Malaysian English

    ERIC Educational Resources Information Center

    Imm, Tan Siew

    2009-01-01

    This paper explores how contact between English and Chinese has resulted in the incorporation of Chinese borrowings into the lexicon of Malaysian English (ME). Using a corpus-based approach, this study analyses a comprehensive range of borrowed features extracted from the Malaysian English Newspaper Corpus (MEN Corpus). Based on the contexts of…

  15. Building a comprehensive syntactic and semantic corpus of Chinese clinical texts.

    PubMed

    He, Bin; Dong, Bin; Guan, Yi; Yang, Jinfeng; Jiang, Zhipeng; Yu, Qiubin; Cheng, Jianyi; Qu, Chunyan

    2017-05-01

    To build a comprehensive corpus covering syntactic and semantic annotations of Chinese clinical texts with corresponding annotation guidelines and methods as well as to develop tools trained on the annotated corpus, which supplies baselines for research on Chinese texts in the clinical domain. An iterative annotation method was proposed to train annotators and to develop annotation guidelines. Then, by using annotation quality assurance measures, a comprehensive corpus was built, containing annotations of part-of-speech (POS) tags, syntactic tags, entities, assertions, and relations. Inter-annotator agreement (IAA) was calculated to evaluate the annotation quality and a Chinese clinical text processing and information extraction system (CCTPIES) was developed based on our annotated corpus. The syntactic corpus consists of 138 Chinese clinical documents with 47,426 tokens and 2612 full parsing trees, while the semantic corpus includes 992 documents that annotated 39,511 entities with their assertions and 7693 relations. IAA evaluation shows that this comprehensive corpus is of good quality, and the system modules are effective. The annotated corpus makes a considerable contribution to natural language processing (NLP) research into Chinese texts in the clinical domain. However, this corpus has a number of limitations. Some additional types of clinical text should be introduced to improve corpus coverage and active learning methods should be utilized to promote annotation efficiency. In this study, several annotation guidelines and an annotation method for Chinese clinical texts were proposed, and a comprehensive corpus with its NLP modules were constructed, providing a foundation for further study of applying NLP techniques to Chinese texts in the clinical domain. Copyright © 2017. Published by Elsevier Inc.

  16. An Empirical Generative Framework for Computational Modeling of Language Acquisition

    ERIC Educational Resources Information Center

    Waterfall, Heidi R.; Sandbank, Ben; Onnis, Luca; Edelman, Shimon

    2010-01-01

    This paper reports progress in developing a computer model of language acquisition in the form of (1) a generative grammar that is (2) algorithmically learnable from realistic corpus data, (3) viable in its large-scale quantitative performance and (4) psychologically real. First, we describe new algorithmic methods for unsupervised learning of…

  17. A Corpus-Informed Text Reconstruction Resource for Learning about the Language of Scientific Abstracts

    ERIC Educational Resources Information Center

    Hartwell, Laura M.; Jacques, Marie-Paule

    2012-01-01

    Both reading and writing abstracts require specific language skills and conceptual capacities, which may challenge advanced learners. This paper draws explicitly upon the "Emergence" and "Scientext" research projects which focused on the lexis of scientific texts in French and English. The teaching objective of the project…

  18. An eye movement corpus study of the age-of-acquisition effect.

    PubMed

    Dirix, Nicolas; Duyck, Wouter

    2017-12-01

    In the present study, we investigated the effects of word-level age of acquisition (AoA) on natural reading. Previous studies, using multiple language modalities, showed that earlier-learned words are recognized, read, spoken, and responded to faster than words learned later in life. Until now, in visual word recognition the experimental materials were limited to single-word or sentence studies. We analyzed the data of the Ghent Eye-tracking Corpus (GECO; Cop, Dirix, Drieghe, & Duyck, in press), an eyetracking corpus of participants reading an entire novel, resulting in the first eye movement megastudy of AoA effects in natural reading. We found that the ages at which specific words were learned indeed influenced reading times, above other important (correlated) lexical variables, such as word frequency and length. Shorter fixations for earlier-learned words were consistently found throughout the reading process, in both early (single-fixation durations, first-fixation durations, gaze durations) and late (total reading times) measures. Implications for theoretical accounts of AoA effects and eye movements are discussed.

  19. Applying Corpus-Based Findings to Form-Focused Instruction: The Case of Reported Speech

    ERIC Educational Resources Information Center

    Barbieri, Federica; Eckhardt, Suzanne E. B.

    2007-01-01

    Arguing that the introduction of corpus linguistics in teaching materials and the language classroom should be informed by theories and principles of SLA, this paper presents a case study illustrating how corpus-based findings on reported speech can be integrated into a form-focused model of instruction. After overviewing previous work which…

  20. The Use of Corpus Examples for Language Comprehension and Production

    ERIC Educational Resources Information Center

    Frankenberg-Garcia, Ana

    2014-01-01

    One of the many new features of English language learners' dictionaries derived from the technological developments that have taken place over recent decades is the presence of corpus-based examples to illustrate the use of words in context. However, empirical studies have generally not been able to produce conclusive evidence about their…

  1. Language experience changes subsequent learning

    PubMed Central

    Onnis, Luca; Thiessen, Erik

    2013-01-01

    What are the effects of experience on subsequent learning? We explored the effects of language-specific word order knowledge on the acquisition of sequential conditional information. Korean and English adults were engaged in a sequence learning task involving three different sets of stimuli: auditory linguistic (nonsense syllables), visual non-linguistic (nonsense shapes), and auditory non-linguistic (pure tones). The forward and backward probabilities between adjacent elements generated two equally probable and orthogonal perceptual parses of the elements, such that any significant preference at test must be due to either general cognitive biases, or prior language-induced biases. We found that language modulated parsing preferences with the linguistic stimuli only. Intriguingly, these preferences are congruent with the dominant word order patterns of each language, as corroborated by corpus analyses, and are driven by probabilistic preferences. Furthermore, although the Korean individuals had received extensive formal explicit training in English and lived in an English-speaking environment, they exhibited statistical learning biases congruent with their native language. Our findings suggest that mechanisms of statistical sequential learning are implicated in language across the lifespan, and experience with language may affect cognitive processes and later learning. PMID:23200510

  2. A Grammar-Based Semantic Similarity Algorithm for Natural Language Sentences

    PubMed Central

    Chang, Jia Wei; Hsieh, Tung Cheng

    2014-01-01

    This paper presents a grammar and semantic corpus based similarity algorithm for natural language sentences. Natural language, in opposition to “artificial language”, such as computer programming languages, is the language used by the general public for daily communication. Traditional information retrieval approaches, such as vector models, LSA, HAL, or even the ontology-based approaches that extend to include concept similarity comparison instead of cooccurrence terms/words, may not always determine the perfect matching while there is no obvious relation or concept overlap between two natural language sentences. This paper proposes a sentence similarity algorithm that takes advantage of corpus-based ontology and grammatical rules to overcome the addressed problems. Experiments on two famous benchmarks demonstrate that the proposed algorithm has a significant performance improvement in sentences/short-texts with arbitrary syntax and structure. PMID:24982952

  3. Natural-Annotation-based Unsupervised Construction of Korean-Chinese Domain Dictionary

    NASA Astrophysics Data System (ADS)

    Liu, Wuying; Wang, Lin

    2018-03-01

    The large-scale bilingual parallel resource is significant to statistical learning and deep learning in natural language processing. This paper addresses the automatic construction issue of the Korean-Chinese domain dictionary, and presents a novel unsupervised construction method based on the natural annotation in the raw corpus. We firstly extract all Korean-Chinese word pairs from Korean texts according to natural annotations, secondly transform the traditional Chinese characters into the simplified ones, and finally distill out a bilingual domain dictionary after retrieving the simplified Chinese words in an extra Chinese domain dictionary. The experimental results show that our method can automatically build multiple Korean-Chinese domain dictionaries efficiently.

  4. Developing Annotation Solutions for Online Data Driven Learning

    ERIC Educational Resources Information Center

    Perez-Paredes, Pascual; Alcaraz-Calero, Jose M.

    2009-01-01

    Although "annotation" is a widely-researched topic in Corpus Linguistics (CL), its potential role in Data Driven Learning (DDL) has not been addressed in depth by Foreign Language Teaching (FLT) practitioners. Furthermore, most of the research in the use of DDL methods pays little attention to annotation in the design and implementation…

  5. Unlearning Overgenerated "Be" through Data-Driven Learning in the Secondary EFL Classroom

    ERIC Educational Resources Information Center

    Moon, Soyeon; Oh, Sun-Young

    2018-01-01

    This paper reports on the cognitive and affective benefits of data-driven learning (DDL), in which Korean EFL learners at the secondary level notice and unlearn their "overgenerated 'be'" by comparing native English-speaker and learner corpora with guided induction. To select the target language item and compile learner-corpus-based…

  6. A Cognitive Neural Architecture Able to Learn and Communicate through Natural Language.

    PubMed

    Golosio, Bruno; Cangelosi, Angelo; Gamotina, Olesya; Masala, Giovanni Luca

    2015-01-01

    Communicative interactions involve a kind of procedural knowledge that is used by the human brain for processing verbal and nonverbal inputs and for language production. Although considerable work has been done on modeling human language abilities, it has been difficult to bring them together to a comprehensive tabula rasa system compatible with current knowledge of how verbal information is processed in the brain. This work presents a cognitive system, entirely based on a large-scale neural architecture, which was developed to shed light on the procedural knowledge involved in language elaboration. The main component of this system is the central executive, which is a supervising system that coordinates the other components of the working memory. In our model, the central executive is a neural network that takes as input the neural activation states of the short-term memory and yields as output mental actions, which control the flow of information among the working memory components through neural gating mechanisms. The proposed system is capable of learning to communicate through natural language starting from tabula rasa, without any a priori knowledge of the structure of phrases, meaning of words, role of the different classes of words, only by interacting with a human through a text-based interface, using an open-ended incremental learning process. It is able to learn nouns, verbs, adjectives, pronouns and other word classes, and to use them in expressive language. The model was validated on a corpus of 1587 input sentences, based on literature on early language assessment, at the level of about 4-years old child, and produced 521 output sentences, expressing a broad range of language processing functionalities.

  7. Evaluating Corpus Literacy Training for Pre-Service Language Teachers: Six Case Studies

    ERIC Educational Resources Information Center

    Heather, Julian; Helt, Marie

    2012-01-01

    Corpus literacy is the ability to use corpora--large, principled databases of spoken and written language--for language analysis and instruction. While linguists have emphasized the importance of corpus training in teacher preparation programs, few studies have investigated the process of initiating teachers into corpus literacy with the result…

  8. Comparability of a Paper-Based Language Test and a Computer-Based Language Test.

    ERIC Educational Resources Information Center

    Choi, Inn-Chull; Kim, Kyoung Sung; Boo, Jaeyool

    2003-01-01

    Utilizing the Test of English Proficiency, developed by Seoul National University (TEPS), examined comparability between the paper-based language test and the computer-based language test based on content and construct validation employing content analyses based on corpus linguistic techniques in addition to such statistical analyses as…

  9. Effects of Corpus-Aided Language Learning in the EFL Grammar Classroom: A Case Study of Students' Learning Attitudes and Teachers' Perceptions in Taiwan

    ERIC Educational Resources Information Center

    Lin, Ming Huei

    2016-01-01

    This study employed a blended approach to form an extensive assessment of the pedagogical suitability of data-driven learning (DDL) in Taiwan's EFL grammar classrooms. On the one hand, the study quantitatively investigated the effects of DDL compared with that of a traditional deductive approach on the learning motivation and self-efficacy of…

  10. Statistical Learning of Two Artificial Languages Presented Successively: How Conscious?

    PubMed Central

    Franco, Ana; Cleeremans, Axel; Destrebecqz, Arnaud

    2011-01-01

    Statistical learning is assumed to occur automatically and implicitly, but little is known about the extent to which the representations acquired over training are available to conscious awareness. In this study, we focus on whether the knowledge acquired in a statistical learning situation is available to conscious control. Participants were first exposed to an artificial language presented auditorily. Immediately thereafter, they were exposed to a second artificial language. Both languages were composed of the same corpus of syllables and differed only in the transitional probabilities. We first determined that both languages were equally learnable (Experiment 1) and that participants could learn the two languages and differentiate between them (Experiment 2). Then, in Experiment 3, we used an adaptation of the Process-Dissociation Procedure (Jacoby, 1991) to explore whether participants could consciously manipulate the acquired knowledge. Results suggest that statistical information can be used to parse and differentiate between two different artificial languages, and that the resulting representations are available to conscious control. PMID:21960981

  11. On the creation of a clinical gold standard corpus in Spanish: Mining adverse drug reactions.

    PubMed

    Oronoz, Maite; Gojenola, Koldo; Pérez, Alicia; de Ilarraza, Arantza Díaz; Casillas, Arantza

    2015-08-01

    The advances achieved in Natural Language Processing make it possible to automatically mine information from electronically created documents. Many Natural Language Processing methods that extract information from texts make use of annotated corpora, but these are scarce in the clinical domain due to legal and ethical issues. In this paper we present the creation of the IxaMed-GS gold standard composed of real electronic health records written in Spanish and manually annotated by experts in pharmacology and pharmacovigilance. The experts mainly annotated entities related to diseases and drugs, but also relationships between entities indicating adverse drug reaction events. To help the experts in the annotation task, we adapted a general corpus linguistic analyzer to the medical domain. The quality of the annotation process in the IxaMed-GS corpus has been assessed by measuring the inter-annotator agreement, which was 90.53% for entities and 82.86% for events. In addition, the corpus has been used for the automatic extraction of adverse drug reaction events using machine learning. Copyright © 2015 Elsevier Inc. All rights reserved.

  12. Corpus Linguistics and Language Testing: Navigating Uncharted Waters

    ERIC Educational Resources Information Center

    Egbert, Jesse

    2017-01-01

    The use of corpora and corpus linguistic methods in language testing research is increasing at an accelerated pace. The growing body of language testing research that uses corpus linguistic data is a testament to their utility in test development and validation. Although there are many reasons to be optimistic about the future of using corpus data…

  13. Developing Interactional Competence by Using TV Series in "English as an Additional Language" Classrooms

    ERIC Educational Resources Information Center

    Sert, Olcay

    2009-01-01

    This paper uses a combined methodology to analyse the conversations in supplementary audio-visual materials to be implemented in language teaching classrooms in order to enhance the Interactional Competence (IC) of the learners. Based on a corpus of 90.000 words (Coupling Corpus), the author tries to reveal the potentials of using TV series in …

  14. Hedges Used in Business Emails: A Corpus Study on the Language Strategy of International Business Communication Online

    ERIC Educational Resources Information Center

    Yue, Siwei; Wang, Xuefei

    2014-01-01

    Based on a corpus of 296 authentic business emails produced in computer-mediated business communication from 7 Chinese international trade enterprises, this paper addresses the language strategy applied in CMC (Computer-mediated Communication) by examining the use of hedges. With the emergence of internet, a wider range of hedges are applied…

  15. Corpus-based Customization for an Ontology

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    2010-09-14

    CCAT scans a corpus of text for terms, and computes lexical similarity between corpus terms and taxonomy terms. Based on a set of metrics and a learning algorithm, the system inserts corpus terms into the taxonomy. Conversely, terms from the taxonomy are disambiguated based on the text in the corpus. Unused terms are discarded, and infrequently used senses of terms are collapsed to make the taxonomy more manageable.

  16. Named Entity Recognition in Chinese Clinical Text Using Deep Neural Network.

    PubMed

    Wu, Yonghui; Jiang, Min; Lei, Jianbo; Xu, Hua

    2015-01-01

    Rapid growth in electronic health records (EHRs) use has led to an unprecedented expansion of available clinical data in electronic formats. However, much of the important healthcare information is locked in the narrative documents. Therefore Natural Language Processing (NLP) technologies, e.g., Named Entity Recognition that identifies boundaries and types of entities, has been extensively studied to unlock important clinical information in free text. In this study, we investigated a novel deep learning method to recognize clinical entities in Chinese clinical documents using the minimal feature engineering approach. We developed a deep neural network (DNN) to generate word embeddings from a large unlabeled corpus through unsupervised learning and another DNN for the NER task. The experiment results showed that the DNN with word embeddings trained from the large unlabeled corpus outperformed the state-of-the-art CRF's model in the minimal feature engineering setting, achieving the highest F1-score of 0.9280. Further analysis showed that word embeddings derived through unsupervised learning from large unlabeled corpus remarkably improved the DNN with randomized embedding, denoting the usefulness of unsupervised feature learning.

  17. Knowledge-Driven Event Extraction in Russian: Corpus-Based Linguistic Resources

    PubMed Central

    Solovyev, Valery; Ivanov, Vladimir

    2016-01-01

    Automatic event extraction form text is an important step in knowledge acquisition and knowledge base population. Manual work in development of extraction system is indispensable either in corpus annotation or in vocabularies and pattern creation for a knowledge-based system. Recent works have been focused on adaptation of existing system (for extraction from English texts) to new domains. Event extraction in other languages was not studied due to the lack of resources and algorithms necessary for natural language processing. In this paper we define a set of linguistic resources that are necessary in development of a knowledge-based event extraction system in Russian: a vocabulary of subordination models, a vocabulary of event triggers, and a vocabulary of Frame Elements that are basic building blocks for semantic patterns. We propose a set of methods for creation of such vocabularies in Russian and other languages using Google Books NGram Corpus. The methods are evaluated in development of event extraction system for Russian. PMID:26955386

  18. Language experience changes subsequent learning.

    PubMed

    Onnis, Luca; Thiessen, Erik

    2013-02-01

    What are the effects of experience on subsequent learning? We explored the effects of language-specific word order knowledge on the acquisition of sequential conditional information. Korean and English adults were engaged in a sequence learning task involving three different sets of stimuli: auditory linguistic (nonsense syllables), visual non-linguistic (nonsense shapes), and auditory non-linguistic (pure tones). The forward and backward probabilities between adjacent elements generated two equally probable and orthogonal perceptual parses of the elements, such that any significant preference at test must be due to either general cognitive biases, or prior language-induced biases. We found that language modulated parsing preferences with the linguistic stimuli only. Intriguingly, these preferences are congruent with the dominant word order patterns of each language, as corroborated by corpus analyses, and are driven by probabilistic preferences. Furthermore, although the Korean individuals had received extensive formal explicit training in English and lived in an English-speaking environment, they exhibited statistical learning biases congruent with their native language. Our findings suggest that mechanisms of statistical sequential learning are implicated in language across the lifespan, and experience with language may affect cognitive processes and later learning. Copyright © 2012 Elsevier B.V. All rights reserved.

  19. Analysing Culture and Interculture in Saudi EFL Textbooks: A Corpus Linguistic Approach

    ERIC Educational Resources Information Center

    Almujaiwel, Sultan

    2018-01-01

    This paper combines corpus processing tools to investigate the cultural elements of Saudi education of English as a foreign language (EFL). The latest Saudi EFL textbooks (2016 onwards) are available in researchable PDF formats. This helps process them through corpus search software tools. The method adopted is based on analysing 20 cultural…

  20. The Nature and Scope of Student Search Strategies in Using a Web Derived Corpus for Writing

    ERIC Educational Resources Information Center

    Franken, Margaret

    2014-01-01

    The use of online language corpora in L2 teaching and learning is gaining momentum largely because corpora are an easily accessed source of language input that potentially provide rich and authentic lexico-grammatical data. This can be of particular use for students' writing as its incorporation can enhance the appearance of native-like fluency.…

  1. Analysis of Textbooks for Teaching Arabic as a Foreign Language in Terms of the Cultural Curriculum

    ERIC Educational Resources Information Center

    Lewicka, Magdalena; Waszau, Anna

    2017-01-01

    The subject of this paper is embedded in the context of the issues of cultural and religious studies on the grounds of the contemporary glottodidactics, since it contains the characteristics of the selected textbooks for learning Arabic as a foreign language in the aspect of the content of the cultural thematic corpus. Three various textbooks,…

  2. Word frequency cues word order in adults: cross-linguistic evidence

    PubMed Central

    Gervain, Judit; Sebastián-Gallés, Núria; Díaz, Begoña; Laka, Itziar; Mazuka, Reiko; Yamane, Naoto; Nespor, Marina; Mehler, Jacques

    2013-01-01

    One universal feature of human languages is the division between grammatical functors and content words. From a learnability point of view, functors might provide entry points or anchors into the syntactic structure of utterances due to their high frequency. Despite its potentially universal scope, this hypothesis has not yet been tested on typologically different languages and on populations of different ages. Here we report a corpus study and an artificial grammar learning experiment testing the anchoring hypothesis in Basque, Japanese, French, and Italian adults. We show that adults are sensitive to the distribution of functors in their native language and use them when learning new linguistic material. However, compared to infants' performance on a similar task, adults exhibit a slightly different behavior, matching the frequency distributions of their native language more closely than infants do. This finding bears on the issue of the continuity of language learning mechanisms. PMID:24106483

  3. Learning grammatical categories from distributional cues: flexible frames for language acquisition.

    PubMed

    St Clair, Michelle C; Monaghan, Padraic; Christiansen, Morten H

    2010-09-01

    Numerous distributional cues in the child's environment may potentially assist in language learning, but what cues are useful to the child and when are these cues utilised? We propose that the most useful source of distributional cue is a flexible frame surrounding the word, where the language learner integrates information from the preceding and the succeeding word for grammatical categorisation. In corpus analyses of child-directed speech together with computational models of category acquisition, we show that these flexible frames are computationally advantageous for language learning, as they benefit from the coverage of bigram information across a large proportion of the language environment as well as exploiting the enhanced accuracy of trigram information. Flexible frames are also consistent with the developmental trajectory of children's sensitivity to different sources of distributional information, and they are therefore a useful and usable information source for supporting the acquisition of grammatical categories. 2010 Elsevier B.V. All rights reserved.

  4. Corpus Approaches to Language Ideology

    ERIC Educational Resources Information Center

    Vessey, Rachelle

    2017-01-01

    This paper outlines how corpus linguistics--and more specifically the corpus-assisted discourse studies approach--can add useful dimensions to studies of language ideology. First, it is argued that the identification of words of high, low, and statistically significant frequency can help in the identification and exploration of language ideologies…

  5. Semantic annotation of consumer health questions.

    PubMed

    Kilicoglu, Halil; Ben Abacha, Asma; Mrabet, Yassine; Shooshan, Sonya E; Rodriguez, Laritza; Masterton, Kate; Demner-Fushman, Dina

    2018-02-06

    Consumers increasingly use online resources for their health information needs. While current search engines can address these needs to some extent, they generally do not take into account that most health information needs are complex and can only fully be expressed in natural language. Consumer health question answering (QA) systems aim to fill this gap. A major challenge in developing consumer health QA systems is extracting relevant semantic content from the natural language questions (question understanding). To develop effective question understanding tools, question corpora semantically annotated for relevant question elements are needed. In this paper, we present a two-part consumer health question corpus annotated with several semantic categories: named entities, question triggers/types, question frames, and question topic. The first part (CHQA-email) consists of relatively long email requests received by the U.S. National Library of Medicine (NLM) customer service, while the second part (CHQA-web) consists of shorter questions posed to MedlinePlus search engine as queries. Each question has been annotated by two annotators. The annotation methodology is largely the same between the two parts of the corpus; however, we also explain and justify the differences between them. Additionally, we provide information about corpus characteristics, inter-annotator agreement, and our attempts to measure annotation confidence in the absence of adjudication of annotations. The resulting corpus consists of 2614 questions (CHQA-email: 1740, CHQA-web: 874). Problems are the most frequent named entities, while treatment and general information questions are the most common question types. Inter-annotator agreement was generally modest: question types and topics yielded highest agreement, while the agreement for more complex frame annotations was lower. Agreement in CHQA-web was consistently higher than that in CHQA-email. Pairwise inter-annotator agreement proved most useful in estimating annotation confidence. To our knowledge, our corpus is the first focusing on annotation of uncurated consumer health questions. It is currently used to develop machine learning-based methods for question understanding. We make the corpus publicly available to stimulate further research on consumer health QA.

  6. Approaching the Linguistic Complexity

    NASA Astrophysics Data System (ADS)

    Drożdż, Stanisław; Kwapień, Jarosław; Orczyk, Adam

    We analyze the rank-frequency distributions of words in selected English and Polish texts. We compare scaling properties of these distributions in both languages. We also study a few small corpora of Polish literary texts and find that for a corpus consisting of texts written by different authors the basic scaling regime is broken more strongly than in the case of comparable corpus consisting of texts written by the same author. Similarly, for a corpus consisting of texts translated into Polish from other languages the scaling regime is broken more strongly than for a comparable corpus of native Polish texts. Moreover, based on the British National Corpus, we consider the rank-frequency distributions of the grammatically basic forms of words (lemmas) tagged with their proper part of speech. We find that these distributions do not scale if each part of speech is analyzed separately. The only part of speech that independently develops a trace of scaling is verbs.

  7. The Effectiveness of Using Corpus-Based Materials in Vocabulary Teaching

    ERIC Educational Resources Information Center

    Paker, Turan; Özcan, Yeliz Ergül

    2017-01-01

    Our study aimed at finding out the effectiveness of corpus-based vocabulary teaching activities as well as students' attitudes towards concordance-based materials when corpus-based tasks in English vocabulary learning are used. The study was conducted in a preparatory school in a private university. The participants were 28 intermediate level…

  8. How a Corpus-Based Study of the Factors which Influence Collocation Can Help in the Teaching of Business English

    ERIC Educational Resources Information Center

    Walker, Crayton

    2011-01-01

    In this paper I use two case studies to show how corpus linguistics can be used to help in the teaching of business English. Senior managers in global companies often find themselves having to do their job in a foreign language. Given that language is one of the key tools of management, the senior managers are normally very keen to develop a…

  9. Subtitle-Based Word Frequencies as the Best Estimate of Reading Behavior: The Case of Greek

    PubMed Central

    Dimitropoulou, Maria; Duñabeitia, Jon Andoni; Avilés, Alberto; Corral, José; Carreiras, Manuel

    2010-01-01

    Previous evidence has shown that word frequencies calculated from corpora based on film and television subtitles can readily account for reading performance, since the language used in subtitles greatly approximates everyday language. The present study examines this issue in a society with increased exposure to subtitle reading. We compiled SUBTLEX-GR, a subtitled-based corpus consisting of more than 27 million Modern Greek words, and tested to what extent subtitle-based frequency estimates and those taken from a written corpus of Modern Greek account for the lexical decision performance of young Greek adults who are exposed to subtitle reading on a daily basis. Results showed that SUBTLEX-GR frequency estimates effectively accounted for participants’ reading performance in two different visual word recognition experiments. More importantly, different analyses showed that frequencies estimated from a subtitle corpus explained the obtained results significantly better than traditional frequencies derived from written corpora. PMID:21833273

  10. Distributional Language Learning: Mechanisms and Models of ategory Formation.

    PubMed

    Aslin, Richard N; Newport, Elissa L

    2014-09-01

    In the past 15 years, a substantial body of evidence has confirmed that a powerful distributional learning mechanism is present in infants, children, adults and (at least to some degree) in nonhuman animals as well. The present article briefly reviews this literature and then examines some of the fundamental questions that must be addressed for any distributional learning mechanism to operate effectively within the linguistic domain. In particular, how does a naive learner determine the number of categories that are present in a corpus of linguistic input and what distributional cues enable the learner to assign individual lexical items to those categories? Contrary to the hypothesis that distributional learning and category (or rule) learning are separate mechanisms, the present article argues that these two seemingly different processes---acquiring specific structure from linguistic input and generalizing beyond that input to novel exemplars---actually represent a single mechanism. Evidence in support of this single-mechanism hypothesis comes from a series of artificial grammar-learning studies that not only demonstrate that adults can learn grammatical categories from distributional information alone, but that the specific patterning of distributional information among attested utterances in the learning corpus enables adults to generalize to novel utterances or to restrict generalization when unattested utterances are consistently absent from the learning corpus. Finally, a computational model of distributional learning that accounts for the presence or absence of generalization is reviewed and the implications of this model for linguistic-category learning are summarized.

  11. Planned experiments and corpus based research play a complementary role. Comment on "Dependency distance: A new perspective on syntactic patterns in natural languages" by Haitao Liu et al.

    NASA Astrophysics Data System (ADS)

    Vasishth, Shravan

    2017-07-01

    This interesting and informative review by Liu and colleagues [17] in this issue covers the full spectrum of research on the idea that in natural language, dependency distance tends to be small. The authors discuss two distinct research threads: experimental work from psycholinguistics on online processes in comprehension and production, and text-corpus studies of dependency length distributions.

  12. Explaining Quantitative Variation in the Rate of Optional Infinitive Errors across Languages: A Comparison of MOSAIC and the Variational Learning Model

    ERIC Educational Resources Information Center

    Freudenthal, Daniel: Pine, Julian; Gobet, Fernando

    2010-01-01

    In this study, we use corpus analysis and computational modelling techniques to compare two recent accounts of the OI stage: Legate & Yang's (2007) Variational Learning Model and Freudenthal, Pine & Gobet's (2006) Model of Syntax Acquisition in Children. We first assess the extent to which each of these accounts can explain the level of OI errors…

  13. An Investigation of Language Teachers' Explorations of the Use of Corpus Tools in the English for Academic Purposes (EAP) Class

    ERIC Educational Resources Information Center

    Bunting, John David

    2013-01-01

    Despite claims that the use of corpus tools can have a major impact in language classrooms (e.g., Conrad, 2000, 2004; Davies, 2004; O'Keefe, McCarthy, & Carter, 2007; Sinclair, 2004b; Tsui, 2004), many language teachers express apparent apathy or even resistance towards adding corpus tools to their repertoire (Cortes, 2013b). This study…

  14. Academic writing in a corpus of 4th grade science notebooks: An analysis of student language use and adult expectations of the genres of school science

    NASA Astrophysics Data System (ADS)

    Esquinca, Alberto

    This is a study of language use in the context of an inquiry-based science curriculum in which conceptual understanding ratings are used split texts into groups of "successful" and "unsuccessful" texts. "Successful" texts could include known features of science language. 420 texts generated by students in 14 classrooms from three school districts, culled from a prior study on the effectiveness of science notebooks to assess understanding, in addition to the aforementioned ratings are the data sources. In science notebooks, students write in the process of learning (here, a unit on electricity). The analytical framework is systemic functional linguistics (Halliday and Matthiessen, 2004; Eggins, 2004), specifically the concepts of genre, register and nominalization. Genre classification involves an analysis of the purpose and register features in the text (Schleppegrell, 2004). The use of features of the scientific academic register, namely the use relational processes and nominalization (Halliday and Martin, 1993), requires transitivity analysis and noun analysis. Transitivity analysis, consisting of the identification of the process type, is conducted on 4737 ranking clauses. A manual count of each noun used in the corpus allows for a typology of nouns. Four school science genres, procedures, procedural recounts reports and explanations, are found. Most texts (85.4%) are factual, and 14.1% are classified as explanations, the analytical genre. Logistic regression analysis indicates that there is no significant probability that the texts classified as explanation are placed in the group of "successful" texts. In addition, material process clauses predominate in the corpus, followed by relational process clauses. Results of a logistic regression analysis indicate that there is a significant probability (Chi square = 15.23, p < .0001) that texts with a high rate of relational processes are placed in the group of "successful" texts. In addition, 59.5% of 6511 nouns are references to physical materials, followed by references to abstract concepts (35.54%). Only two of the concept nouns were found to be nominalized referents in definition model sentences. In sum, the corpus has recognizable genres and features science language, and relational processes are more prevalent in "successful" texts. However, the pervasive feature of science language, nominalization, is scarce.

  15. Reflections on the Grammatical Category of "Before" "After" and "Since" Introducing Non-Finite "-ing" Clauses: A Corpus Approach

    ERIC Educational Resources Information Center

    He, Qingshun

    2016-01-01

    English language learners may be confused in identifying the grammatical category of such conjunctive expressions as "before," "after" and "since" introducing non-finite "-ing" clauses. In this article, we will conduct a corpus-based investigation of hypotactic conjunctions and conjunctive prepositions…

  16. An Infinite Mixture Model for Coreference Resolution in Clinical Notes

    PubMed Central

    Liu, Sijia; Liu, Hongfang; Chaudhary, Vipin; Li, Dingcheng

    2016-01-01

    It is widely acknowledged that natural language processing is indispensable to process electronic health records (EHRs). However, poor performance in relation detection tasks, such as coreference (linguistic expressions pertaining to the same entity/event) may affect the quality of EHR processing. Hence, there is a critical need to advance the research for relation detection from EHRs. Most of the clinical coreference resolution systems are based on either supervised machine learning or rule-based methods. The need for manually annotated corpus hampers the use of such system in large scale. In this paper, we present an infinite mixture model method using definite sampling to resolve coreferent relations among mentions in clinical notes. A similarity measure function is proposed to determine the coreferent relations. Our system achieved a 0.847 F-measure for i2b2 2011 coreference corpus. This promising results and the unsupervised nature make it possible to apply the system in big-data clinical setting. PMID:27595047

  17. Quantum neural network based machine translator for Hindi to English.

    PubMed

    Narayan, Ravi; Singh, V P; Chakraverty, S

    2014-01-01

    This paper presents the machine learning based machine translation system for Hindi to English, which learns the semantically correct corpus. The quantum neural based pattern recognizer is used to recognize and learn the pattern of corpus, using the information of part of speech of individual word in the corpus, like a human. The system performs the machine translation using its knowledge gained during the learning by inputting the pair of sentences of Devnagri-Hindi and English. To analyze the effectiveness of the proposed approach, 2600 sentences have been evaluated during simulation and evaluation. The accuracy achieved on BLEU score is 0.7502, on NIST score is 6.5773, on ROUGE-L score is 0.9233, and on METEOR score is 0.5456, which is significantly higher in comparison with Google Translation and Bing Translation for Hindi to English Machine Translation.

  18. Learned vocal and breathing behavior in an enculturated gorilla.

    PubMed

    Perlman, Marcus; Clark, Nathaniel

    2015-09-01

    We describe the repertoire of learned vocal and breathing-related behaviors (VBBs) performed by the enculturated gorilla Koko. We examined a large video corpus of Koko and observed 439 VBBs spread across 161 bouts. Our analysis shows that Koko exercises voluntary control over the performance of nine distinctive VBBs, which involve variable coordination of her breathing, larynx, and supralaryngeal articulators like the tongue and lips. Each of these behaviors is performed in the context of particular manual action routines and gestures. Based on these and other findings, we suggest that vocal learning and the ability to exercise volitional control over vocalization, particularly in a multimodal context, might have figured relatively early into the evolution of language, with some rudimentary capacity in place at the time of our last common ancestor with great apes.

  19. Learning a generative probabilistic grammar of experience: a process-level model of language acquisition.

    PubMed

    Kolodny, Oren; Lotem, Arnon; Edelman, Shimon

    2015-03-01

    We introduce a set of biologically and computationally motivated design choices for modeling the learning of language, or of other types of sequential, hierarchically structured experience and behavior, and describe an implemented system that conforms to these choices and is capable of unsupervised learning from raw natural-language corpora. Given a stream of linguistic input, our model incrementally learns a grammar that captures its statistical patterns, which can then be used to parse or generate new data. The grammar constructed in this manner takes the form of a directed weighted graph, whose nodes are recursively (hierarchically) defined patterns over the elements of the input stream. We evaluated the model in seventeen experiments, grouped into five studies, which examined, respectively, (a) the generative ability of grammar learned from a corpus of natural language, (b) the characteristics of the learned representation, (c) sequence segmentation and chunking, (d) artificial grammar learning, and (e) certain types of structure dependence. The model's performance largely vindicates our design choices, suggesting that progress in modeling language acquisition can be made on a broad front-ranging from issues of generativity to the replication of human experimental findings-by bringing biological and computational considerations, as well as lessons from prior efforts, to bear on the modeling approach. Copyright © 2014 Cognitive Science Society, Inc.

  20. A Contribution to the Method of Studying Anglicisms in European Languages.

    ERIC Educational Resources Information Center

    Filipovic, Rudolf

    1974-01-01

    Language contact and word borrowing can best be studied in the behavior of bilingual speakers. To establish the universals in language contact and borrowing we must work on a rich and representative corpus. English will be the only lending language, and various European languages the receivers. To narrow the corpus, certain characteristics are…

  1. A Corpus-Based Study of Theme and Thematic Progression in English and Russian Non-Translated Texts and in Russian Translated Texts

    ERIC Educational Resources Information Center

    Alekseyenko, Nataliya V.

    2013-01-01

    The present study is a corpus-based comparative investigation of Theme and thematic progression in English and Russian. While monolingual thematic studies have a long history in Linguistics, comparative studies are relatively few, in particular for the given language pair. In addition to filling the existing gap in the field of Translation…

  2. Juvenile zebra finches learn the underlying structural regularities of their fathers’ song

    PubMed Central

    Menyhart, Otília; Kolodny, Oren; Goldstein, Michael H.; DeVoogd, Timothy J.; Edelman, Shimon

    2015-01-01

    Natural behaviors, such as foraging, tool use, social interaction, birdsong, and language, exhibit branching sequential structure. Such structure should be learnable if it can be inferred from the statistics of early experience. We report that juvenile zebra finches learn such sequential structure in song. Song learning in finches has been extensively studied, and it is generally believed that young males acquire song by imitating tutors (Zann, 1996). Variability in the order of elements in an individual’s mature song occurs, but the degree to which variation in a zebra finch’s song follows statistical regularities has not been quantified, as it has typically been dismissed as production error (Sturdy et al., 1999). Allowing for the possibility that such variation in song is non-random and learnable, we applied a novel analytical approach, based on graph-structured finite-state grammars, to each individual’s full corpus of renditions of songs. This method does not assume syllable-level correspondence between individuals. We find that song variation can be described by probabilistic finite-state graph grammars that are individually distinct, and that the graphs of juveniles are more similar to those of their fathers than to those of other adult males. This grammatical learning is a new parallel between birdsong and language. Our method can be applied across species and contexts to analyze complex variable learned behaviors, as distinct as foraging, tool use, and language. PMID:26005428

  3. Quantum Neural Network Based Machine Translator for Hindi to English

    PubMed Central

    Singh, V. P.; Chakraverty, S.

    2014-01-01

    This paper presents the machine learning based machine translation system for Hindi to English, which learns the semantically correct corpus. The quantum neural based pattern recognizer is used to recognize and learn the pattern of corpus, using the information of part of speech of individual word in the corpus, like a human. The system performs the machine translation using its knowledge gained during the learning by inputting the pair of sentences of Devnagri-Hindi and English. To analyze the effectiveness of the proposed approach, 2600 sentences have been evaluated during simulation and evaluation. The accuracy achieved on BLEU score is 0.7502, on NIST score is 6.5773, on ROUGE-L score is 0.9233, and on METEOR score is 0.5456, which is significantly higher in comparison with Google Translation and Bing Translation for Hindi to English Machine Translation. PMID:24977198

  4. Corpus Planning for the Southern Peruvian Quechua Language.

    ERIC Educational Resources Information Center

    Coronel-Molina, Serafin M.

    1997-01-01

    The discussion of corpus planning for the Southern Quechua language variety of Peru examines issues of graphization, standardization, modernization, and renovation of Quechua in the face of increasing domination by the Spanish language. The efforts of three major groups of linguists and other scholars working on language planning in Peru, and the…

  5. Corpus-Based Optimization of Language Models Derived from Unification Grammars

    NASA Technical Reports Server (NTRS)

    Rayner, Manny; Hockey, Beth Ann; James, Frankie; Bratt, Harry; Bratt, Elizabeth O.; Gawron, Mark; Goldwater, Sharon; Dowding, John; Bhagat, Amrita

    2000-01-01

    We describe a technique which makes it feasible to improve the performance of a language model derived from a manually constructed unification grammar, using low-quality untranscribed speech data and a minimum of human annotation. The method is on a medium-vocabulary spoken language command and control task.

  6. Social Media and Language Processing: How Facebook and Twitter Provide the Best Frequency Estimates for Studying Word Recognition.

    PubMed

    Herdağdelen, Amaç; Marelli, Marco

    2017-05-01

    Corpus-based word frequencies are one of the most important predictors in language processing tasks. Frequencies based on conversational corpora (such as movie subtitles) are shown to better capture the variance in lexical decision tasks compared to traditional corpora. In this study, we show that frequencies computed from social media are currently the best frequency-based estimators of lexical decision reaction times (up to 3.6% increase in explained variance). The results are robust (observed for Twitter- and Facebook-based frequencies on American English and British English datasets) and are still substantial when we control for corpus size. © 2016 The Authors. Cognitive Science published by Wiley Periodicals, Inc. on behalf of Cognitive Science Society.

  7. Playing with Word Endings: Morphological Variation in the Learning of Russian Noun Inflections

    ERIC Educational Resources Information Center

    Kempe, Vera; Brooks, Patricia J.; Mironova, Natalija; Pershukova, Angelina; Fedorova, Olga

    2007-01-01

    This paper documents the occurrence of form variability through diminutive "wordplay", and examines whether this variability facilitates or hinders morphology acquisition in a richly inflected language. First, in a longitudinal speech corpus of eight Russian mothers conversing with their children (1.6-3.6), and with an adult, the use of diminutive…

  8. Using a Stance Corpus to Learn about Effective Authorial Stance-Taking: A Textlinguistic Approach

    ERIC Educational Resources Information Center

    Chang, Peichin

    2012-01-01

    Presenting a persuasive authorial stance is a major challenge for second language (L2) writers in writing academic research. Failure to present an effective authorial stance often results in poor evaluation, which compromises a writer's research potential. This study proposes a "textlinguistic" approach to advanced academic writing to complement a…

  9. From Shared Contexts to Syntactic Categories: The Role of Distributional Information in Learning Linguistic Form-Classes

    ERIC Educational Resources Information Center

    Reeder, Patricia A.; Newport, Elissa L.; Aslin, Richard N.

    2013-01-01

    A fundamental component of language acquisition involves organizing words into grammatical categories. Previous literature has suggested a number of ways in which this categorization task might be accomplished. Here we ask whether the patterning of the words in a corpus of linguistic input ("distributional information") is sufficient, along with a…

  10. Can Learning a Foreign Language Foster Analytic Thinking?-Evidence from Chinese EFL Learners' Writings.

    PubMed

    Jiang, Jingyang; Ouyang, Jinghui; Liu, Haitao

    2016-01-01

    Language is not only the representation of thinking, but also shapes thinking. Studies on bilinguals suggest that a foreign language plays an important and unconscious role in thinking. In this study, a software-Linguistic Inquiry and Word Count 2007-was used to investigate whether the learning of English as a foreign language (EFL) can foster Chinese high school students' English analytic thinking (EAT) through the analysis of their English writings with our self-built corpus. It was found that: (1) learning English can foster Chinese learners' EAT. Chinese EFL learners' ability of making distinctions, degree of cognitive complexity and degree of thinking activeness have all improved along with the increase of their English proficiency and their age; (2) there exist differences in Chinese EFL learners' EAT and that of English native speakers, i. e. English native speakers are better in the ability of making distinctions and degree of thinking activeness. These findings suggest that the best EFL learners in high schools have gained native-like analytic thinking through six years' English learning and are able to switch their cognitive styles as needed.

  11. An Individual Subjectivist Critique of the Use of Corpus Linguistics to Inform Pedagogical Materials

    ERIC Educational Resources Information Center

    Richards, Kendall; Pilcher, Nick

    2016-01-01

    Corpus linguistics, or the gathering together of language into a body for analysis and development of materials, is claimed to be an assured, established method (or field) that valuably informs pedagogical materials and knowledge of language (e.g. Ädel 2010; Gardner & Nesi, 2013). The fundamental validity of corpus linguistics is rarely, if…

  12. The Wildcat Corpus of Native- and Foreign-Accented English: Communicative Efficiency across Conversational Dyads with Varying Language Alignment Profiles

    PubMed Central

    Van Engen, Kristin J.; Baese-Berk, Melissa; Baker, Rachel E.; Choi, Arim; Kim, Midam; Bradlow, Ann R.

    2012-01-01

    This paper describes the development of the Wildcat Corpus of native- and foreign-accented English, a corpus containing scripted and spontaneous speech recordings from 24 native speakers of American English and 52 non-native speakers of English. The core element of this corpus is a set of spontaneous speech recordings, for which a new method of eliciting dialogue-based, laboratory-quality speech recordings was developed (the Diapix task). Dialogues between two native speakers of English, between two non-native speakers of English (with either shared or different L1s), and between one native and one non-native speaker of English are included and analyzed in terms of general measures of communicative efficiency. The overall finding was that pairs of native talkers were most efficient, followed by mixed native/non-native pairs and non-native pairs with shared L1. Non-native pairs with different L1s were least efficient. These results support the hypothesis that successful speech communication depends both on the alignment of talkers to the target language and on the alignment of talkers to one another in terms of native language background. PMID:21313992

  13. A Sample Corpus Integration in Language Teacher Education through Coursebook Evaluation

    ERIC Educational Resources Information Center

    Asik, Asuman

    2017-01-01

    The use of corpora has an increased interest in language teaching in the past two decades. Many corpora have been utilized for several purposes in language classrooms directly or indirectly. In spite of the increasing awareness towards the use of corpora and the corpus tools, language teacher education programs still do not include corpus…

  14. CRIE: An automated analyzer for Chinese texts.

    PubMed

    Sung, Yao-Ting; Chang, Tao-Hsing; Lin, Wei-Chun; Hsieh, Kuan-Sheng; Chang, Kuo-En

    2016-12-01

    Textual analysis has been applied to various fields, such as discourse analysis, corpus studies, text leveling, and automated essay evaluation. Several tools have been developed for analyzing texts written in alphabetic languages such as English and Spanish. However, currently there is no tool available for analyzing Chinese-language texts. This article introduces a tool for the automated analysis of simplified and traditional Chinese texts, called the Chinese Readability Index Explorer (CRIE). Composed of four subsystems and incorporating 82 multilevel linguistic features, CRIE is able to conduct the major tasks of segmentation, syntactic parsing, and feature extraction. Furthermore, the integration of linguistic features with machine learning models enables CRIE to provide leveling and diagnostic information for texts in language arts, texts for learning Chinese as a foreign language, and texts with domain knowledge. The usage and validation of the functions provided by CRIE are also introduced.

  15. How arbitrary is language?

    PubMed Central

    Monaghan, Padraic; Shillcock, Richard C.; Christiansen, Morten H.; Kirby, Simon

    2014-01-01

    It is a long established convention that the relationship between sounds and meanings of words is essentially arbitrary—typically the sound of a word gives no hint of its meaning. However, there are numerous reported instances of systematic sound–meaning mappings in language, and this systematicity has been claimed to be important for early language development. In a large-scale corpus analysis of English, we show that sound–meaning mappings are more systematic than would be expected by chance. Furthermore, this systematicity is more pronounced for words involved in the early stages of language acquisition and reduces in later vocabulary development. We propose that the vocabulary is structured to enable systematicity in early language learning to promote language acquisition, while also incorporating arbitrariness for later language in order to facilitate communicative expressivity and efficiency. PMID:25092667

  16. Automated Classification of Radiology Reports for Acute Lung Injury: Comparison of Keyword and Machine Learning Based Natural Language Processing Approaches.

    PubMed

    Solti, Imre; Cooke, Colin R; Xia, Fei; Wurfel, Mark M

    2009-11-01

    This paper compares the performance of keyword and machine learning-based chest x-ray report classification for Acute Lung Injury (ALI). ALI mortality is approximately 30 percent. High mortality is, in part, a consequence of delayed manual chest x-ray classification. An automated system could reduce the time to recognize ALI and lead to reductions in mortality. For our study, 96 and 857 chest x-ray reports in two corpora were labeled by domain experts for ALI. We developed a keyword and a Maximum Entropy-based classification system. Word unigram and character n-grams provided the features for the machine learning system. The Maximum Entropy algorithm with character 6-gram achieved the highest performance (Recall=0.91, Precision=0.90 and F-measure=0.91) on the 857-report corpus. This study has shown that for the classification of ALI chest x-ray reports, the machine learning approach is superior to the keyword based system and achieves comparable results to highest performing physician annotators.

  17. Automated Classification of Radiology Reports for Acute Lung Injury: Comparison of Keyword and Machine Learning Based Natural Language Processing Approaches

    PubMed Central

    Solti, Imre; Cooke, Colin R.; Xia, Fei; Wurfel, Mark M.

    2010-01-01

    This paper compares the performance of keyword and machine learning-based chest x-ray report classification for Acute Lung Injury (ALI). ALI mortality is approximately 30 percent. High mortality is, in part, a consequence of delayed manual chest x-ray classification. An automated system could reduce the time to recognize ALI and lead to reductions in mortality. For our study, 96 and 857 chest x-ray reports in two corpora were labeled by domain experts for ALI. We developed a keyword and a Maximum Entropy-based classification system. Word unigram and character n-grams provided the features for the machine learning system. The Maximum Entropy algorithm with character 6-gram achieved the highest performance (Recall=0.91, Precision=0.90 and F-measure=0.91) on the 857-report corpus. This study has shown that for the classification of ALI chest x-ray reports, the machine learning approach is superior to the keyword based system and achieves comparable results to highest performing physician annotators. PMID:21152268

  18. The Effect of Data-Driven Approach to Teaching Vocabulary on Iranian Students' Learning of English Vocabulary

    ERIC Educational Resources Information Center

    Barabadi, Elyas; Khajavi, Yaser

    2017-01-01

    Corpus-based data-driven learning (DDL) is an innovation in teaching and learning new vocabulary for EFL students. Using teacher-prepared materials obtained from COCA corpus, the goal of the present study is to compare DDL and traditional methods of teaching vocabulary like consultation of dictionary or a grammar book. As such, two intact classes…

  19. Using Edit Distance to Analyse Errors in a Natural Language to Logic Translation Corpus

    ERIC Educational Resources Information Center

    Barker-Plummer, Dave; Dale, Robert; Cox, Richard; Romanczuk, Alex

    2012-01-01

    We have assembled a large corpus of student submissions to an automatic grading system, where the subject matter involves the translation of natural language sentences into propositional logic. Of the 2.3 million translation instances in the corpus, 286,000 (approximately 12%) are categorized as being in error. We want to understand the nature of…

  20. Linguistic Corpora and Language Teaching.

    ERIC Educational Resources Information Center

    Murison-Bowie, Simon

    1996-01-01

    Examines issues raised by corpus linguistics concerning the description of language. The article argues that it is necessary to start from correct descriptions of linguistic units and the contexts in which they occur. Corpus linguistics has joined with language teaching by sharing a recognition of the importance of a larger, schematic view of…

  1. Automatic de-identification of French clinical records: comparison of rule-based and machine-learning approaches.

    PubMed

    Grouin, Cyril; Zweigenbaum, Pierre

    2013-01-01

    In this paper, we present a comparison of two approaches to automatically de-identify medical records written in French: a rule-based system and a machine-learning based system using a conditional random fields (CRF) formalism. Both systems have been designed to process nine identifiers in a corpus of medical records in cardiology. We performed two evaluations: first, on 62 documents in cardiology, and on 10 documents in foetopathology - produced by optical character recognition (OCR) - to evaluate the robustness of our systems. We achieved a 0.843 (rule-based) and 0.883 (machine-learning) exact match overall F-measure in cardiology. While the rule-based system allowed us to achieve good results on nominative (first and last names) and numerical data (dates, phone numbers, and zip codes), the machine-learning approach performed best on more complex categories (postal addresses, hospital names, medical devices, and towns). On the foetopathology corpus, although our systems have not been designed for this corpus and despite OCR character recognition errors, we obtained promising results: a 0.681 (rule-based) and 0.638 (machine-learning) exact-match overall F-measure. This demonstrates that existing tools can be applied to process new documents of lower quality.

  2. SyllabO+: A new tool to study sublexical phenomena in spoken Quebec French.

    PubMed

    Bédard, Pascale; Audet, Anne-Marie; Drouin, Patrick; Roy, Johanna-Pascale; Rivard, Julie; Tremblay, Pascale

    2017-10-01

    Sublexical phonotactic regularities in language have a major impact on language development, as well as on speech processing and production throughout the entire lifespan. To understand the impact of phonotactic regularities on speech and language functions at the behavioral and neural levels, it is essential to have access to oral language corpora to study these complex phenomena in different languages. Yet, probably because of their complexity, oral language corpora remain less common than written language corpora. This article presents the first corpus and database of spoken Quebec French syllables and phones: SyllabO+. This corpus contains phonetic transcriptions of over 300,000 syllables (over 690,000 phones) extracted from recordings of 184 healthy adult native Quebec French speakers, ranging in age from 20 to 97 years. To ensure the representativeness of the corpus, these recordings were made in both formal and familiar communication contexts. Phonotactic distributional statistics (e.g., syllable and co-occurrence frequencies, percentages, percentile ranks, transition probabilities, and pointwise mutual information) were computed from the corpus. An open-access online application to search the database was developed, and is available at www.speechneurolab.ca/syllabo . In this article, we present a brief overview of the corpus, as well as the syllable and phone databases, and we discuss their practical applications in various fields of research, including cognitive neuroscience, psycholinguistics, neurolinguistics, experimental psychology, phonetics, and phonology. Nonacademic practical applications are also discussed, including uses in speech-language pathology.

  3. Supporting English-Medium Pedagogy through an Online Corpus of Science and Engineering Lectures

    ERIC Educational Resources Information Center

    Kunioshi, Nílson; Noguchi, Judy; Tojo, Kazuko; Hayashi, Hiroko

    2016-01-01

    As English-medium instruction (EMI) spreads around the world, university teachers and students who are non-native speakers of English (NNS) need to put much effort into the delivery or reception of content. Construction of scientific meaning in the process of learning is already complex when instruction is delivered in the first language of the…

  4. Learners' Writing Skills in French: Corpus Consultation and Learner Evaluation

    ERIC Educational Resources Information Center

    O'Sullivan, Ide; Chambers, Angela

    2006-01-01

    While the use of corpora and concordancing in the language-learning environment began as early as 1969 (McEnery & Wilson, 1997, p. 12), it was the work in the 1980s of Tim Johns (1986) and others which brought it to public attention. Important developments occurred in the 1990s, beginning with publications advocating the use of corpora and…

  5. Hello, Who is Calling?: Can Words Reveal the Social Nature of Conversations?

    PubMed

    Stark, Anthony; Shafran, Izhak; Kaye, Jeffrey

    2012-01-01

    This study aims to infer the social nature of conversations from their content automatically. To place this work in context, our motivation stems from the need to understand how social disengagement affects cognitive decline or depression among older adults. For this purpose, we collected a comprehensive and naturalistic corpus comprising of all the incoming and outgoing telephone calls from 10 subjects over the duration of a year. As a first step, we learned a binary classifier to filter out business related conversation, achieving an accuracy of about 85%. This classification task provides a convenient tool to probe the nature of telephone conversations. We evaluated the utility of openings and closing in differentiating personal calls, and find that empirical results on a large corpus do not support the hypotheses by Schegloff and Sacks that personal conversations are marked by unique closing structures. For classifying different types of social relationships such as family vs other, we investigated features related to language use (entropy), hand-crafted dictionary (LIWC) and topics learned using unsupervised latent Dirichlet models (LDA). Our results show that the posteriors over topics from LDA provide consistently higher accuracy (60-81%) compared to LIWC or language use features in distinguishing different types of conversations.

  6. Non-Empirically Based Teaching Materials Can Be Positively Misleading: A Case of Modal Auxiliary Verbs in Malaysian English Language Textbooks

    ERIC Educational Resources Information Center

    Khojasteh, Laleh; Kafipour, Reza

    2012-01-01

    Using corpus approach, a growing number of researchers blamed textbooks for neglecting important information on the use of grammatical structures in natural English. Likewise, the prescribed Malaysian English textbooks used in schools are reportedly prepared through a process of material development that involves intuition. Hence, a corpus-based…

  7. The Hebrew CHILDES corpus: transcription and morphological analysis

    PubMed Central

    Albert, Aviad; MacWhinney, Brian; Nir, Bracha

    2014-01-01

    We present a corpus of transcribed spoken Hebrew that reflects spoken interactions between children and adults. The corpus is an integral part of the CHILDES database, which distributes similar corpora for over 25 languages. We introduce a dedicated transcription scheme for the spoken Hebrew data that is sensitive to both the phonology and the standard orthography of the language. We also introduce a morphological analyzer that was specifically developed for this corpus. The analyzer adequately covers the entire corpus, producing detailed correct analyses for all tokens. Evaluation on a new corpus reveals high coverage as well. Finally, we describe a morphological disambiguation module that selects the correct analysis of each token in context. The result is a high-quality morphologically-annotated CHILDES corpus of Hebrew, along with a set of tools that can be applied to new corpora. PMID:25419199

  8. How do children acquire early grammar and build multiword utterances? A corpus study of French children aged 2 to 4.

    PubMed

    Le Normand, M T; Moreno-Torres, I; Parisse, C; Dellatolas, G

    2013-01-01

    In the last 50 years, researchers have debated over the lexical or grammatical nature of children's early multiword utterances. Due to methodological limitations, the issue remains controversial. This corpus study explores the effect of grammatical, lexical, and pragmatic categories on mean length of utterances (MLU). A total of 312 speech samples from high-low socioeconomic status (SES) French-speaking children aged 2-4 years were annotated with a part-of-speech-tagger. Multiple regression analyses show that grammatical categories, particularly the most frequent subcategories, were the best predictors of MLU both across age and SES groups. These findings support the view that early language learning is guided by grammatical rather than by lexical words. This corpus research design can be used for future cross-linguistic and cross-pathology studies. © 2012 The Authors. Child Development © 2012 Society for Research in Child Development, Inc.

  9. Unsupervised learning of natural languages

    PubMed Central

    Solan, Zach; Horn, David; Ruppin, Eytan; Edelman, Shimon

    2005-01-01

    We address the problem, fundamental to linguistics, bioinformatics, and certain other disciplines, of using corpora of raw symbolic sequential data to infer underlying rules that govern their production. Given a corpus of strings (such as text, transcribed speech, chromosome or protein sequence data, sheet music, etc.), our unsupervised algorithm recursively distills from it hierarchically structured patterns. The adios (automatic distillation of structure) algorithm relies on a statistical method for pattern extraction and on structured generalization, two processes that have been implicated in language acquisition. It has been evaluated on artificial context-free grammars with thousands of rules, on natural languages as diverse as English and Chinese, and on protein data correlating sequence with function. This unsupervised algorithm is capable of learning complex syntax, generating grammatical novel sentences, and proving useful in other fields that call for structure discovery from raw data, such as bioinformatics. PMID:16087885

  10. Unsupervised learning of natural languages.

    PubMed

    Solan, Zach; Horn, David; Ruppin, Eytan; Edelman, Shimon

    2005-08-16

    We address the problem, fundamental to linguistics, bioinformatics, and certain other disciplines, of using corpora of raw symbolic sequential data to infer underlying rules that govern their production. Given a corpus of strings (such as text, transcribed speech, chromosome or protein sequence data, sheet music, etc.), our unsupervised algorithm recursively distills from it hierarchically structured patterns. The adios (automatic distillation of structure) algorithm relies on a statistical method for pattern extraction and on structured generalization, two processes that have been implicated in language acquisition. It has been evaluated on artificial context-free grammars with thousands of rules, on natural languages as diverse as English and Chinese, and on protein data correlating sequence with function. This unsupervised algorithm is capable of learning complex syntax, generating grammatical novel sentences, and proving useful in other fields that call for structure discovery from raw data, such as bioinformatics.

  11. A Corpus-Based Approach for Automatic Thai Unknown Word Recognition Using Boosting Techniques

    NASA Astrophysics Data System (ADS)

    Techo, Jakkrit; Nattee, Cholwich; Theeramunkong, Thanaruk

    While classification techniques can be applied for automatic unknown word recognition in a language without word boundary, it faces with the problem of unbalanced datasets where the number of positive unknown word candidates is dominantly smaller than that of negative candidates. To solve this problem, this paper presents a corpus-based approach that introduces a so-called group-based ranking evaluation technique into ensemble learning in order to generate a sequence of classification models that later collaborate to select the most probable unknown word from multiple candidates. Given a classification model, the group-based ranking evaluation (GRE) is applied to construct a training dataset for learning the succeeding model, by weighing each of its candidates according to their ranks and correctness when the candidates of an unknown word are considered as one group. A number of experiments have been conducted on a large Thai medical text to evaluate performance of the proposed group-based ranking evaluation approach, namely V-GRE, compared to the conventional naïve Bayes classifier and our vanilla version without ensemble learning. As the result, the proposed method achieves an accuracy of 90.93±0.50% when the first rank is selected while it gains 97.26±0.26% when the top-ten candidates are considered, that is 8.45% and 6.79% improvement over the conventional record-based naïve Bayes classifier and the vanilla version. Another result on applying only best features show 93.93±0.22% and up to 98.85±0.15% accuracy for top-1 and top-10, respectively. They are 3.97% and 9.78% improvement over naive Bayes and the vanilla version. Finally, an error analysis is given.

  12. The Use of Epistemic Markers as a Means of Hedging and Boosting in the Discourse of L1 and L2 Speakers of Modern Greek: A Corpus-Based Study in Informal Letter-Writing

    ERIC Educational Resources Information Center

    Efstathiadi, Lia

    2010-01-01

    The paper investigates the semantic area of Epistemic Modality in Modern Greek, by means of a corpus-based research. A comparative, quantitative study was performed between written corpora (informal letter-writing) of non-native informants with various language backgrounds and Greek native speakers. A number of epistemic markers were selected for…

  13. A multilingual gold-standard corpus for biomedical concept recognition: the Mantra GSC

    PubMed Central

    Clematide, Simon; Akhondi, Saber A; van Mulligen, Erik M; Rebholz-Schuhmann, Dietrich

    2015-01-01

    Objective To create a multilingual gold-standard corpus for biomedical concept recognition. Materials and methods We selected text units from different parallel corpora (Medline abstract titles, drug labels, biomedical patent claims) in English, French, German, Spanish, and Dutch. Three annotators per language independently annotated the biomedical concepts, based on a subset of the Unified Medical Language System and covering a wide range of semantic groups. To reduce the annotation workload, automatically generated preannotations were provided. Individual annotations were automatically harmonized and then adjudicated, and cross-language consistency checks were carried out to arrive at the final annotations. Results The number of final annotations was 5530. Inter-annotator agreement scores indicate good agreement (median F-score 0.79), and are similar to those between individual annotators and the gold standard. The automatically generated harmonized annotation set for each language performed equally well as the best annotator for that language. Discussion The use of automatic preannotations, harmonized annotations, and parallel corpora helped to keep the manual annotation efforts manageable. The inter-annotator agreement scores provide a reference standard for gauging the performance of automatic annotation techniques. Conclusion To our knowledge, this is the first gold-standard corpus for biomedical concept recognition in languages other than English. Other distinguishing features are the wide variety of semantic groups that are being covered, and the diversity of text genres that were annotated. PMID:25948699

  14. Redundancy and reduction: Speakers manage syntactic information density

    PubMed Central

    Florian Jaeger, T.

    2010-01-01

    A principle of efficient language production based on information theoretic considerations is proposed: Uniform Information Density predicts that language production is affected by a preference to distribute information uniformly across the linguistic signal. This prediction is tested against data from syntactic reduction. A single multilevel logit model analysis of naturally distributed data from a corpus of spontaneous speech is used to assess the effect of information density on complementizer that-mentioning, while simultaneously evaluating the predictions of several influential alternative accounts: availability, ambiguity avoidance, and dependency processing accounts. Information density emerges as an important predictor of speakers’ preferences during production. As information is defined in terms of probabilities, it follows that production is probability-sensitive, in that speakers’ preferences are affected by the contextual probability of syntactic structures. The merits of a corpus-based approach to the study of language production are discussed as well. PMID:20434141

  15. Word Order in Russian Sign Language

    ERIC Educational Resources Information Center

    Kimmelman, Vadim

    2012-01-01

    In this paper the results of an investigation of word order in Russian Sign Language (RSL) are presented. A small corpus of narratives based on comic strips by nine native signers was analyzed and a picture-description experiment (based on Volterra et al. 1984) was conducted with six native signers. The results are the following: the most frequent…

  16. A study of machine-learning-based approaches to extract clinical entities and their assertions from discharge summaries.

    PubMed

    Jiang, Min; Chen, Yukun; Liu, Mei; Rosenbloom, S Trent; Mani, Subramani; Denny, Joshua C; Xu, Hua

    2011-01-01

    The authors' goal was to develop and evaluate machine-learning-based approaches to extracting clinical entities-including medical problems, tests, and treatments, as well as their asserted status-from hospital discharge summaries written using natural language. This project was part of the 2010 Center of Informatics for Integrating Biology and the Bedside/Veterans Affairs (VA) natural-language-processing challenge. The authors implemented a machine-learning-based named entity recognition system for clinical text and systematically evaluated the contributions of different types of features and ML algorithms, using a training corpus of 349 annotated notes. Based on the results from training data, the authors developed a novel hybrid clinical entity extraction system, which integrated heuristic rule-based modules with the ML-base named entity recognition module. The authors applied the hybrid system to the concept extraction and assertion classification tasks in the challenge and evaluated its performance using a test data set with 477 annotated notes. Standard measures including precision, recall, and F-measure were calculated using the evaluation script provided by the Center of Informatics for Integrating Biology and the Bedside/VA challenge organizers. The overall performance for all three types of clinical entities and all six types of assertions across 477 annotated notes were considered as the primary metric in the challenge. Systematic evaluation on the training set showed that Conditional Random Fields outperformed Support Vector Machines, and semantic information from existing natural-language-processing systems largely improved performance, although contributions from different types of features varied. The authors' hybrid entity extraction system achieved a maximum overall F-score of 0.8391 for concept extraction (ranked second) and 0.9313 for assertion classification (ranked fourth, but not statistically different than the first three systems) on the test data set in the challenge.

  17. Using a Corpus-Informed Pedagogical Intervention to Develop Language Awareness toward Appropriate Lexicogrammatical Choices

    ERIC Educational Resources Information Center

    Fernandez, Julieta; Yuldashev, Aziz

    2015-01-01

    The corpus-informed pedagogical intervention described in this article was developed for an advanced English as a Second Language (ESL) course designed for prospective International Teaching Assistants (ITAs) and implemented over the course of two class periods. Its primary goal was to offer students opportunities to gain language awareness of…

  18. Language Assessment and the Inseparability of Lexis and Grammar: Focus on the Construct of Speaking

    ERIC Educational Resources Information Center

    Römer, Ute

    2017-01-01

    This paper aims to connect recent corpus research on phraseology with current language testing practice. It discusses how corpora and corpus-analytic techniques can illuminate central aspects of speech and help in conceptualizing the notion of lexicogrammar in second language speaking assessment. The description of speech and some of its core…

  19. Can Learning a Foreign Language Foster Analytic Thinking?—Evidence from Chinese EFL Learners' Writings

    PubMed Central

    Jiang, Jingyang; Ouyang, Jinghui; Liu, Haitao

    2016-01-01

    Language is not only the representation of thinking, but also shapes thinking. Studies on bilinguals suggest that a foreign language plays an important and unconscious role in thinking. In this study, a software—Linguistic Inquiry and Word Count 2007—was used to investigate whether the learning of English as a foreign language (EFL) can foster Chinese high school students’ English analytic thinking (EAT) through the analysis of their English writings with our self-built corpus. It was found that: (1) learning English can foster Chinese learners’ EAT. Chinese EFL learners’ ability of making distinctions, degree of cognitive complexity and degree of thinking activeness have all improved along with the increase of their English proficiency and their age; (2) there exist differences in Chinese EFL learners’ EAT and that of English native speakers, i. e. English native speakers are better in the ability of making distinctions and degree of thinking activeness. These findings suggest that the best EFL learners in high schools have gained native-like analytic thinking through six years’ English learning and are able to switch their cognitive styles as needed. PMID:27741270

  20. Don Hammill: A Personal Perspective on the Field of Learning Disabilities, 3-Tier, and RTI

    ERIC Educational Resources Information Center

    Intervention in School and Clinic, 2010

    2010-01-01

    Don D. Hammill received all of his formal education in Texas schools, culminating in a doctorate in educational psychology-special education from the University of Texas at Austin in 1963. He had previously served as a teacher in the Corpus Christi (Texas) public schools and as a speech and language therapist in the Deer Park (Texas) public…

  1. Minority Languages in the Linguistic Landscape of Tourism: The Case of Catalan in Mallorca

    ERIC Educational Resources Information Center

    Bruyèl-Olmedo, Antonio; Juan-Garau, Maria

    2015-01-01

    The relationship between language and tourism is still an incipient area of enquiry. Within it, the study of the linguistic landscape (LL) of holiday destinations affords useful information on the role that minority languages play in the tourist-host interplay, which has received scant attention. Based on a 736-picture corpus, the paper addresses…

  2. An intelligent tutoring system that generates a natural language dialogue using dynamic multi-level planning.

    PubMed

    Woo, Chong Woo; Evens, Martha W; Freedman, Reva; Glass, Michael; Shim, Leem Seop; Zhang, Yuemei; Zhou, Yujian; Michael, Joel

    2006-09-01

    The objective of this research was to build an intelligent tutoring system capable of carrying on a natural language dialogue with a student who is solving a problem in physiology. Previous experiments have shown that students need practice in qualitative causal reasoning to internalize new knowledge and to apply it effectively and that they learn by putting their ideas into words. Analysis of a corpus of 75 hour-long tutoring sessions carried on in keyboard-to-keyboard style by two professors of physiology at Rush Medical College tutoring first-year medical students provided the rules used in tutoring strategies and tactics, parsing, and text generation. The system presents the student with a perturbation to the blood pressure, asks for qualitative predictions of the changes produced in seven important cardiovascular variables, and then launches a dialogue to correct any errors and to probe for possible misconceptions. The natural language understanding component uses a cascade of finite-state machines. The generation is based on lexical functional grammar. Results of experiments with pretests and posttests have shown that using the system for an hour produces significant learning gains and also that even this brief use improves the student's ability to solve problems more then reading textual material on the topic. Student surveys tell us that students like the system and feel that they learn from it. The system is now in regular use in the first-year physiology course at Rush Medical College. We conclude that the CIRCSIM-Tutor system demonstrates that intelligent tutoring systems can implement effective natural language dialogue with current language technology.

  3. Quantitative Investigations in Hungarian Phonotactics and Syllable Structure

    ERIC Educational Resources Information Center

    Grimes, Stephen M.

    2010-01-01

    This dissertation investigates statistical properties of segment collocation and syllable geometry of the Hungarian language. A corpus and dictionary based approach to studying language phonologies is outlined. In order to conduct research on Hungarian, a phonological lexicon was created by compiling existing dictionaries and corpora and using a…

  4. [Behavioral and cognitive profile of corpus callosum agenesia - Review].

    PubMed

    Lábadi, Beatrix; Beke, Anna Maria

    2016-11-30

    Agenesis of corpus callosum is a relatively frequent congenital cerebral malformation including dysplasia, total or partial absence of corpus callosum. The agenesis of corpus callosum can be occured in isolated form without accompanying somatic or central nervous system abnormalities and it can be associated with other central nervus system malformations. The behavioral and cognitive outcome is more favorable for patients with isolated agenesis of corpus callous than syndromic form of corpus callosum. The aim of this study is to review recent research on behavioral and social-cognitive functions in individuals with agenesis of corpus callosum. Developmental delay is common especially in higher-order cognitive and social functions. An internet database search was performed to identify publications on the subject. Fifty-five publications in English corresponded to the criteria. These studies reported deficits in language, social cognition and emotions in individuals with agenesis of corpus callosum which is known as primary corpus callous syndrome. The results indicate that individuals with agenesis of corpus callosum have deficiency in social-cognitive domain (recognition of emotions, weakness in paralinguistic aspects of language and mentalizing abilities). The impaired social cognition can be manifested in behavioral problems like autism and attention deficit hyperactivity disorder.

  5. The words children hear: Picture books and the statistics for language learning

    PubMed Central

    Montag, Jessica L.; Jones, Michael N.; Smith, Linda B.

    2015-01-01

    Young children learn language from the speech they hear. Previous work suggests that the statistical diversity of words and of linguistic contexts is associated with better language outcomes. One potential source of lexical diversity is the text of picture books that caregivers read aloud to children. Many parents begin reading to their children shortly after birth, so this is potentially an important source of linguistic input for many children. We constructed a corpus of 100 children’s picture books and compared word type and token counts to a matched sample of child-directed speech. Overall, the picture books contained more unique word types than the child-directed speech. Further, individual picture books generally contained more unique word types than length-matched, child-directed conversations. The text of picture books may be an important source of vocabulary for young children, and these findings suggest a mechanism that underlies the language benefits associated with reading to children. PMID:26243292

  6. The Words Children Hear: Picture Books and the Statistics for Language Learning.

    PubMed

    Montag, Jessica L; Jones, Michael N; Smith, Linda B

    2015-09-01

    Young children learn language from the speech they hear. Previous work suggests that greater statistical diversity of words and of linguistic contexts is associated with better language outcomes. One potential source of lexical diversity is the text of picture books that caregivers read aloud to children. Many parents begin reading to their children shortly after birth, so this is potentially an important source of linguistic input for many children. We constructed a corpus of 100 children's picture books and compared word type and token counts in that sample and a matched sample of child-directed speech. Overall, the picture books contained more unique word types than the child-directed speech. Further, individual picture books generally contained more unique word types than length-matched, child-directed conversations. The text of picture books may be an important source of vocabulary for young children, and these findings suggest a mechanism that underlies the language benefits associated with reading to children. © The Author(s) 2015.

  7. Social Media and Language Processing: How Facebook and Twitter Provide the Best Frequency Estimates for Studying Word Recognition

    ERIC Educational Resources Information Center

    Herdagdelen, Amaç; Marelli, Marco

    2017-01-01

    Corpus-based word frequencies are one of the most important predictors in language processing tasks. Frequencies based on conversational corpora (such as movie subtitles) are shown to better capture the variance in lexical decision tasks compared to traditional corpora. In this study, we show that frequencies computed from social media are…

  8. Language Planning: Corpus Planning.

    ERIC Educational Resources Information Center

    Baldauf, Richard B., Jr.

    1989-01-01

    Focuses on the historical and sociolinguistic studies that illuminate corpus planning processes. These processes are broken down and discussed under two categories: those related to the establishment of norms, referred to as codification, and those related to the extension of the linguistic functions of language, referred to as elaboration. (60…

  9. A multilingual gold-standard corpus for biomedical concept recognition: the Mantra GSC.

    PubMed

    Kors, Jan A; Clematide, Simon; Akhondi, Saber A; van Mulligen, Erik M; Rebholz-Schuhmann, Dietrich

    2015-09-01

    To create a multilingual gold-standard corpus for biomedical concept recognition. We selected text units from different parallel corpora (Medline abstract titles, drug labels, biomedical patent claims) in English, French, German, Spanish, and Dutch. Three annotators per language independently annotated the biomedical concepts, based on a subset of the Unified Medical Language System and covering a wide range of semantic groups. To reduce the annotation workload, automatically generated preannotations were provided. Individual annotations were automatically harmonized and then adjudicated, and cross-language consistency checks were carried out to arrive at the final annotations. The number of final annotations was 5530. Inter-annotator agreement scores indicate good agreement (median F-score 0.79), and are similar to those between individual annotators and the gold standard. The automatically generated harmonized annotation set for each language performed equally well as the best annotator for that language. The use of automatic preannotations, harmonized annotations, and parallel corpora helped to keep the manual annotation efforts manageable. The inter-annotator agreement scores provide a reference standard for gauging the performance of automatic annotation techniques. To our knowledge, this is the first gold-standard corpus for biomedical concept recognition in languages other than English. Other distinguishing features are the wide variety of semantic groups that are being covered, and the diversity of text genres that were annotated. © The Author 2015. Published by Oxford University Press on behalf of the American Medical Informatics Association.

  10. Exploiting Language Models to Classify Events from Twitter

    PubMed Central

    Vo, Duc-Thuan; Hai, Vo Thuan; Ock, Cheol-Young

    2015-01-01

    Classifying events is challenging in Twitter because tweets texts have a large amount of temporal data with a lot of noise and various kinds of topics. In this paper, we propose a method to classify events from Twitter. We firstly find the distinguishing terms between tweets in events and measure their similarities with learning language models such as ConceptNet and a latent Dirichlet allocation method for selectional preferences (LDA-SP), which have been widely studied based on large text corpora within computational linguistic relations. The relationship of term words in tweets will be discovered by checking them under each model. We then proposed a method to compute the similarity between tweets based on tweets' features including common term words and relationships among their distinguishing term words. It will be explicit and convenient for applying to k-nearest neighbor techniques for classification. We carefully applied experiments on the Edinburgh Twitter Corpus to show that our method achieves competitive results for classifying events. PMID:26451139

  11. A computational cognitive model of syntactic priming.

    PubMed

    Reitter, David; Keller, Frank; Moore, Johanna D

    2011-01-01

    The psycholinguistic literature has identified two syntactic adaptation effects in language production: rapidly decaying short-term priming and long-lasting adaptation. To explain both effects, we present an ACT-R model of syntactic priming based on a wide-coverage, lexicalized syntactic theory that explains priming as facilitation of lexical access. In this model, two well-established ACT-R mechanisms, base-level learning and spreading activation, account for long-term adaptation and short-term priming, respectively. Our model simulates incremental language production and in a series of modeling studies, we show that it accounts for (a) the inverse frequency interaction; (b) the absence of a decay in long-term priming; and (c) the cumulativity of long-term adaptation. The model also explains the lexical boost effect and the fact that it only applies to short-term priming. We also present corpus data that verify a prediction of the model, that is, that the lexical boost affects all lexical material, rather than just heads. Copyright © 2011 Cognitive Science Society, Inc.

  12. Computer-assisted Lemmatisation of a Cornish Text Corpus for Lexicographical Purposes

    ERIC Educational Resources Information Center

    Mills, Jon

    2002-01-01

    This project sets out to discover and develop techniques for the lemmatisation of a historical corpus of the Cornish language in order that a lemmatised dictionary macrostructure can be generated from the corpus. The system should be capable of uniquely identifying every lexical item that is attested in the corpus. A survey of published and…

  13. A Semi-Supervised Learning Approach to Enhance Health Care Community–Based Question Answering: A Case Study in Alcoholism

    PubMed Central

    Klabjan, Diego; Jonnalagadda, Siddhartha Reddy

    2016-01-01

    Background Community-based question answering (CQA) sites play an important role in addressing health information needs. However, a significant number of posted questions remain unanswered. Automatically answering the posted questions can provide a useful source of information for Web-based health communities. Objective In this study, we developed an algorithm to automatically answer health-related questions based on past questions and answers (QA). We also aimed to understand information embedded within Web-based health content that are good features in identifying valid answers. Methods Our proposed algorithm uses information retrieval techniques to identify candidate answers from resolved QA. To rank these candidates, we implemented a semi-supervised leaning algorithm that extracts the best answer to a question. We assessed this approach on a curated corpus from Yahoo! Answers and compared against a rule-based string similarity baseline. Results On our dataset, the semi-supervised learning algorithm has an accuracy of 86.2%. Unified medical language system–based (health related) features used in the model enhance the algorithm’s performance by proximately 8%. A reasonably high rate of accuracy is obtained given that the data are considerably noisy. Important features distinguishing a valid answer from an invalid answer include text length, number of stop words contained in a test question, a distance between the test question and other questions in the corpus, and a number of overlapping health-related terms between questions. Conclusions Overall, our automated QA system based on historical QA pairs is shown to be effective according to the dataset in this case study. It is developed for general use in the health care domain, which can also be applied to other CQA sites. PMID:27485666

  14. Cultivating Effective Corpus Use by Language Learners

    ERIC Educational Resources Information Center

    Kennedy, Claire; Miceli, Tiziana

    2017-01-01

    While there is widespread agreement on the expected benefits of hands-on access to corpora for language learners, reports abound of the difficulties involved in realising those benefits in practice. A particular focus of discussion is the challenge of transferring the skills of the corpus linguist to learners, so that they can explore this type of…

  15. John Sinclair (1933-2007): The Search for Units of Meaning--Sinclair on Empirical Semantics

    ERIC Educational Resources Information Center

    Stubbs, Michael

    2009-01-01

    John McHardy Sinclair has made major contributions to applied linguistics in three related areas: language in education, discourse analysis, and corpus-assisted lexicography. This article discusses the far-reaching implications for language description of this third area. The corpus-assisted search methodology provides empirical evidence for an…

  16. Tashkeela: Novel corpus of Arabic vocalized texts, data for auto-diacritization systems.

    PubMed

    Zerrouki, Taha; Balla, Amar

    2017-04-01

    Arabic diacritics are often missed in Arabic scripts. This feature is a handicap for new learner to read َArabic, text to speech conversion systems, reading and semantic analysis of Arabic texts. The automatic diacritization systems are the best solution to handle this issue. But such automation needs resources as diactritized texts to train and evaluate such systems. In this paper, we describe our corpus of Arabic diacritized texts. This corpus is called Tashkeela. It can be used as a linguistic resource tool for natural language processing such as automatic diacritics systems, dis-ambiguity mechanism, features and data extraction. The corpus is freely available, it contains 75 million of fully vocalized words mainly 97 books from classical and modern Arabic language. The corpus is collected from manually vocalized texts using web crawling process.

  17. Natural language processing pipelines to annotate BioC collections with an application to the NCBI disease corpus

    PubMed Central

    Comeau, Donald C.; Liu, Haibin; Islamaj Doğan, Rezarta; Wilbur, W. John

    2014-01-01

    BioC is a new format and associated code libraries for sharing text and annotations. We have implemented BioC natural language preprocessing pipelines in two popular programming languages: C++ and Java. The current implementations interface with the well-known MedPost and Stanford natural language processing tool sets. The pipeline functionality includes sentence segmentation, tokenization, part-of-speech tagging, lemmatization and sentence parsing. These pipelines can be easily integrated along with other BioC programs into any BioC compliant text mining systems. As an application, we converted the NCBI disease corpus to BioC format, and the pipelines have successfully run on this corpus to demonstrate their functionality. Code and data can be downloaded from http://bioc.sourceforge.net. Database URL: http://bioc.sourceforge.net PMID:24935050

  18. A human language corpus for interstellar message construction

    NASA Astrophysics Data System (ADS)

    Elliott, John

    2011-02-01

    The aim of HuLCC (the human language chorus corpus), is to provide a resource of sufficient size to facilitate inter-language analysis by incorporating languages from all the major language families: for the first time all aspects of typology will be incorporated within a single corpus, adhering to a consistent grammatical classification and granularity, which historically adopt a plethora of disparate schemes. An added feature will be the inclusion of a common text element, which will be translated across all languages, to provide a precise comparable thread for detailed linguistic analysis for translation strategies and a mechanism by which these mappings can be explicitly achieved. Methods developed to solve unambiguous mappings across these languages can then be adopted for any subsequent message authored by the SETI community. Initially, it is planned to provide at least 20,000 words for each chosen language, as this amount of text exceeds the point where randomly generated text can be disambiguated from natural language and is of sufficient size useful for message transmission [1] (Elliot, 2002). This paper details the design of this resource, which ultimately will be made available to SETI upon its completion, and discusses issues 'core' to any message construction.

  19. Vocabulary Size Research at Victoria University of Wellington, New Zealand

    ERIC Educational Resources Information Center

    Nation, Paul; Coxhead, Averil

    2014-01-01

    The English Language Institute (now the School of Linguistics and Applied Language Studies) at Victoria University of Wellington has a long history of corpus-based vocabulary research, especially after the arrival of the second director of the institute, H. V. George, and the appointment of Helen Barnard, whom George knew in India. George's…

  20. What Do We Want EAP Teaching Materials for?

    ERIC Educational Resources Information Center

    Harwood, Nigel

    2005-01-01

    This paper explores the various anti-textbook arguments in the literature to determine their relevance to the field of EAP. I distinguish between what I call a "strong" and a "weak" anti-textbook line, then review the corpus-based studies which compare the language EAP textbooks teach with corpora of the language academic writers use. After…

  1. Talking about German Verb Particles Identified in Concordance Lines--From Spontaneous to Expert-Like Metatalk

    ERIC Educational Resources Information Center

    Schaeffer-Lacroix, Eva

    2016-01-01

    Johns reports in his text "Kibbitzing one-to-ones" (1997) that corpus-informed metatalk with a foreign language expert helps apprentice writers to make progress in independent text revision. Expecting this progress to be based on the development of expert-like ways to observe language features, I integrated Johns' so-called kibbitzing…

  2. The Wildcat Corpus of Native- and Foreign-Accented English: Communicative Efficiency across Conversational Dyads with Varying Language Alignment Profiles

    ERIC Educational Resources Information Center

    Van Engen, Kristin J.; Baese-Berk, Melissa; Baker, Rachel E.; Choi, Arim; Kim, Midam; Bradlow, Ann R.

    2010-01-01

    This paper describes the development of the Wildcat Corpus of native- and foreign-accented English, a corpus containing scripted and spontaneous speech recordings from 24 native speakers of American English and 52 non-native speakers of English. The core element of this corpus is a set of spontaneous speech recordings, for which a new method of…

  3. Assessing the use of multiple sources in student essays.

    PubMed

    Hastings, Peter; Hughes, Simon; Magliano, Joseph P; Goldman, Susan R; Lawless, Kimberly

    2012-09-01

    The present study explored different approaches for automatically scoring student essays that were written on the basis of multiple texts. Specifically, these approaches were developed to classify whether or not important elements of the texts were present in the essays. The first was a simple pattern-matching approach called "multi-word" that allowed for flexible matching of words and phrases in the sentences. The second technique was latent semantic analysis (LSA), which was used to compare student sentences to original source sentences using its high-dimensional vector-based representation. Finally, the third was a machine-learning technique, support vector machines, which learned a classification scheme from the corpus. The results of the study suggested that the LSA-based system was superior for detecting the presence of explicit content from the texts, but the multi-word pattern-matching approach was better for detecting inferences outside or across texts. These results suggest that the best approach for analyzing essays of this nature should draw upon multiple natural language processing approaches.

  4. Natural language processing pipelines to annotate BioC collections with an application to the NCBI disease corpus.

    PubMed

    Comeau, Donald C; Liu, Haibin; Islamaj Doğan, Rezarta; Wilbur, W John

    2014-01-01

    BioC is a new format and associated code libraries for sharing text and annotations. We have implemented BioC natural language preprocessing pipelines in two popular programming languages: C++ and Java. The current implementations interface with the well-known MedPost and Stanford natural language processing tool sets. The pipeline functionality includes sentence segmentation, tokenization, part-of-speech tagging, lemmatization and sentence parsing. These pipelines can be easily integrated along with other BioC programs into any BioC compliant text mining systems. As an application, we converted the NCBI disease corpus to BioC format, and the pipelines have successfully run on this corpus to demonstrate their functionality. Code and data can be downloaded from http://bioc.sourceforge.net. Database URL: http://bioc.sourceforge.net. © The Author(s) 2014. Published by Oxford University Press.

  5. A Computer Analysis Study of the Word Style in Love-songs of Tshang yang Gya tsho

    NASA Astrophysics Data System (ADS)

    Yonghong, Li; SunTing; Lei, Guo; Hongzhi, Yu

    Based on the statistical methods of corpus and the 124 love-songs of Tshang yang Gya tsho as the studying object, this paper have set up the principles of vocabulary segmentation and built the love-songs corpus of Tibetan and Tibetan-Chinese grammar separation lexicon corpus. Then it did quantitative research on the achievement of "love-songs" in the language arts from three aspects: the length of the vocabularie's, the frequency rate of the vocabularies, and the distribution of the term's number in the verses and the songs. In addition it also introduced a new kind of researching idea and method for the study of Tibetan literature.

  6. Replication Research in Pedagogical Approaches to Formulaic Sequences: Jones & Haywood (2004) and Alali & Schmitt (2012)

    ERIC Educational Resources Information Center

    Coxhead, Averil

    2018-01-01

    Research into the formulaic nature of language has grown in size and scale in the last 20 years or more, much of it based in corpus studies and involving the identification and categorisation of formulas. Research suggests that there are benefits for second and foreign language learners recognising formulaic sequences when listening and reading,…

  7. A Usage-Based Investigation of L2 Lexical Acquisition: The Role of Input and Output

    ERIC Educational Resources Information Center

    Crossley, Scott; Kyle, Kristopher; Salsbury, Thomas

    2016-01-01

    This study investigates relations between second language (L2) lexical input and output in terms of word information properties (i.e., lexical salience; Ellis, 2006a). The data for this study come from a longitudinal corpus of naturalistic spoken data between L2 learners and first language (L1) interlocutors collected over a year's time. The…

  8. Automated Measurement of Syntactic Complexity in Corpus-Based L2 Writing Research and Implications for Writing Assessment

    ERIC Educational Resources Information Center

    Lu, Xiaofei

    2017-01-01

    Research investigating corpora of English learners' language raises new questions about how syntactic complexity is defined theoretically and operationally for second language (L2) writing assessment. I show that syntactic complexity is important in construct definitions and L2 writing rating scales as well as in L2 writing research. I describe…

  9. Aspects of a Grammar of Makary Kotoko (Chadic, Cameroon)

    ERIC Educational Resources Information Center

    Allison, Sean David

    2012-01-01

    Makary Kotoko (MK), a Central Chadic B language, is spoken in the north of Cameroon just south of Lake Chad. Published works on MK to date include about a dozen articles on different aspects of the grammar of the language, primarily by H. Tourneux. The present work, which is based on a substantial corpus of recorded texts, is a systematic…

  10. A Corpus-Based Evaluation on Two Different English for Nursing Purposes (ENP) Course Books

    ERIC Educational Resources Information Center

    Mohamad, Alif Fairus Nor; Puteh, Sharifah Nor

    2017-01-01

    It is difficult for most of the second language learners in Malaysia to function proficiently in English language due to limited vocabulary knowledge. It has also been challenging for TESL graduates to fit in as ENP teachers due to the lack of specialized vocabulary knowledge in nursing field. Thus, a course books has always been a highly…

  11. Exploring Culture-Related Content in the COCA with Task-Based Activities in the EFL Classroom

    ERIC Educational Resources Information Center

    Lopes, António

    2013-01-01

    The Corpus of Contemporary American English (COCA) at the Brigham Young University website has been used in the English as a Foreign language (EFL) classroom to help learners better understand how language works at different levels of analysis and also to develop their writing skills. However, it also allows learners to explore culture-related…

  12. Machine learning-based coreference resolution of concepts in clinical documents

    PubMed Central

    Ware, Henry; Mullett, Charles J; El-Rawas, Oussama

    2012-01-01

    Objective Coreference resolution of concepts, although a very active area in the natural language processing community, has not yet been widely applied to clinical documents. Accordingly, the 2011 i2b2 competition focusing on this area is a timely and useful challenge. The objective of this research was to collate coreferent chains of concepts from a corpus of clinical documents. These concepts are in the categories of person, problems, treatments, and tests. Design A machine learning approach based on graphical models was employed to cluster coreferent concepts. Features selected were divided into domain independent and domain specific sets. Training was done with the i2b2 provided training set of 489 documents with 6949 chains. Testing was done on 322 documents. Results The learning engine, using the un-weighted average of three different measurement schemes, resulted in an F measure of 0.8423 where no domain specific features were included and 0.8483 where the feature set included both domain independent and domain specific features. Conclusion Our machine learning approach is a promising solution for recognizing coreferent concepts, which in turn is useful for practical applications such as the assembly of problem and medication lists from clinical documents. PMID:22582205

  13. The Effects of Utilizing Corpus Resources to Correct Collocation Errors in L2 Writing--Students' Performance, Corpus Use and Perceptions

    ERIC Educational Resources Information Center

    Wu, Yi-ju

    2016-01-01

    Data-Driven Learning (DDL), in which learners "confront [themselves] directly with the corpus data" (Johns, 2002, p. 108), has shown to be effective in collocation learning in L2 writing. Nevertheless, there have been only few research studies of this type examining the relationship between English proficiency and corpus consultation.…

  14. Using UAM Corpustool to Explore the Language of Evaluation in Interview Program

    ERIC Educational Resources Information Center

    Hu, Chunyu; Tan, Jinlin

    2017-01-01

    As an interactional encounter between a journalist and one or more newsworthy public figures, an interview program is a special type of discourse that is full of evaluative language. This paper sets out to explore evaluation in interview programs from the perspective of appraisal system. The corpus software used in this study is UAM CorpusTool…

  15. Utilizing Lexical Data from a Web-Derived Corpus to Expand Productive Collocation Knowledge

    ERIC Educational Resources Information Center

    Wu, Shaoqun; Witten, Ian H.; Franken, Margaret

    2010-01-01

    Collocations are of great importance for second language learners, and a learner's knowledge of them plays a key role in producing language fluently (Nation, 2001: 323). In this article we describe and evaluate an innovative system that uses a Web-derived corpus and digital library software to produce a vast concordance and present it in a way…

  16. Variability in Chinese as a Foreign Language Learners' Development of the Chinese Numeral Classifier System

    ERIC Educational Resources Information Center

    Zhang, Jie; Lu, Xiaofei

    2013-01-01

    This study examined variability in Chinese as a Foreign Language (CFL) learners' development of the Chinese numeral classifier system from a dynamic systems approach. Our data consisted of a longitudinal corpus of 657 essays written by CFL learners at lower and higher intermediate levels and a corpus of 100 essays written by native speakers (NSs)…

  17. A Simple Czech and English Probabilistic Tagger: A Comparison.

    ERIC Educational Resources Information Center

    Hladka, Barbora; Hajic, Jan

    An experiment compared the tagging of two languages: Czech, a highly inflected language with a high degree of ambiguity, and English. For Czech, the corpus was one gathered in the 1970s at the Czechoslovak Academy of Sciences; for English, it was the Wall Street Journal corpus. Results indicate 81.53 percent accuracy for Czech and 96.83 percent…

  18. A Comparison of the Effectiveness of EFL Students' Use of Dictionaries and an Online Corpus for the Enhancement of Revision Skills

    ERIC Educational Resources Information Center

    Mueller, Charles M.; Jacobsen, Natalia D.

    2016-01-01

    Qualitative research focusing primarily on advanced-proficiency second language (L2) learners suggests that online corpora can function as useful reference tools for language learners, especially when addressing phraseological issues. However, the feasibility and effectiveness of online corpus consultation for learners at a basic level of L2…

  19. Terminology model discovery using natural language processing and visualization techniques.

    PubMed

    Zhou, Li; Tao, Ying; Cimino, James J; Chen, Elizabeth S; Liu, Hongfang; Lussier, Yves A; Hripcsak, George; Friedman, Carol

    2006-12-01

    Medical terminologies are important for unambiguous encoding and exchange of clinical information. The traditional manual method of developing terminology models is time-consuming and limited in the number of phrases that a human developer can examine. In this paper, we present an automated method for developing medical terminology models based on natural language processing (NLP) and information visualization techniques. Surgical pathology reports were selected as the testing corpus for developing a pathology procedure terminology model. The use of a general NLP processor for the medical domain, MedLEE, provides an automated method for acquiring semantic structures from a free text corpus and sheds light on a new high-throughput method of medical terminology model development. The use of an information visualization technique supports the summarization and visualization of the large quantity of semantic structures generated from medical documents. We believe that a general method based on NLP and information visualization will facilitate the modeling of medical terminologies.

  20. A semi-supervised learning framework for biomedical event extraction based on hidden topics.

    PubMed

    Zhou, Deyu; Zhong, Dayou

    2015-05-01

    Scientists have devoted decades of efforts to understanding the interaction between proteins or RNA production. The information might empower the current knowledge on drug reactions or the development of certain diseases. Nevertheless, due to the lack of explicit structure, literature in life science, one of the most important sources of this information, prevents computer-based systems from accessing. Therefore, biomedical event extraction, automatically acquiring knowledge of molecular events in research articles, has attracted community-wide efforts recently. Most approaches are based on statistical models, requiring large-scale annotated corpora to precisely estimate models' parameters. However, it is usually difficult to obtain in practice. Therefore, employing un-annotated data based on semi-supervised learning for biomedical event extraction is a feasible solution and attracts more interests. In this paper, a semi-supervised learning framework based on hidden topics for biomedical event extraction is presented. In this framework, sentences in the un-annotated corpus are elaborately and automatically assigned with event annotations based on their distances to these sentences in the annotated corpus. More specifically, not only the structures of the sentences, but also the hidden topics embedded in the sentences are used for describing the distance. The sentences and newly assigned event annotations, together with the annotated corpus, are employed for training. Experiments were conducted on the multi-level event extraction corpus, a golden standard corpus. Experimental results show that more than 2.2% improvement on F-score on biomedical event extraction is achieved by the proposed framework when compared to the state-of-the-art approach. The results suggest that by incorporating un-annotated data, the proposed framework indeed improves the performance of the state-of-the-art event extraction system and the similarity between sentences might be precisely described by hidden topics and structures of the sentences. Copyright © 2015 Elsevier B.V. All rights reserved.

  1. Optimization of internet content filtering-Combined with KNN and OCAT algorithms

    NASA Astrophysics Data System (ADS)

    Guo, Tianze; Wu, Lingjing; Liu, Jiaming

    2018-04-01

    The face of the status quo that rampant illegal content in the Internet, the result of traditional way to filter information, keyword recognition and manual screening, is getting worse. Based on this, this paper uses OCAT algorithm nested by KNN classification algorithm to construct a corpus training library that can dynamically learn and update, which can be improved on the filter corpus for constantly updated illegal content of the network, including text and pictures, and thus can better filter and investigate illegal content and its source. After that, the research direction will focus on the simplified updating of recognition and comparison algorithms and the optimization of the corpus learning ability in order to improve the efficiency of filtering, save time and resources.

  2. Pattern and Meaning across Genres and Disciplines: An Exploratory Study

    ERIC Educational Resources Information Center

    Groom, Nicholas

    2005-01-01

    Work in corpus linguistics has led to the development of a theory of language as "phraseology" [Hunston, S., & Francis, G. (1999). "Pattern grammar: A corpus-driven approach to the lexical grammar of English." Amsterdam: John Benjamins. Sinclair, J. M. (1991). "Corpus, concordance, collocation." Oxford: Oxford University Press. Sinclair, J. M.…

  3. Lexical bundles in an advanced INTOCSU writing class and engineering texts: A functional analysis

    NASA Astrophysics Data System (ADS)

    Alquraishi, Mohammed Abdulrahman

    The purpose of this study is to investigate the functions of lexical bundles in two corpora: a corpus of engineering academic texts and a corpus of IEP advanced writing class texts. This study is concerned with the nature of formulaic language in Pathway IEPs and engineering texts, and whether those types of texts show similar or distinctive formulaic functions. Moreover, the study looked into lexical bundles found in an engineering 1.26 million-word corpus and an ESL 65000-word corpus using a concordancing program. The study then analyzed the functions of those lexical bundles and compared them statistically using chi-square tests. Additionally, the results of this investigation showed 236 unique frequent lexical bundles in the engineering corpus and 37 bundles in the pathway corpus. Also, the study identified several differences between the density and functions of lexical bundles in the two corpora. These differences were evident in the distribution of functions of lexical bundles and the minimal overlap of lexical bundles found in the two corpora. The results of this study call for more attention to formulaic language at ESP and EAP programs.

  4. The differential role of phonological and distributional cues in grammatical categorisation.

    PubMed

    Monaghan, Padraic; Chater, Nick; Christiansen, Morten H

    2005-06-01

    Recognising the grammatical categories of words is a necessary skill for the acquisition of syntax and for on-line sentence processing. The syntactic and semantic context of the word contribute as cues for grammatical category assignment, but phonological cues, too, have been implicated as important sources of information. The value of phonological and distributional cues has not, with very few exceptions, been empirically assessed. This paper presents a series of analyses of phonological cues and distributional cues and their potential for distinguishing grammatical categories of words in corpus analyses. The corpus analyses indicated that phonological cues were more reliable for less frequent words, whereas distributional information was most valuable for high frequency words. We tested this prediction in an artificial language learning experiment, where the distributional and phonological cues of categories of nonsense words were varied. The results corroborated the corpus analyses. For high-frequency nonwords, distributional information was more useful, whereas for low-frequency words there was more reliance on phonological cues. The results indicate that phonological and distributional cues contribute differentially towards grammatical categorisation.

  5. Strategies for human-driven robot comprehension of spatial descriptions by older adults in a robot fetch task.

    PubMed

    Carlson, Laura; Skubic, Marjorie; Miller, Jared; Huo, Zhiyu; Alexenko, Tatiana

    2014-07-01

    This contribution presents a corpus of spatial descriptions and describes the development of a human-driven spatial language robot system for their comprehension. The domain of application is an eldercare setting in which an assistive robot is asked to "fetch" an object for an elderly resident based on a natural language spatial description given by the resident. In Part One, we describe a corpus of naturally occurring descriptions elicited from a group of older adults within a virtual 3D home that simulates the eldercare setting. We contrast descriptions elicited when participants offered descriptions to a human versus robot avatar, and under instructions to tell the addressee how to find the target versus where the target is. We summarize the key features of the spatial descriptions, including their dynamic versus static nature and the perspective adopted by the speaker. In Part Two, we discuss critical cognitive and perceptual processing capabilities necessary for the robot to establish a common ground with the human user and perform the "fetch" task. Based on the collected corpus, we focus here on resolving the perspective ambiguity and recognizing furniture items used as landmarks in the descriptions. Taken together, the work presented here offers the key building blocks of a robust system that takes as input natural spatial language descriptions and produces commands that drive the robot to successfully fetch objects within our eldercare scenario. Copyright © 2014 Cognitive Science Society, Inc.

  6. Disambiguating the species of biomedical named entities using natural language parsers

    PubMed Central

    Wang, Xinglong; Tsujii, Jun'ichi; Ananiadou, Sophia

    2010-01-01

    Motivation: Text mining technologies have been shown to reduce the laborious work involved in organizing the vast amount of information hidden in the literature. One challenge in text mining is linking ambiguous word forms to unambiguous biological concepts. This article reports on a comprehensive study on resolving the ambiguity in mentions of biomedical named entities with respect to model organisms and presents an array of approaches, with focus on methods utilizing natural language parsers. Results: We build a corpus for organism disambiguation where every occurrence of protein/gene entity is manually tagged with a species ID, and evaluate a number of methods on it. Promising results are obtained by training a machine learning model on syntactic parse trees, which is then used to decide whether an entity belongs to the model organism denoted by a neighbouring species-indicating word (e.g. yeast). The parser-based approaches are also compared with a supervised classification method and results indicate that the former are a more favorable choice when domain portability is of concern. The best overall performance is obtained by combining the strengths of syntactic features and supervised classification. Availability: The corpus and demo are available at http://www.nactem.ac.uk/deca_details/start.cgi, and the software is freely available as U-Compare components (Kano et al., 2009): NaCTeM Species Word Detector and NaCTeM Species Disambiguator. U-Compare is available at http://-compare.org/ Contact: xinglong.wang@manchester.ac.uk PMID:20053840

  7. Predicting Patterns of Grammatical Complexity across Language Exam Task Types and Proficiency Levels

    ERIC Educational Resources Information Center

    Biber, Douglas; Gray, Bethany; Staples, Shelley

    2016-01-01

    In the present article, we explore the extent to which previous research on register variation can be used to predict spoken/written task-type variation as well as differences across score levels in the context of a major standardized language exam (TOEFL iBT). Specifically, we carry out two sets of linguistic analyses based on a large corpus of…

  8. ContextD: an algorithm to identify contextual properties of medical terms in a Dutch clinical corpus.

    PubMed

    Afzal, Zubair; Pons, Ewoud; Kang, Ning; Sturkenboom, Miriam C J M; Schuemie, Martijn J; Kors, Jan A

    2014-11-29

    In order to extract meaningful information from electronic medical records, such as signs and symptoms, diagnoses, and treatments, it is important to take into account the contextual properties of the identified information: negation, temporality, and experiencer. Most work on automatic identification of these contextual properties has been done on English clinical text. This study presents ContextD, an adaptation of the English ConText algorithm to the Dutch language, and a Dutch clinical corpus. We created a Dutch clinical corpus containing four types of anonymized clinical documents: entries from general practitioners, specialists' letters, radiology reports, and discharge letters. Using a Dutch list of medical terms extracted from the Unified Medical Language System, we identified medical terms in the corpus with exact matching. The identified terms were annotated for negation, temporality, and experiencer properties. To adapt the ConText algorithm, we translated English trigger terms to Dutch and added several general and document specific enhancements, such as negation rules for general practitioners' entries and a regular expression based temporality module. The ContextD algorithm utilized 41 unique triggers to identify the contextual properties in the clinical corpus. For the negation property, the algorithm obtained an F-score from 87% to 93% for the different document types. For the experiencer property, the F-score was 99% to 100%. For the historical and hypothetical values of the temporality property, F-scores ranged from 26% to 54% and from 13% to 44%, respectively. The ContextD showed good performance in identifying negation and experiencer property values across all Dutch clinical document types. Accurate identification of the temporality property proved to be difficult and requires further work. The anonymized and annotated Dutch clinical corpus can serve as a useful resource for further algorithm development.

  9. Cross-language information retrieval using PARAFAC2.

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Bader, Brett William; Chew, Peter; Abdelali, Ahmed

    A standard approach to cross-language information retrieval (CLIR) uses Latent Semantic Analysis (LSA) in conjunction with a multilingual parallel aligned corpus. This approach has been shown to be successful in identifying similar documents across languages - or more precisely, retrieving the most similar document in one language to a query in another language. However, the approach has severe drawbacks when applied to a related task, that of clustering documents 'language-independently', so that documents about similar topics end up closest to one another in the semantic space regardless of their language. The problem is that documents are generally more similar tomore » other documents in the same language than they are to documents in a different language, but on the same topic. As a result, when using multilingual LSA, documents will in practice cluster by language, not by topic. We propose a novel application of PARAFAC2 (which is a variant of PARAFAC, a multi-way generalization of the singular value decomposition [SVD]) to overcome this problem. Instead of forming a single multilingual term-by-document matrix which, under LSA, is subjected to SVD, we form an irregular three-way array, each slice of which is a separate term-by-document matrix for a single language in the parallel corpus. The goal is to compute an SVD for each language such that V (the matrix of right singular vectors) is the same across all languages. Effectively, PARAFAC2 imposes the constraint, not present in standard LSA, that the 'concepts' in all documents in the parallel corpus are the same regardless of language. Intuitively, this constraint makes sense, since the whole purpose of using a parallel corpus is that exactly the same concepts are expressed in the translations. We tested this approach by comparing the performance of PARAFAC2 with standard LSA in solving a particular CLIR problem. From our results, we conclude that PARAFAC2 offers a very promising alternative to LSA not only for multilingual document clustering, but also for solving other problems in cross-language information retrieval.« less

  10. Aligning Greek-English parallel texts

    NASA Astrophysics Data System (ADS)

    Galiotou, Eleni; Koronakis, George; Lazari, Vassiliki

    2015-02-01

    In this paper, we discuss issues concerning the alignment of parallel texts written in languages with different alphabets based on an experiment of aligning texts from the proceedings of the European Parliament in Greek and English. First, we describe our implementation of the k-vec algorithm and its application to the bilingual corpus. Then the output of the algorithm is used as a starting point for an alignment procedure at a sentence level which also takes into account mark-ups of meta-information. The results of the implementation are compared to those of the application of the Church and Gale alignment algorithm on the Europarl corpus. The conclusions of this comparison can give useful insights as for the efficiency of alignment algorithms when applied to the particular bilingual corpus.

  11. An Analysis of Stative Verbs Used with the Progressive Aspect in Corpus-Informed Textbooks

    ERIC Educational Resources Information Center

    Belli, Serap Atasever

    2018-01-01

    This study was designed to investigate whether contemporary corpus-informed grammar textbooks written for English language learners and teachers presented the progressive use of stative verbs and if yes, which stative verbs were presented to occur with the progressive aspect and for which functions they took this aspect. A corpus of six electronic…

  12. Investigating L2 Spoken English through the Role Play Learner Corpus

    ERIC Educational Resources Information Center

    Nava, Andrea; Pedrazzini, Luciana

    2011-01-01

    We describe an exploratory study carried out within the University of Milan, Department of English the aim of which was to analyse features of the spoken English of first-year Modern Languages undergraduates. We compiled a learner corpus, the "Role Play" corpus, which consisted of 69 role-play interactions in English carried out by…

  13. The Brazilian Portuguese Lexicon: An Instrument for Psycholinguistic Research

    PubMed Central

    Estivalet, Gustavo L.; Meunier, Fanny

    2015-01-01

    In this article, we present the Brazilian Portuguese Lexicon, a new word-based corpus for psycholinguistic and computational linguistic research in Brazilian Portuguese. We describe the corpus development, the specific characteristics on the internet site and database for user access. We also perform distributional analyses of the corpus and comparisons to other current databases. Our main objective was to provide a large, reliable, and useful word-based corpus with a dynamic, easy-to-use, and intuitive interface with free internet access for word and word-criteria searches. We used the Núcleo Interinstitucional de Linguística Computacional’s corpus as the basic data source and developed the Brazilian Portuguese Lexicon by deriving and adding metalinguistic and psycholinguistic information about Brazilian Portuguese words. We obtained a final corpus with more than 30 million word tokens, 215 thousand word types and 25 categories of information about each word. This corpus was made available on the internet via a free-access site with two search engines: a simple search and a complex search. The simple engine basically searches for a list of words, while the complex engine accepts all types of criteria in the corpus categories. The output result presents all entries found in the corpus with the criteria specified in the input search and can be downloaded as a.csv file. We created a module in the results that delivers basic statistics about each search. The Brazilian Portuguese Lexicon also provides a pseudoword engine and specific tools for linguistic and statistical analysis. Therefore, the Brazilian Portuguese Lexicon is a convenient instrument for stimulus search, selection, control, and manipulation in psycholinguistic experiments, as also it is a powerful database for computational linguistics research and language modeling related to lexicon distribution, functioning, and behavior. PMID:26630138

  14. The Brazilian Portuguese Lexicon: An Instrument for Psycholinguistic Research.

    PubMed

    Estivalet, Gustavo L; Meunier, Fanny

    2015-01-01

    In this article, we present the Brazilian Portuguese Lexicon, a new word-based corpus for psycholinguistic and computational linguistic research in Brazilian Portuguese. We describe the corpus development, the specific characteristics on the internet site and database for user access. We also perform distributional analyses of the corpus and comparisons to other current databases. Our main objective was to provide a large, reliable, and useful word-based corpus with a dynamic, easy-to-use, and intuitive interface with free internet access for word and word-criteria searches. We used the Núcleo Interinstitucional de Linguística Computacional's corpus as the basic data source and developed the Brazilian Portuguese Lexicon by deriving and adding metalinguistic and psycholinguistic information about Brazilian Portuguese words. We obtained a final corpus with more than 30 million word tokens, 215 thousand word types and 25 categories of information about each word. This corpus was made available on the internet via a free-access site with two search engines: a simple search and a complex search. The simple engine basically searches for a list of words, while the complex engine accepts all types of criteria in the corpus categories. The output result presents all entries found in the corpus with the criteria specified in the input search and can be downloaded as a.csv file. We created a module in the results that delivers basic statistics about each search. The Brazilian Portuguese Lexicon also provides a pseudoword engine and specific tools for linguistic and statistical analysis. Therefore, the Brazilian Portuguese Lexicon is a convenient instrument for stimulus search, selection, control, and manipulation in psycholinguistic experiments, as also it is a powerful database for computational linguistics research and language modeling related to lexicon distribution, functioning, and behavior.

  15. Language Policy and Planning in South America.

    ERIC Educational Resources Information Center

    Hornberger, Nancy H.

    1994-01-01

    A discussion of language policy formation and planning in South America focuses on the highland indigenous sectors and covers the following: colonial languages; immigrant languages; and indigenous languages, including planning, acquisition planning, and corpus planning. (Contains 83 references.) (LB)

  16. Identification of four class emotion from Indonesian spoken language using acoustic and lexical features

    NASA Astrophysics Data System (ADS)

    Kasyidi, Fatan; Puji Lestari, Dessi

    2018-03-01

    One of the important aspects in human to human communication is to understand emotion of each party. Recently, interactions between human and computer continues to develop, especially affective interaction where emotion recognition is one of its important components. This paper presents our extended works on emotion recognition of Indonesian spoken language to identify four main class of emotions: Happy, Sad, Angry, and Contentment using combination of acoustic/prosodic features and lexical features. We construct emotion speech corpus from Indonesia television talk show where the situations are as close as possible to the natural situation. After constructing the emotion speech corpus, the acoustic/prosodic and lexical features are extracted to train the emotion model. We employ some machine learning algorithms such as Support Vector Machine (SVM), Naive Bayes, and Random Forest to get the best model. The experiment result of testing data shows that the best model has an F-measure score of 0.447 by using only the acoustic/prosodic feature and F-measure score of 0.488 by using both acoustic/prosodic and lexical features to recognize four class emotion using the SVM RBF Kernel.

  17. Language with Character: A Stratified Corpus Comparison of Individual Differences in E-Mail Communication

    ERIC Educational Resources Information Center

    Oberlander, Jon; Gill, Alastair J.

    2006-01-01

    To what extent does the wording and syntactic form of people's writing reflect their personalities? Using a bottom-up stratified corpus comparison, rather than the top-down content analysis techniques that have been used before, we examine a corpus of e-mail messages elicited from individuals of known personality, as measured by the Eysenck…

  18. Natural Language-based Machine Learning Models for the Annotation of Clinical Radiology Reports.

    PubMed

    Zech, John; Pain, Margaret; Titano, Joseph; Badgeley, Marcus; Schefflein, Javin; Su, Andres; Costa, Anthony; Bederson, Joshua; Lehar, Joseph; Oermann, Eric Karl

    2018-05-01

    Purpose To compare different methods for generating features from radiology reports and to develop a method to automatically identify findings in these reports. Materials and Methods In this study, 96 303 head computed tomography (CT) reports were obtained. The linguistic complexity of these reports was compared with that of alternative corpora. Head CT reports were preprocessed, and machine-analyzable features were constructed by using bag-of-words (BOW), word embedding, and Latent Dirichlet allocation-based approaches. Ultimately, 1004 head CT reports were manually labeled for findings of interest by physicians, and a subset of these were deemed critical findings. Lasso logistic regression was used to train models for physician-assigned labels on 602 of 1004 head CT reports (60%) using the constructed features, and the performance of these models was validated on a held-out 402 of 1004 reports (40%). Models were scored by area under the receiver operating characteristic curve (AUC), and aggregate AUC statistics were reported for (a) all labels, (b) critical labels, and (c) the presence of any critical finding in a report. Sensitivity, specificity, accuracy, and F1 score were reported for the best performing model's (a) predictions of all labels and (b) identification of reports containing critical findings. Results The best-performing model (BOW with unigrams, bigrams, and trigrams plus average word embeddings vector) had a held-out AUC of 0.966 for identifying the presence of any critical head CT finding and an average 0.957 AUC across all head CT findings. Sensitivity and specificity for identifying the presence of any critical finding were 92.59% (175 of 189) and 89.67% (191 of 213), respectively. Average sensitivity and specificity across all findings were 90.25% (1898 of 2103) and 91.72% (18 351 of 20 007), respectively. Simpler BOW methods achieved results competitive with those of more sophisticated approaches, with an average AUC for presence of any critical finding of 0.951 for unigram BOW versus 0.966 for the best-performing model. The Yule I of the head CT corpus was 34, markedly lower than that of the Reuters corpus (at 103) or I2B2 discharge summaries (at 271), indicating lower linguistic complexity. Conclusion Automated methods can be used to identify findings in radiology reports. The success of this approach benefits from the standardized language of these reports. With this method, a large labeled corpus can be generated for applications such as deep learning. © RSNA, 2018 Online supplemental material is available for this article.

  19. MorphoSaurus--design and evaluation of an interlingua-based, cross-language document retrieval engine for the medical domain.

    PubMed

    Markó, K; Schulz, S; Hahn, U

    2005-01-01

    We propose an interlingua-based indexing approach to account for the particular challenges that arise in the design and implementation of cross-language document retrieval systems for the medical domain. Documents, as well as queries, are mapped to a language-independent conceptual layer on which retrieval operations are performed. We contrast this approach with the direct translation of German queries to English ones which, subsequently, are matched against English documents. We evaluate both approaches, interlingua-based and direct translation, on a large medical document collection, the OHSUMED corpus. A substantial benefit for interlingua-based document retrieval using German queries on English texts is found, which amounts to 93% of the (monolingual) English baseline. Most state-of-the-art cross-language information retrieval systems translate user queries to the language(s) of the target documents. In contra-distinction to this approach, translating both documents and user queries into a language-independent, concept-like representation format is more beneficial to enhance cross-language retrieval performance.

  20. Literacy Practices in Computer-Mediated Communication in Hong Kong.

    ERIC Educational Resources Information Center

    Lee, Carmen

    2002-01-01

    Examines linguistic features of text-based computer-mediated communication (CMC) in Hong Kong. The study is based on a 70,000-word corpus of electronic mail and ICQ instant messaging texts, which were collected from students in Hong Kong. Identified language-specific features that may be seen as new literacy practices within the theoretical…

  1. Acceptability of Dative Argument Structure in Spanish: Assessing Semantic and Usage-Based Factors

    ERIC Educational Resources Information Center

    Reali, Florencia

    2017-01-01

    Multiple constraints, including semantic, lexical, and usage-based factors, have been shown to influence dative alternation across different languages. This work explores whether fine-grained statistics and semantic properties of the verb affect the acceptability of dative constructions in Spanish. First, a corpus analysis reveals that verbs of…

  2. Lexical Variation and Change in British Sign Language

    PubMed Central

    Stamp, Rose; Schembri, Adam; Fenlon, Jordan; Rentelis, Ramas; Woll, Bencie; Cormier, Kearsy

    2014-01-01

    This paper presents results from a corpus-based study investigating lexical variation in BSL. An earlier study investigating variation in BSL numeral signs found that younger signers were using a decreasing variety of regionally distinct variants, suggesting that levelling may be taking place. Here, we report findings from a larger investigation looking at regional lexical variants for colours, countries, numbers and UK placenames elicited as part of the BSL Corpus Project. Age, school location and language background were significant predictors of lexical variation, with younger signers using a more levelled variety. This change appears to be happening faster in particular sub-groups of the deaf community (e.g., signers from hearing families). Also, we find that for the names of some UK cities, signers from outside the region use a different sign than those who live in the region. PMID:24759673

  3. Extracting microRNA-gene relations from biomedical literature using distant supervision

    PubMed Central

    Clarke, Luka A.; Couto, Francisco M.

    2017-01-01

    Many biomedical relation extraction approaches are based on supervised machine learning, requiring an annotated corpus. Distant supervision aims at training a classifier by combining a knowledge base with a corpus, reducing the amount of manual effort necessary. This is particularly useful for biomedicine because many databases and ontologies have been made available for many biological processes, while the availability of annotated corpora is still limited. We studied the extraction of microRNA-gene relations from text. MicroRNA regulation is an important biological process due to its close association with human diseases. The proposed method, IBRel, is based on distantly supervised multi-instance learning. We evaluated IBRel on three datasets, and the results were compared with a co-occurrence approach as well as a supervised machine learning algorithm. While supervised learning outperformed on two of those datasets, IBRel obtained an F-score 28.3 percentage points higher on the dataset for which there was no training set developed specifically. To demonstrate the applicability of IBRel, we used it to extract 27 miRNA-gene relations from recently published papers about cystic fibrosis. Our results demonstrate that our method can be successfully used to extract relations from literature about a biological process without an annotated corpus. The source code and data used in this study are available at https://github.com/AndreLamurias/IBRel. PMID:28263989

  4. Extracting microRNA-gene relations from biomedical literature using distant supervision.

    PubMed

    Lamurias, Andre; Clarke, Luka A; Couto, Francisco M

    2017-01-01

    Many biomedical relation extraction approaches are based on supervised machine learning, requiring an annotated corpus. Distant supervision aims at training a classifier by combining a knowledge base with a corpus, reducing the amount of manual effort necessary. This is particularly useful for biomedicine because many databases and ontologies have been made available for many biological processes, while the availability of annotated corpora is still limited. We studied the extraction of microRNA-gene relations from text. MicroRNA regulation is an important biological process due to its close association with human diseases. The proposed method, IBRel, is based on distantly supervised multi-instance learning. We evaluated IBRel on three datasets, and the results were compared with a co-occurrence approach as well as a supervised machine learning algorithm. While supervised learning outperformed on two of those datasets, IBRel obtained an F-score 28.3 percentage points higher on the dataset for which there was no training set developed specifically. To demonstrate the applicability of IBRel, we used it to extract 27 miRNA-gene relations from recently published papers about cystic fibrosis. Our results demonstrate that our method can be successfully used to extract relations from literature about a biological process without an annotated corpus. The source code and data used in this study are available at https://github.com/AndreLamurias/IBRel.

  5. Combining active learning and semi-supervised learning techniques to extract protein interaction sentences.

    PubMed

    Song, Min; Yu, Hwanjo; Han, Wook-Shin

    2011-11-24

    Protein-protein interaction (PPI) extraction has been a focal point of many biomedical research and database curation tools. Both Active Learning and Semi-supervised SVMs have recently been applied to extract PPI automatically. In this paper, we explore combining the AL with the SSL to improve the performance of the PPI task. We propose a novel PPI extraction technique called PPISpotter by combining Deterministic Annealing-based SSL and an AL technique to extract protein-protein interaction. In addition, we extract a comprehensive set of features from MEDLINE records by Natural Language Processing (NLP) techniques, which further improve the SVM classifiers. In our feature selection technique, syntactic, semantic, and lexical properties of text are incorporated into feature selection that boosts the system performance significantly. By conducting experiments with three different PPI corpuses, we show that PPISpotter is superior to the other techniques incorporated into semi-supervised SVMs such as Random Sampling, Clustering, and Transductive SVMs by precision, recall, and F-measure. Our system is a novel, state-of-the-art technique for efficiently extracting protein-protein interaction pairs.

  6. A unified approach for development of Urdu Corpus for OCR and demographic purpose

    NASA Astrophysics Data System (ADS)

    Choudhary, Prakash; Nain, Neeta; Ahmed, Mushtaq

    2015-02-01

    This paper presents a methodology for the development of an Urdu handwritten text image Corpus and application of Corpus linguistics in the field of OCR and information retrieval from handwritten document. Compared to other language scripts, Urdu script is little bit complicated for data entry. To enter a single character it requires a combination of multiple keys entry. Here, a mixed approach is proposed and demonstrated for building Urdu Corpus for OCR and Demographic data collection. Demographic part of database could be used to train a system to fetch the data automatically, which will be helpful to simplify existing manual data-processing task involved in the field of data collection such as input forms like Passport, Ration Card, Voting Card, AADHAR, Driving licence, Indian Railway Reservation, Census data etc. This would increase the participation of Urdu language community in understanding and taking benefit of the Government schemes. To make availability and applicability of database in a vast area of corpus linguistics, we propose a methodology for data collection, mark-up, digital transcription, and XML metadata information for benchmarking.

  7. White matter structure changes as adults learn a second language.

    PubMed

    Schlegel, Alexander A; Rudelson, Justin J; Tse, Peter U

    2012-08-01

    Traditional models hold that the plastic reorganization of brain structures occurs mainly during childhood and adolescence, leaving adults with limited means to learn new knowledge and skills. Research within the last decade has begun to overturn this belief, documenting changes in the brain's gray and white matter as healthy adults learn simple motor and cognitive skills [Lövdén, M., Bodammer, N. C., Kühn, S., Kaufmann, J., Schütze, H., Tempelmann, C., et al. Experience-dependent plasticity of white-matter microstructure extends into old age. Neuropsychologia, 48, 3878-3883, 2010; Taubert, M., Draganski, B., Anwander, A., Müller, K., Horstmann, A., Villringer, A., et al. Dynamic properties of human brain structure: Learning-related changes in cortical areas and associated fiber connections. The Journal of Neuroscience, 30, 11670-11677, 2010; Scholz, J., Klein, M. C., Behrens, T. E. J., & Johansen-Berg, H. Training induces changes in white-matter architecture. Nature Neuroscience, 12, 1370-1371, 2009; Draganski, B., Gaser, C., Busch, V., Schuirer, G., Bogdahn, U., & May, A. Changes in grey matter induced by training. Nature, 427, 311-312, 2004]. Although the significance of these changes is not fully understood, they reveal a brain that remains plastic well beyond early developmental periods. Here we investigate the role of adult structural plasticity in the complex, long-term learning process of foreign language acquisition. We collected monthly diffusion tensor imaging scans of 11 English speakers who took a 9-month intensive course in written and spoken Modern Standard Chinese as well as from 16 control participants who did not study a language. We show that white matter reorganizes progressively across multiple sites as adults study a new language. Language learners exhibited progressive changes in white matter tracts associated with traditional left hemisphere language areas and their right hemisphere analogs. Surprisingly, the most significant changes occurred in frontal lobe tracts crossing the genu of the corpus callosum-a region not generally included in current neural models of language processing. These results indicate that plasticity of white matter plays an important role in adult language learning and additionally demonstrate the potential of longitudinal diffusion tensor imaging as a new tool to yield insights into cognitive processes.

  8. Pulling It Together: Using Integrative Assignments as Empirical Direct Measures of Student Learning for Learning Community Program Assessment

    ERIC Educational Resources Information Center

    Huerta, Juan Carlos; Sperry, Rita

    2013-01-01

    This article outlines a systematic and manageable method for learning community program assessment based on collecting empirical direct measures of student learning. Developed at Texas A&M University--Corpus Christi where all full-time, first-year students are in learning communities, the approach ties integrative assignment design to a rubric…

  9. Language Planning Orientations and Bilingual Education in Peru.

    ERIC Educational Resources Information Center

    Hornberger, Nancy H.

    1988-01-01

    Considers the status and corpus planning aspects of three of Peru's Quechua policies in light of the language planning orientations of language-as-problem, language-as-right, and language-as-resource. Current Quechua/Spanish bilingual education recognizes the rights of Quechua speakers and the role of the language as a national resource.…

  10. Gendered Language in Interactive Discourse

    ERIC Educational Resources Information Center

    Hussey, Karen A.; Katz, Albert N.; Leith, Scott A.

    2015-01-01

    Over two studies, we examined the nature of gendered language in interactive discourse. In the first study, we analyzed gendered language from a chat corpus to see whether tokens of gendered language proposed in the gender-as-culture hypothesis (Maltz and Borker in "Language and social identity." Cambridge University Press, Cambridge, pp…

  11. Language Policy and Pedagogy: Essays in Honor of A. Ronald Walton.

    ERIC Educational Resources Information Center

    Lambert, Richard D., Ed.; Shohamy, Elana, Ed.

    This edited volume brings together 14 diverse articles dealing with various aspects of language policy and pedagogy. Chapter titles include the following: "Language Practice, Language Ideology, and Language Policy" (Bernard Spolsky and Elana Shohamy); "The Status Agenda in Corpus Planning" (Joshua A. Fishman); "The Way…

  12. Working Together: Contributions of Corpus Analyses and Experimental Psycholinguistics to Understanding Conversation.

    PubMed

    Meyer, Antje S; Alday, Phillip M; Decuyper, Caitlin; Knudsen, Birgit

    2018-01-01

    As conversation is the most important way of using language, linguists and psychologists should combine forces to investigate how interlocutors deal with the cognitive demands arising during conversation. Linguistic analyses of corpora of conversation are needed to understand the structure of conversations, and experimental work is indispensable for understanding the underlying cognitive processes. We argue that joint consideration of corpus and experimental data is most informative when the utterances elicited in a lab experiment match those extracted from a corpus in relevant ways. This requirement to compare like with like seems obvious but is not trivial to achieve. To illustrate this approach, we report two experiments where responses to polar (yes/no) questions were elicited in the lab and the response latencies were compared to gaps between polar questions and answers in a corpus of conversational speech. We found, as expected, that responses were given faster when they were easy to plan and planning could be initiated earlier than when they were harder to plan and planning was initiated later. Overall, in all but one condition, the latencies were longer than one would expect based on the analyses of corpus data. We discuss the implication of this partial match between the data sets and more generally how corpus and experimental data can best be combined in studies of conversation.

  13. The Role of Guided Induction in Paper-Based Data-Driven Learning

    ERIC Educational Resources Information Center

    Smart, Jonathan

    2014-01-01

    This study examines the role of guided induction as an instructional approach in paper-based data-driven learning (DDL) in the context of an ESL grammar course during an intensive English program at an American public university. Specifically, it examines whether corpus-informed grammar instruction is more effective through inductive, data-driven…

  14. Semantic ambiguity effects on traditional Chinese character naming: A corpus-based approach.

    PubMed

    Chang, Ya-Ning; Lee, Chia-Ying

    2017-11-09

    Words are considered semantically ambiguous if they have more than one meaning and can be used in multiple contexts. A number of recent studies have provided objective ambiguity measures by using a corpus-based approach and have demonstrated ambiguity advantages in both naming and lexical decision tasks. Although the predictive power of objective ambiguity measures has been examined in several alphabetic language systems, the effects in logographic languages remain unclear. Moreover, most ambiguity measures do not explicitly address how the various contexts associated with a given word relate to each other. To explore these issues, we computed the contextual diversity (Adelman, Brown, & Quesada, Psychological Science, 17; 814-823, 2006) and semantic ambiguity (Hoffman, Lambon Ralph, & Rogers, Behavior Research Methods, 45; 718-730, 2013) of traditional Chinese single-character words based on the Academia Sinica Balanced Corpus, where contextual diversity was used to evaluate the present semantic space. We then derived a novel ambiguity measure, namely semantic variability, by computing the distance properties of the distinct clusters grouped by the contexts that contained a given word. We demonstrated that semantic variability was superior to semantic diversity in accounting for the variance in naming response times, suggesting that considering the substructure of the various contexts associated with a given word can provide a relatively fine scale of ambiguity information for a word. All of the context and ambiguity measures for 2,418 Chinese single-character words are provided as supplementary materials.

  15. Using Network-Based Language Analysis to Bridge Expertise and Cultivate Sensitivity to Differentiated Language Use in Interdisciplinary Geoscience Research

    NASA Astrophysics Data System (ADS)

    Hannah, M. A.; Simeone, M.

    2017-12-01

    On interdisciplinary teams, expertise is varied, as is evidenced by differences in team members' language use. Developing strategies to combine that expertise and bridge differentiated language practices is especially difficult between geoscience subdisciplines as researchers assume they use a shared language—vocabulary, jargon, codes, linguistic styles. In our paper, we discuss a network-based approach used to identify varied expertise and language practices between geoscientists (n=29) on a NSF team funded to study how deep and surface Earth processes worked together to give rise to the Great Oxygenation Event. We describe how we modeled the team's expertise from a language corpus consisting of 220 oxygen-related terms frequently used by team members and then compared their understanding of the terms to develop interventions to bridge the team's expertise. Corpus terms were identified via team member interviews, observations of members' interactions at research meetings, and discourse analysis of members' publications. Comparisons of members' language use were based on a Likert scale survey that asked members to assess how they understood a term; how frequently they used a term; and whether they conceptualized a term as an object or process. Rather than use our method as a communication audit tool (Zwijze-Koning & de Jong, 2015), teams can proactively use it in a project's early stages to assess the contours of the team's differentiated expertise and show where specialized knowledge resides in the team, where latent or non-obvious expertise exists, where expertise overlaps, and where gaps are in the team's knowledge. With this information, teams can make evidence based recommendations to forward their work such as allocating resources; identifying and empowering members to serve as connectors and lead cross-functional project initiatives; and developing strategies to avoid communication barriers. The method also generates models for teaching language sensitivity to subdisciplinary colleagues by making visible the nuanced ways they use language to organize and communicate their research. Ultimately, understanding the impact of differentiated language use is an unmet need in Earth science research, and our method offers a unique way to visualize and understand how such use impacts team communication.

  16. Annotating longitudinal clinical narratives for de-identification: The 2014 i2b2/UTHealth corpus.

    PubMed

    Stubbs, Amber; Uzuner, Özlem

    2015-12-01

    The 2014 i2b2/UTHealth natural language processing shared task featured a track focused on the de-identification of longitudinal medical records. For this track, we de-identified a set of 1304 longitudinal medical records describing 296 patients. This corpus was de-identified under a broad interpretation of the HIPAA guidelines using double-annotation followed by arbitration, rounds of sanity checking, and proof reading. The average token-based F1 measure for the annotators compared to the gold standard was 0.927. The resulting annotations were used both to de-identify the data and to set the gold standard for the de-identification track of the 2014 i2b2/UTHealth shared task. All annotated private health information were replaced with realistic surrogates automatically and then read over and corrected manually. The resulting corpus is the first of its kind made available for de-identification research. This corpus was first used for the 2014 i2b2/UTHealth shared task, during which the systems achieved a mean F-measure of 0.872 and a maximum F-measure of 0.964 using entity-based micro-averaged evaluations. Copyright © 2015 Elsevier Inc. All rights reserved.

  17. Numerical morphology supports early number word learning: Evidence from a comparison of young Mandarin and English learners

    PubMed Central

    Corre, Mathieu Le; Li, Peggy; Huang, Becky H.; Jia, Gisela; Carey, Susan

    2016-01-01

    Previous studies showed that children learning a language with an obligatory singular/plural distinction (Russian and English) learn the meaning of the number word for one earlier than children learning Japanese, a language without obligatory number morphology (Barner, Libenson, Cheung, & Takasaki, 2009; Sarnecka, Kamenskaya, Yamana, Ogura, & Yudovina, 2007). This can be explained by differences in number morphology, but it can also be explained by many other differences between the languages and the environments of the children who were compared. The present study tests the hypothesis that the morphological singular/plural distinction supports the early acquisition of the meaning of the number word for one by comparing young English learners to age and SES matched young Mandarin Chinese learners. Mandarin does not have obligatory number morphology but is more similar to English than Japanese in many crucial respects. Corpus analyses show that, compared to English learners, Mandarin learners hear number words more frequently, are more likely to hear number words followed by a noun, and are more likely to hear number words in contexts where they denote a cardinal value. Two tasks show that, despite these advantages, Mandarin learners learn the meaning of the number word for one three to six months later than do English learners. These results provide the strongest evidence to date that prior knowledge of the numerical meaning of the distinction between singular and plural supports the acquisition of the meaning of the number word for one. PMID:27423486

  18. Une approche plurilingue pour faciliter l'inclusion scolaire : engagement et dynamique pédagogique

    NASA Astrophysics Data System (ADS)

    Ribierre-Dubile, Nathalie

    2017-08-01

    A multilingual approach to facilitating inclusion: Educational commitment and dynamics - In the school context, the first steps in the process of learning a foreign language require commitment and motivation on the part of the learner, as well as a commitment from the teacher to include all students. This raises questions about the inclusiveness of education and the educational achievement of multilingual and/or immigrant students in predominantly monolingual classes. The author draws on a corpus of research to explore a number of parameters involved in the implementation of a multilingual and inclusive approach. She links the foundations of a multilingual approach to the institutional framework and the positions of the actors in the didactic relationship, as well as in their relationship to languages. The article then gives an overview of the characteristics of the metacognitive strategies employed by multilingual learners and, in conclusion, proposes some innovative methods to go beyond the monolingual principles in learning and foster exchanges that are both multicultural and multilingual.

  19. SCOPIC Design and Overview

    ERIC Educational Resources Information Center

    Barth, Danielle; Evans, Nicholas

    2017-01-01

    This paper provides an overview of the design and motivation for creating the Social Cognition Parallax Interview Corpus (SCOPIC), an open-ended, accessible corpus that balances the need for language-specific annotation with typologically-calibrated markup. SCOPIC provides richly annotated data, focusing on functional categories relevant to social…

  20. Experiments in automatic word class and word sense identification for information retrieval

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Gauch, S.; Futrelle, R.P.

    Automatic identification of related words and automatic detection of word senses are two long-standing goals of researchers in natural language processing. Word class information and word sense identification may enhance the performance of information retrieval system4ms. Large online corpora and increased computational capabilities make new techniques based on corpus linguisitics feasible. Corpus-based analysis is especially needed for corpora from specialized fields for which no electronic dictionaries or thesauri exist. The methods described here use a combination of mutual information and word context to establish word similarities. Then, unsupervised classification is done using clustering in the word space, identifying word classesmore » without pretagging. We also describe an extension of the method to handle the difficult problems of disambiguation and of determining part-of-speech and semantic information for low-frequency words. The method is powerful enough to produce high-quality results on a small corpus of 200,000 words from abstracts in a field of molecular biology.« less

  1. Active learning-based information structure analysis of full scientific articles and two applications for biomedical literature review.

    PubMed

    Guo, Yufan; Silins, Ilona; Stenius, Ulla; Korhonen, Anna

    2013-06-01

    Techniques that are capable of automatically analyzing the information structure of scientific articles could be highly useful for improving information access to biomedical literature. However, most existing approaches rely on supervised machine learning (ML) and substantial labeled data that are expensive to develop and apply to different sub-fields of biomedicine. Recent research shows that minimal supervision is sufficient for fairly accurate information structure analysis of biomedical abstracts. However, is it realistic for full articles given their high linguistic and informational complexity? We introduce and release a novel corpus of 50 biomedical articles annotated according to the Argumentative Zoning (AZ) scheme, and investigate active learning with one of the most widely used ML models-Support Vector Machines (SVM)-on this corpus. Additionally, we introduce two novel applications that use AZ to support real-life literature review in biomedicine via question answering and summarization. We show that active learning with SVM trained on 500 labeled sentences (6% of the corpus) performs surprisingly well with the accuracy of 82%, just 2% lower than fully supervised learning. In our question answering task, biomedical researchers find relevant information significantly faster from AZ-annotated than unannotated articles. In the summarization task, sentences extracted from particular zones are significantly more similar to gold standard summaries than those extracted from particular sections of full articles. These results demonstrate that active learning of full articles' information structure is indeed realistic and the accuracy is high enough to support real-life literature review in biomedicine. The annotated corpus, our AZ classifier and the two novel applications are available at http://www.cl.cam.ac.uk/yg244/12bioinfo.html

  2. Natural Language Processing Techniques for Extracting and Categorizing Finding Measurements in Narrative Radiology Reports.

    PubMed

    Sevenster, M; Buurman, J; Liu, P; Peters, J F; Chang, P J

    2015-01-01

    Accumulating quantitative outcome parameters may contribute to constructing a healthcare organization in which outcomes of clinical procedures are reproducible and predictable. In imaging studies, measurements are the principal category of quantitative para meters. The purpose of this work is to develop and evaluate two natural language processing engines that extract finding and organ measurements from narrative radiology reports and to categorize extracted measurements by their "temporality". The measurement extraction engine is developed as a set of regular expressions. The engine was evaluated against a manually created ground truth. Automated categorization of measurement temporality is defined as a machine learning problem. A ground truth was manually developed based on a corpus of radiology reports. A maximum entropy model was created using features that characterize the measurement itself and its narrative context. The model was evaluated in a ten-fold cross validation protocol. The measurement extraction engine has precision 0.994 and recall 0.991. Accuracy of the measurement classification engine is 0.960. The work contributes to machine understanding of radiology reports and may find application in software applications that process medical data.

  3. Rearticulating the Case for Micro Language Planning in a Language Ecology Context

    ERIC Educational Resources Information Center

    Baldauf, Richard B., Jr.

    2006-01-01

    Language planning is normally thought of in terms of large-scale, usually national planning, often undertaken by governments and meant to influence, if not change, ways of speaking or literacy practices within a society. It normally encompasses four aspects: status planning (about society), corpus planning (about language), language-in-education…

  4. EFL Students' Perceptions of Corpus-Tools as Writing References

    ERIC Educational Resources Information Center

    Lai, Shu-Li

    2015-01-01

    A number of studies have suggested the potentials of corpus tools in vocabulary learning. However, there are still some concerns. Corpus tools might be too complicated to use; example sentences retrieved from corpus tools might be too difficult to understand; processing large number of sample sentences could be challenging and time-consuming;…

  5. Pediatric traumatic brain injury: language outcomes and their relationship to the arcuate fasciculus.

    PubMed

    Liégeois, Frédérique J; Mahony, Kate; Connelly, Alan; Pigdon, Lauren; Tournier, Jacques-Donald; Morgan, Angela T

    2013-12-01

    Pediatric traumatic brain injury (TBI) may result in long-lasting language impairments alongside dysarthria, a motor-speech disorder. Whether this co-morbidity is due to the functional links between speech and language networks, or to widespread damage affecting both motor and language tracts, remains unknown. Here we investigated language function and diffusion metrics (using diffusion-weighted tractography) within the arcuate fasciculus, the uncinate fasciculus, and the corpus callosum in 32 young people after TBI (approximately half with dysarthria) and age-matched healthy controls (n=17). Only participants with dysarthria showed impairments in language, affecting sentence formulation and semantic association. In the whole TBI group, sentence formulation was best predicted by combined corpus callosum and left arcuate volumes, suggesting this "dual blow" seriously reduces the potential for functional reorganisation. Word comprehension was predicted by fractional anisotropy in the right arcuate. The co-morbidity between dysarthria and language deficits therefore seems to be the consequence of multiple tract damage. Copyright © 2013 Elsevier Inc. All rights reserved.

  6. The syntactic complexity of Russian relative clauses

    PubMed Central

    Fedorenko, Evelina; Gibson, Edward

    2012-01-01

    Although syntactic complexity has been investigated across dozens of studies, the available data still greatly underdetermine relevant theories of processing difficulty. Memory-based and expectation-based theories make opposite predictions regarding fine-grained time course of processing difficulty in syntactically constrained contexts, and each class of theory receives support from results on some constructions in some languages. Here we report four self-paced reading experiments on the online comprehension of Russian relative clauses together with related corpus studies, taking advantage of Russian’s flexible word order to disentangle predictions of competing theories. We find support for key predictions of memory-based theories in reading times at RC verbs, and for key predictions of expectation-based theories in processing difficulty at RC-initial accusative noun phrase (NP) objects, which corpus data suggest should be highly unexpected. These results suggest that a complete theory of syntactic complexity must integrate insights from both expectation-based and memory-based theories. PMID:24711687

  7. Sharing a Multimodal Corpus to Study Webcam-Mediated Language Teaching

    ERIC Educational Resources Information Center

    Guichon, Nicolas

    2017-01-01

    This article proposes a methodology to create a multimodal corpus that can be shared with a group of researchers in order to analyze synchronous online pedagogical interactions. Epistemological aspects involved in studying online interactions from a multimodal and semiotic perspective are addressed. Then, issues and challenges raised by corpus…

  8. Languages cool as they expand: Allometric scaling and the decreasing need for new words

    PubMed Central

    Petersen, Alexander M.; Tenenbaum, Joel N.; Havlin, Shlomo; Stanley, H. Eugene; Perc, Matjaž

    2012-01-01

    We analyze the occurrence frequencies of over 15 million words recorded in millions of books published during the past two centuries in seven different languages. For all languages and chronological subsets of the data we confirm that two scaling regimes characterize the word frequency distributions, with only the more common words obeying the classic Zipf law. Using corpora of unprecedented size, we test the allometric scaling relation between the corpus size and the vocabulary size of growing languages to demonstrate a decreasing marginal need for new words, a feature that is likely related to the underlying correlations between words. We calculate the annual growth fluctuations of word use which has a decreasing trend as the corpus size increases, indicating a slowdown in linguistic evolution following language expansion. This “cooling pattern” forms the basis of a third statistical regularity, which unlike the Zipf and the Heaps law, is dynamical in nature. PMID:23230508

  9. Languages cool as they expand: Allometric scaling and the decreasing need for new words

    NASA Astrophysics Data System (ADS)

    Petersen, Alexander M.; Tenenbaum, Joel N.; Havlin, Shlomo; Stanley, H. Eugene; Perc, Matjaž

    2012-12-01

    We analyze the occurrence frequencies of over 15 million words recorded in millions of books published during the past two centuries in seven different languages. For all languages and chronological subsets of the data we confirm that two scaling regimes characterize the word frequency distributions, with only the more common words obeying the classic Zipf law. Using corpora of unprecedented size, we test the allometric scaling relation between the corpus size and the vocabulary size of growing languages to demonstrate a decreasing marginal need for new words, a feature that is likely related to the underlying correlations between words. We calculate the annual growth fluctuations of word use which has a decreasing trend as the corpus size increases, indicating a slowdown in linguistic evolution following language expansion. This ``cooling pattern'' forms the basis of a third statistical regularity, which unlike the Zipf and the Heaps law, is dynamical in nature.

  10. Deviations in the Zipf and Heaps laws in natural languages

    NASA Astrophysics Data System (ADS)

    Bochkarev, Vladimir V.; Lerner, Eduard Yu; Shevlyakova, Anna V.

    2014-03-01

    This paper is devoted to verifying of the empirical Zipf and Hips laws in natural languages using Google Books Ngram corpus data. The connection between the Zipf and Heaps law which predicts the power dependence of the vocabulary size on the text size is discussed. In fact, the Heaps exponent in this dependence varies with the increasing of the text corpus. To explain it, the obtained results are compared with the probability model of text generation. Quasi-periodic variations with characteristic time periods of 60-100 years were also found.

  11. Biomedical information retrieval across languages.

    PubMed

    Daumke, Philipp; Markü, Kornél; Poprat, Michael; Schulz, Stefan; Klar, Rüdiger

    2007-06-01

    This work presents a new dictionary-based approach to biomedical cross-language information retrieval (CLIR) that addresses many of the general and domain-specific challenges in current CLIR research. Our method is based on a multilingual lexicon that was generated partly manually and partly automatically, and currently covers six European languages. It contains morphologically meaningful word fragments, termed subwords. Using subwords instead of entire words significantly reduces the number of lexical entries necessary to sufficiently cover a specific language and domain. Mediation between queries and documents is based on these subwords as well as on lists of word-n-grams that are generated from large monolingual corpora and constitute possible translation units. The translations are then sent to a standard Internet search engine. This process makes our approach an effective tool for searching the biomedical content of the World Wide Web in different languages. We evaluate this approach using the OHSUMED corpus, a large medical document collection, within a cross-language retrieval setting.

  12. Self-Repair in Oral Production by Intermediate Chinese Learners of English

    ERIC Educational Resources Information Center

    Liu, Jiangtao

    2009-01-01

    For various reasons, second language learners modify their speech by means of self-repair. This study, based on a small-scale corpus, shows the patterns and features of self-repairs by intermediate Chinese learners of English. The results suggest that intermediate Chinese learners of English more frequently make repairs than advanced Chinese…

  13. Gender Difference in the Use of Thought Representation--A Corpus-Based Study

    ERIC Educational Resources Information Center

    Riissanen, Anne; Watson, Greg

    2014-01-01

    This study (Note 1) investigates potential differences in language use between genders, by applying a modified model of thought representation. Our hypothesis is that women use more direct forms of thought representation than men in modern spoken British English. Women are said to favour "private speech" that creates intimacy and…

  14. A Corpus-Based Study on Turkish Spoken Productions of Bilingual Adults

    ERIC Educational Resources Information Center

    Agçam, Reyhan; Bulut, Adem

    2016-01-01

    The current study investigated whether monolingual adult speakers of Turkish and bilingual adult speakers of Arabic and Turkish significantly differ regarding their spoken productions in Turkish. Accordingly, two groups of undergraduate students studying Turkish Language and Literature at a state university in Turkey were presented two videos on a…

  15. Analysis of Metadiscourse Markers in Academic Written Discourse Produced by Turkish Researchers

    ERIC Educational Resources Information Center

    Duruk, Eda

    2017-01-01

    This study aims at examining the frequency of interpersonal metadiscourse markers in academic written discourse and investigating the way Turkish writers use interpersonal metadiscourse, namely in MA dissertations from one major academic field; English language teaching (ELT). A corpus based research is applied by examining a total of 20…

  16. The Acquisition of Jamaican Creole: Null Subject Phenomenon

    ERIC Educational Resources Information Center

    De Lisser, Tamirand Nnena; Durrleman, Stephanie; Rizzi, Luigi; Shlonsky, Ur

    2016-01-01

    This article provides the first systematic analysis of early subject omission in a creole language. Basing our analysis on a longitudinal corpus of natural production of Jamaican Creole (JC), we observe that early subject drop is robustly attested for several months. Early subject omission is basically confined to the clause initial position,…

  17. Fine-Grained Linguistic Soft Constraints on Statistical Natural Language Processing Models

    DTIC Science & Technology

    2009-01-01

    88 4 Monolingually -Derived Phrasal Paraphrase Generation for Statistical Ma- chine Translation 90 4.1...123 4.4 Spanish-English (S2E) results . . . . . . . . . . . . . . . . . . . . . . 125 4.5 Gains from using larger monolingual corpora for...96 4.2 Visual example of a phrasal distributional profile . . . . . . . . . . . . 103 4.3 Monolingual corpus-based distributional

  18. Speech rhythm analysis with decomposition of the amplitude envelope: characterizing rhythmic patterns within and across languages.

    PubMed

    Tilsen, Sam; Arvaniti, Amalia

    2013-07-01

    This study presents a method for analyzing speech rhythm using empirical mode decomposition of the speech amplitude envelope, which allows for extraction and quantification of syllabic- and supra-syllabic time-scale components of the envelope. The method of empirical mode decomposition of a vocalic energy amplitude envelope is illustrated in detail, and several types of rhythm metrics derived from this method are presented. Spontaneous speech extracted from the Buckeye Corpus is used to assess the effect of utterance length on metrics, and it is shown how metrics representing variability in the supra-syllabic time-scale components of the envelope can be used to identify stretches of speech with targeted rhythmic characteristics. Furthermore, the envelope-based metrics are used to characterize cross-linguistic differences in speech rhythm in the UC San Diego Speech Lab corpus of English, German, Greek, Italian, Korean, and Spanish speech elicited in read sentences, read passages, and spontaneous speech. The envelope-based metrics exhibit significant effects of language and elicitation method that argue for a nuanced view of cross-linguistic rhythm patterns.

  19. Motivation in Language Planning and Language Policy. Multilingual Matters 119.

    ERIC Educational Resources Information Center

    Ager, Dennis

    The aim of this book is to investigate the motives for action on language behavior, whether this means corpus, status, or acquisition planning. It examines such questions as why individuals, groups, and governments try to influence their own or others' language behavior or language attitudes, and what drives authorities to try to control, favor,…

  20. Working Together: Contributions of Corpus Analyses and Experimental Psycholinguistics to Understanding Conversation

    PubMed Central

    Meyer, Antje S.; Alday, Phillip M.; Decuyper, Caitlin; Knudsen, Birgit

    2018-01-01

    As conversation is the most important way of using language, linguists and psychologists should combine forces to investigate how interlocutors deal with the cognitive demands arising during conversation. Linguistic analyses of corpora of conversation are needed to understand the structure of conversations, and experimental work is indispensable for understanding the underlying cognitive processes. We argue that joint consideration of corpus and experimental data is most informative when the utterances elicited in a lab experiment match those extracted from a corpus in relevant ways. This requirement to compare like with like seems obvious but is not trivial to achieve. To illustrate this approach, we report two experiments where responses to polar (yes/no) questions were elicited in the lab and the response latencies were compared to gaps between polar questions and answers in a corpus of conversational speech. We found, as expected, that responses were given faster when they were easy to plan and planning could be initiated earlier than when they were harder to plan and planning was initiated later. Overall, in all but one condition, the latencies were longer than one would expect based on the analyses of corpus data. We discuss the implication of this partial match between the data sets and more generally how corpus and experimental data can best be combined in studies of conversation. PMID:29706919

  1. Applying Active Learning to Assertion Classification of Concepts in Clinical Text

    PubMed Central

    Chen, Yukun; Mani, Subramani; Xu, Hua

    2012-01-01

    Supervised machine learning methods for clinical natural language processing (NLP) research require a large number of annotated samples, which are very expensive to build because of the involvement of physicians. Active learning, an approach that actively samples from a large pool, provides an alternative solution. Its major goal in classification is to reduce the annotation effort while maintaining the quality of the predictive model. However, few studies have investigated its uses in clinical NLP. This paper reports an application of active learning to a clinical text classification task: to determine the assertion status of clinical concepts. The annotated corpus for the assertion classification task in the 2010 i2b2/VA Clinical NLP Challenge was used in this study. We implemented several existing and newly developed active learning algorithms and assessed their uses. The outcome is reported in the global ALC score, based on the Area under the average Learning Curve of the AUC (Area Under the Curve) score. Results showed that when the same number of annotated samples was used, active learning strategies could generate better classification models (best ALC – 0.7715) than the passive learning method (random sampling) (ALC – 0.7411). Moreover, to achieve the same classification performance, active learning strategies required fewer samples than the random sampling method. For example, to achieve an AUC of 0.79, the random sampling method used 32 samples, while our best active learning algorithm required only 12 samples, a reduction of 62.5% in manual annotation effort. PMID:22127105

  2. Numerical morphology supports early number word learning: Evidence from a comparison of young Mandarin and English learners.

    PubMed

    Le Corre, Mathieu; Li, Peggy; Huang, Becky H; Jia, Gisela; Carey, Susan

    2016-08-01

    Previous studies showed that children learning a language with an obligatory singular/plural distinction (Russian and English) learn the meaning of the number word for one earlier than children learning Japanese, a language without obligatory number morphology (Barner, Libenson, Cheung, & Takasaki, 2009; Sarnecka, Kamenskaya, Yamana, Ogura, & Yudovina, 2007). This can be explained by differences in number morphology, but it can also be explained by many other differences between the languages and the environments of the children who were compared. The present study tests the hypothesis that the morphological singular/plural distinction supports the early acquisition of the meaning of the number word for one by comparing young English learners to age and SES matched young Mandarin Chinese learners. Mandarin does not have obligatory number morphology but is more similar to English than Japanese in many crucial respects. Corpus analyses show that, compared to English learners, Mandarin learners hear number words more frequently, are more likely to hear number words followed by a noun, and are more likely to hear number words in contexts where they denote a cardinal value. Two tasks show that, despite these advantages, Mandarin learners learn the meaning of the number word for one three to six months later than do English learners. These results provide the strongest evidence to date that prior knowledge of the numerical meaning of the distinction between singular and plural supports the acquisition of the meaning of the number word for one. Copyright © 2016. Published by Elsevier Inc.

  3. Tracking Learners' Progress: Adopting a Dual "Corpus cum Experimental Data" Approach

    ERIC Educational Resources Information Center

    Meunier, Fanny; Littre, Damien

    2013-01-01

    The article discusses the potential of combining learner corpus research with experimental studies in order to fine-tune the understanding of learner language development. It illustrates the complementarity of the two methodological approaches with data from an ongoing study of the acquisition of the English tense and aspect system by French…

  4. On the Application of Corpus of Contemporary American English in Vocabulary Instruction

    ERIC Educational Resources Information Center

    Yusu, Xu

    2014-01-01

    The development of corpus linguistics has laid theoretical foundation and provided technical support for breaking the bottleneck in traditional vocabulary instruction in China. Corpora allow access to authentic data and show frequency patterns of words and grammar construction. Such patterns can be used to improve language materials or to directly…

  5. Generation of an annotated reference standard for vaccine adverse event reports.

    PubMed

    Foster, Matthew; Pandey, Abhishek; Kreimeyer, Kory; Botsis, Taxiarchis

    2018-07-05

    As part of a collaborative project between the US Food and Drug Administration (FDA) and the Centers for Disease Control and Prevention for the development of a web-based natural language processing (NLP) workbench, we created a corpus of 1000 Vaccine Adverse Event Reporting System (VAERS) reports annotated for 36,726 clinical features, 13,365 temporal features, and 22,395 clinical-temporal links. This paper describes the final corpus, as well as the methodology used to create it, so that clinical NLP researchers outside FDA can evaluate the utility of the corpus to aid their own work. The creation of this standard went through four phases: pre-training, pre-production, production-clinical feature annotation, and production-temporal annotation. The pre-production phase used a double annotation followed by adjudication strategy to refine and finalize the annotation model while the production phases followed a single annotation strategy to maximize the number of reports in the corpus. An analysis of 30 reports randomly selected as part of a quality control assessment yielded accuracies of 0.97, 0.96, and 0.83 for clinical features, temporal features, and clinical-temporal associations, respectively and speaks to the quality of the corpus. Copyright © 2018 Elsevier Ltd. All rights reserved.

  6. Text exposure predicts spoken production of complex sentences in eight and twelve year old children and adults

    PubMed Central

    Montag, Jessica L.; MacDonald, Maryellen C.

    2015-01-01

    There is still much debate about the nature of the experiential and maturational changes that take place during childhood to bring about the sophisticated language abilities of an adult. The present study investigated text exposure as a possible source of linguistic experience that plays a role in the development of adult-like language abilities. Corpus analyses of object and passive relative clauses (Object: The book that the woman carried; Passive: The book that was carried by the woman) established the frequencies of these sentence types in child-directed speech and children's literature. We found that relative clauses of either type were more frequent in the written corpus, and that the ratio of passive to object relatives was much higher in the written corpus as well. This analysis suggests that passive relative clauses are much more frequent in a child's linguistic environment if they have high rates of text exposure. We then elicited object and passive relative clauses using a picture-description production task with eight and twelve year old children and adults. Both group and individual differences were consistent with the corpus analyses, such that older individuals and individuals with more text exposure produced more passive relative clauses. These findings suggest that the qualitatively different patterns of text versus speech may be an important source of linguistic experience for the development of adult-like language behavior. PMID:25844625

  7. Working Papers in Educational Linguistics, 1996-1997.

    ERIC Educational Resources Information Center

    Furumoto, Mitchell A., Ed.; And Others

    1997-01-01

    Reports of language research in the 1996 issue include: "Corpus Planning for the Southern Peruvian Quechua Language" (Serafin M. Coronel-Molina); "Foreign Language Planning in U.S. Higher Education: The Case of a Graduate Business Program" (Mitchell A. Furumoto); "Charting New Directions: Of Communication in a Social…

  8. Similarity-Based Recommendation of New Concepts to a Terminology

    PubMed Central

    Chandar, Praveen; Yaman, Anil; Hoxha, Julia; He, Zhe; Weng, Chunhua

    2015-01-01

    Terminologies can suffer from poor concept coverage due to delays in addition of new concepts. This study tests a similarity-based approach to recommending concepts from a text corpus to a terminology. Our approach involves extraction of candidate concepts from a given text corpus, which are represented using a set of features. The model learns the important features to characterize a concept and recommends new concepts to a terminology. Further, we propose a cost-effective evaluation methodology to estimate the effectiveness of terminology enrichment methods. To test our methodology, we use the clinical trial eligibility criteria free-text as an example text corpus to recommend concepts for SNOMED CT. We computed precision at various rank intervals to measure the performance of the methods. Results indicate that our automated algorithm is an effective method for concept recommendation. PMID:26958170

  9. Examining the relationship between comprehension and production processes in code-switched language

    PubMed Central

    Guzzardo Tamargo, Rosa E.; Valdés Kroff, Jorge R.; Dussias, Paola E.

    2016-01-01

    We employ code-switching (the alternation of two languages in bilingual communication) to test the hypothesis, derived from experience-based models of processing (e.g., Boland, Tanenhaus, Carlson, & Garnsey, 1989; Gennari & MacDonald, 2009), that bilinguals are sensitive to the combinatorial distributional patterns derived from production and that they use this information to guide processing during the comprehension of code-switched sentences. An analysis of spontaneous bilingual speech confirmed the existence of production asymmetries involving two auxiliary + participle phrases in Spanish–English code-switches. A subsequent eye-tracking study with two groups of bilingual code-switchers examined the consequences of the differences in distributional patterns found in the corpus study for comprehension. Participants’ comprehension costs mirrored the production patterns found in the corpus study. Findings are discussed in terms of the constraints that may be responsible for the distributional patterns in code-switching production and are situated within recent proposals of the links between production and comprehension. PMID:28670049

  10. Examining the relationship between comprehension and production processes in code-switched language.

    PubMed

    Guzzardo Tamargo, Rosa E; Valdés Kroff, Jorge R; Dussias, Paola E

    2016-08-01

    We employ code-switching (the alternation of two languages in bilingual communication) to test the hypothesis, derived from experience-based models of processing (e.g., Boland, Tanenhaus, Carlson, & Garnsey, 1989; Gennari & MacDonald, 2009), that bilinguals are sensitive to the combinatorial distributional patterns derived from production and that they use this information to guide processing during the comprehension of code-switched sentences. An analysis of spontaneous bilingual speech confirmed the existence of production asymmetries involving two auxiliary + participle phrases in Spanish-English code-switches. A subsequent eye-tracking study with two groups of bilingual code-switchers examined the consequences of the differences in distributional patterns found in the corpus study for comprehension. Participants' comprehension costs mirrored the production patterns found in the corpus study. Findings are discussed in terms of the constraints that may be responsible for the distributional patterns in code-switching production and are situated within recent proposals of the links between production and comprehension.

  11. Immersion Education and Cognitive Strategies: Can the Obstacle Be the Advantage in a Multilingual Society?

    ERIC Educational Resources Information Center

    Maillat, Didier; Serra, Cecilia

    2009-01-01

    This paper focusses on the teaching of non-linguistic subject matters in a second or third language through bilingual education. We investigate how this specific educational framework influences the development of linguistic competence as well as disciplinary knowledge. Based on a large-scale corpus of classroom interactions collected in bilingual…

  12. Lexicogrammar in the International Construction Industry: A Corpus-Based Case Study of Japanese-Hong-Kongese On-Site Interactions in English

    ERIC Educational Resources Information Center

    Handford, Michael; Matous, Petr

    2011-01-01

    The purpose of this research is to identify and interpret statistically significant lexicogrammatical items that are used in on-site spoken communication in the international construction industry, initially through comparisons with reference corpora of everyday spoken and business language. Several data sources, including audio and video…

  13. Developing a Maori Language Mathematics Lexicon: Challenges for Corpus Planning in Indigenous Language Contexts

    ERIC Educational Resources Information Center

    Trinick, Tony; May, Stephen

    2013-01-01

    Over the last 25 years, there has been significant modernisation and elaboration of the Maori language mathematics lexicon and register to support the teaching of (Western) mathematics as a component of Maori-medium schooling. These developments are situated within the wider Maori language revitalisation movement in Aotearoa/New Zealand, of which…

  14. Representations of Language Education in Canadian Newspapers

    ERIC Educational Resources Information Center

    Vessey, Rachelle

    2017-01-01

    This article examines the salience and content of representations of language education in a corpus of English- and French-Canadian newspapers. Findings suggest that English-Canadian newspapers foreground official-language education issues, in which public schools are represented as the primary means by which Canadians can gain equal access to…

  15. An Avatar-Based Italian Sign Language Visualization System

    NASA Astrophysics Data System (ADS)

    Falletto, Andrea; Prinetto, Paolo; Tiotto, Gabriele

    In this paper, we present an experimental system that supports the translation from Italian to Italian Sign Language (ISL) of the deaf and its visualization through a virtual character. Our objective is to develop a complete platform useful for any application and reusable on several platforms including Web, Digital Television and offline text translation. The system relies on a database that stores both a corpus of Italian words and words coded in the ISL notation system. An interface for the insertion of data is implemented, that allows future extensions and integrations.

  16. All-paths graph kernel for protein-protein interaction extraction with evaluation of cross-corpus learning.

    PubMed

    Airola, Antti; Pyysalo, Sampo; Björne, Jari; Pahikkala, Tapio; Ginter, Filip; Salakoski, Tapio

    2008-11-19

    Automated extraction of protein-protein interactions (PPI) is an important and widely studied task in biomedical text mining. We propose a graph kernel based approach for this task. In contrast to earlier approaches to PPI extraction, the introduced all-paths graph kernel has the capability to make use of full, general dependency graphs representing the sentence structure. We evaluate the proposed method on five publicly available PPI corpora, providing the most comprehensive evaluation done for a machine learning based PPI-extraction system. We additionally perform a detailed evaluation of the effects of training and testing on different resources, providing insight into the challenges involved in applying a system beyond the data it was trained on. Our method is shown to achieve state-of-the-art performance with respect to comparable evaluations, with 56.4 F-score and 84.8 AUC on the AImed corpus. We show that the graph kernel approach performs on state-of-the-art level in PPI extraction, and note the possible extension to the task of extracting complex interactions. Cross-corpus results provide further insight into how the learning generalizes beyond individual corpora. Further, we identify several pitfalls that can make evaluations of PPI-extraction systems incomparable, or even invalid. These include incorrect cross-validation strategies and problems related to comparing F-score results achieved on different evaluation resources. Recommendations for avoiding these pitfalls are provided.

  17. Evaluating Lexical Coverage in Simple English Wikipedia Articles: A Corpus-Driven Study

    ERIC Educational Resources Information Center

    Hendry, Clinton; Sheepy, Emily

    2017-01-01

    Simple English Wikipedia is a user-contributed online encyclopedia intended for young readers and readers whose first language is not English. We compiled a corpus of the entirety of Simple English Wikipedia as of June 20th, 2017. We used lexical frequency profiling tools to investigate the vocabulary size needed to comprehend Simple English…

  18. The Development of Second Language Writing Complexity in Groups and Individuals: A Longitudinal Learner Corpus Study

    ERIC Educational Resources Information Center

    Vyatkina, Nina

    2012-01-01

    This study explores the development of multiple dimensions of linguistic complexity in the writing of beginning learners of German both as a group and as individuals. The data come from an annotated, longitudinal learner corpus. The development of lexicogrammatical complexity is explored at 2 intersections: (a) between cross-sectional trendlines…

  19. "We Have about Seven Minutes for Questions": The Discussion Sessions from a Specialized Conference

    ERIC Educational Resources Information Center

    Wulff, Stefanie; Swales, John M.; Keller, Kristen

    2009-01-01

    This paper discusses the "John Swales Conference Corpus" (JSCC), which contains the lectures and discussion sessions from an applied linguistics conference held in 2006 at the University of Michigan. This corpus constitutes a useful resource in that it provides insights into the language of a narrowly defined academic community.…

  20. A Study of Composition/Correction System with Corpus Retrieval Function

    ERIC Educational Resources Information Center

    Liu, Song; Liu, Peng; Urano, Yoshiyori

    2013-01-01

    Practice and research in the composition education that is using computer and network have been more and more active. Through online composition system, a large amount of written texts produced by students and teachers can be collected. This kind of information is called a learner corpus, which is important in second language education because the…

  1. Young People's Everyday Literacies: The Language Features of Instant Messaging

    ERIC Educational Resources Information Center

    Haas, Christina; Takayoshi, Pamela

    2011-01-01

    In this article, we examine writing in the context of new communication technologies as a kind of everyday literacy. Using an inductive approach developed from grounded theory, we analyzed a 32,000-word corpus of college students' Instant Messaging (IM) exchanges. Through our analysis of this corpus, we identify a fifteen-item taxonomy of IM…

  2. A School Healthcare Program for Low Income Families of Very Young Children.

    ERIC Educational Resources Information Center

    Joyce, Esperanza Villanueva

    This chapter is part of a book that recounts the year's work at the Early Childhood Development Center (ECDC) at Texas A & M University-Corpus Christi. Rather than an "elitist" laboratory school for the children of university faculty, the dual-language ECDC is a collaboration between the Corpus Christi Independent School District and…

  3. Well, Now, Okey Dokey: English Discourse Markers in Spanish Language Medical Consultations

    PubMed Central

    Vickers, Caroline H.; Goble, Ryan

    2013-01-01

    The purpose of this paper is to examine use of English discourse markers in otherwise Spanish language consultations. Data is derived from an audio-recorded corpus of Spanish language consultations that took place in a small community clinic in the United States as well as post-consultation interviews with patients and providers. Through quantification of the use of discourse makers in the corpus and discourse analysis of transcripts, we demonstrate that English-speaking dominant medical providers use English discourse markers more frequently and with a broader range of functions than do Spanish-speaking dominant medical providers and patients. We argue that such use of English discourse markers serves to exacerbate the power relationship between providers and patients even though the use of English discourse markers does not cause overt miscommunication in the ongoing interaction. Implications for providers who use a second language in their medical consultations are discussed. PMID:24347670

  4. Morpheme matching based text tokenization for a scarce resourced language.

    PubMed

    Rehman, Zobia; Anwar, Waqas; Bajwa, Usama Ijaz; Xuan, Wang; Chaoying, Zhou

    2013-01-01

    Text tokenization is a fundamental pre-processing step for almost all the information processing applications. This task is nontrivial for the scarce resourced languages such as Urdu, as there is inconsistent use of space between words. In this paper a morpheme matching based approach has been proposed for Urdu text tokenization, along with some other algorithms to solve the additional issues of boundary detection of compound words, affixation, reduplication, names and abbreviations. This study resulted into 97.28% precision, 93.71% recall, and 95.46% F1-measure; while tokenizing a corpus of 57000 words by using a morpheme list with 6400 entries.

  5. Morpheme Matching Based Text Tokenization for a Scarce Resourced Language

    PubMed Central

    Rehman, Zobia; Anwar, Waqas; Bajwa, Usama Ijaz; Xuan, Wang; Chaoying, Zhou

    2013-01-01

    Text tokenization is a fundamental pre-processing step for almost all the information processing applications. This task is nontrivial for the scarce resourced languages such as Urdu, as there is inconsistent use of space between words. In this paper a morpheme matching based approach has been proposed for Urdu text tokenization, along with some other algorithms to solve the additional issues of boundary detection of compound words, affixation, reduplication, names and abbreviations. This study resulted into 97.28% precision, 93.71% recall, and 95.46% F1-measure; while tokenizing a corpus of 57000 words by using a morpheme list with 6400 entries. PMID:23990871

  6. A Digital Liquid State Machine With Biologically Inspired Learning and Its Application to Speech Recognition.

    PubMed

    Zhang, Yong; Li, Peng; Jin, Yingyezhe; Choe, Yoonsuck

    2015-11-01

    This paper presents a bioinspired digital liquid-state machine (LSM) for low-power very-large-scale-integration (VLSI)-based machine learning applications. To the best of the authors' knowledge, this is the first work that employs a bioinspired spike-based learning algorithm for the LSM. With the proposed online learning, the LSM extracts information from input patterns on the fly without needing intermediate data storage as required in offline learning methods such as ridge regression. The proposed learning rule is local such that each synaptic weight update is based only upon the firing activities of the corresponding presynaptic and postsynaptic neurons without incurring global communications across the neural network. Compared with the backpropagation-based learning, the locality of computation in the proposed approach lends itself to efficient parallel VLSI implementation. We use subsets of the TI46 speech corpus to benchmark the bioinspired digital LSM. To reduce the complexity of the spiking neural network model without performance degradation for speech recognition, we study the impacts of synaptic models on the fading memory of the reservoir and hence the network performance. Moreover, we examine the tradeoffs between synaptic weight resolution, reservoir size, and recognition performance and present techniques to further reduce the overhead of hardware implementation. Our simulation results show that in terms of isolated word recognition evaluated using the TI46 speech corpus, the proposed digital LSM rivals the state-of-the-art hidden Markov-model-based recognizer Sphinx-4 and outperforms all other reported recognizers including the ones that are based upon the LSM or neural networks.

  7. A Customized Attention-Based Long Short-Term Memory Network for Distant Supervised Relation Extraction.

    PubMed

    He, Dengchao; Zhang, Hongjun; Hao, Wenning; Zhang, Rui; Cheng, Kai

    2017-07-01

    Distant supervision, a widely applied approach in the field of relation extraction can automatically generate large amounts of labeled training corpus with minimal manual effort. However, the labeled training corpus may have many false-positive data, which would hurt the performance of relation extraction. Moreover, in traditional feature-based distant supervised approaches, extraction models adopt human design features with natural language processing. It may also cause poor performance. To address these two shortcomings, we propose a customized attention-based long short-term memory network. Our approach adopts word-level attention to achieve better data representation for relation extraction without manually designed features to perform distant supervision instead of fully supervised relation extraction, and it utilizes instance-level attention to tackle the problem of false-positive data. Experimental results demonstrate that our proposed approach is effective and achieves better performance than traditional methods.

  8. Task-Based EFL Language Teaching with Procedural Information Design in a Technical Writing Context

    ERIC Educational Resources Information Center

    Roy, Debopriyo

    2017-01-01

    Task-based language learning (TBLL) has heavily influenced syllabus design, classroom teaching, and learner assessment in a foreign or second language teaching context. In this English as foreign language (EFL) learning environment, the paper discussed an innovative language learning pedagogy based on design education and technical writing. In…

  9. Development and Use of a Corpus Tailored for Legal English Learning

    ERIC Educational Resources Information Center

    Skier, Jason; Vibulphol, Jutarat

    2016-01-01

    While corpus linguistics has been applied towards many specific academic purposes, reports are few regarding its use to facilitate learning of legal English by non-native English speakers. Specialized corpora are required because legal English often differs significantly from ordinary usage, with words such as bar, motion, and hearing having…

  10. Beyond the Four Walls: Community-Based Learning and Languages

    ERIC Educational Resources Information Center

    O'Connor, Anne

    2012-01-01

    At a time when languages in universities are under pressure, community-based learning language courses can have many positive benefits: they can increase interest in language learning, they can foster greater engagement with learning, and they can encourage active learning, creativity and teamwork. These courses, which link the classroom and the…

  11. Combining MEDLINE and publisher data to create parallel corpora for the automatic translation of biomedical text

    PubMed Central

    2013-01-01

    Background Most of the institutional and research information in the biomedical domain is available in the form of English text. Even in countries where English is an official language, such as the United States, language can be a barrier for accessing biomedical information for non-native speakers. Recent progress in machine translation suggests that this technique could help make English texts accessible to speakers of other languages. However, the lack of adequate specialized corpora needed to train statistical models currently limits the quality of automatic translations in the biomedical domain. Results We show how a large-sized parallel corpus can automatically be obtained for the biomedical domain, using the MEDLINE database. The corpus generated in this work comprises article titles obtained from MEDLINE and abstract text automatically retrieved from journal websites, which substantially extends the corpora used in previous work. After assessing the quality of the corpus for two language pairs (English/French and English/Spanish) we use the Moses package to train a statistical machine translation model that outperforms previous models for automatic translation of biomedical text. Conclusions We have built translation data sets in the biomedical domain that can easily be extended to other languages available in MEDLINE. These sets can successfully be applied to train statistical machine translation models. While further progress should be made by incorporating out-of-domain corpora and domain-specific lexicons, we believe that this work improves the automatic translation of biomedical texts. PMID:23631733

  12. How the Potawatomi Language Lives: A Grammar of Potawatomi

    ERIC Educational Resources Information Center

    Lockwood, Hunter Thompson

    2017-01-01

    This dissertation is a descriptive grammar of Potawatomi, a critically endangered Algonquian language now only spoken as a first language by a handful of elders in northern Wisconsin. Throughout, the goal is to present an authoritative linguistic description of Potawatomi by drawing on direct elicitation, a corpus of new texts gathered in close…

  13. Taking Advantage of the "Big Mo"--Momentum in Everyday English and Swedish and in Physics Teaching

    ERIC Educational Resources Information Center

    Haglund, Jesper; Jeppsson, Fredrik; Ahrenberg, Lars

    2015-01-01

    Science education research suggests that our everyday intuitions of motion and interaction of physical objects fit well with how physicists use the term "momentum". Corpus linguistics provides an easily accessible approach to study language in different domains, including everyday language. Analysis of language samples from English text…

  14. Empirical Distributional Semantics: Methods and Biomedical Applications

    PubMed Central

    Cohen, Trevor; Widdows, Dominic

    2009-01-01

    Over the past fifteen years, a range of methods have been developed that are able to learn human-like estimates of the semantic relatedness between terms from the way in which these terms are distributed in a corpus of unannotated natural language text. These methods have also been evaluated in a number of applications in the cognitive science, computational linguistics and the information retrieval literatures. In this paper, we review the available methodologies for derivation of semantic relatedness from free text, as well as their evaluation in a variety of biomedical and other applications. Recent methodological developments, and their applicability to several existing applications are also discussed. PMID:19232399

  15. Cross-lingual neighborhood effects in generalized lexical decision and natural reading.

    PubMed

    Dirix, Nicolas; Cop, Uschi; Drieghe, Denis; Duyck, Wouter

    2017-06-01

    The present study assessed intra- and cross-lingual neighborhood effects, using both a generalized lexical decision task and an analysis of a large-scale bilingual eye-tracking corpus (Cop, Dirix, Drieghe, & Duyck, 2016). Using new neighborhood density and frequency measures, the general lexical decision task yielded an inhibitory cross-lingual neighborhood density effect on reading times of second language words, replicating van Heuven, Dijkstra, and Grainger (1998). Reaction times for native language words were not influenced by neighborhood density or frequency but error rates showed cross-lingual neighborhood effects depending on target word frequency. The large-scale eye movement corpus confirmed effects of cross-lingual neighborhood on natural reading, even though participants were reading a novel in a unilingual context. Especially second language reading and to a lesser extent native language reading were influenced by lexical candidates from the nontarget language, although these effects in natural reading were largely facilitatory. These results offer strong and direct support for bilingual word recognition models that assume language-independent lexical access. (PsycINFO Database Record (c) 2017 APA, all rights reserved).

  16. The concept of compassion within uk media generated discourse: A corpus informed analysis.

    PubMed

    Bond, Carmel; Stacey, Gemma; Field-Richards, Sarah; Callaghan, Patrick; Keeley, Philip; Lymn, Joanne; Redsell, Sarah; Spiby, Helen

    2018-04-27

    To examine how the concept of compassion is socially constructed within UK discourse, in response to recommendations that aspiring nurses gain care experience prior to entering nurse education. Following a report of significant failings in care, the UK government proposed prior care experience for aspiring nurses as a strategy to enhance compassion amongst the profession. Media reporting of this generated substantial online discussion, which formed the data for this research. There is a need to define how compassion is constructed through language as a limited understanding exists, of what compassion means in healthcare. This is important, for any meaningful evaluation of quality, compassionate practices. A corpus-informed discourse analysis. A 62626-word corpus of data was analysed using Laurence Anthony software 'AntCon', a free corpus analysis toolkit. Frequent words were retrieved and used as a focal point for further analysis. Concordance lines were computed and analysed in the context of which frequent word-types occurred. Patterns of language were revealed and interpreted through researcher immersion. Findings identified that compassion was frequently described in various ways as a natural characteristic attribute. A pattern of language also referred to compassion as something that was not able to be taught, but could be developed through the repetition of behaviours observed in practice learning. In the context of compassion, the word-type 'nurse' was used positively. This paper adds to important debates highlighting how compassion is constructed and defined in the context of nursing. Compassion is constructed as both an individual, personal trait and a professional behaviour to be learnt. Educational design could include effective interpersonal skills training, which may help enhance and develop compassion from within the nursing profession. Likewise, ways of thinking, behaving and communicating should also be addressed by established practitioners in order to maintain compassionate interactions between professionals as well as nurse-patient relationships. Future research should focus on how compassionate practice is defined by both health professionals and patients. In order to maintain nursing as an attractive profession to join, it is important that nurses are viewed as compassionate. This holds implications for professional morale, associated with the continued retention and recruitment of the future workforce. Existing ideologies within the practice placement, the prior care experience environment, as well as the educational and organisational design are crucial factors to consider, in terms of their influences on the expression of compassion in practice. This article is protected by copyright. All rights reserved. This article is protected by copyright. All rights reserved.

  17. The Effect of the Integration of Corpora in Reading Comprehension Classrooms on English as a Foreign Language Learners' Vocabulary Development

    ERIC Educational Resources Information Center

    Gordani, Yahya

    2013-01-01

    This study used a randomized pretest-posttest control group design to examine the effect of the integration of corpora in general English courses on the students' vocabulary development. To enhance the learners' lexical repertoire and thereby improve their reading comprehension, an online corpus-based approach was integrated into 42 hours of…

  18. A Corpus-Based Investigation of Critical Reflective Practice and Context in Early Career Teacher Settings

    ERIC Educational Resources Information Center

    Murphy, Bróna

    2015-01-01

    Reflective practice is at the core of teacher education programmes and is highly regarded as an essential component in the education of new and experienced teachers. Given the recent interest in language use and the role of discourse in articulating knowledge of one's practice, this paper focuses on how two groups of early career teachers from…

  19. The Decline and Fall of English in Hong Kong's Legislative Council

    ERIC Educational Resources Information Center

    Evans, Stephen

    2014-01-01

    This article presents the findings of a corpus-based study of the use of English vis-à-vis Cantonese and Putonghua in Hong Kong's Legislative Council in the past four decades. The objective of the study was to track the changing fortunes of the three languages in a key government institution during a period of unprecedented political, economic and…

  20. An Automatic Collocation Writing Assistant for Taiwanese EFL Learners: A Case of Corpus-Based NLP Technology

    ERIC Educational Resources Information Center

    Chang, Yu-Chia; Chang, Jason S.; Chen, Hao-Jan; Liou, Hsien-Chin

    2008-01-01

    Previous work in the literature reveals that EFL learners were deficient in collocations that are a hallmark of near native fluency in learner's writing. Among different types of collocations, the verb-noun (V-N) one was found to be particularly difficult to master, and learners' first language was also found to heavily influence their collocation…

  1. Design of a Low-Cost Adaptive Question Answering System for Closed Domain Factoid Queries

    ERIC Educational Resources Information Center

    Toh, Huey Ling

    2010-01-01

    Closed domain question answering (QA) systems achieve precision and recall at the cost of complex language processing techniques to parse the answer corpus. We propose a "query-based" model for indexing answers in a closed domain factoid QA system. Further, we use a phrase term inference method for improving the ranking order of related questions.…

  2. Using Corpus Linguistics to Examine the Extrapolation Inference in the Validity Argument for a High-Stakes Speaking Assessment

    ERIC Educational Resources Information Center

    LaFlair, Geoffrey T.; Staples, Shelley

    2017-01-01

    Investigations of the validity of a number of high-stakes language assessments are conducted using an argument-based approach, which requires evidence for inferences that are critical to score interpretation (Chapelle, Enright, & Jamieson, 2008b; Kane, 2013). The current study investigates the extrapolation inference for a high-stakes test of…

  3. Conjunctive Cohesion in English Language EU Documents--A Corpus-Based Analysis and Its Implications

    ERIC Educational Resources Information Center

    Trebits, Anna

    2009-01-01

    This paper reports the findings of a study which forms part of a larger-scale research project investigating the use of English in the documents of the European Union (EU). The documents of the EU show various features of texts written for legal, business and other specific purposes. Moreover, the translation services of the EU institutions often…

  4. Experiments with Cross-Language Information Retrieval on a Health Portal for Psychology and Psychotherapy.

    PubMed

    Andrenucci, Andrea

    2016-01-01

    Few studies have been performed within cross-language information retrieval (CLIR) in the field of psychology and psychotherapy. The aim of this paper is to to analyze and assess the quality of available query translation methods for CLIR on a health portal for psychology. A test base of 100 user queries, 50 Multi Word Units (WUs) and 50 Single WUs, was used. Swedish was the source language and English the target language. Query translation methods based on machine translation (MT) and dictionary look-up were utilized in order to submit query translations to two search engines: Google Site Search and Quick Ask. Standard IR evaluation measures and a qualitative analysis were utilized to assess the results. The lexicon extracted with word alignment of the portal's parallel corpus provided better statistical results among dictionary look-ups. Google Translate provided more linguistically correct translations overall and also delivered better retrieval results in MT.

  5. Reflective Outcomes of Convergent and Divergent Group Tasking in the Online Learning Environment

    ERIC Educational Resources Information Center

    Hawkes, Mark

    2007-01-01

    Using collaborative critical reflection as an index, this study examines the asynchronous and face-to-face discourse of 28 suburban Chicago elementary teachers developing problem based learning (PBL) curriculum. Statistical analysis of the corpus produced by the 2 mediums shows that the asynchronous online network emerges as the medium of choice…

  6. Detection of Common Errors in Turkish EFL Students' Writing through a Corpus Analytic Approach

    ERIC Educational Resources Information Center

    Demirel, Elif Tokdemir

    2017-01-01

    The present study aims to explore Turkish EFL students' major writing difficulties by analyzing the frequent writing errors in academic essays. Accordingly, the study examined errors in a corpus of 150 academic essays written by Turkish EFL students studying at the Department of English Language and Literature at a public university in Turkey. The…

  7. The Effects of Using Corpora on Revision Tasks in L2 Writing with Coded Error Feedback

    ERIC Educational Resources Information Center

    Tono, Yukio; Satake, Yoshiho; Miura, Aika

    2014-01-01

    This study reports on the results of classroom research investigating the effects of corpus use in the process of revising compositions in English as a foreign language. Our primary aim was to investigate the relationship between the information extracted from corpus data and how that information actually helped in revising different types of…

  8. Book Choices for Culturally and Linguistically Diverse (CLD) Parents: Strategies for Sharing Books in Bilingual Homes.

    ERIC Educational Resources Information Center

    Ratliff, Joanne L.; Montague, Nicole S.

    This chapter is part of a book that recounts the year's work at the Early Childhood Development Center (ECDC) at Texas A & M University-Corpus Christi. Rather than an "elitist" laboratory school for the children of university faculty, the dual-language ECDC is a collaboration between the Corpus Christi Independent School District and…

  9. English Collocation Learning through Corpus Data: On-Line Concordance and Statistical Information

    ERIC Educational Resources Information Center

    Ohtake, Hiroshi; Fujita, Nobuyuki; Kawamoto, Takeshi; Morren, Brian; Ugawa, Yoshihiro; Kaneko, Shuji

    2012-01-01

    We developed an English Collocations On Demand system offering on-line corpus and concordance information to help Japanese researchers acquire a better command of English collocation patterns. The Life Science Dictionary Corpus consists of approximately 90,000,000 words collected from life science related research papers published in academic…

  10. Assessment of disease named entity recognition on a corpus of annotated sentences.

    PubMed

    Jimeno, Antonio; Jimenez-Ruiz, Ernesto; Lee, Vivian; Gaudan, Sylvain; Berlanga, Rafael; Rebholz-Schuhmann, Dietrich

    2008-04-11

    In recent years, the recognition of semantic types from the biomedical scientific literature has been focused on named entities like protein and gene names (PGNs) and gene ontology terms (GO terms). Other semantic types like diseases have not received the same level of attention. Different solutions have been proposed to identify disease named entities in the scientific literature. While matching the terminology with language patterns suffers from low recall (e.g., Whatizit) other solutions make use of morpho-syntactic features to better cover the full scope of terminological variability (e.g., MetaMap). Currently, MetaMap that is provided from the National Library of Medicine (NLM) is the state of the art solution for the annotation of concepts from UMLS (Unified Medical Language System) in the literature. Nonetheless, its performance has not yet been assessed on an annotated corpus. In addition, little effort has been invested so far to generate an annotated dataset that links disease entities in text to disease entries in a database, thesaurus or ontology and that could serve as a gold standard to benchmark text mining solutions. As part of our research work, we have taken a corpus that has been delivered in the past for the identification of associations of genes to diseases based on the UMLS Metathesaurus and we have reprocessed and re-annotated the corpus. We have gathered annotations for disease entities from two curators, analyzed their disagreement (0.51 in the kappa-statistic) and composed a single annotated corpus for public use. Thereafter, three solutions for disease named entity recognition including MetaMap have been applied to the corpus to automatically annotate it with UMLS Metathesaurus concepts. The resulting annotations have been benchmarked to compare their performance. The annotated corpus is publicly available at ftp://ftp.ebi.ac.uk/pub/software/textmining/corpora/diseases and can serve as a benchmark to other systems. In addition, we found that dictionary look-up already provides competitive results indicating that the use of disease terminology is highly standardized throughout the terminologies and the literature. MetaMap generates precise results at the expense of insufficient recall while our statistical method obtains better recall at a lower precision rate. Even better results in terms of precision are achieved by combining at least two of the three methods leading, but this approach again lowers recall. Altogether, our analysis gives a better understanding of the complexity of disease annotations in the literature. MetaMap and the dictionary based approach are available through the Whatizit web service infrastructure (Rebholz-Schuhmann D, Arregui M, Gaudan S, Kirsch H, Jimeno A: Text processing through Web services: Calling Whatizit. Bioinformatics 2008, 24:296-298).

  11. Exploring Emotive Verbs in Persian and English Short Stories: A Contrastive Sociopragmatic Approach.

    PubMed

    Karimi, Keihaneh; Biria, Reza

    2017-04-01

    Current developments in the areas of discourse analysis and cross-cultural studies have led to an increased interest in the way people of different cultures express their affections on various occasions. Individuals learn how to regulate their emotional reactions according to sociocultural norms of behavior defined by the cultures to which they belong. Accordingly, this article aimed to investigate the linguistic expression of emotions in English and Persian short stories in order to fathom out the impact of culture on the way feelings are expressed cross-culturally. For this purpose, a corpus of eight different English and Persian short stories, four in each language, was selected based on a purposive sampling method. Then, using Devon's (The origin of emotions, 2006) typology of emotions, different types of emotive verbs were selected as the unit of analysis. Finally, the frequency and percentage values of emotive verb tokens used in these stories were carefully tabulated in terms of types and their respective metalinguistic categories introduced by Wierzbicka (Emotions across languages and cultures: diversity and universals, Cambridge University Press, Cambridge, 1999). The results obtained from the analysis of the targeted corpora reflected that English and Persian writers employ different types of emotive verbs in expressing their feelings. Essentially, the findings of the present study may have important implications for language teachers, material developers, and course designers.

  12. Applications of Cognitive Load Theory to Multimedia-Based Foreign Language Learning: An Overview

    ERIC Educational Resources Information Center

    Chen, I-Jung; Chang, Chi-Cheng; Lee, Yen-Chang

    2009-01-01

    This article reviews the multimedia instructional design literature based on cognitive load theory (CLT) in the context of foreign language learning. Multimedia are of particular importance in language learning materials because they incorporate text, image, and sound, thus offering an integrated learning experience of the four language skills…

  13. Effects of Verb Semantics and Proficiency in Second Language Use of Constructional Knowledge

    ERIC Educational Resources Information Center

    Kim, Hyunwoo; Rah, Yangon

    2016-01-01

    This study investigates the influence of the semantic heaviness of verbs (i.e., heavy or light verbs) and language proficiency on second language (L2) learners' use of constructional information in a sentence-sorting task and a corpus analysis. Previous studies employing a sentence-sorting task demonstrated that advanced L2 learners sorted English…

  14. English and French Journal Abstracts in the Language Sciences: Three Exploratory Studies

    ERIC Educational Resources Information Center

    Van Bonn, Sarah; Swales, John M.

    2007-01-01

    This article compares French and English academic article abstracts from the language sciences in an attempt to understand how and why language choice might affect this part-genre--both in actual use and according to authors' linguistic and rhetorical perceptions. Two corpora are used: Corpus A consists of abstracts from a French linguistics…

  15. Linguistic Prescriptivism in Letters to the Editor

    ERIC Educational Resources Information Center

    Lukac, Morana

    2016-01-01

    The public's concern with the fate of the standard language has been well documented in the history of the complaint tradition. The print media have for centuries featured letters to the editor on questions of language use. This study examines a corpus of 258 language-related letters to the editor published in the English-speaking print media. By…

  16. Term Planning in Oriya: Problems and Prospects.

    ERIC Educational Resources Information Center

    Mohanty, Panchanan; Pattanayak, Subrat Kalyan

    1996-01-01

    Discusses the failed attempt to cultivate Oriya as a modern language and to make it the official language of Orissa, India, in the 19th century. The reasons for this failure are a lack of language consciousness among the Oriyas, the pro-English attitude of the Oriya bureaucrats, and an irresolute corpus planning exercise. (eight references)…

  17. Chomsky's Universal Grammar and Halliday's Systemic Functional Linguistics: An Appraisal and a Compromise

    ERIC Educational Resources Information Center

    Bavali, Mohammad; Sadighi, Firooz

    2008-01-01

    Recent developments in theories of language (grammars) seem to share a number of tenets which mark a drastic shift from traditional disentangled descriptions of language: emphasis on a big number of discrete grammatical rules or a corpus of structure patterns has given way to a more unitary, explanatory powerful description of language informed by…

  18. Splenium Development and Early Spoken Language in Human Infants

    ERIC Educational Resources Information Center

    Swanson, Meghan R.; Wolff, Jason J.; Elison, Jed T.; Gu, Hongbin; Hazlett, Heather C.; Botteron, Kelly; Styner, Martin; Paterson, Sarah; Gerig, Guido; Constantino, John; Dager, Stephen; Estes, Annette; Vachet, Clement; Piven, Joseph

    2017-01-01

    The association between developmental trajectories of language-related white matter fiber pathways from 6 to 24 months of age and individual differences in language production at 24 months of age was investigated. The splenium of the corpus callosum, a fiber pathway projecting through the posterior hub of the default mode network to occipital…

  19. Making a Minimalist Approach to Codeswitching Work: Adding the Matrix Language.

    ERIC Educational Resources Information Center

    Jake, Janice L.; Myers-Scotton, Carol; Gross, Steven

    2002-01-01

    Discusses the Matrix Language Frame model. Analysis of noun phrases in a Spanish-English corpus illustrates this compatibility and shows how recent minimalist proposals can explain the distribution of nouns and determiners in the data if they adopt the notion of matrix language as bilingual instantiation of structural uniformity in a CP.…

  20. Morphosyntactic annotation of CHILDES transcripts*

    PubMed Central

    SAGAE, KENJI; DAVIS, ERIC; LAVIE, ALON; MACWHINNEY, BRIAN; WINTNER, SHULY

    2014-01-01

    Corpora of child language are essential for research in child language acquisition and psycholinguistics. Linguistic annotation of the corpora provides researchers with better means for exploring the development of grammatical constructions and their usage. We describe a project whose goal is to annotate the English section of the CHILDES database with grammatical relations in the form of labeled dependency structures. We have produced a corpus of over 18,800 utterances (approximately 65,000 words) with manually curated gold-standard grammatical relation annotations. Using this corpus, we have developed a highly accurate data-driven parser for the English CHILDES data, which we used to automatically annotate the remainder of the English section of CHILDES. We have also extended the parser to Spanish, and are currently working on supporting more languages. The parser and the manually and automatically annotated data are freely available for research purposes. PMID:20334720

  1. Metaphor, Metonymy, and Their Interaction in the Production of Semantic Approximations by Monolingual Children: A Corpus Analysis

    ERIC Educational Resources Information Center

    Pérez-Hernández, Lorena; Duvignau, Karine

    2016-01-01

    The present study looks into the largely unexplored territory of the cognitive underpinnings of semantic approximations in child language. The analysis of a corpus of 233 semantic approximations produced by 101 monolingual French-speaking children from 1;8 to 4;2 years of age leads to a classification of a significant number of them as instances…

  2. Computer simulation as an important approach to explore language universal. Comment on "Dependency distance: a new perspective on syntactic patterns in natural languages" by Haitao Liu et al.

    NASA Astrophysics Data System (ADS)

    Lu, Qian

    2017-07-01

    Exploring language universal is one of the major goals of linguistic researches, which are largely devoted to answering the ;Platonic questions; in linguistics, that is, what is the language knowledge, how to get and use this knowledge. However, if solely guided by linguistic intuition, it is very difficult for syntactic studies to answer these questions, or to achieve abstractions in the scientific sense. This suggests that linguistic analyses based on the probability theory may provide effective ways to investigate into language universals in terms of biological motivations or cognitive psychological mechanisms. With the view that ;Language is a human-driven system;, Liu, Xu & Liang's review [1] pointed out that dependency distance minimization (DDM), which has been corroborated by big data analysis of corpus, may be a language universal shaped in language evolution, a universal that has profound effect on syntactic patterns.

  3. DNorm: disease name normalization with pairwise learning to rank.

    PubMed

    Leaman, Robert; Islamaj Dogan, Rezarta; Lu, Zhiyong

    2013-11-15

    Despite the central role of diseases in biomedical research, there have been much fewer attempts to automatically determine which diseases are mentioned in a text-the task of disease name normalization (DNorm)-compared with other normalization tasks in biomedical text mining research. In this article we introduce the first machine learning approach for DNorm, using the NCBI disease corpus and the MEDIC vocabulary, which combines MeSH® and OMIM. Our method is a high-performing and mathematically principled framework for learning similarities between mentions and concept names directly from training data. The technique is based on pairwise learning to rank, which has not previously been applied to the normalization task but has proven successful in large optimization problems for information retrieval. We compare our method with several techniques based on lexical normalization and matching, MetaMap and Lucene. Our algorithm achieves 0.782 micro-averaged F-measure and 0.809 macro-averaged F-measure, an increase over the highest performing baseline method of 0.121 and 0.098, respectively. The source code for DNorm is available at http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/DNorm, along with a web-based demonstration and links to the NCBI disease corpus. Results on PubMed abstracts are available in PubTator: http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/PubTator .

  4. The Blackbird Whistling or Just After? Vygotsky's Tool and Sign as an Analytic for Writing

    ERIC Educational Resources Information Center

    Imbrenda, Jon-Philip

    2016-01-01

    Based on Vygotsky's theory of the interplay of the tool and sign functions of language, this study presents a textual analysis of a corpus of student-authored texts to illuminate aspects of development evidenced through the dialectical tension of tool and sign. Data were drawn from a series of reflective memos I authored during a seminar for new…

  5. Callosal tracts and patterns of hemispheric dominance: a combined fMRI and DTI study.

    PubMed

    Häberling, Isabelle S; Badzakova-Trajkov, Gjurgjica; Corballis, Michael C

    2011-01-15

    Left-hemispheric dominance for language and right-hemispheric dominance for spatial processing are distinctive characteristics of the human brain. However, variations of these hemispheric asymmetries have been observed, with a minority showing crowding of both functions to the same hemisphere or even a mirror reversal of the typical lateralization pattern. Here, we used diffusion tensor imaging and functional magnetic imaging to investigate the role of the corpus callosum in participants with atypical hemispheric dominance. The corpus callosum was segmented according to the projection site of the underlying fibre tracts. Analyses of the microstructure of the identified callosal segments revealed that atypical hemispheric dominance for language was associated with high anisotropic diffusion through the corpus callosum as a whole. This effect was most evident in participants with crowding of both functions to the right. The enhanced anisotropic diffusion in atypical hemispheric dominance implies that in these individuals the two hemispheres are more heavily interconnected. Copyright © 2010 Elsevier Inc. All rights reserved.

  6. A study of active learning methods for named entity recognition in clinical text.

    PubMed

    Chen, Yukun; Lasko, Thomas A; Mei, Qiaozhu; Denny, Joshua C; Xu, Hua

    2015-12-01

    Named entity recognition (NER), a sequential labeling task, is one of the fundamental tasks for building clinical natural language processing (NLP) systems. Machine learning (ML) based approaches can achieve good performance, but they often require large amounts of annotated samples, which are expensive to build due to the requirement of domain experts in annotation. Active learning (AL), a sample selection approach integrated with supervised ML, aims to minimize the annotation cost while maximizing the performance of ML-based models. In this study, our goal was to develop and evaluate both existing and new AL methods for a clinical NER task to identify concepts of medical problems, treatments, and lab tests from the clinical notes. Using the annotated NER corpus from the 2010 i2b2/VA NLP challenge that contained 349 clinical documents with 20,423 unique sentences, we simulated AL experiments using a number of existing and novel algorithms in three different categories including uncertainty-based, diversity-based, and baseline sampling strategies. They were compared with the passive learning that uses random sampling. Learning curves that plot performance of the NER model against the estimated annotation cost (based on number of sentences or words in the training set) were generated to evaluate different active learning and the passive learning methods and the area under the learning curve (ALC) score was computed. Based on the learning curves of F-measure vs. number of sentences, uncertainty sampling algorithms outperformed all other methods in ALC. Most diversity-based methods also performed better than random sampling in ALC. To achieve an F-measure of 0.80, the best method based on uncertainty sampling could save 66% annotations in sentences, as compared to random sampling. For the learning curves of F-measure vs. number of words, uncertainty sampling methods again outperformed all other methods in ALC. To achieve 0.80 in F-measure, in comparison to random sampling, the best uncertainty based method saved 42% annotations in words. But the best diversity based method reduced only 7% annotation effort. In the simulated setting, AL methods, particularly uncertainty-sampling based approaches, seemed to significantly save annotation cost for the clinical NER task. The actual benefit of active learning in clinical NER should be further evaluated in a real-time setting. Copyright © 2015 Elsevier Inc. All rights reserved.

  7. Anatomical entity mention recognition at literature scale

    PubMed Central

    Pyysalo, Sampo; Ananiadou, Sophia

    2014-01-01

    Motivation: Anatomical entities ranging from subcellular structures to organ systems are central to biomedical science, and mentions of these entities are essential to understanding the scientific literature. Despite extensive efforts to automatically analyze various aspects of biomedical text, there have been only few studies focusing on anatomical entities, and no dedicated methods for learning to automatically recognize anatomical entity mentions in free-form text have been introduced. Results: We present AnatomyTagger, a machine learning-based system for anatomical entity mention recognition. The system incorporates a broad array of approaches proposed to benefit tagging, including the use of Unified Medical Language System (UMLS)- and Open Biomedical Ontologies (OBO)-based lexical resources, word representations induced from unlabeled text, statistical truecasing and non-local features. We train and evaluate the system on a newly introduced corpus that substantially extends on previously available resources, and apply the resulting tagger to automatically annotate the entire open access scientific domain literature. The resulting analyses have been applied to extend services provided by the Europe PubMed Central literature database. Availability and implementation: All tools and resources introduced in this work are available from http://nactem.ac.uk/anatomytagger. Contact: sophia.ananiadou@manchester.ac.uk Supplementary Information: Supplementary data are available at Bioinformatics online. PMID:24162468

  8. Automatically Detecting Authors’ Native Language

    DTIC Science & Technology

    2011-03-01

    exploring stylistic idiosyncrasies in the author’s writing [15]. Kop- pel used the data from International Corpus of Learner English version 1, which is... stylistic feature sets such as function words, letter n-grams, and er- rors and idiosyncrasies [15]. 1. Function words: 400 specific function words were...language on the choice of written second language words. Proceedings of the Workshop on Cognitive Aspects of Computation Language Acquisition, pp. 9–16

  9. Practice and Progression in Second Language Research Methods

    ERIC Educational Resources Information Center

    Mackey, Alison

    2014-01-01

    Since its inception, the field of second language research has utilized methods from a number of areas, including general linguistics, psychology, education, sociology, anthropology and, recently, neuroscience and corpus linguistics. As the questions and objectives expand, researchers are increasingly pushing methodological boundaries to gain a…

  10. Multilingualism in indigenous mathematics education: an epistemic matter

    NASA Astrophysics Data System (ADS)

    Parra, Aldo; Trinick, Tony

    2017-12-01

    An investigation into an aspect of indigenous education provides the opportunity to forefront an epistemological discussion about mathematical knowledge. This paper analyses indigenous peoples' educational experiences in Colombia and Aotearoa/New Zealand of mathematics education, focusing on, among other things, sociolinguistic issues such as language planning. In these experiences, researchers, teachers and local communities, working together, elaborated their respective languages to create a corpus of lexicon that has enabled the teaching of Western mathematics. An analysis using decolonial theory is made, showing how this corpus development works to enable the teaching of [Western] mathematics resulted in investigations into culture, language and mathematics that revealed an interplay among knowledge and power. Such analysis raises issues about the epistemology of mathematics and the politics of knowledge, analogous with current discussions on multilingualism in mathematics education and in ethnomathematics. The paper concludes that mathematics educators can explore and take advantage of the sociolinguistic and epistemological issues that arise when an indigenous language is elaborated in a short period of time in comparison to other languages which have been developed incrementally over hundreds of years and thus much more difficult to critique.

  11. Towards a Semantic Lexicon for Biological Language Processing

    DOE PAGES

    Verspoor, Karin

    2005-01-01

    This paper explores the use of the resources in the National Library of Medicine's Unified Medical Language System (UMLS) for the construction of a lexicon useful for processing texts in the field of molecular biology. A lexicon is constructed from overlapping terms in the UMLS SPECIALIST lexicon and the UMLS Metathesaurus to obtain both morphosyntactic and semantic information for terms, and the coverage of a domain corpus is assessed. Over 77% of tokens in the domain corpus are found in the constructed lexicon, validating the lexicon's coverage of the most frequent terms in the domain and indicating that the constructedmore » lexicon is potentially an important resource for biological text processing.« less

  12. Language Evolution by Iterated Learning with Bayesian Agents

    ERIC Educational Resources Information Center

    Griffiths, Thomas L.; Kalish, Michael L.

    2007-01-01

    Languages are transmitted from person to person and generation to generation via a process of iterated learning: people learn a language from other people who once learned that language themselves. We analyze the consequences of iterated learning for learning algorithms based on the principles of Bayesian inference, assuming that learners compute…

  13. Turkish Language Teachers' Stance Taking Movements in the Discourse on Globalization and Language

    ERIC Educational Resources Information Center

    Coskun, Ibrahim

    2013-01-01

    This study investigates how Turkish teachers take and give stances in the discourse on globalization and language by using linguistic resources. According to the findings obtained through the discourse analysis of the corpus that consisted of 36 h of recording of the discussion among 4 teachers with 5 to 10 years of teaching experience, the…

  14. Sociolinguistic Variation and Change in British Sign Language Number Signs: Evidence of Leveling?

    ERIC Educational Resources Information Center

    Stamp, Rose; Schembri, Adam; Fenlon, Jordan; Rentelis, Ramas

    2015-01-01

    This article presents findings from the first major study to investigate lexical variation and change in British Sign Language (BSL) number signs. As part of the BSL Corpus Project, number sign variants were elicited from 249 deaf signers from eight sites throughout the UK. Age, school location, and language background were found to be significant…

  15. Integrating Learner Corpora and Natural Language Processing: A Crucial Step towards Reconciling Technological Sophistication and Pedagogical Effectiveness

    ERIC Educational Resources Information Center

    Granger, Sylviane; Kraif, Olivier; Ponton, Claude; Antoniadis, Georges; Zampa, Virginie

    2007-01-01

    Learner corpora, electronic collections of spoken or written data from foreign language learners, offer unparalleled access to many hitherto uncovered aspects of learner language, particularly in their error-tagged format. This article aims to demonstrate the role that the learner corpus can play in CALL, particularly when used in conjunction with…

  16. A Rights-Based Approach to Science Literacy Using Local Languages: Contextualising Inquiry-Based Learning in Africa

    ERIC Educational Resources Information Center

    Babaci-Wilhite, Zehlia

    2017-01-01

    This article addresses the importance of teaching and learning science in local languages. The author argues that acknowledging local knowledge and using local languages in science education while emphasising inquiry-based learning improve teaching and learning science. She frames her arguments with the theory of inquiry, which draws on…

  17. Vocabulary, Grammar, Sex, and Aging

    ERIC Educational Resources Information Center

    Moscoso del Prado Martín, Fermín

    2017-01-01

    Understanding the changes in our language abilities along the lifespan is a crucial step for understanding the aging process both in normal and in abnormal circumstances. Besides controlled experimental tasks, it is equally crucial to investigate language in unconstrained conversation. I present an information-theoretical analysis of a corpus of…

  18. Event Recognition Based on Deep Learning in Chinese Texts

    PubMed Central

    Zhang, Yajun; Liu, Zongtian; Zhou, Wen

    2016-01-01

    Event recognition is the most fundamental and critical task in event-based natural language processing systems. Existing event recognition methods based on rules and shallow neural networks have certain limitations. For example, extracting features using methods based on rules is difficult; methods based on shallow neural networks converge too quickly to a local minimum, resulting in low recognition precision. To address these problems, we propose the Chinese emergency event recognition model based on deep learning (CEERM). Firstly, we use a word segmentation system to segment sentences. According to event elements labeled in the CEC 2.0 corpus, we classify words into five categories: trigger words, participants, objects, time and location. Each word is vectorized according to the following six feature layers: part of speech, dependency grammar, length, location, distance between trigger word and core word and trigger word frequency. We obtain deep semantic features of words by training a feature vector set using a deep belief network (DBN), then analyze those features in order to identify trigger words by means of a back propagation neural network. Extensive testing shows that the CEERM achieves excellent recognition performance, with a maximum F-measure value of 85.17%. Moreover, we propose the dynamic-supervised DBN, which adds supervised fine-tuning to a restricted Boltzmann machine layer by monitoring its training performance. Test analysis reveals that the new DBN improves recognition performance and effectively controls the training time. Although the F-measure increases to 88.11%, the training time increases by only 25.35%. PMID:27501231

  19. Event Recognition Based on Deep Learning in Chinese Texts.

    PubMed

    Zhang, Yajun; Liu, Zongtian; Zhou, Wen

    2016-01-01

    Event recognition is the most fundamental and critical task in event-based natural language processing systems. Existing event recognition methods based on rules and shallow neural networks have certain limitations. For example, extracting features using methods based on rules is difficult; methods based on shallow neural networks converge too quickly to a local minimum, resulting in low recognition precision. To address these problems, we propose the Chinese emergency event recognition model based on deep learning (CEERM). Firstly, we use a word segmentation system to segment sentences. According to event elements labeled in the CEC 2.0 corpus, we classify words into five categories: trigger words, participants, objects, time and location. Each word is vectorized according to the following six feature layers: part of speech, dependency grammar, length, location, distance between trigger word and core word and trigger word frequency. We obtain deep semantic features of words by training a feature vector set using a deep belief network (DBN), then analyze those features in order to identify trigger words by means of a back propagation neural network. Extensive testing shows that the CEERM achieves excellent recognition performance, with a maximum F-measure value of 85.17%. Moreover, we propose the dynamic-supervised DBN, which adds supervised fine-tuning to a restricted Boltzmann machine layer by monitoring its training performance. Test analysis reveals that the new DBN improves recognition performance and effectively controls the training time. Although the F-measure increases to 88.11%, the training time increases by only 25.35%.

  20. Revisiting syntactic development in deaf and hearing children from a dependency approach. Comment on "Dependency distance: a new perspective on syntactic patterns in natural languages" by Haitao Liu et al.

    NASA Astrophysics Data System (ADS)

    Yan, Jingqi

    2017-07-01

    Linguists are always endeavoring to discover universal rules to explain the language phenomena and interrelations [1]. Through a handful of corpus-based studies and a vast body of supporting evidence from psychological experiments, Liu, Xu and Liang [2] arrive at a conclusion on a general tendency toward dependency distance minimization (DDM) and relate this linguistic universal to the constraints of memory. Dependency distance (DD) is hereby introduced as a linguistic property, with quantitative features of frequency, and profound cognitive grounding as well. However, since the authors do not include language development, in this comment, I would like to discuss some future prospects from this perspective.

  1. Corpus annotation for mining biomedical events from literature

    PubMed Central

    Kim, Jin-Dong; Ohta, Tomoko; Tsujii, Jun'ichi

    2008-01-01

    Background Advanced Text Mining (TM) such as semantic enrichment of papers, event or relation extraction, and intelligent Question Answering have increasingly attracted attention in the bio-medical domain. For such attempts to succeed, text annotation from the biological point of view is indispensable. However, due to the complexity of the task, semantic annotation has never been tried on a large scale, apart from relatively simple term annotation. Results We have completed a new type of semantic annotation, event annotation, which is an addition to the existing annotations in the GENIA corpus. The corpus has already been annotated with POS (Parts of Speech), syntactic trees, terms, etc. The new annotation was made on half of the GENIA corpus, consisting of 1,000 Medline abstracts. It contains 9,372 sentences in which 36,114 events are identified. The major challenges during event annotation were (1) to design a scheme of annotation which meets specific requirements of text annotation, (2) to achieve biology-oriented annotation which reflect biologists' interpretation of text, and (3) to ensure the homogeneity of annotation quality across annotators. To meet these challenges, we introduced new concepts such as Single-facet Annotation and Semantic Typing, which have collectively contributed to successful completion of a large scale annotation. Conclusion The resulting event-annotated corpus is the largest and one of the best in quality among similar annotation efforts. We expect it to become a valuable resource for NLP (Natural Language Processing)-based TM in the bio-medical domain. PMID:18182099

  2. SUBTLEX-CH: Chinese Word and Character Frequencies Based on Film Subtitles

    PubMed Central

    Cai, Qing; Brysbaert, Marc

    2010-01-01

    Background Word frequency is the most important variable in language research. However, despite the growing interest in the Chinese language, there are only a few sources of word frequency measures available to researchers, and the quality is less than what researchers in other languages are used to. Methodology Following recent work by New, Brysbaert, and colleagues in English, French and Dutch, we assembled a database of word and character frequencies based on a corpus of film and television subtitles (46.8 million characters, 33.5 million words). In line with what has been found in the other languages, the new word and character frequencies explain significantly more of the variance in Chinese word naming and lexical decision performance than measures based on written texts. Conclusions Our results confirm that word frequencies based on subtitles are a good estimate of daily language exposure and capture much of the variance in word processing efficiency. In addition, our database is the first to include information about the contextual diversity of the words and to provide good frequency estimates for multi-character words and the different syntactic roles in which the words are used. The word frequencies are freely available for research purposes. PMID:20532192

  3. The Secret Is in the Sound

    PubMed Central

    Christiansen, Morten H.; Onnis, Luca; Hockema, Stephen A.

    2009-01-01

    When learning language young children are faced with many seemingly formidable challenges, including discovering words embedded in a continuous stream of sounds and determining what role these words play in syntactic constructions. We suggest that knowledge of phoneme distributions may play a crucial part in helping children segment words and determine their lexical category, and propose an integrated model of how children might go from unsegmented speech to lexical categories. We corroborated this theoretical model using a two-stage computational analysis of a large corpus of English child-directed speech. First, we used transition probabilities between phonemes to find words in unsegmented speech. Second, we used distributional information about word edges—the beginning and ending phonemes of words—to predict whether the segmented words from the first stage were nouns, verbs, or something else. The results indicate that discovering lexical units and their associated syntactic category in child-directed speech is possible by attending to the statistics of single phoneme transitions and word-initial and final phonemes. Thus, we suggest that a core computational principle in language acquisition is that the same source of information is used to learn about different aspects of linguistic structure. PMID:19371361

  4. Internal and external dynamics in language: evidence from verb regularity in a historical corpus of English.

    PubMed

    Cuskley, Christine F; Pugliese, Martina; Castellano, Claudio; Colaiori, Francesca; Loreto, Vittorio; Tria, Francesca

    2014-01-01

    Human languages are rule governed, but almost invariably these rules have exceptions in the form of irregularities. Since rules in language are efficient and productive, the persistence of irregularity is an anomaly. How does irregularity linger in the face of internal (endogenous) and external (exogenous) pressures to conform to a rule? Here we address this problem by taking a detailed look at simple past tense verbs in the Corpus of Historical American English. The data show that the language is open, with many new verbs entering. At the same time, existing verbs might tend to regularize or irregularize as a consequence of internal dynamics, but overall, the amount of irregularity sustained by the language stays roughly constant over time. Despite continuous vocabulary growth, and presumably, an attendant increase in expressive power, there is no corresponding growth in irregularity. We analyze the set of irregulars, showing they may adhere to a set of minority rules, allowing for increased stability of irregularity over time. These findings contribute to the debate on how language systems become rule governed, and how and why they sustain exceptions to rules, providing insight into the interplay between the emergence and maintenance of rules and exceptions in language.

  5. Internal and External Dynamics in Language: Evidence from Verb Regularity in a Historical Corpus of English

    PubMed Central

    Cuskley, Christine F.; Pugliese, Martina; Castellano, Claudio; Colaiori, Francesca; Loreto, Vittorio; Tria, Francesca

    2014-01-01

    Human languages are rule governed, but almost invariably these rules have exceptions in the form of irregularities. Since rules in language are efficient and productive, the persistence of irregularity is an anomaly. How does irregularity linger in the face of internal (endogenous) and external (exogenous) pressures to conform to a rule? Here we address this problem by taking a detailed look at simple past tense verbs in the Corpus of Historical American English. The data show that the language is open, with many new verbs entering. At the same time, existing verbs might tend to regularize or irregularize as a consequence of internal dynamics, but overall, the amount of irregularity sustained by the language stays roughly constant over time. Despite continuous vocabulary growth, and presumably, an attendant increase in expressive power, there is no corresponding growth in irregularity. We analyze the set of irregulars, showing they may adhere to a set of minority rules, allowing for increased stability of irregularity over time. These findings contribute to the debate on how language systems become rule governed, and how and why they sustain exceptions to rules, providing insight into the interplay between the emergence and maintenance of rules and exceptions in language. PMID:25084006

  6. "To Whom It May Concern": A Study on the Use of Lexical Bundles in Email Writing Tasks in an English Proficiency Test

    ERIC Educational Resources Information Center

    Li, Zhi; Volkov, Alex

    2017-01-01

    Lexical bundles are worthy of attention in both teaching and testing writing as they function as basic building blocks of discourse. This corpus-based study focuses on the rated writing responses to the email tasks in the Canadian English Language Proficiency Index Program® General test (CELPIP-General) and explores the extent to which lexical…

  7. The Association of Macro- and Microstructure of the Corpus Callosum and Language Lateralisation

    ERIC Educational Resources Information Center

    Westerhausen, Rene; Kreuder, Frank; Sequeira, Sarah Dos Santos; Walter, Christof; Woerner, Wolfgang; Wittling, Ralf Arne; Schweiger, Elisabeth; Wittling, Werner

    2006-01-01

    The present study aimed to examine how differences in functional lateralisation of language are related to interindividual variations in interhemispheric connectivity. Utilising an fMRI silent word-generation paradigm, 89 left- and right-handed subjects were subdivided into four lateralisation subgroups. Applying morphological and diffusion-tensor…

  8. Acquisition of Multiple Questions in English, Russian, and Malayalam

    ERIC Educational Resources Information Center

    Grebenyova, Lydia

    2011-01-01

    This article presents the results of four studies exploring the acquisition of the language-specific syntactic and semantic properties of multiple interrogatives in English, Russian, and Malayalam, languages that behave differently with respect to the syntax and semantics of multiple interrogatives. A corpus analysis investigated the frequency of…

  9. Language and Development in FG Syndrome with Callosal Agenesis.

    ERIC Educational Resources Information Center

    McCardle, Peggy; Wilson, Bruce

    1993-01-01

    The FG syndrome is characterized by unusual facies; sudden infant death; developmental delay; and abnormalities of the cardiac, gastrointestinal, and central nervous systems. Serial evaluations of one case with isolated agenesis of the corpus callosum found consistent patterns over time in specific language impairments in syntactic and…

  10. Adverb Code-Switching among Miami's Haitian Creole-English Second Generation

    ERIC Educational Resources Information Center

    Hebblethwaite, Benjamin

    2010-01-01

    The findings for adverbs and adverbial phrases in a naturalistic corpus of Miami Haitian Creole-English code-switching show that one language, Haitian Creole, asymmetrically supplies the grammatical frame while the other language, English, asymmetrically supplies mixed lexical categories like adverbs. Traces of code-switching with an English frame…

  11. Generating structure from experience: A retrieval-based model of language processing.

    PubMed

    Johns, Brendan T; Jones, Michael N

    2015-09-01

    Standard theories of language generally assume that some abstraction of linguistic input is necessary to create higher level representations of linguistic structures (e.g., a grammar). However, the importance of individual experiences with language has recently been emphasized by both usage-based theories (Tomasello, 2003) and grounded and situated theories (e.g., Zwaan & Madden, 2005). Following the usage-based approach, we present a formal exemplar model that stores instances of sentences across a natural language corpus, applying recent advances from models of semantic memory. In this model, an exemplar memory is used to generate expectations about the future structure of sentences, using a mechanism for prediction in language processing (Altmann & Mirković, 2009). The model successfully captures a broad range of behavioral effects-reduced relative clause processing (Reali & Christiansen, 2007), the role of contextual constraint (Rayner & Well, 1996), and event knowledge activation (Ferretti, Kutas, & McRae, 2007), among others. We further demonstrate how perceptual knowledge could be integrated into this exemplar-based framework, with the goal of grounding language processing in perception. Finally, we illustrate how an exemplar memory system could have been used in the cultural evolution of language. The model provides evidence that an impressive amount of language processing may be bottom-up in nature, built on the storage and retrieval of individual linguistic experiences. (c) 2015 APA, all rights reserved).

  12. Deep PDF parsing to extract features for detecting embedded malware.

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Munson, Miles Arthur; Cross, Jesse S.

    2011-09-01

    The number of PDF files with embedded malicious code has risen significantly in the past few years. This is due to the portability of the file format, the ways Adobe Reader recovers from corrupt PDF files, the addition of many multimedia and scripting extensions to the file format, and many format properties the malware author may use to disguise the presence of malware. Current research focuses on executable, MS Office, and HTML formats. In this paper, several features and properties of PDF Files are identified. Features are extracted using an instrumented open source PDF viewer. The feature descriptions of benignmore » and malicious PDFs can be used to construct a machine learning model for detecting possible malware in future PDF files. The detection rate of PDF malware by current antivirus software is very low. A PDF file is easy to edit and manipulate because it is a text format, providing a low barrier to malware authors. Analyzing PDF files for malware is nonetheless difficult because of (a) the complexity of the formatting language, (b) the parsing idiosyncrasies in Adobe Reader, and (c) undocumented correction techniques employed in Adobe Reader. In May 2011, Esparza demonstrated that PDF malware could be hidden from 42 of 43 antivirus packages by combining multiple obfuscation techniques [4]. One reason current antivirus software fails is the ease of varying byte sequences in PDF malware, thereby rendering conventional signature-based virus detection useless. The compression and encryption functions produce sequences of bytes that are each functions of multiple input bytes. As a result, padding the malware payload with some whitespace before compression/encryption can change many of the bytes in the final payload. In this study we analyzed a corpus of 2591 benign and 87 malicious PDF files. While this corpus is admittedly small, it allowed us to test a system for collecting indicators of embedded PDF malware. We will call these indicators features throughout the rest of this report. The features are extracted using an instrumented PDF viewer, and are the inputs to a prediction model that scores the likelihood of a PDF file containing malware. The prediction model is constructed from a sample of labeled data by a machine learning algorithm (specifically, decision tree ensemble learning). Preliminary experiments show that the model is able to detect half of the PDF malware in the corpus with zero false alarms. We conclude the report with suggestions for extending this work to detect a greater variety of PDF malware.« less

  13. Semi Automatic Ontology Instantiation in the domain of Risk Management

    NASA Astrophysics Data System (ADS)

    Makki, Jawad; Alquier, Anne-Marie; Prince, Violaine

    One of the challenging tasks in the context of Ontological Engineering is to automatically or semi-automatically support the process of Ontology Learning and Ontology Population from semi-structured documents (texts). In this paper we describe a Semi-Automatic Ontology Instantiation method from natural language text, in the domain of Risk Management. This method is composed from three steps 1 ) Annotation with part-of-speech tags, 2) Semantic Relation Instances Extraction, 3) Ontology instantiation process. It's based on combined NLP techniques using human intervention between steps 2 and 3 for control and validation. Since it heavily relies on linguistic knowledge it is not domain dependent which is a good feature for portability between the different fields of risk management application. The proposed methodology uses the ontology of the PRIMA1 project (supported by the European community) as a Generic Domain Ontology and populates it via an available corpus. A first validation of the approach is done through an experiment with Chemical Fact Sheets from Environmental Protection Agency2.

  14. "Languages in the Workplace": Embedding Employability in the Foreign Language Undergraduate Curriculum

    ERIC Educational Resources Information Center

    Organ, Alison

    2017-01-01

    This case study examines student perceptions of the experiential value of a work placement carried out as part of a languages degree programme. The data for the case study consists of a corpus of 67 reports submitted from 2011 to 2015, reflecting on placements carried out in Europe, Japan, the UK and the US. The data offers a student view of the…

  15. Beliefs about Learning English as a Second Language among Native Groups in Rural Sabah, Malaysia

    ERIC Educational Resources Information Center

    Krishnasamy, Hariharan N.; Veloo, Arsaythamby; Lu, Ho Fui

    2013-01-01

    This paper identifies differences between the three ethnic groups, namely, Kadazans/Dusuns, Bajaus, and other minority ethnic groups on the beliefs about learning English as a second language based on the five variables, that is, language aptitude, language learning difficulty, language learning and communicating strategies, nature of language…

  16. Topic Modeling of NASA Space System Problem Reports: Research in Practice

    NASA Technical Reports Server (NTRS)

    Layman, Lucas; Nikora, Allen P.; Meek, Joshua; Menzies, Tim

    2016-01-01

    Problem reports at NASA are similar to bug reports: they capture defects found during test, post-launch operational anomalies, and document the investigation and corrective action of the issue. These artifacts are a rich source of lessons learned for NASA, but are expensive to analyze since problem reports are comprised primarily of natural language text. We apply topic modeling to a corpus of NASA problem reports to extract trends in testing and operational failures. We collected 16,669 problem reports from six NASA space flight missions and applied Latent Dirichlet Allocation topic modeling to the document corpus. We analyze the most popular topics within and across missions, and how popular topics changed over the lifetime of a mission. We find that hardware material and flight software issues are common during the integration and testing phase, while ground station software and equipment issues are more common during the operations phase. We identify a number of challenges in topic modeling for trend analysis: 1) that the process of selecting the topic modeling parameters lacks definitive guidance, 2) defining semantically-meaningful topic labels requires nontrivial effort and domain expertise, 3) topic models derived from the combined corpus of the six missions were biased toward the larger missions, and 4) topics must be semantically distinct as well as cohesive to be useful. Nonetheless,topic modeling can identify problem themes within missions and across mission lifetimes, providing useful feedback to engineers and project managers.

  17. Presentation-Practice-Production and Task-Based Learning in the Light of Second Language Learning Theories.

    ERIC Educational Resources Information Center

    Ritchie, Graeme

    2003-01-01

    Features of presentation-practice-production (PPP) and task-based learning (TBL) models for language teaching are discussed with reference to language learning theories. Pre-selection of target structures, use of controlled repetition, and explicit grammar instruction in a PPP lesson are given. Suggests TBL approaches afford greater learning…

  18. Collaborative Tasks in Wiki-Based Environment in EFL Learning

    ERIC Educational Resources Information Center

    Zou, Bin; Wang, Dongshuo; Xing, Minjie

    2016-01-01

    Wikis provide users with opportunities to post and edit messages to collaborate in the language learning process. Many studies have offered findings to show positive impact of Wiki-based language learning for learners. This paper explores the effect of collaborative task in error correction for English as a Foreign Language learning in an online…

  19. An Infinite Game in a Finite Setting: Visualizing Foreign Language Teaching and Learning in America.

    ERIC Educational Resources Information Center

    Mantero, Miguel

    According to contemporary thought and foundational research, this paper presents various elements of the foreign language teaching profession and language learning environment in the United States as either product-driven or process-based. It is argued that a process-based approach to language teaching and learning benefits not only second…

  20. Domain Adaption of Parsing for Operative Notes

    PubMed Central

    Wang, Yan; Pakhomov, Serguei; Ryan, James O.; Melton, Genevieve B.

    2016-01-01

    Background Full syntactic parsing of clinical text as a part of clinical natural language processing (NLP) is critical for a wide range of applications, such as identification of adverse drug reactions, patient cohort identification, and gene interaction extraction. Several robust syntactic parsers are publicly available to produce linguistic representations for sentences. However, these existing parsers are mostly trained on general English text and often require adaptation for optimal performance on clinical text. Our objective was to adapt an existing general English parser for the clinical text of operative reports via lexicon augmentation, statistics adjusting, and grammar rules modification based on a set of biomedical text. Method The Stanford unlexicalized probabilistic context-free grammar (PCFG) parser lexicon was expanded with SPECIALIST lexicon along with statistics collected from a limited set of operative notes tagged with a two of POS taggers (GENIA tagger and MedPost). The most frequently occurring verb entries of the SPECIALIST lexicon were adjusted based on manual review of verb usage in operative notes. Stanford parser grammar production rules were also modified based on linguistic features of operative reports. An analogous approach was then applied to the GENIA corpus to test the generalizability of this approach to biomedical text. Results The new unlexicalized PCFG parser extended with the extra lexicon from SPECIALIST along with accurate statistics collected from an operative note corpus tagged with GENIA POS tagger improved the parser performance by 2.26% from 87.64% to 89.90%. There was a progressive improvement with the addition of multiple approaches. Most of the improvement occurred with lexicon augmentation combined with statistics from the operative notes corpus. Application of this approach on the GENIA corpus showed that parsing performance was boosted by 3.81% with a simple new grammar and the addition of the GENIA corpus lexicon. Conclusion Using statistics collected from clinical text tagged with POS taggers along with proper modification of grammars and lexicons of an unlexicalized PCFG parser can improve parsing performance. PMID:25661593

  1. Training Language Teachers to Sustain Self-Directed Language Learning: An Exploration of Advisers' Experiences on a Web-Based Open Virtual Learning Environment

    ERIC Educational Resources Information Center

    Bailly, Sophie; Ciekanski, Maud; Guély-Costa, Eglantine

    2013-01-01

    This article describes the rationale for pedagogical, technological and organizational choices in the design of a web-based and open virtual learning environment (VLE) promoting and sustaining self-directed language learning. Based on the last forty years of research on learner autonomy at the CRAPEL according to Holec's definition (1988), we…

  2. Does the road go up the mountain? Fictive motion between linguistic conventions and cognitive motivations.

    PubMed

    Stosic, Dejan; Fagard, Benjamin; Sarda, Laure; Colin, Camille

    2015-09-01

    Fictive motion (FM) characterizes the use of dynamic expressions to describe static scenes. This phenomenon is crucial in terms of cognitive motivations for language use; several explanations have been proposed to account for it, among which mental simulation (Talmy in Toward a cognitive semantics, vol 1. MIT Press, Cambridge, 2000) and visual scanning (Matlock in Studies in linguistic motivation. Mouton de Gruyter, Berlin and New York, pp 221-248, 2004a). The aims of this paper were to test these competing explanations and identify language-specific constraints. To do this, we compared the linguistic strategies for expressing several types of static configurations in four languages, French, Italian, German and Serbian, with an experimental set-up (59 participants). The experiment yielded significant differences for motion-affordance versus no motion-affordance, for all four languages. Significant differences between languages included mean frequency of FM expressions. In order to refine the picture, and more specifically to disentangle the respective roles of language-specific conventions and language-independent (i.e. possibly cognitive) motivations, we completed our study with a corpus approach (besides the four initial languages, we added English and Polish). The corpus study showed low frequency of FM across languages, but a higher frequency and translation ratio for some FM types--among which those best accounted for by enactive perception. The importance of enactive perception could thus explain both the universality of FM and the fact that language-specific conventions appear mainly in very specific contexts--the ones furthest from enaction.

  3. Chi-square-based scoring function for categorization of MEDLINE citations.

    PubMed

    Kastrin, A; Peterlin, B; Hristovski, D

    2010-01-01

    Text categorization has been used in biomedical informatics for identifying documents containing relevant topics of interest. We developed a simple method that uses a chi-square-based scoring function to determine the likelihood of MEDLINE citations containing genetic relevant topic. Our procedure requires construction of a genetic and a nongenetic domain document corpus. We used MeSH descriptors assigned to MEDLINE citations for this categorization task. We compared frequencies of MeSH descriptors between two corpora applying chi-square test. A MeSH descriptor was considered to be a positive indicator if its relative observed frequency in the genetic domain corpus was greater than its relative observed frequency in the nongenetic domain corpus. The output of the proposed method is a list of scores for all the citations, with the highest score given to those citations containing MeSH descriptors typical for the genetic domain. Validation was done on a set of 734 manually annotated MEDLINE citations. It achieved predictive accuracy of 0.87 with 0.69 recall and 0.64 precision. We evaluated the method by comparing it to three machine-learning algorithms (support vector machines, decision trees, naïve Bayes). Although the differences were not statistically significantly different, results showed that our chi-square scoring performs as good as compared machine-learning algorithms. We suggest that the chi-square scoring is an effective solution to help categorize MEDLINE citations. The algorithm is implemented in the BITOLA literature-based discovery support system as a preprocessor for gene symbol disambiguation process.

  4. Emerging Approach of Natural Language Processing in Opinion Mining: A Review

    NASA Astrophysics Data System (ADS)

    Kim, Tai-Hoon

    Natural language processing (NLP) is a subfield of artificial intelligence and computational linguistics. It studies the problems of automated generation and understanding of natural human languages. This paper outlines a framework to use computer and natural language techniques for various levels of learners to learn foreign languages in Computer-based Learning environment. We propose some ideas for using the computer as a practical tool for learning foreign language where the most of courseware is generated automatically. We then describe how to build Computer Based Learning tools, discuss its effectiveness, and conclude with some possibilities using on-line resources.

  5. DNorm: disease name normalization with pairwise learning to rank

    PubMed Central

    Leaman, Robert; Islamaj Doğan, Rezarta; Lu, Zhiyong

    2013-01-01

    Motivation: Despite the central role of diseases in biomedical research, there have been much fewer attempts to automatically determine which diseases are mentioned in a text—the task of disease name normalization (DNorm)—compared with other normalization tasks in biomedical text mining research. Methods: In this article we introduce the first machine learning approach for DNorm, using the NCBI disease corpus and the MEDIC vocabulary, which combines MeSH® and OMIM. Our method is a high-performing and mathematically principled framework for learning similarities between mentions and concept names directly from training data. The technique is based on pairwise learning to rank, which has not previously been applied to the normalization task but has proven successful in large optimization problems for information retrieval. Results: We compare our method with several techniques based on lexical normalization and matching, MetaMap and Lucene. Our algorithm achieves 0.782 micro-averaged F-measure and 0.809 macro-averaged F-measure, an increase over the highest performing baseline method of 0.121 and 0.098, respectively. Availability: The source code for DNorm is available at http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/DNorm, along with a web-based demonstration and links to the NCBI disease corpus. Results on PubMed abstracts are available in PubTator: http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/PubTator Contact: zhiyong.lu@nih.gov PMID:23969135

  6. Teacher-Student Interaction and Learning.

    ERIC Educational Resources Information Center

    Hall, Joan Kelly; Walsh, Meghan

    2002-01-01

    Reviews literature on recent developments in teacher-student interaction and language learning. Based on a sociocultural perspective of language and learning, draws from three types of classrooms: first language, second language, and foreign language. Attention is given to studies that investigate the specific means used in teacher-student…

  7. Integrating Culture into Language Teaching and Learning: Learner Outcomes

    ERIC Educational Resources Information Center

    Nguyen, Trang Thi Thuy

    2017-01-01

    This paper discusses the issue of learner outcomes in learning culture as part of their language learning. First, some brief discussion on the role of culture in language teaching and learning, as well as on culture contents in language lessons is presented. Based on a detailed review of previous literature related to culture in language teaching…

  8. Generation of co-speech gestures based on spatial imagery from the right-hemisphere: evidence from split-brain patients.

    PubMed

    Kita, Sotaro; Lausberg, Hedda

    2008-02-01

    It has been claimed that the linguistically dominant (left) hemisphere is obligatorily involved in production of spontaneous speech-accompanying gestures (Kimura, 1973a, 1973b; Lavergne and Kimura, 1987). We examined this claim for the gestures that are based on spatial imagery: iconic gestures with observer viewpoint (McNeill, 1992) and abstract deictic gestures (McNeill, et al. 1993). We observed gesture production in three patients with complete section of the corpus callosum in commissurotomy or callosotomy (two with left-hemisphere language, and one with bilaterally represented language) and nine healthy control participants. All three patients produced spatial-imagery gestures with the left-hand as well as with the right-hand. However, unlike healthy controls and the split-brain patient with bilaterally represented language, the two patients with left-hemispheric language dominance coordinated speech and spatial-imagery gestures more poorly in the left-hand than in the right-hand. It is concluded that the linguistically non-dominant (right) hemisphere alone can generate co-speech gestures based on spatial imagery, just as the left-hemisphere can.

  9. A method for named entity normalization in biomedical articles: application to diseases and plants.

    PubMed

    Cho, Hyejin; Choi, Wonjun; Lee, Hyunju

    2017-10-13

    In biomedical articles, a named entity recognition (NER) technique that identifies entity names from texts is an important element for extracting biological knowledge from articles. After NER is applied to articles, the next step is to normalize the identified names into standard concepts (i.e., disease names are mapped to the National Library of Medicine's Medical Subject Headings disease terms). In biomedical articles, many entity normalization methods rely on domain-specific dictionaries for resolving synonyms and abbreviations. However, the dictionaries are not comprehensive except for some entities such as genes. In recent years, biomedical articles have accumulated rapidly, and neural network-based algorithms that incorporate a large amount of unlabeled data have shown considerable success in several natural language processing problems. In this study, we propose an approach for normalizing biological entities, such as disease names and plant names, by using word embeddings to represent semantic spaces. For diseases, training data from the National Center for Biotechnology Information (NCBI) disease corpus and unlabeled data from PubMed abstracts were used to construct word representations. For plants, a training corpus that we manually constructed and unlabeled PubMed abstracts were used to represent word vectors. We showed that the proposed approach performed better than the use of only the training corpus or only the unlabeled data and showed that the normalization accuracy was improved by using our model even when the dictionaries were not comprehensive. We obtained F-scores of 0.808 and 0.690 for normalizing the NCBI disease corpus and manually constructed plant corpus, respectively. We further evaluated our approach using a data set in the disease normalization task of the BioCreative V challenge. When only the disease corpus was used as a dictionary, our approach significantly outperformed the best system of the task. The proposed approach shows robust performance for normalizing biological entities. The manually constructed plant corpus and the proposed model are available at http://gcancer.org/plant and http://gcancer.org/normalization , respectively.

  10. On the Conventionalization of Mouth Actions in Australian Sign Language.

    PubMed

    Johnston, Trevor; van Roekel, Jane; Schembri, Adam

    2016-03-01

    This study investigates the conventionalization of mouth actions in Australian Sign Language. Signed languages were once thought of as simply manual languages because the hands produce the signs which individually and in groups are the symbolic units most easily equated with the words, phrases and clauses of spoken languages. However, it has long been acknowledged that non-manual activity, such as movements of the body, head and the face play a very important role. In this context, mouth actions that occur while communicating in signed languages have posed a number of questions for linguists: are the silent mouthings of spoken language words simply borrowings from the respective majority community spoken language(s)? Are those mouth actions that are not silent mouthings of spoken words conventionalized linguistic units proper to each signed language, culturally linked semi-conventional gestural units shared by signers with members of the majority speaking community, or even gestures and expressions common to all humans? We use a corpus-based approach to gather evidence of the extent of the use of mouth actions in naturalistic Australian Sign Language-making comparisons with other signed languages where data is available--and the form/meaning pairings that these mouth actions instantiate.

  11. Learner Corpora: The Missing Link in EAP Pedagogy

    ERIC Educational Resources Information Center

    Gilquin, Gaetanelle; Granger, Sylviane; Paquot, Magali

    2007-01-01

    This article deals with the place of learner corpora, i.e. corpora containing authentic language data produced by learners of a foreign/second language, in English for academic purposes (EAP) pedagogy and sets out to demonstrate that they have a valuable contribution to make to the field. Following an initial brief introduction to corpus-based…

  12. Research in Foreign Language Education in Portugal (2006-2011): Its Transformative Potential

    ERIC Educational Resources Information Center

    Vieira, Flávia; Moreira, Maria Alfredo; Peralta, Helena

    2014-01-01

    This article reviews a selective corpus of empirical and theoretical texts on foreign language pedagogy and teacher education, produced in Portugal between 2006 and 2011. A descriptive and interpretative approach is adopted to inquire into the transformative potential of research, with a focus on its scope, purposes, conceptual and methodological…

  13. Is Instant Messaging the Same in Every Language? A Basque Perspective

    ERIC Educational Resources Information Center

    Cenoz, Jasone; Bereziartua, Garbiñe

    2016-01-01

    This study focuses on computer mediated communication (CMC) in instant messaging using the Basque language in a context where exposure to English is very limited outside the classroom. This context provides an opportunity to analyze the universality of linguistic features identified in CMC in English. The corpus consists of 54 naturalistic dyadic…

  14. Laughing and Smiling to Manage Trouble in French-Language Classroom Interaction

    ERIC Educational Resources Information Center

    Petitjean, Cécile; González-Martínez, Esther

    2015-01-01

    This article deals with communicative functions of laughter and smiling in the classroom studied using a conversation analytical approach. Analysing a corpus of video-recorded French first-language lessons, we show how students sequentially organise laughter and smiling, and use them to preempt, solve or assess a problematic action. We also focus…

  15. Chinese Learners' Acquisition of English Verbs: A Corpus-Driven Approach

    ERIC Educational Resources Information Center

    Wang, Linxiao; Jo, Hie-myung

    2012-01-01

    Limited research has investigated advanced language learners' acquisition of English verbs. The current study examines and compares the acquisition pattern of English verbs among Chinese second language (L2) learners at both intermediate and advanced levels to answer the following questions: (1) Do L2 learners acquire regular verbs and irregular…

  16. Gender Representation in Japanese EFL Textbooks--A Corpus Study

    ERIC Educational Resources Information Center

    Lee, Jackie F. K.

    2018-01-01

    This paper seeks to investigate whether the Japanese government's attempt to promote a 'gender-equal' society in recent decades and the improved status of women are reflected in patterns of gender representation in Japanese English as a foreign language textbooks. The study made an analysis of four popular series of English language textbooks…

  17. Measuring Second Language Acquisition. Studies in Language Education, No. 6.

    ERIC Educational Resources Information Center

    Cooper, Thomas C.

    This research project was designed to analyze by quantitative methods a corpus of writing produced by four groups of American college students enrolled in German courses and by one group of professional German writers. Analysis was undertaken in order to determine whether or not significant quantitative differences in the use of selected syntactic…

  18. Eye Movement Patterns in Natural Reading: A Comparison of Monolingual and Bilingual Reading of a Novel

    PubMed Central

    Cop, Uschi; Drieghe, Denis; Duyck, Wouter

    2015-01-01

    Introduction and Method This paper presents a corpus of sentence level eye movement parameters for unbalanced bilingual first language (L1) and second-language (L2) reading and monolingual reading of a complete novel (56 000 words). We present important sentence-level basic eye movement parameters of both bilingual and monolingual natural reading extracted from this large data corpus. Results and Conclusion Bilingual L2 reading patterns show longer sentence reading times (20%), more fixations (21%), shorter saccades (12%) and less word skipping (4.6%), than L1 reading patterns. Regression rates are the same for L1 and L2 reading. These results could indicate, analogous to a previous simulation with the E-Z reader model in the literature, that it is primarily the speeding up of lexical access that drives both L1 and L2 reading development. Bilingual L1 reading does not differ in any major way from monolingual reading. This contrasts with predictions made by the weaker links account, which predicts a bilingual disadvantage in language processing caused by divided exposure between languages. PMID:26287379

  19. The Text Retrieval Conferences (TRECs)

    DTIC Science & Technology

    1998-10-01

    per- form a monolingual run in the target language to act as a baseline. Thirteen groups participated in the TREC-6 CLIR track. Three major...language; the use of machine-readable bilingual dictionaries or other existing linguistic re- sources; and the use of corpus resources to train or...formance for each method. In general, the best cross- language performance was between 50%-75% as ef- fective as a quality monolingual run. The TREC-7

  20. Extracting rate changes in transcriptional regulation from MEDLINE abstracts.

    PubMed

    Liu, Wenting; Miao, Kui; Li, Guangxia; Chang, Kuiyu; Zheng, Jie; Rajapakse, Jagath C

    2014-01-01

    Time delays are important factors that are often neglected in gene regulatory network (GRN) inference models. Validating time delays from knowledge bases is a challenge since the vast majority of biological databases do not record temporal information of gene regulations. Biological knowledge and facts on gene regulations are typically extracted from bio-literature with specialized methods that depend on the regulation task. In this paper, we mine evidences for time delays related to the transcriptional regulation of yeast from the PubMed abstracts. Since the vast majority of abstracts lack quantitative time information, we can only collect qualitative evidences of time delays. Specifically, the speed-up or delay in transcriptional regulation rate can provide evidences for time delays (shorter or longer) in GRN. Thus, we focus on deriving events related to rate changes in transcriptional regulation. A corpus of yeast regulation related abstracts was manually labeled with such events. In order to capture these events automatically, we create an ontology of sub-processes that are likely to result in transcription rate changes by combining textual patterns and biological knowledge. We also propose effective feature extraction methods based on the created ontology to identify the direct evidences with specific details of these events. Our ontologies outperform existing state-of-the-art gene regulation ontologies in the automatic rule learning method applied to our corpus. The proposed deterministic ontology rule-based method can achieve comparable performance to the automatic rule learning method based on decision trees. This demonstrates the effectiveness of our ontology in identifying rate-changing events. We also tested the effectiveness of the proposed feature mining methods on detecting direct evidence of events. Experimental results show that the machine learning method on these features achieves an F1-score of 71.43%. The manually labeled corpus of events relating to rate changes in transcriptional regulation for yeast is available in https://sites.google.com/site/wentingntu/data. The created ontologies summarized both biological causes of rate changes in transcriptional regulation and corresponding positive and negative textual patterns from the corpus. They are demonstrated to be effective in identifying rate-changing events, which shows the benefits of combining textual patterns and biological knowledge on extracting complex biological events.

  1. Attitudes toward Task-Based Language Learning: A Study of College Korean Language Learners

    ERIC Educational Resources Information Center

    Pyun, Danielle Ooyoung

    2013-01-01

    This study explores second/foreign language (L2) learners' attitudes toward task-based language learning (TBLL) and how these attitudes relate to selected learner variables, namely anxiety, integrated motivation, instrumental motivation, and self-efficacy. Ninety-one college students of Korean as a foreign language, who received task-based…

  2. Tableau's Influence on the Oral Language Skills of Students with Language-Based Learning Disabilities

    ERIC Educational Resources Information Center

    Anderson, Alida; Berry, Katherine A.

    2017-01-01

    This study examined the influence of tableau on the expressive language skills of three students with language-based learning disabilities in inclusive urban fourth-grade English language arts (ELA) classroom settings. Data were collected on linguistic productivity, specificity, and narrative cohesion through analysis of students' responses to…

  3. Language, reading, and math learning profiles in an epidemiological sample of school age children.

    PubMed

    Archibald, Lisa M D; Oram Cardy, Janis; Joanisse, Marc F; Ansari, Daniel

    2013-01-01

    Dyscalculia, dyslexia, and specific language impairment (SLI) are relatively specific developmental learning disabilities in math, reading, and oral language, respectively, that occur in the context of average intellectual capacity and adequate environmental opportunities. Past research has been dominated by studies focused on single impairments despite the widespread recognition that overlapping and comorbid deficits are common. The present study took an epidemiological approach to study the learning profiles of a large school age sample in language, reading, and math. Both general learning profiles reflecting good or poor performance across measures and specific learning profiles involving either weak language, weak reading, weak math, or weak math and reading were observed. These latter four profiles characterized 70% of children with some evidence of a learning disability. Low scores in phonological short-term memory characterized clusters with a language-based weakness whereas low or variable phonological awareness was associated with the reading (but not language-based) weaknesses. The low math only group did not show these phonological deficits. These findings may suggest different etiologies for language-based deficits in language, reading, and math, reading-related impairments in reading and math, and isolated math disabilities.

  4. Language, Reading, and Math Learning Profiles in an Epidemiological Sample of School Age Children

    PubMed Central

    Archibald, Lisa M. D.; Oram Cardy, Janis; Joanisse, Marc F.; Ansari, Daniel

    2013-01-01

    Dyscalculia, dyslexia, and specific language impairment (SLI) are relatively specific developmental learning disabilities in math, reading, and oral language, respectively, that occur in the context of average intellectual capacity and adequate environmental opportunities. Past research has been dominated by studies focused on single impairments despite the widespread recognition that overlapping and comorbid deficits are common. The present study took an epidemiological approach to study the learning profiles of a large school age sample in language, reading, and math. Both general learning profiles reflecting good or poor performance across measures and specific learning profiles involving either weak language, weak reading, weak math, or weak math and reading were observed. These latter four profiles characterized 70% of children with some evidence of a learning disability. Low scores in phonological short-term memory characterized clusters with a language-based weakness whereas low or variable phonological awareness was associated with the reading (but not language-based) weaknesses. The low math only group did not show these phonological deficits. These findings may suggest different etiologies for language-based deficits in language, reading, and math, reading-related impairments in reading and math, and isolated math disabilities. PMID:24155959

  5. A Program That Acquires Language Using Positive and Negative Feedback.

    ERIC Educational Resources Information Center

    Brand, James

    1987-01-01

    Describes the language learning program "Acquire," which is a sample of grammar induction. It is a learning algorithm based on a pattern-matching scheme, using both a positive and negative network to reduce overgeneration. Language learning programs may be useful as tutorials for learning the syntax of a foreign language. (Author/LMO)

  6. Semi-automated ontology generation and evolution

    NASA Astrophysics Data System (ADS)

    Stirtzinger, Anthony P.; Anken, Craig S.

    2009-05-01

    Extending the notion of data models or object models, ontology can provide rich semantic definition not only to the meta-data but also to the instance data of domain knowledge, making these semantic definitions available in machine readable form. However, the generation of an effective ontology is a difficult task involving considerable labor and skill. This paper discusses an Ontology Generation and Evolution Processor (OGEP) aimed at automating this process, only requesting user input when un-resolvable ambiguous situations occur. OGEP directly attacks the main barrier which prevents automated (or self learning) ontology generation: the ability to understand the meaning of artifacts and the relationships the artifacts have to the domain space. OGEP leverages existing lexical to ontological mappings in the form of WordNet, and Suggested Upper Merged Ontology (SUMO) integrated with a semantic pattern-based structure referred to as the Semantic Grounding Mechanism (SGM) and implemented as a Corpus Reasoner. The OGEP processing is initiated by a Corpus Parser performing a lexical analysis of the corpus, reading in a document (or corpus) and preparing it for processing by annotating words and phrases. After the Corpus Parser is done, the Corpus Reasoner uses the parts of speech output to determine the semantic meaning of a word or phrase. The Corpus Reasoner is the crux of the OGEP system, analyzing, extrapolating, and evolving data from free text into cohesive semantic relationships. The Semantic Grounding Mechanism provides a basis for identifying and mapping semantic relationships. By blending together the WordNet lexicon and SUMO ontological layout, the SGM is given breadth and depth in its ability to extrapolate semantic relationships between domain entities. The combination of all these components results in an innovative approach to user assisted semantic-based ontology generation. This paper will describe the OGEP technology in the context of the architectural components referenced above and identify a potential technology transition path to Scott AFB's Tanker Airlift Control Center (TACC) which serves as the Air Operations Center (AOC) for the Air Mobility Command (AMC).

  7. Challenges in the association of human single nucleotide polymorphism mentions with unique database identifiers

    PubMed Central

    2011-01-01

    Background Most information on genomic variations and their associations with phenotypes are covered exclusively in scientific publications rather than in structured databases. These texts commonly describe variations using natural language; database identifiers are seldom mentioned. This complicates the retrieval of variations, associated articles, as well as information extraction, e. g. the search for biological implications. To overcome these challenges, procedures to map textual mentions of variations to database identifiers need to be developed. Results This article describes a workflow for normalization of variation mentions, i.e. the association of them to unique database identifiers. Common pitfalls in the interpretation of single nucleotide polymorphism (SNP) mentions are highlighted and discussed. The developed normalization procedure achieves a precision of 98.1 % and a recall of 67.5% for unambiguous association of variation mentions with dbSNP identifiers on a text corpus based on 296 MEDLINE abstracts containing 527 mentions of SNPs. The annotated corpus is freely available at http://www.scai.fraunhofer.de/snp-normalization-corpus.html. Conclusions Comparable approaches usually focus on variations mentioned on the protein sequence and neglect problems for other SNP mentions. The results presented here indicate that normalizing SNPs described on DNA level is more difficult than the normalization of SNPs described on protein level. The challenges associated with normalization are exemplified with ambiguities and errors, which occur in this corpus. PMID:21992066

  8. Understanding the Development of a Hybrid Practice of Inquiry-Based Science Instruction and Language Development: A Case Study of One Teacher's Journey Through Reflections on Classroom Practice

    NASA Astrophysics Data System (ADS)

    Capitelli, Sarah; Hooper, Paula; Rankin, Lynn; Austin, Marilyn; Caven, Gennifer

    2016-04-01

    This qualitative case study looks closely at an elementary teacher who participated in professional development experiences that helped her develop a hybrid practice of using inquiry-based science to teach both science content and English language development (ELD) to her students, many of whom are English language learners (ELLs). This case study examines the teacher's reflections on her teaching and her students' learning as she engaged her students in science learning and supported their developing language skills. It explicates the professional learning experiences that supported the development of this hybrid practice. Closely examining the pedagogical practice and reflections of a teacher who is developing an inquiry-based approach to both science learning and language development can provide insights into how teachers come to integrate their professional development experiences with their classroom expertise in order to create a hybrid inquiry-based science ELD practice. This qualitative case study contributes to the emerging scholarship on the development of teacher practice of inquiry-based science instruction as a vehicle for both science instruction and ELD for ELLs. This study demonstrates how an effective teaching practice that supports both the science and language learning of students can develop from ongoing professional learning experiences that are grounded in current perspectives about language development and that immerse teachers in an inquiry-based approach to learning and instruction. Additionally, this case study also underscores the important role that professional learning opportunities can play in supporting teachers in developing a deeper understanding of the affordances that inquiry-based science can provide for language development.

  9. Technology and Second Language Learning

    ERIC Educational Resources Information Center

    Lin, Li Li

    2009-01-01

    Current technology provides new opportunities to increase the effectiveness of language learning and teaching. Incorporating well-organized and effective technology into second language learning and teaching for improving students' language proficiency has been refined by researchers and educators for many decades. Based on the rapidly changing…

  10. Clinical Named Entity Recognition Using Deep Learning Models.

    PubMed

    Wu, Yonghui; Jiang, Min; Xu, Jun; Zhi, Degui; Xu, Hua

    2017-01-01

    Clinical Named Entity Recognition (NER) is a critical natural language processing (NLP) task to extract important concepts (named entities) from clinical narratives. Researchers have extensively investigated machine learning models for clinical NER. Recently, there have been increasing efforts to apply deep learning models to improve the performance of current clinical NER systems. This study examined two popular deep learning architectures, the Convolutional Neural Network (CNN) and the Recurrent Neural Network (RNN), to extract concepts from clinical texts. We compared the two deep neural network architectures with three baseline Conditional Random Fields (CRFs) models and two state-of-the-art clinical NER systems using the i2b2 2010 clinical concept extraction corpus. The evaluation results showed that the RNN model trained with the word embeddings achieved a new state-of-the- art performance (a strict F1 score of 85.94%) for the defined clinical NER task, outperforming the best-reported system that used both manually defined and unsupervised learning features. This study demonstrates the advantage of using deep neural network architectures for clinical concept extraction, including distributed feature representation, automatic feature learning, and long-term dependencies capture. This is one of the first studies to compare the two widely used deep learning models and demonstrate the superior performance of the RNN model for clinical NER.

  11. Prediction task guided representation learning of medical codes in EHR.

    PubMed

    Cui, Liwen; Xie, Xiaolei; Shen, Zuojun

    2018-06-18

    There have been rapidly growing applications using machine learning models for predictive analytics in Electronic Health Records (EHR) to improve the quality of hospital services and the efficiency of healthcare resource utilization. A fundamental and crucial step in developing such models is to convert medical codes in EHR to feature vectors. These medical codes are used to represent diagnoses or procedures. Their vector representations have a tremendous impact on the performance of machine learning models. Recently, some researchers have utilized representation learning methods from Natural Language Processing (NLP) to learn vector representations of medical codes. However, most previous approaches are unsupervised, i.e. the generation of medical code vectors is independent from prediction tasks. Thus, the obtained feature vectors may be inappropriate for a specific prediction task. Moreover, unsupervised methods often require a lot of samples to obtain reliable results, but most practical problems have very limited patient samples. In this paper, we develop a new method called Prediction Task Guided Health Record Aggregation (PTGHRA), which aggregates health records guided by prediction tasks, to construct training corpus for various representation learning models. Compared with unsupervised approaches, representation learning models integrated with PTGHRA yield a significant improvement in predictive capability of generated medical code vectors, especially for limited training samples. Copyright © 2018. Published by Elsevier Inc.

  12. Clinical Named Entity Recognition Using Deep Learning Models

    PubMed Central

    Wu, Yonghui; Jiang, Min; Xu, Jun; Zhi, Degui; Xu, Hua

    2017-01-01

    Clinical Named Entity Recognition (NER) is a critical natural language processing (NLP) task to extract important concepts (named entities) from clinical narratives. Researchers have extensively investigated machine learning models for clinical NER. Recently, there have been increasing efforts to apply deep learning models to improve the performance of current clinical NER systems. This study examined two popular deep learning architectures, the Convolutional Neural Network (CNN) and the Recurrent Neural Network (RNN), to extract concepts from clinical texts. We compared the two deep neural network architectures with three baseline Conditional Random Fields (CRFs) models and two state-of-the-art clinical NER systems using the i2b2 2010 clinical concept extraction corpus. The evaluation results showed that the RNN model trained with the word embeddings achieved a new state-of-the- art performance (a strict F1 score of 85.94%) for the defined clinical NER task, outperforming the best-reported system that used both manually defined and unsupervised learning features. This study demonstrates the advantage of using deep neural network architectures for clinical concept extraction, including distributed feature representation, automatic feature learning, and long-term dependencies capture. This is one of the first studies to compare the two widely used deep learning models and demonstrate the superior performance of the RNN model for clinical NER. PMID:29854252

  13. What to Follow "Make" and What to Follow "Do"--Corpus-Based Study on the De-Lexical Use of "Make" and "Do" in Native Speakers' and Chinese Students' Writing

    ERIC Educational Resources Information Center

    Fu, Zhuqin

    2006-01-01

    To many Chinese students, learning the words such as "make" and "do" seems a piece of cake, yet learning how to use them appropriately is anther case. This paper aims to investigate Chinese learners' use of the verbs "make" and "do", two major representatives of high-frequency words from the perspective of…

  14. Automatic Semantic Orientation of Adjectives for Indonesian Language Using PMI-IR and Clustering

    NASA Astrophysics Data System (ADS)

    Riyanti, Dewi; Arif Bijaksana, M.; Adiwijaya

    2018-03-01

    We present our work in the area of sentiment analysis for Indonesian language. We focus on bulding automatic semantic orientation using available resources in Indonesian. In this research we used Indonesian corpus that contains 9 million words from kompas.txt and tempo.txt that manually tagged and annotated with of part-of-speech tagset. And then we construct a dataset by taking all the adjectives from the corpus, removing the adjective with no orientation. The set contained 923 adjective words. This systems will include several steps such as text pre-processing and clustering. The text pre-processing aims to increase the accuracy. And finally clustering method will classify each word to related sentiment which is positive or negative. With improvements to the text preprocessing, can be achieved 72% of accuracy.

  15. Text generation from Taiwanese Sign Language using a PST-based language model for augmentative communication.

    PubMed

    Wu, Chung-Hsien; Chiu, Yu-Hsien; Guo, Chi-Shiang

    2004-12-01

    This paper proposes a novel approach to the generation of Chinese sentences from ill-formed Taiwanese Sign Language (TSL) for people with hearing impairments. First, a sign icon-based virtual keyboard is constructed to provide a visualized interface to retrieve sign icons from a sign database. A proposed language model (LM), based on a predictive sentence template (PST) tree, integrates a statistical variable n-gram LM and linguistic constraints to deal with the translation problem from ill-formed sign sequences to grammatical written sentences. The PST tree trained by a corpus collected from the deaf schools was used to model the correspondence between signed and written Chinese. In addition, a set of phrase formation rules, based on trigger pair category, was derived for sentence pattern expansion. These approaches improved the efficiency of text generation and the accuracy of word prediction and, therefore, improved the input rate. For the assessment of practical communication aids, a reading-comprehension training program with ten profoundly deaf students was undertaken in a deaf school in Tainan, Taiwan. Evaluation results show that the literacy aptitude test and subjective satisfactory level are significantly improved.

  16. Term Familiarity to indicate Perceived and Actual Difficulty of Text in Medical Digital Libraries.

    PubMed

    Leroy, Gondy; Endicott, James E

    2011-10-01

    With increasing text digitization, digital libraries can personalize materials for individuals with different education levels and language skills. To this end, documents need meta-information describing their difficulty level. Previous attempts at such labeling used readability formulas but the formulas have not been validated with modern texts and their outcome is seldom associated with actual difficulty. We focus on medical texts and are developing new, evidence-based meta-tags that are associated with perceived and actual text difficulty. This work describes a first tag, term familiarity , which is based on term frequency in the Google corpus. We evaluated its feasibility to serve as a tag by looking at a document corpus (N=1,073) and found that terms in blogs or journal articles displayed unexpected but significantly different scores. Term familiarity was then applied to texts and results from a previous user study (N=86) and could better explain differences for perceived and actual difficulty.

  17. Extracting salient sublexical units from written texts: “Emophon,” a corpus-based approach to phonological iconicity

    PubMed Central

    Aryani, Arash; Jacobs, Arthur M.; Conrad, Markus

    2013-01-01

    A growing body of literature in psychology, linguistics, and the neurosciences has paid increasing attention to the understanding of the relationships between phonological representations of words and their meaning: a phenomenon also known as phonological iconicity. In this article, we investigate how a text's intended emotional meaning, particularly in literature and poetry, may be reflected at the level of sublexical phonological salience and the use of foregrounded elements. To extract such elements from a given text, we developed a probabilistic model to predict the exceeding of a confidence interval for specific sublexical units concerning their frequency of occurrence within a given text contrasted with a reference linguistic corpus for the German language. Implementing this model in a computational application, we provide a text analysis tool which automatically delivers information about sublexical phonological salience allowing researchers, inter alia, to investigate effects of the sublexical emotional tone of texts based on current findings on phonological iconicity. PMID:24101907

  18. Website Analysis as a Tool for Task-Based Language Learning and Higher Order Thinking in an EFL Context

    ERIC Educational Resources Information Center

    Roy, Debopriyo

    2014-01-01

    Besides focusing on grammar, writing skills, and web-based language learning, researchers in "CALL" and second language acquisition have also argued for the importance of promoting higher-order thinking skills in ESL (English as Second Language) and EFL (English as Foreign Language) classrooms. There is solid evidence supporting the…

  19. An Online Task-Based Language Learning Environment: Is It Better for Advanced- or Intermediate-Level Second Language Learners?

    ERIC Educational Resources Information Center

    Arslanyilmaz, Abdurrahman

    2012-01-01

    This study investigates the relationship of language proficiency to language production and negotiation of meaning that non-native speakers (NNSs) produced in an online task-based language learning (TBLL) environment. Fourteen NNS-NNS dyads collaboratively completed four communicative tasks, using an online TBLL environment specifically designed…

  20. Overcoming Learning Time and Space Constraints through Technological Tool

    ERIC Educational Resources Information Center

    Zarei, Nafiseh; Hussin, Supyan; Rashid, Taufik

    2015-01-01

    Today the use of technological tools has become an evolution in language learning and language acquisition. Many instructors and lecturers believe that integrating Web-based learning tools into language courses allows pupils to become active learners during learning process. This study investigates how the Learning Management Blog (LMB) overcomes…

  1. Strategies for Better Learning of English Grammar: Chinese vs. Thais

    ERIC Educational Resources Information Center

    Supakorn, Patnarin; Feng, Min; Limmun, Wanida

    2018-01-01

    The success of language learning significantly depends on multiple sets of complex factors; among these are language-learning strategies of which learners in different countries may show different preferences. Needed areas of language learning strategy research include, among others, the strategy of grammar learning and the context-based approach…

  2. Translation of Japanese Noun Compounds at Super-Function Based MT System

    NASA Astrophysics Data System (ADS)

    Zhao, Xin; Ren, Fuji; Kuroiwa, Shingo

    Noun compounds are frequently encountered construction in nature language processing (NLP), consisting of a sequence of two or more nouns which functions syntactically as one noun. The translation of noun compounds has become a major issue in Machine Translation (MT) due to their frequency of occurrence and high productivity. In our previous studies on Super-Function Based Machine Translation (SFBMT), we have found that noun compounds are very frequently used and difficult to be translated correctly, the overgeneration of noun compounds can be dangerous as it may introduce ambiguity in the translation. In this paper, we discuss the challenges in handling Japanese noun compounds in an SFBMT system, we present a shallow method for translating noun compounds by using a word level translation dictionary and target language monolingual corpus.

  3. FipsOrtho: A Spell Checker for Learners of French

    ERIC Educational Resources Information Center

    L'Haire, Sebastien

    2007-01-01

    This paper presents FipsOrtho, a spell checker targeted at learners of French, and a corpus of learners' errors which has been gathered to test the system and to get a sample of specific language learners' errors. Spell checkers are a standard feature of many software products, however they are not designed for specific language learners' errors.…

  4. Language Planning and Personal Naming in Lithuania

    ERIC Educational Resources Information Center

    Ramoniene, Meilute

    2007-01-01

    This paper deals with the issues of language planning and naming in Lithuania since the restoration of independence in 1990. The aim of the paper is to analyse the challenges of corpus planning with the focus on the use and standardisation of personal names. The paper first presents the historical context of the formation of names in Lithuania and…

  5. Making and Breaking the Rules: Lexical Creativity in the Alternative Music Scene

    ERIC Educational Resources Information Center

    Rua, Paula Lopez

    2010-01-01

    This article delves into the connections between language as a rule-governed system of communication and music as a means to express subcultural ideologies and satisfy collective needs. By resorting to the morphological analysis of a corpus of names of alternative music artists, the article evinces that language is a manipulable code through which…

  6. Quantitative Research Methods, Study Quality, and Outcomes: The Case of Interaction Research

    ERIC Educational Resources Information Center

    Plonsky, Luke; Gass, Susan

    2011-01-01

    This article constitutes the first empirical assessment of methodological quality in second language acquisition (SLA). We surveyed a corpus of 174 studies (N = 7,951) within the tradition of research on second-language interaction, one of the longest and most influential traditions of inquiry in SLA. Each report was coded for methodological…

  7. Cultural Conceptualisations in Learning English as an L2: Examples from Persian-Speaking Learners

    ERIC Educational Resources Information Center

    Sharifian, Farzad

    2013-01-01

    Traditionally, many studies of second language acquisition (SLA) were based on the assumption that learning a new language mainly involves learning a set of grammatical rules, lexical items, and certain new sounds and sound combinations. However, for many second language learners, learning a second language may involve contact and interactions…

  8. Language Learning and Its Impact on the Brain: Connecting Language Learning with the Mind through Content-Based Instruction

    ERIC Educational Resources Information Center

    Kennedy, Teresa J.

    2006-01-01

    Cognitive sciences are discovering many things that educators have always intuitively known about language learning. However, the important point is actively using this new information to improve both students learning and current teaching practices. The implications of neuroscience for educational reform regarding second language (L2) learning…

  9. Synchronous and Asynchronous E-Language Learning: A Case Study of Virtual University of Pakistan

    ERIC Educational Resources Information Center

    Perveen, Ayesha

    2016-01-01

    This case study evaluated the impact of synchronous and asynchronous E-Language Learning activities (ELL-ivities) in an E-Language Learning Environment (ELLE) at Virtual University of Pakistan. The purpose of the study was to assess e-language learning analytics based on the constructivist approach of collaborative construction of knowledge. The…

  10. Corpus callosal atrophy and associations with cognitive impairment in Parkinson disease

    PubMed Central

    Bledsoe, Ian O.; Merkitch, Doug; Dinh, Vy; Bernard, Bryan; Stebbins, Glenn T.

    2017-01-01

    Objective: To investigate atrophy of the corpus callosum on MRI in Parkinson disease (PD) and its relationship to cognitive impairment. Methods: One hundred patients with PD and 24 healthy control participants underwent clinical and neuropsychological evaluations and structural MRI brain scans. Participants with PD were classified as cognitively normal (PD-NC; n = 28), having mild cognitive impairment (PD-MCI; n = 47), or having dementia (PDD; n = 25) by Movement Disorder Society criteria. Cognitive domain (attention/working memory, executive function, memory, language, visuospatial function) z scores were calculated. With the use of FreeSurfer image processing, volumes for total corpus callosum and its subsections (anterior, midanterior, central, midposterior, posterior) were computed and normalized by total intracranial volume. Callosal volumes were compared between participants with PD and controls and among PD cognitive groups, covarying for age, sex, and PD duration and with multiple comparison corrections. Regression analyses were performed to evaluate relationships between callosal volumes and performance in cognitive domains. Results: Participants with PD had reduced corpus callosum volumes in midanterior and central regions compared to healthy controls. Participants with PDD demonstrated decreased callosal volumes involving multiple subsections spanning anterior to posterior compared to participants with PD-MCI and PD-NC. Regional callosal atrophy predicted cognitive domain performance such that central volumes were associated with the attention/working memory domain; midposterior volumes with executive function, language, and memory domains; and posterior volumes with memory and visuospatial domains. Conclusions: Notable volume loss occurs in the corpus callosum in PD, with specific neuroanatomic distributions in PDD and relationships of regional atrophy to different cognitive domains. Callosal volume loss may contribute to clinical manifestations of PD cognitive impairment. PMID:28235816

  11. A Dual-Route Model that Learns to Pronounce English Words

    NASA Technical Reports Server (NTRS)

    Remington, Roger W.; Miller, Craig S.; Null, Cynthia H. (Technical Monitor)

    1995-01-01

    This paper describes a model that learns to pronounce English words. Learning occurs in two modules: 1) a rule-based module that constructs pronunciations by phonetic analysis of the letter string, and 2) a whole-word module that learns to associate subsets of letters to the pronunciation, without phonetic analysis. In a simulation on a corpus of over 300 words the model produced pronunciation latencies consistent with the effects of word frequency and orthographic regularity observed in human data. Implications of the model for theories of visual word processing and reading instruction are discussed.

  12. Structural brain and neuropsychometric changes associated with pediatric bipolar disorder with psychosis.

    PubMed

    James, Anthony; Hough, Morgan; James, Susan; Burge, Linda; Winmill, Louise; Nijhawan, Sunita; Matthews, Paul M; Zarei, Mojtaba

    2011-02-01

    To identify neuropsychological and structural brain changes using a combination of high-resolution structural and diffusion tensor imaging in pediatric bipolar disorder (PBD) with psychosis (presence of delusions and or hallucinations). We recruited 15 patients and 20 euthymic age- and gender-matched healthy controls. All subjects underwent high-resolution structural and diffusion tensor imaging. Voxel-based morphometry (VBM), tract-based spatial statistics (TBSS), and probabilistic tractography were used to analyse magnetic resonance imaging data. The PBD subjects had normal overall intelligence with specific impairments in working memory, executive function, language function, and verbal memory. Reduced gray matter (GM) density was found in the left orbitofrontal cortex, left pars triangularis, right premotor cortex, occipital cortex, right occipital fusiform gyrus, and right crus of the cerebellum. TBSS analysis showed reduced fractional anisotropy (FA) in the anterior corpus callosum. Probabilistic tractography from this cluster showed that this region of the corpus callosum is connected with the prefrontal cortices, including those regions whose density is decreased in PBD. In addition, FA change was correlated with verbal memory and working memory, while more widespread reductions in GM density correlated with working memory, executive function, language function, and verbal memory. The findings suggest widespread cortical changes as well as specific involvement of interhemispheric prefrontal tracts in PBD, which may reflect delayed myelination in these tracts. © 2011 John Wiley and Sons A/S.

  13. Metaphor Identification in Large Texts Corpora

    PubMed Central

    Neuman, Yair; Assaf, Dan; Cohen, Yohai; Last, Mark; Argamon, Shlomo; Howard, Newton; Frieder, Ophir

    2013-01-01

    Identifying metaphorical language-use (e.g., sweet child) is one of the challenges facing natural language processing. This paper describes three novel algorithms for automatic metaphor identification. The algorithms are variations of the same core algorithm. We evaluate the algorithms on two corpora of Reuters and the New York Times articles. The paper presents the most comprehensive study of metaphor identification in terms of scope of metaphorical phrases and annotated corpora size. Algorithms’ performance in identifying linguistic phrases as metaphorical or literal has been compared to human judgment. Overall, the algorithms outperform the state-of-the-art algorithm with 71% precision and 27% averaged improvement in prediction over the base-rate of metaphors in the corpus. PMID:23658625

  14. Proposed Framework for the Evaluation of Standalone Corpora Processing Systems: An Application to Arabic Corpora

    PubMed Central

    Al-Thubaity, Abdulmohsen; Alqifari, Reem

    2014-01-01

    Despite the accessibility of numerous online corpora, students and researchers engaged in the fields of Natural Language Processing (NLP), corpus linguistics, and language learning and teaching may encounter situations in which they need to develop their own corpora. Several commercial and free standalone corpora processing systems are available to process such corpora. In this study, we first propose a framework for the evaluation of standalone corpora processing systems and then use it to evaluate seven freely available systems. The proposed framework considers the usability, functionality, and performance of the evaluated systems while taking into consideration their suitability for Arabic corpora. While the results show that most of the evaluated systems exhibited comparable usability scores, the scores for functionality and performance were substantially different with respect to support for the Arabic language and N-grams profile generation. The results of our evaluation will help potential users of the evaluated systems to choose the system that best meets their needs. More importantly, the results will help the developers of the evaluated systems to enhance their systems and developers of new corpora processing systems by providing them with a reference framework. PMID:25610910

  15. Proposed framework for the evaluation of standalone corpora processing systems: an application to Arabic corpora.

    PubMed

    Al-Thubaity, Abdulmohsen; Al-Khalifa, Hend; Alqifari, Reem; Almazrua, Manal

    2014-01-01

    Despite the accessibility of numerous online corpora, students and researchers engaged in the fields of Natural Language Processing (NLP), corpus linguistics, and language learning and teaching may encounter situations in which they need to develop their own corpora. Several commercial and free standalone corpora processing systems are available to process such corpora. In this study, we first propose a framework for the evaluation of standalone corpora processing systems and then use it to evaluate seven freely available systems. The proposed framework considers the usability, functionality, and performance of the evaluated systems while taking into consideration their suitability for Arabic corpora. While the results show that most of the evaluated systems exhibited comparable usability scores, the scores for functionality and performance were substantially different with respect to support for the Arabic language and N-grams profile generation. The results of our evaluation will help potential users of the evaluated systems to choose the system that best meets their needs. More importantly, the results will help the developers of the evaluated systems to enhance their systems and developers of new corpora processing systems by providing them with a reference framework.

  16. Language Learning and the Raising of Cultural Awareness through Internet Telephony: A Case Study

    ERIC Educational Resources Information Center

    Polisca, Elena

    2011-01-01

    This article seeks to assess the impact of V-Pal (Virtual Partnerships for All Languages) on the student language learning experience within a conventional UK higher education (HE) curriculum. V-Pal is an innovative computer-mediated language scheme, based on a reciprocal, distance-learning language project, run by the University of Manchester in…

  17. Task-Based Language Learning: Old Approach, New Style. A New Lesson to Learn (Aprendizaje basado en tareas: un antiguo enfoque, un nuevo estilo. Una nueva lección para aprender)

    ERIC Educational Resources Information Center

    Rodríguez-Bonces, Mónica; Rodríguez-Bonces, Jeisson

    2010-01-01

    This paper provides an overview of Task-Based Language Learning (TBL) and its use in the teaching and learning of foreign languages. It begins by defining the concept of TBL, followed by a presentation of its framework and implications, and finally, a lesson plan based on TBL. The article presents an additional stage to be considered when planning…

  18. Criteria for Evaluating a Game-Based CALL Platform

    ERIC Educational Resources Information Center

    Ní Chiaráin, Neasa; Ní Chasaide, Ailbhe

    2017-01-01

    Game-based Computer-Assisted Language Learning (CALL) is an area that currently warrants attention, as task-based, interactive, multimodal games increasingly show promise for language learning. This area is inherently multidisciplinary--theories from second language acquisition, games, and psychology must be explored and relevant concepts from…

  19. Synonym extraction and abbreviation expansion with ensembles of semantic spaces.

    PubMed

    Henriksson, Aron; Moen, Hans; Skeppstedt, Maria; Daudaravičius, Vidas; Duneld, Martin

    2014-02-05

    Terminologies that account for variation in language use by linking synonyms and abbreviations to their corresponding concept are important enablers of high-quality information extraction from medical texts. Due to the use of specialized sub-languages in the medical domain, manual construction of semantic resources that accurately reflect language use is both costly and challenging, often resulting in low coverage. Although models of distributional semantics applied to large corpora provide a potential means of supporting development of such resources, their ability to isolate synonymy from other semantic relations is limited. Their application in the clinical domain has also only recently begun to be explored. Combining distributional models and applying them to different types of corpora may lead to enhanced performance on the tasks of automatically extracting synonyms and abbreviation-expansion pairs. A combination of two distributional models - Random Indexing and Random Permutation - employed in conjunction with a single corpus outperforms using either of the models in isolation. Furthermore, combining semantic spaces induced from different types of corpora - a corpus of clinical text and a corpus of medical journal articles - further improves results, outperforming a combination of semantic spaces induced from a single source, as well as a single semantic space induced from the conjoint corpus. A combination strategy that simply sums the cosine similarity scores of candidate terms is generally the most profitable out of the ones explored. Finally, applying simple post-processing filtering rules yields substantial performance gains on the tasks of extracting abbreviation-expansion pairs, but not synonyms. The best results, measured as recall in a list of ten candidate terms, for the three tasks are: 0.39 for abbreviations to long forms, 0.33 for long forms to abbreviations, and 0.47 for synonyms. This study demonstrates that ensembles of semantic spaces can yield improved performance on the tasks of automatically extracting synonyms and abbreviation-expansion pairs. This notion, which merits further exploration, allows different distributional models - with different model parameters - and different types of corpora to be combined, potentially allowing enhanced performance to be obtained on a wide range of natural language processing tasks.

  20. Synonym extraction and abbreviation expansion with ensembles of semantic spaces

    PubMed Central

    2014-01-01

    Background Terminologies that account for variation in language use by linking synonyms and abbreviations to their corresponding concept are important enablers of high-quality information extraction from medical texts. Due to the use of specialized sub-languages in the medical domain, manual construction of semantic resources that accurately reflect language use is both costly and challenging, often resulting in low coverage. Although models of distributional semantics applied to large corpora provide a potential means of supporting development of such resources, their ability to isolate synonymy from other semantic relations is limited. Their application in the clinical domain has also only recently begun to be explored. Combining distributional models and applying them to different types of corpora may lead to enhanced performance on the tasks of automatically extracting synonyms and abbreviation-expansion pairs. Results A combination of two distributional models – Random Indexing and Random Permutation – employed in conjunction with a single corpus outperforms using either of the models in isolation. Furthermore, combining semantic spaces induced from different types of corpora – a corpus of clinical text and a corpus of medical journal articles – further improves results, outperforming a combination of semantic spaces induced from a single source, as well as a single semantic space induced from the conjoint corpus. A combination strategy that simply sums the cosine similarity scores of candidate terms is generally the most profitable out of the ones explored. Finally, applying simple post-processing filtering rules yields substantial performance gains on the tasks of extracting abbreviation-expansion pairs, but not synonyms. The best results, measured as recall in a list of ten candidate terms, for the three tasks are: 0.39 for abbreviations to long forms, 0.33 for long forms to abbreviations, and 0.47 for synonyms. Conclusions This study demonstrates that ensembles of semantic spaces can yield improved performance on the tasks of automatically extracting synonyms and abbreviation-expansion pairs. This notion, which merits further exploration, allows different distributional models – with different model parameters – and different types of corpora to be combined, potentially allowing enhanced performance to be obtained on a wide range of natural language processing tasks. PMID:24499679

  1. Enculturating Seamless Language Learning through Artifact Creation and Social Interaction Process

    ERIC Educational Resources Information Center

    Wong, Lung-Hsiang; Chai, Ching Sing; Aw, Guat Poh; King, Ronnel B.

    2015-01-01

    This paper reports a design-based research (DBR) cycle of MyCLOUD (My Chinese ubiquitOUs learning Days). MyCLOUD is a seamless language learning model that addresses identified limitations of conventional Chinese language teaching, such as the decontextualized and unauthentic learning processes that usually hinder reflection and deep learning.…

  2. The Complexity of Language Learning

    ERIC Educational Resources Information Center

    Nelson, Charles

    2011-01-01

    This paper takes a complexity theory approach to looking at language learning, an approach that investigates how language learners adapt to and interact with people and their environment. Based on interviews with four graduate students, it shows how complexity theory can help us understand both the situatedness of language learning and also…

  3. Student Modeling and Ab Initio Language Learning.

    ERIC Educational Resources Information Center

    Heift, Trude; Schulze, Mathias

    2003-01-01

    Provides examples of student modeling techniques that have been employed in computer-assisted language learning over the past decade. Describes two systems for learning German: "German Tutor" and "Geroline." Shows how a student model can support computerized adaptive language testing for diagnostic purposes in a Web-based language learning…

  4. Students' Motivation towards Computer Use in EFL Learning

    ERIC Educational Resources Information Center

    Genc, Gulten; Aydin, Selami

    2010-01-01

    It has been widely recognized that language instruction that integrates technology has become popular, and has had a tremendous impact on language learning process whereas learners are expected to be more motivated in a web-based Computer assisted language learning program, and improve their comprehensive language ability. Thus, the present paper…

  5. Informal Language Learning Setting: Technology or Social Interaction?

    ERIC Educational Resources Information Center

    Bahrani, Taher; Sim, Tam Shu

    2012-01-01

    Based on the informal language learning theory, language learning can occur outside the classroom setting unconsciously and incidentally through interaction with the native speakers or exposure to authentic language input through technology. However, an EFL context lacks the social interaction which naturally occurs in an ESL context. To explore…

  6. My Personal Mobile Language Learning Environment: An Exploration and Classification of Language Learning Possibilities Using the iPhone

    ERIC Educational Resources Information Center

    Perifanou, Maria A.

    2011-01-01

    Mobile devices can motivate learners through moving language learning from predominantly classroom-based contexts into contexts that are free from time and space. The increasing development of new applications can offer valuable support to the language learning process and can provide a basis for a new self regulated and personal approach to…

  7. Co-development of manner and path concepts in language, action, and eye-gaze behavior.

    PubMed

    Lohan, Katrin S; Griffiths, Sascha S; Sciutti, Alessandra; Partmann, Tim C; Rohlfing, Katharina J

    2014-07-01

    In order for artificial intelligent systems to interact naturally with human users, they need to be able to learn from human instructions when actions should be imitated. Human tutoring will typically consist of action demonstrations accompanied by speech. In the following, the characteristics of human tutoring during action demonstration will be examined. A special focus will be put on the distinction between two kinds of motion events: path-oriented actions and manner-oriented actions. Such a distinction is inspired by the literature pertaining to cognitive linguistics, which indicates that the human conceptual system can distinguish these two distinct types of motion. These two kinds of actions are described in language by more path-oriented or more manner-oriented utterances. In path-oriented utterances, the source, trajectory, or goal is emphasized, whereas in manner-oriented utterances the medium, velocity, or means of motion are highlighted. We examined a video corpus of adult-child interactions comprised of three age groups of children-pre-lexical, early lexical, and lexical-and two different tasks, one emphasizing manner more strongly and one emphasizing path more strongly. We analyzed the language and motion of the caregiver and the gazing behavior of the child to highlight the differences between the tutoring and the acquisition of the manner and path concepts. The results suggest that age is an important factor in the development of these action categories. The analysis of this corpus has also been exploited to develop an intelligent robotic behavior-the tutoring spotter system-able to emulate children's behaviors in a tutoring situation, with the aim of evoking in human subjects a natural and effective behavior in teaching to a robot. The findings related to the development of manner and path concepts have been used to implement new effective feedback strategies in the tutoring spotter system, which should provide improvements in human-robot interaction. Copyright © 2014 Cognitive Science Society, Inc.

  8. Impact of corpus domain for sentiment classification: An evaluation study using supervised machine learning techniques

    NASA Astrophysics Data System (ADS)

    Karsi, Redouane; Zaim, Mounia; El Alami, Jamila

    2017-07-01

    Thanks to the development of the internet, a large community now has the possibility to communicate and express its opinions and preferences through multiple media such as blogs, forums, social networks and e-commerce sites. Today, it becomes clearer that opinions published on the web are a very valuable source for decision-making, so a rapidly growing field of research called “sentiment analysis” is born to address the problem of automatically determining the polarity (Positive, negative, neutral,…) of textual opinions. People expressing themselves in a particular domain often use specific domain language expressions, thus, building a classifier, which performs well in different domains is a challenging problem. The purpose of this paper is to evaluate the impact of domain for sentiment classification when using machine learning techniques. In our study three popular machine learning techniques: Support Vector Machines (SVM), Naive Bayes and K nearest neighbors(KNN) were applied on datasets collected from different domains. Experimental results show that Support Vector Machines outperforms other classifiers in all domains, since it achieved at least 74.75% accuracy with a standard deviation of 4,08.

  9. Learning Orthographic Structure With Sequential Generative Neural Networks.

    PubMed

    Testolin, Alberto; Stoianov, Ivilin; Sperduti, Alessandro; Zorzi, Marco

    2016-04-01

    Learning the structure of event sequences is a ubiquitous problem in cognition and particularly in language. One possible solution is to learn a probabilistic generative model of sequences that allows making predictions about upcoming events. Though appealing from a neurobiological standpoint, this approach is typically not pursued in connectionist modeling. Here, we investigated a sequential version of the restricted Boltzmann machine (RBM), a stochastic recurrent neural network that extracts high-order structure from sensory data through unsupervised generative learning and can encode contextual information in the form of internal, distributed representations. We assessed whether this type of network can extract the orthographic structure of English monosyllables by learning a generative model of the letter sequences forming a word training corpus. We show that the network learned an accurate probabilistic model of English graphotactics, which can be used to make predictions about the letter following a given context as well as to autonomously generate high-quality pseudowords. The model was compared to an extended version of simple recurrent networks, augmented with a stochastic process that allows autonomous generation of sequences, and to non-connectionist probabilistic models (n-grams and hidden Markov models). We conclude that sequential RBMs and stochastic simple recurrent networks are promising candidates for modeling cognition in the temporal domain. Copyright © 2015 Cognitive Science Society, Inc.

  10. Becoming Little Scientists: Technologically-Enhanced Project-Based Language Learning

    ERIC Educational Resources Information Center

    Dooly, Melinda; Sadler, Randall

    2016-01-01

    This article outlines research into innovative language teaching practices that make optimal use of technology and Computer-Mediated Communication (CMC) for an integrated approach to Project-Based Learning. It is based on data compiled during a 10- week language project that employed videoconferencing and "machinima" (short video clips…

  11. Using Teacher-Developed Corpora in the CBI Classroom

    ERIC Educational Resources Information Center

    Salsbury, Tom; Crummer, Crista

    2008-01-01

    This article argues for the use of teacher-generated corpora in content-based courses. Using a content course for engineering and architecture students as an example, the article explains how a corpus consisting of texts from textbooks and journal articles helped students learn grammar, vocabulary, and writing. The article explains how the corpus…

  12. Emotion computing using Word Mover's Distance features based on Ren_CECps.

    PubMed

    Ren, Fuji; Liu, Ning

    2018-01-01

    In this paper, we propose an emotion separated method(SeTF·IDF) to assign the emotion labels of sentences with different values, which has a better visual effect compared with the values represented by TF·IDF in the visualization of a multi-label Chinese emotional corpus Ren_CECps. Inspired by the enormous improvement of the visualization map propelled by the changed distances among the sentences, we being the first group utilizes the Word Mover's Distance(WMD) algorithm as a way of feature representation in Chinese text emotion classification. Our experiments show that both in 80% for training, 20% for testing and 50% for training, 50% for testing experiments of Ren_CECps, WMD features get the best f1 scores and have a greater increase compared with the same dimension feature vectors obtained by dimension reduction TF·IDF method. Compared experiments in English corpus also show the efficiency of WMD features in the cross-language field.

  13. "Lösen Sie Schachtelsätze Möglichst Auf"': The Impact of Editorial Guidelines on Sentence Splitting in German Business Article Translations

    ERIC Educational Resources Information Center

    Bisiada, Mario

    2016-01-01

    Sentence splitting is assumed to occur mainly in translations from languages that prefer a hierarchical discourse structure, such as German, to languages that prefer an incremental structure. This article challenges that assumption by presenting findings from a diachronic corpus study of English-German business article translations, which shows…

  14. A Corpus of Young Learners' English in the Baltic Region--Texts for Studies on Sustainable Development

    ERIC Educational Resources Information Center

    Sundh, Stellan

    2016-01-01

    In order to reach far in the work for sustainable development, communication in foreign languages prior to strategic decisions is required from international partners. In this communication English has become the lingua franca. Even though the use of EFL (English as a foreign language) is widely spread, it is clear that in some geographical…

  15. The Frequency and Distribution of the X-Finite Verb Clauses in the Second Book of the Psalter

    ERIC Educational Resources Information Center

    Stewart, Joshua E.

    2014-01-01

    The Hebrew verbal system is riddled with difficulties that have caused Hebraist much consternation for centuries. The study of the Hebrew verb system is complicated further because no living informant of the language exists. Without a living witness of the language, uncovering the nuances has been limited to the corpus of written material…

  16. Towards Implicational Scales for Use in Chicano English Composition. Papers in Southwest English 1: Research Techniques and Prospects.

    ERIC Educational Resources Information Center

    Hoffer, Bates

    Dialect analysis should follow the procedure for analysis of a new language: collection of a corpus of words, stories, and sentences and identifying structural features of phonology, morphology, syntax, and lexicon. Contrastive analysis between standard English and the native language is used and the ethnic dialect of English is described and…

  17. The Big Picture: A Meta-Analysis of Program Effectiveness Research on English Language Learners

    ERIC Educational Resources Information Center

    Rolstad, Kellie; Mahoney, Kate; Glass, Gene V.

    2005-01-01

    This article presents a meta-analysis of program effectiveness research on English language learners. The study includes a corpus of 17 studies conducted since Willig's earlier meta-analysis and uses Glass, McGaw, and Smith's strategy of including as many studies as possible in the analysis rather than excluding some on the basis of a priori…

  18. Online Domains of Language Use: Second Language Learners' Experiences of Virtual Community and Foreignness

    ERIC Educational Resources Information Center

    Pasfield-Neofitou, Sarah

    2011-01-01

    This paper examines the use of CMC in both Japanese and English dominated "domains" by Australian learners of Japanese. The natural, social online communication of 12 Australian university students with 18 of their Japanese contacts was collected for a period of up to four years, resulting in a corpus of approximately 2,000 instances of…

  19. Extracting laboratory test information from biomedical text

    PubMed Central

    Kang, Yanna Shen; Kayaalp, Mehmet

    2013-01-01

    Background: No previous study reported the efficacy of current natural language processing (NLP) methods for extracting laboratory test information from narrative documents. This study investigates the pathology informatics question of how accurately such information can be extracted from text with the current tools and techniques, especially machine learning and symbolic NLP methods. The study data came from a text corpus maintained by the U.S. Food and Drug Administration, containing a rich set of information on laboratory tests and test devices. Methods: The authors developed a symbolic information extraction (SIE) system to extract device and test specific information about four types of laboratory test entities: Specimens, analytes, units of measures and detection limits. They compared the performance of SIE and three prominent machine learning based NLP systems, LingPipe, GATE and BANNER, each implementing a distinct supervised machine learning method, hidden Markov models, support vector machines and conditional random fields, respectively. Results: Machine learning systems recognized laboratory test entities with moderately high recall, but low precision rates. Their recall rates were relatively higher when the number of distinct entity values (e.g., the spectrum of specimens) was very limited or when lexical morphology of the entity was distinctive (as in units of measures), yet SIE outperformed them with statistically significant margins on extracting specimen, analyte and detection limit information in both precision and F-measure. Its high recall performance was statistically significant on analyte information extraction. Conclusions: Despite its shortcomings against machine learning methods, a well-tailored symbolic system may better discern relevancy among a pile of information of the same type and may outperform a machine learning system by tapping into lexically non-local contextual information such as the document structure. PMID:24083058

  20. Bridges to Swaziland: Using Task-Based Learning and Computer-Mediated Instruction to Improve English Language Teaching and Learning

    ERIC Educational Resources Information Center

    Pierson, Susan Jacques

    2015-01-01

    One way to provide high quality instruction for underserved English Language Learners around the world is to combine Task-Based English Language Learning with Computer- Assisted Instruction. As part of an ongoing project, "Bridges to Swaziland," these approaches have been implemented in a determined effort to improve the ESL program for…

  1. Data-Driven Learning of Speech Acts Based on Corpora of DVD Subtitles

    ERIC Educational Resources Information Center

    Kitao, S. Kathleen; Kitao, Kenji

    2013-01-01

    Data-driven learning (DDL) is an inductive approach to language learning in which students study examples of authentic language and use them to find patterns of language use. This inductive approach to learning has the advantages of being learner-centered, encouraging hypothesis testing and learner autonomy, and helping develop learning skills.…

  2. Co-Creation Learning Procedures: Comparing Interactive Language Lessons for Deaf and Hearing Students.

    PubMed

    Hosono, Naotsune; Inoue, Hiromitsu; Tomita, Yutaka

    2017-01-01

    This paper discusses co-creation learning procedures of second language lessons for deaf students, and sign language lessons by a deaf lecturer. The analyses focus on the learning procedure and resulting assessment, considering the disability. Through questionnaires ICT-based co-creative learning technologies are effective and efficient and promote spontaneous learning motivation goals.

  3. From shared contexts to syntactic categories: The role of distributional information in learning linguistic form-classes

    PubMed Central

    Reeder, Patricia A.; Newport, Elissa L.; Aslin, Richard N.

    2012-01-01

    A fundamental component of language acquisition involves organizing words into grammatical categories. Previous literature has suggested a number of ways in which this categorization task might be accomplished. Here we ask whether the patterning of the words in a corpus of linguistic input (distributional information) is sufficient, along with a small set of learning biases, to extract these underlying structural categories. In a series of experiments, we show that learners can acquire linguistic form-classes, generalizing from instances of the distributional contexts of individual words in the exposure set to the full range of contexts for all the words in the set. Crucially, we explore how several specific distributional variables enable learners to form a category of lexical items and generalize to novel words, yet also allow for exceptions that maintain lexical specificity. We suggest that learners are sensitive to the contexts of individual words, the overlaps among contexts across words, the non-overlap of contexts (or systematic gaps in information), and the size of the exposure set. We also ask how learners determine the category membership of a new word for which there is very sparse contextual information. We find that, when there are strong category cues and robust category learning of other words, adults readily generalize the distributional properties of the learned category to a new word that shares just one context with the other category members. However, as the distributional cues regarding the category become sparser and contain more consistent gaps, learners show more conservatism in generalizing distributional properties to the novel word. Taken together, these results show that learners are highly systematic in their use of the distributional properties of the input corpus, using them in a principled way to determine when to generalize and when to preserve lexical specificity. PMID:23089290

  4. From shared contexts to syntactic categories: the role of distributional information in learning linguistic form-classes.

    PubMed

    Reeder, Patricia A; Newport, Elissa L; Aslin, Richard N

    2013-02-01

    A fundamental component of language acquisition involves organizing words into grammatical categories. Previous literature has suggested a number of ways in which this categorization task might be accomplished. Here we ask whether the patterning of the words in a corpus of linguistic input (distributional information) is sufficient, along with a small set of learning biases, to extract these underlying structural categories. In a series of experiments, we show that learners can acquire linguistic form-classes, generalizing from instances of the distributional contexts of individual words in the exposure set to the full range of contexts for all the words in the set. Crucially, we explore how several specific distributional variables enable learners to form a category of lexical items and generalize to novel words, yet also allow for exceptions that maintain lexical specificity. We suggest that learners are sensitive to the contexts of individual words, the overlaps among contexts across words, the non-overlap of contexts (or systematic gaps in information), and the size of the exposure set. We also ask how learners determine the category membership of a new word for which there is very sparse contextual information. We find that, when there are strong category cues and robust category learning of other words, adults readily generalize the distributional properties of the learned category to a new word that shares just one context with the other category members. However, as the distributional cues regarding the category become sparser and contain more consistent gaps, learners show more conservatism in generalizing distributional properties to the novel word. Taken together, these results show that learners are highly systematic in their use of the distributional properties of the input corpus, using them in a principled way to determine when to generalize and when to preserve lexical specificity. Copyright © 2012 Elsevier Inc. All rights reserved.

  5. Learning Additional Languages as Hierarchical Probabilistic Inference: Insights From First Language Processing.

    PubMed

    Pajak, Bozena; Fine, Alex B; Kleinschmidt, Dave F; Jaeger, T Florian

    2016-12-01

    We present a framework of second and additional language (L2/L n ) acquisition motivated by recent work on socio-indexical knowledge in first language (L1) processing. The distribution of linguistic categories covaries with socio-indexical variables (e.g., talker identity, gender, dialects). We summarize evidence that implicit probabilistic knowledge of this covariance is critical to L1 processing, and propose that L2/L n learning uses the same type of socio-indexical information to probabilistically infer latent hierarchical structure over previously learned and new languages. This structure guides the acquisition of new languages based on their inferred place within that hierarchy, and is itself continuously revised based on new input from any language. This proposal unifies L1 processing and L2/L n acquisition as probabilistic inference under uncertainty over socio-indexical structure. It also offers a new perspective on crosslinguistic influences during L2/L n learning, accommodating gradient and continued transfer (both negative and positive) from previously learned to novel languages, and vice versa.

  6. Learning Additional Languages as Hierarchical Probabilistic Inference: Insights From First Language Processing

    PubMed Central

    Pajak, Bozena; Fine, Alex B.; Kleinschmidt, Dave F.; Jaeger, T. Florian

    2015-01-01

    We present a framework of second and additional language (L2/Ln) acquisition motivated by recent work on socio-indexical knowledge in first language (L1) processing. The distribution of linguistic categories covaries with socio-indexical variables (e.g., talker identity, gender, dialects). We summarize evidence that implicit probabilistic knowledge of this covariance is critical to L1 processing, and propose that L2/Ln learning uses the same type of socio-indexical information to probabilistically infer latent hierarchical structure over previously learned and new languages. This structure guides the acquisition of new languages based on their inferred place within that hierarchy, and is itself continuously revised based on new input from any language. This proposal unifies L1 processing and L2/Ln acquisition as probabilistic inference under uncertainty over socio-indexical structure. It also offers a new perspective on crosslinguistic influences during L2/Ln learning, accommodating gradient and continued transfer (both negative and positive) from previously learned to novel languages, and vice versa. PMID:28348442

  7. Language Learning of Gifted Individuals: A Content Analysis Study

    ERIC Educational Resources Information Center

    Gokaydin, Beria; Baglama, Basak; Uzunboylu, Huseyin

    2017-01-01

    This study aims to carry out a content analysis of the studies on language learning of gifted individuals and determine the trends in this field. Articles on language learning of gifted individuals published in the Scopus database were examined based on certain criteria including type of publication, year of publication, language, research…

  8. Web-Based Language Learning Perception and Personality Characteristics of University Students

    ERIC Educational Resources Information Center

    Mirzaee, Meisam; Gharibeh, Sajjad Gharibeh

    2016-01-01

    The significance of learners' personality in language learning/teaching contexts has often been cited in literature but few studies have scrutinized the role it can play in technology-oriented language classes. In modern language teaching/learning contexts, personality differences are important and should be taken into account. This study…

  9. Finding Important Terms for Patients in Their Electronic Health Records: A Learning-to-Rank Approach Using Expert Annotations

    PubMed Central

    Zheng, Jiaping; Yu, Hong

    2016-01-01

    Background Many health organizations allow patients to access their own electronic health record (EHR) notes through online patient portals as a way to enhance patient-centered care. However, EHR notes are typically long and contain abundant medical jargon that can be difficult for patients to understand. In addition, many medical terms in patients’ notes are not directly related to their health care needs. One way to help patients better comprehend their own notes is to reduce information overload and help them focus on medical terms that matter most to them. Interventions can then be developed by giving them targeted education to improve their EHR comprehension and the quality of care. Objective We aimed to develop a supervised natural language processing (NLP) system called Finding impOrtant medical Concepts most Useful to patientS (FOCUS) that automatically identifies and ranks medical terms in EHR notes based on their importance to the patients. Methods First, we built an expert-annotated corpus. For each EHR note, 2 physicians independently identified medical terms important to the patient. Using the physicians’ agreement as the gold standard, we developed and evaluated FOCUS. FOCUS first identifies candidate terms from each EHR note using MetaMap and then ranks the terms using a support vector machine-based learn-to-rank algorithm. We explored rich learning features, including distributed word representation, Unified Medical Language System semantic type, topic features, and features derived from consumer health vocabulary. We compared FOCUS with 2 strong baseline NLP systems. Results Physicians annotated 90 EHR notes and identified a mean of 9 (SD 5) important terms per note. The Cohen’s kappa annotation agreement was .51. The 10-fold cross-validation results show that FOCUS achieved an area under the receiver operating characteristic curve (AUC-ROC) of 0.940 for ranking candidate terms from EHR notes to identify important terms. When including term identification, the performance of FOCUS for identifying important terms from EHR notes was 0.866 AUC-ROC. Both performance scores significantly exceeded the corresponding baseline system scores (P<.001). Rich learning features contributed to FOCUS’s performance substantially. Conclusions FOCUS can automatically rank terms from EHR notes based on their importance to patients. It may help develop future interventions that improve quality of care. PMID:27903489

  10. Finding Important Terms for Patients in Their Electronic Health Records: A Learning-to-Rank Approach Using Expert Annotations.

    PubMed

    Chen, Jinying; Zheng, Jiaping; Yu, Hong

    2016-11-30

    Many health organizations allow patients to access their own electronic health record (EHR) notes through online patient portals as a way to enhance patient-centered care. However, EHR notes are typically long and contain abundant medical jargon that can be difficult for patients to understand. In addition, many medical terms in patients' notes are not directly related to their health care needs. One way to help patients better comprehend their own notes is to reduce information overload and help them focus on medical terms that matter most to them. Interventions can then be developed by giving them targeted education to improve their EHR comprehension and the quality of care. We aimed to develop a supervised natural language processing (NLP) system called Finding impOrtant medical Concepts most Useful to patientS (FOCUS) that automatically identifies and ranks medical terms in EHR notes based on their importance to the patients. First, we built an expert-annotated corpus. For each EHR note, 2 physicians independently identified medical terms important to the patient. Using the physicians' agreement as the gold standard, we developed and evaluated FOCUS. FOCUS first identifies candidate terms from each EHR note using MetaMap and then ranks the terms using a support vector machine-based learn-to-rank algorithm. We explored rich learning features, including distributed word representation, Unified Medical Language System semantic type, topic features, and features derived from consumer health vocabulary. We compared FOCUS with 2 strong baseline NLP systems. Physicians annotated 90 EHR notes and identified a mean of 9 (SD 5) important terms per note. The Cohen's kappa annotation agreement was .51. The 10-fold cross-validation results show that FOCUS achieved an area under the receiver operating characteristic curve (AUC-ROC) of 0.940 for ranking candidate terms from EHR notes to identify important terms. When including term identification, the performance of FOCUS for identifying important terms from EHR notes was 0.866 AUC-ROC. Both performance scores significantly exceeded the corresponding baseline system scores (P<.001). Rich learning features contributed to FOCUS's performance substantially. FOCUS can automatically rank terms from EHR notes based on their importance to patients. It may help develop future interventions that improve quality of care. ©Jinying Chen, Jiaping Zheng, Hong Yu. Originally published in JMIR Medical Informatics (http://medinform.jmir.org), 30.11.2016.

  11. TwiMed: Twitter and PubMed Comparable Corpus of Drugs, Diseases, Symptoms, and Their Relations

    PubMed Central

    Miyao, Yusuke; Collier, Nigel

    2017-01-01

    Background Work on pharmacovigilance systems using texts from PubMed and Twitter typically target at different elements and use different annotation guidelines resulting in a scenario where there is no comparable set of documents from both Twitter and PubMed annotated in the same manner. Objective This study aimed to provide a comparable corpus of texts from PubMed and Twitter that can be used to study drug reports from these two sources of information, allowing researchers in the area of pharmacovigilance using natural language processing (NLP) to perform experiments to better understand the similarities and differences between drug reports in Twitter and PubMed. Methods We produced a corpus comprising 1000 tweets and 1000 PubMed sentences selected using the same strategy and annotated at entity level by the same experts (pharmacists) using the same set of guidelines. Results The resulting corpus, annotated by two pharmacists, comprises semantically correct annotations for a set of drugs, diseases, and symptoms. This corpus contains the annotations for 3144 entities, 2749 relations, and 5003 attributes. Conclusions We present a corpus that is unique in its characteristics as this is the first corpus for pharmacovigilance curated from Twitter messages and PubMed sentences using the same data selection and annotation strategies. We believe this corpus will be of particular interest for researchers willing to compare results from pharmacovigilance systems (eg, classifiers and named entity recognition systems) when using data from Twitter and from PubMed. We hope that given the comprehensive set of drug names and the annotated entities and relations, this corpus becomes a standard resource to compare results from different pharmacovigilance studies in the area of NLP. PMID:28468748

  12. TwiMed: Twitter and PubMed Comparable Corpus of Drugs, Diseases, Symptoms, and Their Relations.

    PubMed

    Alvaro, Nestor; Miyao, Yusuke; Collier, Nigel

    2017-05-03

    Work on pharmacovigilance systems using texts from PubMed and Twitter typically target at different elements and use different annotation guidelines resulting in a scenario where there is no comparable set of documents from both Twitter and PubMed annotated in the same manner. This study aimed to provide a comparable corpus of texts from PubMed and Twitter that can be used to study drug reports from these two sources of information, allowing researchers in the area of pharmacovigilance using natural language processing (NLP) to perform experiments to better understand the similarities and differences between drug reports in Twitter and PubMed. We produced a corpus comprising 1000 tweets and 1000 PubMed sentences selected using the same strategy and annotated at entity level by the same experts (pharmacists) using the same set of guidelines. The resulting corpus, annotated by two pharmacists, comprises semantically correct annotations for a set of drugs, diseases, and symptoms. This corpus contains the annotations for 3144 entities, 2749 relations, and 5003 attributes. We present a corpus that is unique in its characteristics as this is the first corpus for pharmacovigilance curated from Twitter messages and PubMed sentences using the same data selection and annotation strategies. We believe this corpus will be of particular interest for researchers willing to compare results from pharmacovigilance systems (eg, classifiers and named entity recognition systems) when using data from Twitter and from PubMed. We hope that given the comprehensive set of drug names and the annotated entities and relations, this corpus becomes a standard resource to compare results from different pharmacovigilance studies in the area of NLP. ©Nestor Alvaro, Yusuke Miyao, Nigel Collier. Originally published in JMIR Public Health and Surveillance (http://publichealth.jmir.org), 03.05.2017.

  13. Selected Topics from LVCSR Research for Asian Languages at Tokyo Tech

    NASA Astrophysics Data System (ADS)

    Furui, Sadaoki

    This paper presents our recent work in regard to building Large Vocabulary Continuous Speech Recognition (LVCSR) systems for the Thai, Indonesian, and Chinese languages. For Thai, since there is no word boundary in the written form, we have proposed a new method for automatically creating word-like units from a text corpus, and applied topic and speaking style adaptation to the language model to recognize spoken-style utterances. For Indonesian, we have applied proper noun-specific adaptation to acoustic modeling, and rule-based English-to-Indonesian phoneme mapping to solve the problem of large variation in proper noun and English word pronunciation in a spoken-query information retrieval system. In spoken Chinese, long organization names are frequently abbreviated, and abbreviated utterances cannot be recognized if the abbreviations are not included in the dictionary. We have proposed a new method for automatically generating Chinese abbreviations, and by expanding the vocabulary using the generated abbreviations, we have significantly improved the performance of spoken query-based search.

  14. Language acquisition is model-based rather than model-free.

    PubMed

    Wang, Felix Hao; Mintz, Toben H

    2016-01-01

    Christiansen & Chater (C&C) propose that learning language is learning to process language. However, we believe that the general-purpose prediction mechanism they propose is insufficient to account for many phenomena in language acquisition. We argue from theoretical considerations and empirical evidence that many acquisition tasks are model-based, and that different acquisition tasks require different, specialized models.

  15. Concept annotation in the CRAFT corpus.

    PubMed

    Bada, Michael; Eckert, Miriam; Evans, Donald; Garcia, Kristin; Shipley, Krista; Sitnikov, Dmitry; Baumgartner, William A; Cohen, K Bretonnel; Verspoor, Karin; Blake, Judith A; Hunter, Lawrence E

    2012-07-09

    Manually annotated corpora are critical for the training and evaluation of automated methods to identify concepts in biomedical text. This paper presents the concept annotations of the Colorado Richly Annotated Full-Text (CRAFT) Corpus, a collection of 97 full-length, open-access biomedical journal articles that have been annotated both semantically and syntactically to serve as a research resource for the biomedical natural-language-processing (NLP) community. CRAFT identifies all mentions of nearly all concepts from nine prominent biomedical ontologies and terminologies: the Cell Type Ontology, the Chemical Entities of Biological Interest ontology, the NCBI Taxonomy, the Protein Ontology, the Sequence Ontology, the entries of the Entrez Gene database, and the three subontologies of the Gene Ontology. The first public release includes the annotations for 67 of the 97 articles, reserving two sets of 15 articles for future text-mining competitions (after which these too will be released). Concept annotations were created based on a single set of guidelines, which has enabled us to achieve consistently high interannotator agreement. As the initial 67-article release contains more than 560,000 tokens (and the full set more than 790,000 tokens), our corpus is among the largest gold-standard annotated biomedical corpora. Unlike most others, the journal articles that comprise the corpus are drawn from diverse biomedical disciplines and are marked up in their entirety. Additionally, with a concept-annotation count of nearly 100,000 in the 67-article subset (and more than 140,000 in the full collection), the scale of conceptual markup is also among the largest of comparable corpora. The concept annotations of the CRAFT Corpus have the potential to significantly advance biomedical text mining by providing a high-quality gold standard for NLP systems. The corpus, annotation guidelines, and other associated resources are freely available at http://bionlp-corpora.sourceforge.net/CRAFT/index.shtml.

  16. Concept annotation in the CRAFT corpus

    PubMed Central

    2012-01-01

    Background Manually annotated corpora are critical for the training and evaluation of automated methods to identify concepts in biomedical text. Results This paper presents the concept annotations of the Colorado Richly Annotated Full-Text (CRAFT) Corpus, a collection of 97 full-length, open-access biomedical journal articles that have been annotated both semantically and syntactically to serve as a research resource for the biomedical natural-language-processing (NLP) community. CRAFT identifies all mentions of nearly all concepts from nine prominent biomedical ontologies and terminologies: the Cell Type Ontology, the Chemical Entities of Biological Interest ontology, the NCBI Taxonomy, the Protein Ontology, the Sequence Ontology, the entries of the Entrez Gene database, and the three subontologies of the Gene Ontology. The first public release includes the annotations for 67 of the 97 articles, reserving two sets of 15 articles for future text-mining competitions (after which these too will be released). Concept annotations were created based on a single set of guidelines, which has enabled us to achieve consistently high interannotator agreement. Conclusions As the initial 67-article release contains more than 560,000 tokens (and the full set more than 790,000 tokens), our corpus is among the largest gold-standard annotated biomedical corpora. Unlike most others, the journal articles that comprise the corpus are drawn from diverse biomedical disciplines and are marked up in their entirety. Additionally, with a concept-annotation count of nearly 100,000 in the 67-article subset (and more than 140,000 in the full collection), the scale of conceptual markup is also among the largest of comparable corpora. The concept annotations of the CRAFT Corpus have the potential to significantly advance biomedical text mining by providing a high-quality gold standard for NLP systems. The corpus, annotation guidelines, and other associated resources are freely available at http://bionlp-corpora.sourceforge.net/CRAFT/index.shtml. PMID:22776079

  17. Intonation and dialog context as constraints for speech recognition.

    PubMed

    Taylor, P; King, S; Isard, S; Wright, H

    1998-01-01

    This paper describes a way of using intonation and dialog context to improve the performance of an automatic speech recognition (ASR) system. Our experiments were run on the DCIEM Maptask corpus, a corpus of spontaneous task-oriented dialog speech. This corpus has been tagged according to a dialog analysis scheme that assigns each utterance to one of 12 "move types," such as "acknowledge," "query-yes/no" or "instruct." Most ASR systems use a bigram language model to constrain the possible sequences of words that might be recognized. Here we use a separate bigram language model for each move type. We show that when the "correct" move-specific language model is used for each utterance in the test set, the word error rate of the recognizer drops. Of course when the recognizer is run on previously unseen data, it cannot know in advance what move type the speaker has just produced. To determine the move type we use an intonation model combined with a dialog model that puts constraints on possible sequences of move types, as well as the speech recognizer likelihoods for the different move-specific models. In the full recognition system, the combination of automatic move type recognition with the move specific language models reduces the overall word error rate by a small but significant amount when compared with a baseline system that does not take intonation or dialog acts into account. Interestingly, the word error improvement is restricted to "initiating" move types, where word recognition is important. In "response" move types, where the important information is conveyed by the move type itself--for example, positive versus negative response--there is no word error improvement, but recognition of the response types themselves is good. The paper discusses the intonation model, the language models, and the dialog model in detail and describes the architecture in which they are combined.

  18. Applications of Task-Based Learning in TESOL

    ERIC Educational Resources Information Center

    Shehadeh, Ali, Ed.; Coombe, Christine, Ed.

    2010-01-01

    Why are many teachers around the world moving toward task-based learning (TBL)? This shift is based on the strong belief that TBL facilitates second language acquisition and makes second language learning and teaching more principled and effective. Based on insights gained from using tasks as research tools, this volume shows how teachers can use…

  19. A Randomized Field Trial of the Fast ForWord Language Computer-Based Training Program

    ERIC Educational Resources Information Center

    Borman, Geoffrey D.; Benson, James G.; Overman, Laura

    2009-01-01

    This article describes an independent assessment of the Fast ForWord Language computer-based training program developed by Scientific Learning Corporation. Previous laboratory research involving children with language-based learning impairments showed strong effects on their abilities to recognize brief and fast sequences of nonspeech and speech…

  20. Historical Astrolexicography and Old Publications

    NASA Astrophysics Data System (ADS)

    Mahoney, Terry J.

    I describe how the principles of lexicography have been applied in limited ways in astronomy and look at the revision work under way for the third edition of the Oxford English Dictionary, which, when completed, will contain the widest and most detailed coverage of the astronomical lexicon in the English language. Finally, I argue the need for a dedicated historical dictionary of astronomy based rigorously on a corpus of quotations from sources published in English from the beginnings of written English to the present day.

  1. Oral Interaction in Task-Based EFL Learning: The Use of the L1 as a Cognitive Tool

    ERIC Educational Resources Information Center

    de la Colina, Ana Alegria; Mayo, Maria del Pilar Garcia

    2009-01-01

    The role of the first language (L1) in the learning of a second language (L2) has been widely studied as a source of cross-linguistic influence from the native system (Gass and Selinker, "Language Transfer in Language Learning," John Benjamins, 1992). Yet, this perspective provides no room for an understanding of language as a cognitive tool…

  2. The Effect of Corpus-Based Instruction on Pragmatic Routines

    ERIC Educational Resources Information Center

    Bardovi-Harlig, Kathleen; Mossman, Sabrina; Su, Yunwen

    2017-01-01

    This study compares the effect of using corpus-based materials and activities for the instruction of pragmatic routines under two conditions: implementing direct corpus searches by learners during classroom instruction and working with teacher-developed corpus-based materials. The outcome is compared to a repeated-test control group. Pragmatic…

  3. Nonlinguistic vocalizations from online amateur videos for emotion research: A validated corpus.

    PubMed

    Anikin, Andrey; Persson, Tomas

    2017-04-01

    This study introduces a corpus of 260 naturalistic human nonlinguistic vocalizations representing nine emotions: amusement, anger, disgust, effort, fear, joy, pain, pleasure, and sadness. The recognition accuracy in a rating task varied greatly per emotion, from <40% for joy and pain, to >70% for amusement, pleasure, fear, and sadness. In contrast, the raters' linguistic-cultural group had no effect on recognition accuracy: The predominantly English-language corpus was classified with similar accuracies by participants from Brazil, Russia, Sweden, and the UK/USA. Supervised random forest models classified the sounds as accurately as the human raters. The best acoustic predictors of emotion were pitch, harmonicity, and the spacing and regularity of syllables. This corpus of ecologically valid emotional vocalizations can be filtered to include only sounds with high recognition rates, in order to study reactions to emotional stimuli of known perceptual types (reception side), or can be used in its entirety to study the association between affective states and vocal expressions (production side).

  4. Digital Game-Based Learning: What's Literacy Got to Do With It?

    ERIC Educational Resources Information Center

    Spires, Hiller A.

    2015-01-01

    Just as literacy practices are contextualized in social situations and relationships, game players establish shared language and understandings within a game; in essence, they gain fluency in specialized languages. This commentary explores the importance of digital game-based learning for schooling, the relationship between game-based learning,…

  5. Leveraging Mobile Games for Place-Based Language Learning

    ERIC Educational Resources Information Center

    Holden, Christopher L.; Sykes, Julie M.

    2011-01-01

    This paper builds on the emerging body of research aimed at exploring the educational potential of mobile technologies, specifically, how to leverage place-based, augmented reality mobile games for language learning. Mentira is the first place-based, augmented reality mobile game for learning Spanish in a local neighborhood in the Southwestern…

  6. Output-Based Instruction, Learning Styles and Vocabulary Learning in the EFL Context of Iran

    ERIC Educational Resources Information Center

    Rastegar, Behnaz; Safari, Fatemeh

    2017-01-01

    Language learners' productive role in teaching and learning processes has recently been the focus of attention. Therefore, this study aimed at investigating the effect of oral vs. written output-based instruction on English as a foreign language (EFL) learners' vocabulary learning with a focus on reflective vs. impulsive learning styles. To this…

  7. Automatic recognition of topic-classified relations between prostate cancer and genes using MEDLINE abstracts

    PubMed Central

    Chun, Hong-Woo; Tsuruoka, Yoshimasa; Kim, Jin-Dong; Shiba, Rie; Nagata, Naoki; Hishiki, Teruyoshi; Tsujii, Jun'ichi

    2006-01-01

    Background Automatic recognition of relations between a specific disease term and its relevant genes or protein terms is an important practice of bioinformatics. Considering the utility of the results of this approach, we identified prostate cancer and gene terms with the ID tags of public biomedical databases. Moreover, considering that genetics experts will use our results, we classified them based on six topics that can be used to analyze the type of prostate cancers, genes, and their relations. Methods We developed a maximum entropy-based named entity recognizer and a relation recognizer and applied them to a corpus-based approach. We collected prostate cancer-related abstracts from MEDLINE, and constructed an annotated corpus of gene and prostate cancer relations based on six topics by biologists. We used it to train the maximum entropy-based named entity recognizer and relation recognizer. Results Topic-classified relation recognition achieved 92.1% precision for the relation (an increase of 11.0% from that obtained in a baseline experiment). For all topics, the precision was between 67.6 and 88.1%. Conclusion A series of experimental results revealed two important findings: a carefully designed relation recognition system using named entity recognition can improve the performance of relation recognition, and topic-classified relation recognition can be effectively addressed through a corpus-based approach using manual annotation and machine learning techniques. PMID:17134477

  8. Automatic recognition of topic-classified relations between prostate cancer and genes using MEDLINE abstracts.

    PubMed

    Chun, Hong-Woo; Tsuruoka, Yoshimasa; Kim, Jin-Dong; Shiba, Rie; Nagata, Naoki; Hishiki, Teruyoshi; Tsujii, Jun'ichi

    2006-11-24

    Automatic recognition of relations between a specific disease term and its relevant genes or protein terms is an important practice of bioinformatics. Considering the utility of the results of this approach, we identified prostate cancer and gene terms with the ID tags of public biomedical databases. Moreover, considering that genetics experts will use our results, we classified them based on six topics that can be used to analyze the type of prostate cancers, genes, and their relations. We developed a maximum entropy-based named entity recognizer and a relation recognizer and applied them to a corpus-based approach. We collected prostate cancer-related abstracts from MEDLINE, and constructed an annotated corpus of gene and prostate cancer relations based on six topics by biologists. We used it to train the maximum entropy-based named entity recognizer and relation recognizer. Topic-classified relation recognition achieved 92.1% precision for the relation (an increase of 11.0% from that obtained in a baseline experiment). For all topics, the precision was between 67.6 and 88.1%. A series of experimental results revealed two important findings: a carefully designed relation recognition system using named entity recognition can improve the performance of relation recognition, and topic-classified relation recognition can be effectively addressed through a corpus-based approach using manual annotation and machine learning techniques.

  9. Utilizing Descriptive Statements from the Biodiversity Heritage Library to Expand the Hymenoptera Anatomy Ontology

    PubMed Central

    Seltmann, Katja C.; Pénzes, Zsolt; Yoder, Matthew J.; Bertone, Matthew A.; Deans, Andrew R.

    2013-01-01

    Hymenoptera, the insect order that includes sawflies, bees, wasps, and ants, exhibits an incredible diversity of phenotypes, with over 145,000 species described in a corpus of textual knowledge since Carolus Linnaeus. In the absence of specialized training, often spanning decades, however, these articles can be challenging to decipher. Much of the vocabulary is domain-specific (e.g., Hymenoptera biology), historically without a comprehensive glossary, and contains much homonymous and synonymous terminology. The Hymenoptera Anatomy Ontology was developed to surmount this challenge and to aid future communication related to hymenopteran anatomy, as well as provide support for domain experts so they may actively benefit from the anatomy ontology development. As part of HAO development, an active learning, dictionary-based, natural language recognition tool was implemented to facilitate Hymenoptera anatomy term discovery in literature. We present this tool, referred to as the ‘Proofer’, as part of an iterative approach to growing phenotype-relevant ontologies, regardless of domain. The process of ontology development results in a critical mass of terms that is applied as a filter to the source collection of articles in order to reveal term occurrence and biases in natural language species descriptions. Our results indicate that taxonomists use domain-specific terminology that follows taxonomic specialization, particularly at superfamily and family level groupings and that the developed Proofer tool is effective for term discovery, facilitating ontology construction. PMID:23441153

  10. Utilizing descriptive statements from the biodiversity heritage library to expand the Hymenoptera Anatomy Ontology.

    PubMed

    Seltmann, Katja C; Pénzes, Zsolt; Yoder, Matthew J; Bertone, Matthew A; Deans, Andrew R

    2013-01-01

    Hymenoptera, the insect order that includes sawflies, bees, wasps, and ants, exhibits an incredible diversity of phenotypes, with over 145,000 species described in a corpus of textual knowledge since Carolus Linnaeus. In the absence of specialized training, often spanning decades, however, these articles can be challenging to decipher. Much of the vocabulary is domain-specific (e.g., Hymenoptera biology), historically without a comprehensive glossary, and contains much homonymous and synonymous terminology. The Hymenoptera Anatomy Ontology was developed to surmount this challenge and to aid future communication related to hymenopteran anatomy, as well as provide support for domain experts so they may actively benefit from the anatomy ontology development. As part of HAO development, an active learning, dictionary-based, natural language recognition tool was implemented to facilitate Hymenoptera anatomy term discovery in literature. We present this tool, referred to as the 'Proofer', as part of an iterative approach to growing phenotype-relevant ontologies, regardless of domain. The process of ontology development results in a critical mass of terms that is applied as a filter to the source collection of articles in order to reveal term occurrence and biases in natural language species descriptions. Our results indicate that taxonomists use domain-specific terminology that follows taxonomic specialization, particularly at superfamily and family level groupings and that the developed Proofer tool is effective for term discovery, facilitating ontology construction.

  11. Acquisition of locative utterances in Norwegian: structure-building via lexical learning.

    PubMed

    Mitrofanova, Natalia; Westergaard, Marit

    2018-03-15

    This paper focuses on the acquisition of locative prepositional phrases in L1 Norwegian. We report on two production experiments with children acquiring Norwegian as their first language and compare the results to similar experiments conducted with Russian children. The results of the experiments show that Norwegian children at age 2 regularly produce locative utterances lacking overt prepositions, with the rate of preposition omission decreasing significantly by age 3. Furthermore, our results suggest that phonologically strong and semantically unambiguous locative items appear earlier in Norwegian children's utterances than their phonologically weak and semantically ambiguous counterparts. This conclusion is confirmed by a corpus study. We argue that our results are best captured by the Underspecified P Hypothesis (UPH; Mitrofanova, 2017), which assumes that, at early stages of grammatical development, the underlying structure of locative utterances is underspecified, with more complex functional representations emerging gradually based on the input. This approach predicts that the rate of acquisition in the domain of locative PPs should be influenced by the lexical properties of individual language-specific grammatical elements (such as frequency, morphological complexity, phonological salience, or semantic ambiguity). Our data from child Norwegian show that this prediction is borne out. Specifically, the results of our study suggest that phonologically more salient and semantically unambiguous items are mastered earlier than their ambiguous and phonologically less salient counterparts, despite the higher frequency of the latter in the input (Clahsen et al., 1996).

  12. The "Anchor" Method: Principle and Practice.

    ERIC Educational Resources Information Center

    Selgin, Paul

    This report discusses the "anchor" language learning method that is based upon derivation rather than construction, using Italian as an example of a language to be learned. This method borrows from the natural process of language learning as it asks the student to remember whole expressions that serve as vehicles for learning both words…

  13. The Genre of Instructor Feedback in Doctoral Programs: A Corpus Linguistic Analysis

    ERIC Educational Resources Information Center

    Walters, Kelley Jo; Henry, Patricia; Vinella, Michael; Wells, Steve; Shaw, Melanie; Miller, James

    2015-01-01

    Providing transparent written feedback to doctoral students is essential to the learning process and preparation for the capstone. The purpose of this study was to conduct a qualitative exploration of faculty feedback on benchmark written assignments across multiple, online doctoral programs. The Corpus for this analysis included 236 doctoral…

  14. Lexical Analysis of the Verb "COOK" and Learning Vocabulary: A Corpus Study

    ERIC Educational Resources Information Center

    Priyono

    2011-01-01

    English verbs have built-in properties that determine how they behave syntactically and generate appropriate meaning associated. With these inherent properties some verbs can fill in only in certain syntactic structures and some in others. The observation of the verb "COOK" using English corpus has revealed its lexical properties…

  15. Motivation, students' needs and learning outcomes: a hybrid game-based app for enhanced language learning.

    PubMed

    Berns, Anke; Isla-Montes, José-Luis; Palomo-Duarte, Manuel; Dodero, Juan-Manuel

    2016-01-01

    In the context of European Higher Education students face an increasing focus on independent, individual learning-at the expense of face-to-face interaction. Hence learners are, all too often, not provided with enough opportunities to negotiate in the target language. The current case study aims to address this reality by going beyond conventional approaches to provide students with a hybrid game-based app, combining individual and collaborative learning opportunities. The 4-week study was carried out with 104 German language students (A1.2 CEFR) who had previously been enrolled in a first-semester A1.1 level course at a Spanish university. The VocabTrainerA1 app-designed specifically for this study-harnesses the synergy of combining individual learning tasks and a collaborative murder mystery game in a hybrid level-based architecture. By doing so, the app provides learners with opportunities to apply their language skills to real-life-like communication. The purpose of the study was twofold: on one hand we aimed to measure learner motivation, perceived usefulness and added value of hybrid game-based apps; on the other, we sought to determine their impact on language learning. To this end, we conducted focus group interviews and an anonymous Technology Acceptance Model survey (TAM). In addition, students took a pre-test and a post-test. Scores from both tests were compared with the results obtained in first-semester conventional writing tasks, with a view to measure learning outcomes. The study provides qualitative and quantitative data supporting our initial hypotheses. Our findings suggest that hybrid game-based apps like VocabTrainerA1-which seamlessly combine individual and collaborative learning tasks-motivate learners, stimulate perceived usefulness and added value, and better meet the language learning needs of today's digital natives. In terms of acceptance, outcomes and sustainability, the data indicate that hybrid game-based apps significantly improve proficiency, hence are indeed, effective tools for enhanced language learning.

  16. Personality Traits as Predictors of the Social English Language Learning Strategies

    ERIC Educational Resources Information Center

    Fazeli, Seyed Hossein

    2012-01-01

    The present study aims to find out the role of personality traits in the prediction use of the Social English Language Learning Strategies (SELLSs) for learners of English as a foreign language. Four instruments were used, which were Adapted Inventory for Social English Language Learning Strategies based on Social category of Strategy Inventory…

  17. The Impact of Personality Traits on the Affective Category of English Language Learning Strategies

    ERIC Educational Resources Information Center

    Fazeli, Seyed Hossein

    2011-01-01

    The present study aims at discovering the impact of personality traits in the prediction use of the Affective English Language Learning Strategies (AELLSs) for learners of English as a foreign language. Four instruments were used, which were Adapted Inventory for Affective English Language Learning Strategies based on Affective category of…

  18. Web-Based English Language Learning

    ERIC Educational Resources Information Center

    Sarica, Gulcin Nagehan; Cavus, Nadire

    2008-01-01

    Knowledge of another language is an advantage and it gives people to look at the world and in particular to the world's cultures with a broader perspective. Learning English as a second language is the process by which students learn it in addition to their native language. Today, internet is an important part of our lives as English. For this…

  19. A European Languages Virtual Network Proposal

    NASA Astrophysics Data System (ADS)

    García-Peñalvo, Francisco José; González-González, Juan Carlos; Murray, Maria

    ELVIN (European Languages Virtual Network) is a European Union (EU) Lifelong Learning Programme Project aimed at creating an informal social network to support and facilitate language learning. The ELVIN project aims to research and develop the connection between social networks, professional profiles and language learning in an informal educational context. At the core of the ELVIN project, there will be a web 2.0 social networking platform that connects employees/students for language practice based on their own professional/academic needs and abilities, using all relevant technologies. The ELVIN remit involves the examination of both methodological and technological issues inherent in achieving a social-based learning platform that provides the user with their own customized Personal Learning Environment for EU language acquisition. ELVIN started in November 2009 and this paper presents the project aims and objectives as well as the development and implementation of the web platform.

  20. A rights-based approach to science literacy using local languages: Contextualising inquiry-based learning in Africa

    NASA Astrophysics Data System (ADS)

    Babaci-Wilhite, Zehlia

    2017-06-01

    This article addresses the importance of teaching and learning science in local languages. The author argues that acknowledging local knowledge and using local languages in science education while emphasising inquiry-based learning improve teaching and learning science. She frames her arguments with the theory of inquiry, which draws on perspectives of both dominant and non-dominant cultures with a focus on science literacy as a human right. She first examines key assumptions about knowledge which inform mainstream educational research and practice. She then argues for an emphasis on contextualised learning as a right in education. This means accounting for contextualised knowledge and resisting the current trend towards de-contextualisation of curricula. This trend is reflected in Zanzibar's recent curriculum reform, in which English replaced Kiswahili as the language of instruction (LOI) in the last two years of primary school. The author's own research during the initial stage of the change (2010-2015) revealed that the effect has in fact proven to be counterproductive, with educational quality deteriorating further rather than improving. Arguing that language is essential to inquiry-based learning, she introduces a new didactic model which integrates alternative assumptions about the value of local knowledge and local languages in the teaching and learning of science subjects. In practical terms, the model is designed to address key science concepts through multiple modalities - "do it, say it, read it, write it" - a "hands-on" experiential combination which, she posits, may form a new platform for innovation based on a unique mix of local and global knowledge, and facilitate genuine science literacy. She provides examples from cutting-edge educational research and practice that illustrate this new model of teaching and learning science. This model has the potential to improve learning while supporting local languages and culture, giving local languages their rightful place in all aspects of education.

  1. Effects of Locus of Control and Learner-Control on Web-Based Language Learning

    ERIC Educational Resources Information Center

    Chang, Mei-Mei; Ho, Chiung-Mei

    2009-01-01

    The study explored the effects of students' locus of control and types of control over instruction on their self-efficacy and performance in a web-based language learning environment. A web-based interactive instructional program focusing on the comprehension of news articles for English language learners was developed in two versions: learner-…

  2. To What Extent Do Native and Non-Native Writers Make Use of Collocations?

    ERIC Educational Resources Information Center

    Durrant, Philip; Schmitt, Norbert

    2009-01-01

    Usage-based models claim that first language learning is based on the frequency-based analysis of memorised phrases. It is not clear though, whether adult second language learning works in the same way. It has been claimed that non-native language lacks idiomatic formulas, suggesting that learners neglect phrases, focusing instead on orthographic…

  3. A FAQ-Based e-Learning Environment to Support Japanese Language Learning

    ERIC Educational Resources Information Center

    Liu, Yuqin; Yin, Chengjiu; Ogata, Hiroaki; Qiao, Guojun; Yano, Yoneo

    2011-01-01

    In traditional classes, having many questions from learners is important because these questions indicate difficult points for learners and for teachers. This paper proposes a FAQ-based e-Learning environment to support Japanese language learning that focuses on learner questions. This knowledge sharing system enables learners to interact and…

  4. A Study on the Developmental Characteristics of Adverbial Conjuncts by Chinese Non-English Majors

    ERIC Educational Resources Information Center

    Junmei, Jiang

    2015-01-01

    Based on the Chinese Learner English Corpus, the present study seeks to investigate the developmental characteristics of the use of adverbial conjuncts. And the results show that at different learning stages Non-English majors use all kinds of adverbial conjuncts, but their occurrence frequencies are quite different, the enumerative adverbials are…

  5. The LIS Corpus Project: A Discussion of Sociolinguistic Variation in the Lexicon

    ERIC Educational Resources Information Center

    Geraci, Carlo; Battaglia, Katia; Cardinaletti, Anna; Cecchetto, Carlo; Donati, Caterina; Giudice, Serena; Mereghetti, Emiliano

    2011-01-01

    Following a well-established tradition going back to the 1980s (cf. Volterra 1987/2004), the authors use the name Lingua dei Segni Italiana (Italian Sign Language [LIS]) for the language used by Italian deaf people (and by Swiss deaf people living in the Ticino canton). LIS is becoming more and more visible, and its status as a minority language…

  6. Linguistic Models, Acquisition Theories, and Learner Corpora: Morphological Productivity in SLA Research Exemplified by Complex Verbs in German

    ERIC Educational Resources Information Center

    Lüdeling, Anke; Hirschmann, Hagen; Shadrova, Anna

    2017-01-01

    The present study analyzes morphological productivity for complex verbs in second language acquisition by analyzing a corpus of German as a Foreign Language (GFL). It shows that advanced learners of GFL use prefix and particle verbs relatively frequently and productively but less so than native speakers do and discusses these findings in the light…

  7. The Effectiveness of the Game-Based Learning System for the Improvement of American Sign Language Using Kinect

    ERIC Educational Resources Information Center

    Kamnardsiri, Teerawat; Hongsit, Ler-on; Khuwuthyakorn, Pattaraporn; Wongta, Noppon

    2017-01-01

    This paper investigated students' achievement for learning American Sign Language (ASL), using two different methods. There were two groups of samples. The first experimental group (Group A) was the game-based learning for ASL, using Kinect. The second control learning group (Group B) was the traditional face-to-face learning method, generally…

  8. C[superscript 4] (C Quad): Development of the Application for Language Learning Based on Social and Cognitive Presences

    ERIC Educational Resources Information Center

    Yamada, Masanori; Goda, Yoshiko; Matsukawa, Hideya; Hata, Kojiro; Yasunami, Seisuke

    2013-01-01

    This research aims to develop collaborative language learning systems based on social and cognitive presence for learning settings out of class, and evaluate their effects on learning attitude and performance. The main purpose of this system is focusing on the building of a learning community, therefore the Community of Inquiry (CoI) framework…

  9. Construction and Evaluation of an Integrated Formal/Informal Learning Environment for Foreign Language Learning across Real and Virtual Spaces

    ERIC Educational Resources Information Center

    Waragai, Ikumi; Ohta, Tatsuya; Kurabayashi, Shuichi; Kiyoki, Yasushi; Sato, Yukiko; Brückner, Stefan

    2017-01-01

    This paper presents the prototype of a foreign language learning space, based on the construction of an integrated formal/informal learning environment. Before the background of the continued innovation of information technology that places conventional learning styles and educational methods into new contexts based on new value-standards,…

  10. Second Language Learning; Investigating the Classroom Context.

    ERIC Educational Resources Information Center

    Mitchell, Rosamond

    1989-01-01

    Reviews a number of second-language (L2) classroom-based research projects undertaken at the University of Stirling in Scotland. It is argued that a full understanding of classroom-based L2 learning requires the integration of sociolinguistic studies of the classroom context with psycholinguistic studies of second language acquisition. (Author/VWL)

  11. Students' Motivation toward Computer-Based Language Learning

    ERIC Educational Resources Information Center

    Genc, Gulten; Aydin, Selami

    2011-01-01

    The present article examined some factors affecting the motivation level of the preparatory school students in using a web-based computer-assisted language-learning course. The sample group of the study consisted of 126 English-as-a-foreign-language learners at a preparatory school of a state university. After performing statistical analyses…

  12. Tracking Immanent Language Learning Behavior Over Time in Task-Based Classroom Work

    ERIC Educational Resources Information Center

    Kunitz, Silvia; Marian, Klara Skogmyr

    2017-01-01

    In this study, the authors explore how classroom tasks that are commonly used in task-based language teaching (TBLT) are achieved as observable aspects of "local educational order" (Hester & Francis, 2000) through observable and immanently social classroom behaviors. They focus specifically on students' language learning behaviors,…

  13. Digital Game-Based Language Learning in Foreign Language Teacher Education

    ERIC Educational Resources Information Center

    Alyaz, Yunus; Genc, Zubeyde Sinem

    2016-01-01

    New technologies including digital game-based language learning have increasingly received attention. However, their implementation is far from expected and desired levels due to technical, instructional, financial and sociological barriers. Previous studies suggest that there is a strong need to establish courses in order to support adaptation of…

  14. Impact of Contextuality on Mobile Learning Acceptance: An Empirical Study Based on a Language Learning App

    ERIC Educational Resources Information Center

    Böhm, Stephan; Constantine, Georges Philip

    2016-01-01

    Purpose: This paper aims to focus on contextualized features for mobile language learning apps. The scope of this paper is to explore students' perceptions of contextualized mobile language learning. Design/Methodology/Approach: An extended Technology Acceptance Model was developed to analyze the effect of contextual app features on students'…

  15. Effect of Phonetic Association on Learning Vocabulary in Foreign Language

    ERIC Educational Resources Information Center

    Bozavli, Ebubekir

    2017-01-01

    Word is one of the most important components of a natural language. Speech is meaningful because of the meanings of words. Vocabulary acquired in one's mother tongue is learned consciously in a foreign language in non-native settings. Learning vocabulary in a system based on grammar is generally neglected or learned in conventional ways. This…

  16. [A study on English loan words in French plastic surgery].

    PubMed

    Hansson, E; Tegelberg, E

    2014-10-01

    The French language is less and less used as an international scientific language and many French researchers publish their work in English. Nowadays, Annales de Chirurgie Plastique Esthétique is the only international plastic surgical journal published completely in French. The use of English loan words in French plastic surgery has never been studied. The aim of this study was to describe the frequency and types of English loan words in French plastic surgery. A corpus consisting of all the articles in a number of Annales de Chirurgie Plastique Esthethique, chosen by default, was created. The frequency of English loan words was calculated and the types of words were analysed. The corpus contains 367 (0.8%) English loan words. Most of them are non-integrated loan words and calques. The majority of the plastic surgical loan words describe surgical techniques. The French plastic surgical language seems to be influenced by English. The usage of loan words does not always follow the recommendations and the usage is sometimes ambiguous. Copyright © 2014 Elsevier Masson SAS. All rights reserved.

  17. Individualized Teaching and Autonomous Learning: Developing EFL Learners' CLA in a Web-Based Language Skills Training System

    ERIC Educational Resources Information Center

    Lu, Zhihong; Wen, Fuan; Li, Ping

    2012-01-01

    Teaching listening and speaking in English in China has been given top priority on the post-secondary level. This has lead to the question of how learners develop communicative language ability (CLA) effectively in computer-assisted language learning (CALL) environments. The authors demonstrate a self-developed language skill learning system with…

  18. Language comprehension in nonspeaking children with severe cerebral palsy: Neuroanatomical substrate?

    PubMed

    Geytenbeek, Joke J; Oostrom, Kim J; Harlaar, Laurike; Becher, Jules G; Knol, Dirk L; Barkhof, Frederik; Pinto, Pedro S; Vermeulen, R Jeroen

    2015-09-01

    To identify relations between brain abnormalities and spoken language comprehension, MRI characteristics of 80 nonspeaking children with severe CP were examined. MRI scans were analysed for patterns of brain abnormalities and scored for specific MRI measures: white matter (WM) areas; size of lateral ventricles, WM abnormality/reduction, cysts, subarachnoid space, corpus callosum thinning and grey matter (GM) areas; cortical GM abnormalities, thalamus, putamen, globus pallidus and nucleus caudatus and cerebellar abnormalities. Language comprehension was assessed with a new validated instrument (C-BiLLT). MRI scans of 35 children were classified as a basal ganglia necrosis (BGN) pattern, with damage to central GM areas; in 60% of these children damage to WM areas was also found. MRI scans of 13 children were classified as periventricular leukomalacia (PVL) with little concomitant damage to central GM areas, 13 as malformations and 19 as miscellaneous. Language comprehension was best in children with BGN, followed by malformations and miscellaneous, and was poorest in PVL. Linear regression modelling per pattern group (malformations excluded), with MRI measures as independent variables, revealed that corpus callosum thinning in BGN and parieto-occipital WM reduction in PVL were the most important explanatory factors for poor language comprehension. No MRI measures explained outcomes in language comprehension in the miscellaneous group. Comprehension of spoken language differs between MRI patterns of severe CP. In children with BGN and PVL differences in language comprehension performance is attributed to damage in the WM areas. Language comprehension was most affected in children with WM lesions in the subcortical and then periventricular areas, most characteristic for children with PVL. Copyright © 2015 European Paediatric Neurology Society. Published by Elsevier Ltd. All rights reserved.

  19. Informatics technology mimics ecology: dense, mutualistic collaboration networks are associated with higher publication rates.

    PubMed

    Sorani, Marco D

    2012-01-01

    Information technology (IT) adoption enables biomedical research. Publications are an accepted measure of research output, and network models can describe the collaborative nature of publication. In particular, ecological networks can serve as analogies for publication and technology adoption. We constructed network models of adoption of bioinformatics programming languages and health IT (HIT) from the literature.We selected seven programming languages and four types of HIT. We performed PubMed searches to identify publications since 2001. We calculated summary statistics and analyzed spatiotemporal relationships. Then, we assessed ecological models of specialization, cooperativity, competition, evolution, biodiversity, and stability associated with publications.Adoption of HIT has been variable, while scripting languages have experienced rapid adoption. Hospital systems had the largest HIT research corpus, while Perl had the largest language corpus. Scripting languages represented the largest connected network components. The relationship between edges and nodes was linear, though Bioconductor had more edges than expected and Perl had fewer. Spatiotemporal relationships were weak. Most languages shared a bioinformatics specialization and appeared mutualistic or competitive. HIT specializations varied. Specialization was highest for Bioconductor and radiology systems. Specialization and cooperativity were positively correlated among languages but negatively correlated among HIT. Rates of language evolution were similar. Biodiversity among languages grew in the first half of the decade and stabilized, while diversity among HIT was variable but flat. Compared with publications in 2001, correlation with publications one year later was positive while correlation after ten years was weak and negative.Adoption of new technologies can be unpredictable. Spatiotemporal relationships facilitate adoption but are not sufficient. As with ecosystems, dense, mutualistic, specialized co-habitation is associated with faster growth. There are rapidly changing trends in external technological and macroeconomic influences. We propose that a better understanding of how technologies are adopted can facilitate their development.

  20. Semi-Supervised Recurrent Neural Network for Adverse Drug Reaction mention extraction.

    PubMed

    Gupta, Shashank; Pawar, Sachin; Ramrakhiyani, Nitin; Palshikar, Girish Keshav; Varma, Vasudeva

    2018-06-13

    Social media is a useful platform to share health-related information due to its vast reach. This makes it a good candidate for public-health monitoring tasks, specifically for pharmacovigilance. We study the problem of extraction of Adverse-Drug-Reaction (ADR) mentions from social media, particularly from Twitter. Medical information extraction from social media is challenging, mainly due to short and highly informal nature of text, as compared to more technical and formal medical reports. Current methods in ADR mention extraction rely on supervised learning methods, which suffer from labeled data scarcity problem. The state-of-the-art method uses deep neural networks, specifically a class of Recurrent Neural Network (RNN) which is Long-Short-Term-Memory network (LSTM). Deep neural networks, due to their large number of free parameters rely heavily on large annotated corpora for learning the end task. But in the real-world, it is hard to get large labeled data, mainly due to the heavy cost associated with the manual annotation. To this end, we propose a novel semi-supervised learning based RNN model, which can leverage unlabeled data also present in abundance on social media. Through experiments we demonstrate the effectiveness of our method, achieving state-of-the-art performance in ADR mention extraction. In this study, we tackle the problem of labeled data scarcity for Adverse Drug Reaction mention extraction from social media and propose a novel semi-supervised learning based method which can leverage large unlabeled corpus available in abundance on the web. Through empirical study, we demonstrate that our proposed method outperforms fully supervised learning based baseline which relies on large manually annotated corpus for a good performance.

Top